Marcel Salathé’s Post

Professor EPFL, Co-Director EPFL AI Center

"#AI models training on #AI output will decrease in quality" - this claim was never fully convincing. Microsoft's #Phi4 just demonstrated the opposite. Their latest small language model performs exceptionally well, with high-quality synthetic data playing a particularly important role. Why this apparent contradiction? The answer is straightforward. LLMs learn from data, and early models trained on web content - which contains errors, to put it mildly 😅. We've long known that better training data leads to better models. Blindly training on AI outputs without curation won't improve quality, since the AI output won't be much better than what it was trained on. And because AI makes mistakes, this leads to a degradation over time - similar to Muller's ratchet in evolution, where unchecked mutations (which have mostly negative effects) spiral into mutational meltdown. Thankfully, evolution has selection - a process that preserves beneficial mutations while eliminating harmful ones. Similarly, a robust selection process for AI-generated training data prevents degradation and enables better models. In other words: The source of the data, human or AI, matters less than its quality. In both cases, quality is maintained through rigorous processes. For fact-based matters, the best process is science and logic. AI models will probably follow the same path. https://lnkd.in/eb7bVWJf

5 Comments

Marcel Salathé

Professor EPFL, Co-Director EPFL AI Center

Another solution out of Muller's ratchet is recombination, but I did not quite want to go there...| https://en.wikipedia.org/wiki/Muller%27s_ratchet Maybe for the next generation of models 😅

Lindsey DeWitt Prat, PhD

Director of Research @ Bold Insight | Language & culture research | Global dot connector | Translator (Japanese to English) & author | Good friend & mountain lover

Thank you for sharing this. I have one comment I'd appreciate your feedback on. You write, "The source of the data, human or AI, matters less than its quality." While this statement emphasizes quality over source, Phi4's own methodology shows that quality fundamentally depends on human-generated data and curation. The high-quality synthetic data pipeline begins with human-generated seeds and human filtering. In other words, source and quality cannot be meaningfully separated, and both are dependent upon human input.

2 Reactions

Dr. Eva-Maria Hempe

Healthcare & Life Sciences Leader EMEA @NVIDIA | Supercharging healthcare with AI | Servant leader, high-energy speaker and avid rower

Quality matters - a truism but sometimes it's important to remember the basics :)

3 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

Starvon Washington

Founder at Moodgaze | Experienced Backend Software Engineer | Building Digital AI Solutions | Audio and Music Enthusiast
4mo
Report this post
Day 29: The Power of AI: A Double-Edged Sword 🤖 AI is transforming industries, but it’s not without its pitfalls. Here are 5 key factors that can lead to inaccurate results if not addressed: 1. Outdated Information: Relying on old data is like navigating with an outdated map—it can lead you astray. 2. Biased Data: AI reflects the biases in its training data. If that data is skewed, so are the outcomes. 3. Lack of Diversity: Limited data is like trying to understand the world from just one city. Diversity is essential for accurate predictions. 4. Mislabeled Information: Incorrectly labeled data can mislead AI, much like a supermarket with switched labels. 5. Changing Trends: The world evolves rapidly. AI must adapt to stay relevant; otherwise, its insights can become obsolete. 💡 Key Takeaway: AI is a powerful tool, but it's crucial to ensure it's learning from high-quality, current, and diverse data.
Like Comment
To view or add a comment, sign in
Anders Sundstedt®

↯ Independent 2D Animated Film Director 🎬 | Motion Graphics Designer | Creating Awesome Animations & Illustrations | 19x Award Winner 🏆 | M.Sc. in Media Technology & Engineering 🎓 | Creative Storyteller
5mo Edited
Report this post
Even without being an AI researcher, it’s evident that AI systems will increasingly train on AI-generated data. This trend will also extend to AI image and AI video generation. As AI-generated content grows, it will become a significant part of training data. This recursive training can lead to “model collapse,” where the quality of generated content deteriorates over time due to a lack of diversity and richness from original human-created data. Consequently, AI models may become less accurate and meaningful. In some ways, this might actually benefit humanity. If AI image and video generators were to self-destruct and collapse, it could increase the value of human-made content. This would preserve jobs and enhance human skills, rather than relying on AI prompting etc. The same applies to large language models (LLMs). While they are useful, over-dependence on them is risky. If they become less accurate and eventually collapse, it would highlight the dangers of society’s reliance on them and the massive investments involved. I kind of like that idea actually, as it will make people think, and put an end to the AI hype. I don’t want humans to be replaced by AI. What’s your thoughts on this? Have you heard of model collapse? 💭 If not, l think you should. I know l need to read up more about it and the expected outcomes.
Like Comment
To view or add a comment, sign in
Michel Tricot

CEO & Co-Founder at Airbyte | Leading open-source data movement platform
7mo
Report this post
How do you train an AI/ML model but found yourself lacking the proper training data? I have been asked this question quite often! Don't worry; it's a common challenge in the AI space. Here’s the good news: you have options! - Start collecting data: It might take time, but it's essential to gather the right quality and quantity of data. Begin by capturing user interactions, feedback… - Source data internally or externally: Sometimes, the needed data is within reach—either elsewhere in your organization or externally. Explore data in the tools you’re using, databases you’re managing, public datasets, purchase data, or establish data-sharing agreements - Generate synthetic data: Create artificial datasets that mimic real-world data using rule-based algorithms or generative AI tools These strategies will help you overcome data challenges and set your AI-driven product on the path to success. Find the complete article, link in the comments! #ai #airbyte
1 Comment
Like Comment
To view or add a comment, sign in
Christian Fabián Torres

Cloud Solutions Architect | Application Architect | SRE & DevOps Engineer | Software Developer | Consultant | Trainer
9mo
Report this post
Building Safe AI: A Comprehensive Guide to Bias Mitigation, Inclusive Datasets, and Ethical Considerations https://lnkd.in/dDiRaVfx Artificial intelligence (AI) holds vast potential for societal and industrial transformation. However, ensuring AI systems are safe, fair, inclusive, and trustworthy depends on the quality and integrity of the data upon which they are built. Biased datasets can produce AI models that perpetuate harmful stereotypes, discriminate against specific groups, and yield inaccurate or unreliable results. This article explores the complexities of data bias, outlines practical mitigation strategies, and delves into the importance of building inclusive datasets for the training and testing of AI models [1]. Understanding the Complexities of Data Bias Data plays a key role in the development of AI models. Data bias can infiltrate AI systems in various ways. Here's a breakdown of the primary types of data bias, along with real-world examples [1,2]:
Like Comment
To view or add a comment, sign in
Eyal Leeder, MBA

Transformational Operations & Business Development Leader | Driving Efficiency, Innovation, Exponential Growth & High-Performing Teams
1mo
Report this post
🧠 ever had your AI confidently tell you something that made you say, “wait, what?!” That's the challenge of AI "hallucinations"—when AI generates information that's just plain wrong. In business, this isn't just annoying; it can lead to costly mistakes. Enter Retrieval-Augmented Generation (RAG)—a way to keep AI grounded in reality. 🔍 Grounding AI in Real Data with RAG - RAG enhances AI by retrieving relevant, factual information from trusted sources before generating a response. This ensures the AI’s answers are accurate and directly tied to the data you provide. 📚 Using Knowledge Graphs - These tools map out relationships in your data, helping AI understand context better. Think of a knowledge graph as a map that organizes data into logical networks, so AI doesn’t just know "what" something is but also "how" it connects to everything else. 🔄 Continuous Learning - Regularly updating your AI with new, accurate information keeps it sharp and reliable. This process involves teaching AI models to recognize and adapt to the latest updates, changes, and corrections in your industry or database. To reduce hallucinations, ensure your AI learns from real data and reliable sources—and RAG is the framework to make that happen. #AI #DataManagement #BusinessIntelligence #RAG
Like Comment
To view or add a comment, sign in
Syed Asif Sultan

Building Free & Affordable SaaS for Bootstrappers and Startups.
6mo
Report this post
The role of data in Artificial Intelligence (AI) cannot be overstated. Why is data so critical? → It serves as the foundation for algorithms. → Enables machines to learn and improve. → Provides insights for decision-making. → Drives prediction and forecasting. However, for effective AI, you need: → Accurate and quality data. → Appropriate data management processes. → Ethical and responsible data use. Remember, in the world of AI, data is not just a resource, it's the lifeblood.
Like Comment
To view or add a comment, sign in
Bogdan Shuliar, Ph.D.

Experienced Leader in Digital Transformation | Expert in Leveraging Automation | AI to Drive Business Innovation and Efficiency
9mo
Report this post
Based on the feedback, the topic of implementing artificial intelligence (AI) is top-notch. It's not surprising since most people think of it as a magic pill that can instantly solve all business challenges. Unfortunately, this is not the case. What makes AI awesome: 1) Data analysis and dependency discovery 2) Extraction of key ideas 3)Content generation (with prior training) However, applying AI comes with its challenges: 1)The need to train AI 2)AI often presents its conclusions as reality, which may not correlate with the actual situation 3)The parameters for decision-making need to be set very clearly; otherwise, the outcome will be unsatisfying Regarding new ideas - AI can generate them with prior training, but this is quite challenging. It requires large datasets, which are often lacking in small and medium-sized businesses. Nonetheless, businesses can already use many tools that simplify and automate routine tasks. What tasks would you like to delegate to AI?
Like Comment
To view or add a comment, sign in
Divyaansh Kumar Gupta

AI and ML Enthusiast | Founder: AIflo | Computer Engineering Undergrad at NMIMS | Certified Cloud Security Analyst by Google Cloud
3mo
Report this post
Unwanted Bias in AI Models: How They are Created and How to Resolve Them😐 AI models are only as good as the data they are trained on. Unfortunately, this means they can inherit unwanted biases present in the data, which may lead to unfair decisions and outcomes. But how do these biases arise? 1. Historical Data: AI learns from past data, which may reflect societal biases. For example, hiring models trained on historical data might favor certain demographic groups due to past discriminatory practices. 2. Imbalanced Datasets: If certain groups are underrepresented in training data, AI models may perform poorly for those groups, resulting in biased outcomes. 3. Feature Selection: Bias can also be introduced when selecting features for the model, either intentionally or unintentionally. How can these biases be resolved? 1. Diverse and Representative Data: Ensuring the training dataset is inclusive and representative of all groups can help reduce bias. 2. Bias Detection Tools: Tools like IBM’s AI Fairness 360 or Google’s What-If Tool can help detect and mitigate bias during model training. 3. Regular Audits: Continuously auditing models for bias and updating them with more balanced data can ensure fairer outcomes. 4. Explainability: Incorporating explainable AI techniques can help identify where and why biases occur, making it easier to address them. Unwanted bias in AI is a crucial challenge but not insurmountable. Through thoughtful data practices and responsible AI development, we can build systems that are fairer and more inclusive for all. #AI #ArtificialIntelligence #BiasInAI #DataScience #ResponsibleAI
Like Comment
To view or add a comment, sign in
Steve Harris

CEO at Mindtech Global (Helping AI to see)
3mo
Report this post
Synthetic Data for AI’s Future 🚀 Companies are increasingly restricting access to their valuable data, making it harder to develop production-ready AI models. Without diverse datasets, AI models struggle with bias and poor performance. 🔎 Why This Matters: Companies Protecting Their Data: The trend of restricting data access is here to stay, pushing the need for alternative data sources. Data Deficiency and Bias: Incomplete or biased data leads to flawed AI models. Our synthetic data ensures more accurate, fairer models. Improved Model Performance: By resolving data deficiencies, we enable businesses to develop high-performance AI models faster and more efficiently. That’s why Mindtech Global Limited is using synthetic data to solve this issue. Our DataOps platform analyses existing training data to identify data deficiencies and bias and generates synthetic data to create more diverse, high-performing models. Synthetic data is the key to unlocking AI’s full potential in a data-restricted world. Mindtech Global Limited #AI #SyntheticData #DataOps #AITraining #ArtificialIntelligence
Like Comment
To view or add a comment, sign in
Amanda Nicole Leadership and Creative Consulting

I align visionary and innovative global leaders with their future.
2mo
Report this post
The AI 'debate' is an interesting conundrum. However, if we can't trust that AI isn't making things up, it has the potential for a whole other set of problems... https://lnkd.in/gbxQjB_C #AIdebate #conversation #AI

Both humans and AI hallucinate — but not in the same way

theconversation.com
Like Comment
To view or add a comment, sign in

15,481 followers

796 Posts

View Profile Connect

Marcel Salathé’s Post

More Relevant Posts

Explore topics