Marcel Salathé’s Post

View profile for Marcel Salathé, graphic

Professor EPFL, Co-Director EPFL AI Center

"#AI models training on #AI output will decrease in quality" - this claim was never fully convincing. Microsoft's #Phi4 just demonstrated the opposite. Their latest small language model performs exceptionally well, with high-quality synthetic data playing a particularly important role. Why this apparent contradiction? The answer is straightforward. LLMs learn from data, and early models trained on web content - which contains errors, to put it mildly 😅. We've long known that better training data leads to better models. Blindly training on AI outputs without curation won't improve quality, since the AI output won't be much better than what it was trained on. And because AI makes mistakes, this leads to a degradation over time - similar to Muller's ratchet in evolution, where unchecked mutations (which have mostly negative effects) spiral into mutational meltdown. Thankfully, evolution has selection - a process that preserves beneficial mutations while eliminating harmful ones. Similarly, a robust selection process for AI-generated training data prevents degradation and enables better models. In other words: The source of the data, human or AI, matters less than its quality. In both cases, quality is maintained through rigorous processes. For fact-based matters, the best process is science and logic. AI models will probably follow the same path. https://lnkd.in/eb7bVWJf

  • table
Marcel Salathé

Professor EPFL, Co-Director EPFL AI Center

2w

Another solution out of Muller's ratchet is recombination, but I did not quite want to go there...| https://en.wikipedia.org/wiki/Muller%27s_ratchet Maybe for the next generation of models 😅

Like
Reply
Lindsey DeWitt Prat, PhD

Director of Research @ Bold Insight | Language & culture research | Global dot connector | Translator (Japanese to English) & author | Good friend & mountain lover

2w

Thank you for sharing this. I have one comment I'd appreciate your feedback on. You write, "The source of the data, human or AI, matters less than its quality." While this statement emphasizes quality over source, Phi4's own methodology shows that quality fundamentally depends on human-generated data and curation. The high-quality synthetic data pipeline begins with human-generated seeds and human filtering. In other words, source and quality cannot be meaningfully separated, and both are dependent upon human input.

Dr. Eva-Maria Hempe

Healthcare & Life Sciences Leader EMEA @NVIDIA | Supercharging healthcare with AI | Servant leader, high-energy speaker and avid rower

2w

Quality matters - a truism but sometimes it's important to remember the basics :)

See more comments

To view or add a comment, sign in

Explore topics