"#AI models training on #AI output will decrease in quality" - this claim was never fully convincing. Microsoft's #Phi4 just demonstrated the opposite. Their latest small language model performs exceptionally well, with high-quality synthetic data playing a particularly important role. Why this apparent contradiction? The answer is straightforward. LLMs learn from data, and early models trained on web content - which contains errors, to put it mildly 😅. We've long known that better training data leads to better models. Blindly training on AI outputs without curation won't improve quality, since the AI output won't be much better than what it was trained on. And because AI makes mistakes, this leads to a degradation over time - similar to Muller's ratchet in evolution, where unchecked mutations (which have mostly negative effects) spiral into mutational meltdown. Thankfully, evolution has selection - a process that preserves beneficial mutations while eliminating harmful ones. Similarly, a robust selection process for AI-generated training data prevents degradation and enables better models. In other words: The source of the data, human or AI, matters less than its quality. In both cases, quality is maintained through rigorous processes. For fact-based matters, the best process is science and logic. AI models will probably follow the same path. https://lnkd.in/eb7bVWJf
Thank you for sharing this. I have one comment I'd appreciate your feedback on. You write, "The source of the data, human or AI, matters less than its quality." While this statement emphasizes quality over source, Phi4's own methodology shows that quality fundamentally depends on human-generated data and curation. The high-quality synthetic data pipeline begins with human-generated seeds and human filtering. In other words, source and quality cannot be meaningfully separated, and both are dependent upon human input.
Quality matters - a truism but sometimes it's important to remember the basics :)
Professor EPFL, Co-Director EPFL AI Center
2wAnother solution out of Muller's ratchet is recombination, but I did not quite want to go there...| https://en.wikipedia.org/wiki/Muller%27s_ratchet Maybe for the next generation of models 😅