Fine LLMs (Not fine-tuning, folks!)
The emergence of large language models (LLMs) has ignited a debate around two contrasting narratives: the notion of "free rides" for LLMs on existing content versus the imperative need to safeguard individual creators and their intellectual property. The central question is: Can LLMs be granted unrestricted access to content for training purposes, even if that content is freely available in the public domain?
Proponents of open access argue that utilizing publicly available data ultimately benefits society by driving technological advancements. However, this stance is met with strong opposition from artists and news organizations who contend that their work, regardless of its public accessibility, remains their intellectual property and should not be exploited without proper authorization or compensation. This assertion is underscored by the recent open letter signed by over 10,500 artists, including prominent figures in Hollywood, music, and literature, protesting the unauthorized use of their creative works for AI model training.
The apprehension surrounding using personal data for AI training adds another complexity to this debate. Professionals are increasingly wary of how Microsoft-owned platforms like LinkedIn intend to utilize their data to train AI models. The close relationship between Microsoft and OpenAI amplifies these concerns, raising questions about the potential flow of personal data from LinkedIn to OpenAI's models.
Further intensifying this debate are the lawsuits filed by prominent news organizations, such as The Wall Street Journal and The New York Post, against Perplexity AI, alleging copyright infringement. These publications argue that Perplexity is essentially "freeriding" on their content by leveraging it to train LLMs and even serving users complete articles, particularly those who subscribe to Perplexity's premium service. This practice, they argue, not only deprives them of potential revenue but also risks harming their brand reputation due to Perplexity's AI occasionally generating inaccurate or fabricated information attributed to their publications.
Recommended by LinkedIn
While OpenAI, the company behind ChatGPT, has proactively secured licensing agreements with certain news outlets, including a notable $250 million deal with News Corp., the parent company of The Wall Street Journal and The New York Post, the legal actions against Perplexity underscore the persistent tension between AI companies seeking expansive training datasets and content creators demanding fair compensation and control over their intellectual property. This tension is further highlighted by similar lawsuits filed by other news organizations, such as The New York Times, against OpenAI for comparable reasons and actions taken by Condé Nast and Amazon against Perplexity.
The lawsuits against Perplexity and the licensing agreement between OpenAI and News Corp. raise an intriguing possibility: Could LLM companies evolve into a novel revenue stream for news and content companies? If so, content creators could capitalize on the growing demand for training data by negotiating favorable agreements with LLM developers, potentially reshaping the relationship between these entities.
The future trajectory of LLMs hinges on striking a delicate balance between fostering innovation and safeguarding the rights of individuals and creators. The outcomes of these legal battles and the nature of future agreements will profoundly shape how AI interacts with creative content, influencing the advancement of LLMs and the livelihoods of those who generate the content upon which these models depend.
Furthermore, an intriguing question emerges: as LLMs increasingly generate content, will they eventually be trained on their output, creating a self-referential loop? This scenario, akin to a snake eating its own tail, could lead to various issues, including bias amplification, stagnation of creativity, and diminished accuracy and reliability. Addressing these potential challenges will necessitate careful research and strategic measures, such as diversifying training data, developing mechanisms to filter AI-generated content, and continuously evaluating and refining LLM training processes.
Independent AI Blogger | Prompt Engineering Specialist | Certified in Generative AI | Python Enthusiast | Monetizing Content with Google AdSense
1moInsightful
Strategic Advisor- Data Science | Data Engineering | Advanced Analytics | AI/ML/Gen AI-Powered Solutions.
2moHI Rama K., Very insightful message. Thanks you.