The Significance of Human Input in Generative AI

The Significance of Human Input in Generative AI

Wikipedia serves as a prime example of human-machine interaction. Will this dynamic evolve with the advent of AI?

Until now, Wikipedia has enjoyed a rather successful journey. Despite occasional challenges in the last two decades, the English version of this online encyclopedia, which happens to be the largest and most extensively utilized, boasts over 6.7 million articles globally and a community of more than 118,000 active editors. It serves as the default information source for both humans and is relied upon by search engines and digital assistants alike.

Will everything transform with the ascent of generative AI? I hold a contrary opinion and don't believe it will.

"In early 2021, an insightful Wikipedia contributor gazed into the future and discerned what appeared to be a looming transformation: the emergence of GPT-3, a precursor to the latest chatbots from OpenAI," wrote author and journalist Jon Gertner in a recent article for The New York Times Magazine titled "Wikipedia’s Moment of Truth." "This dedicated Wikipedian, who operates under the username Barkeep49 on the platform, decided to test the new technology and observed that it exhibited unreliability. The bot often incorporated fictional elements, such as false names or fictitious academic citations, into otherwise factual and coherent responses. However, there was no doubt in his mind about its potential."

For a considerable time, I've heavily depended on Wikipedia as my primary online resource for exploring subjects I wish to delve into further. Additionally, I've frequently utilized Wikipedia articles as references in the blogs I publish.

Beyond Just an Encyclopedia

In the last decade, Wikipedia has evolved into something greater than a mere encyclopedia; it has transformed into "a sort of factual scaffolding that underpins the entire digital realm." Frequently, search engines such as Google and Bing, as well as digital assistants like Siri and Alexa, depend on Wikipedia to furnish the information required to respond to users' queries.

In recent times, Wikipedia has emerged as one of the primary reservoirs of data, comprising approximately 3% to 5% of the content used to train Large Language Models (LLMs) and associated chatbots. Wikipedia's significant contribution to the digital landscape can be attributed to its extensive, freely available, high-quality, and meticulously maintained data.

Having experimented with ChatGPT in 2021, Barkeep49 penned an essay bearing the foreboding title "Death of Wikipedia." In this essay, he contemplated the potential scenario in which Wikipedia could relinquish its status as the foremost English-language encyclopedia. He mused, "For another encyclopedia to supplant Wikipedia, it would need to rival some of the advantages we have painstakingly accumulated over the years. Specifically, the fact that we boast millions of articles and that articles of utmost interest receive swift updates."

He further noted that it appeared improbable for any prospective successor encyclopedia to embrace the same level of dedication to principles such as transparency, non-commercial objectives, and the liberal licensing that permits the free reuse of its content. He emphasized, "These values have not only enhanced our reputation but have also generated immense value for global readers. I firmly believe the world would be disadvantaged by an encyclopedia that diverged from these essential principles."

Is the Conclusion Approaching?

"I find it unlikely that humans could close the gap with us. However, artificial intelligence could potentially do so. AI is currently making exponential advancements, and as I write this in October 2023, it has already demonstrated the ability to produce some remarkably proficient text." While he expressed optimism in the immediate future, Barkeep49 also cautioned that...

he held concerns that over the long term, AI might replace Wikipedia and its human editors in a manner akin to how Wikipedia had superseded Britannica.

"In the Wikipedia community, there exists a tentative optimism that AI, if appropriately handled, can facilitate the organization's enhancement instead of causing its downfall," as stated in the NYT Magazine article. "However, even if the editors were to triumph in the immediate future, one couldn't help but ponder: Wouldn't the machines ultimately emerge victorious?"

Predictions heralding the demise of Wikipedia have persisted ever since its establishment in January 2001. As a precursor to Wikipedia's 20th anniversary, Professor Joseph Reagle of Northeastern University penned "The many (reported) deaths of Wikipedia," a historical essay delving into the recurrent prognostications of its demise over the past two decades and how Wikipedia managed to evolve and persevere in the face of these challenges.

Reagle observed that during its initial years, both critics and founders of Wikipedia embodied three distinct approaches to envisioning the future. Firstly, they turned to analogous encyclopedia initiatives to gauge feasibility and discovered that even well-funded and established projects, such as Microsoft's Encarta, had faltered in their attempts to establish a lasting online encyclopedia. Additionally, they operated under the assumption that the initial arduous six months of Wikipedia would serve as the standard for the subsequent seven years.

Reagle noted, "The only model that remained unexplored was the concept of exponential growth, a trend that defined Wikipedia article creation until approximately 2007." During its inaugural year, Wikipedia enthusiasts aspired to eventually reach 100,000 articles, a scale surpassing that of most printed encyclopedias. They calculated that by generating 1,000 articles each month, they could approach this milestone in approximately seven years.

As it happened, by September 2007, the English Wikipedia had achieved a staggering milestone of two million articles, a staggering twentyfold increase compared to the initial estimate.

By 2009, it became evident that the expansion of new articles in the English Wikipedia had decelerated or reached a plateau. Additionally, the distribution of activity was shifting more towards experienced editors rather than drawing in fresh contributors. The count of active editors saw a decline, dropping from its peak of 53,000 in 2007 to approximately 30,000 by 2014. An opinion piece in The New York Times from 2015 raised the question: Can Wikipedia endure? It speculated whether the challenges facing Wikipedia were exacerbated by the rapid proliferation of smartphones, which made editing Wikipedia articles more cumbersome compared to working on laptops.

"Nevertheless, it seems that the count of active editors has remained consistent since 2014, consistently staying above 29,000. This pattern of rapid growth followed by a plateau is a common occurrence in the world of wikis," Reagle wrote.

"The only forecast I would venture for the upcoming decade is that Wikipedia will undoubtedly remain in existence," he continued, "The platform and its community possess an enduring momentum that no alternative can easily replace. Furthermore, by that time, the Wikimedia Endowment, initiated in 2016, is expected to have successfully reached its objective of amassing $100+ million to sustain its projects 'in perpetuity.' While the English Wikipedia community will inevitably confront challenges and crises, as it has in the past, I do not anticipate a scenario where only a static collection of unaltered articles prevails."

By September 2023, the English Wikipedia boasts a staggering collection of more than 6.7 million articles, alongside a dedicated community of over 118,000 active editors, defined as individuals who have contributed by making one or more edits in the past 30 days.

The Human Element

As per Gertner's piece in The New York Times Magazine, "Wikipedia presently encompasses versions in 334 languages, with a cumulative article count surpassing 61 million. It maintains its position among the top 10 most-visited websites globally, but notably distinguishes itself from this exclusive set, which typically comprises profit-driven giants like Google, YouTube, and Facebook, by prioritizing a non-profit ethos."

However, the utmost significance of Wikipedia to generative AI lies in the fact that its knowledge is generated by human contributors.

"The recent AI chatbots have, for the most part, assimilated the entirety of Wikipedia's content," noted Gertner. "Nestled deep within their answers to inquiries lies the wealth of information and textual data sourced from Wikipedia, the result of years of meticulous contributions by human volunteers." Following a conference call with several Wikipedia editors, he further remarked, "One resounding message from the conference call was evident: We aspire to uphold a world where knowledge originates from human effort. But the question that lingers is whether it might already be too late for that ideal to persist?"

Ensuring that generative AI systems are educated using content meticulously crafted by human hands goes well beyond mere human idealism opposing AI.

It becomes apparent that AI systems are bound to malfunction without training data generated by humans.

In a research paper published in May 2023, this phenomenon was extensively defined and termed "model collapse."

A more straightforward explanation of model collapse can be found in a recent TechTarget article titled "Model collapse explained: How synthetic training data disrupts AI." According to the article, model collapse occurs when new generative models are trained on AI-generated content and gradually deteriorate as a result. In this situation, the models begin to forget the true underlying data distribution, even if the distribution remains unchanged. Consequently, these models start losing information related to less common but still significant aspects of the data. As successive generations of AI models evolve, they tend to produce outputs that are increasingly similar and less diverse.

"Generative AI models require training on data created by humans to operate effectively. When trained on content generated by the model itself, new models demonstrate permanent flaws."

Their outputs become progressively more 'inaccurate' and uniform. Researchers discovered that even under the most favorable learning circumstances, model collapse was an unavoidable outcome.

The significance of model collapse lies in the transformative impact that generative AI is expected to have on digital content. Increasingly, online communications are being generated, in part or entirely, through AI tools. Broadly, this trend has the potential to generate widespread data pollution.

Despite the increased efficiency in generating large volumes of text, model collapse posits that none of this data will be of value for training the subsequent generation of AI models.


To view or add a comment, sign in

More articles by Dr. RVS Praveen Ph.D

Insights from the community

Others also viewed

Explore topics