Wikimedia Research — Research Report Nº 10

Research Report Nº 10

The tenth in a series of biannual reports from the Wikimedia Foundation Research team.

Executive Summary

Welcome! We're the Wikimedia Foundation's Research team. We develop models and insights utilizing scientific methods, and we strengthen the Wikimedia research communities. We do this work in order to: support technology and policy needs of the Wikimedia projects, and advance the understanding of the projects. This Research Report is an overview of our team's latest developments – an entry point that highlights existing and new work, and details new collaborations and considerations.

Between January and June 2024 we worked on a variety of projects and initiatives:

We responded to the technology needs of the Wikimedia projects by scaling existing ML models to include more languages (see Add-a-link ML model to 19 more Wikipedia languages for a total of 297), investing in developing multilingual models that can reduce the number of models maintained in production (see exploratory work for add-a-link which is expected to reduce the number of models in production from 300 to 5 or fewer), developing a new generation of patrolling models for Wikidata, testing the deployment of new ML models for product applications (see Article Description ML model), informed decision making (see the recommendations for Wikifunctions’s focus), and more.

We supported the policy work on the Wikimedia projects by initiating a project to write a whitepaper to support Wikipedia researchers in more effectively navigating the privacy needs and expectations of Wikipedia editors. We additionally wrote a whitepaper that positions Wikipedia as an antidote to disinformation.

We advanced the understanding of the Wikimedia projects by empirically uncovering (ACL’2024) that in many Wikipedia languages, much of the content is not accessible to those with average reading ability. We then successfully advocated for the introduction of a line of product work to simplify text for readers of Wikipedia.

We invested in further strengthening the Wikimedia research community by co-hosting the 11th edition of Wiki Workshop, awarding the Wikimedia Foundation Research Award of the Year, and distributing more than 250,000 USD to 9 research proposals as part of Wikimedia Research Fund. We further co-organized the AI and Knowledge Commons event.

The above are only examples of what we achieved together as a team at the service of the Movement. We invite you to read on to learn more about the breadth and depth of our work. And remember: The work that the Research team does is a work in progress. That is why we are publishing these Research Reports: To keep you updated on what we think is important to know about our most recent work and the work that will happen in the following six months.

Projects

Address Knowledge Gaps
We develop models and insights using scientific methods to identify, measure, and bridge Wikimedia’s knowledge gaps
A deeper understanding of content contributions

We expanded the evaluation and error analysis of Edisum, our large language model prototype trained to generate edit summary recommendations on English Wikipedia. This enabled us to further analyze its effectiveness and compare it with existing commercial LLMs. (Paper) (Learn more)

An improved readership experience through better new user onboarding

We tested our new link recommendation model after finalizing and incorporating the mwtokenizer library to support tokenization across Wikipedia languages, including languages that do not use whitespaces to separate tokens, such as Japanese. Our results show that the model can now run in many more languages without ecnountering tokenization errors, and has substantially improved performance in many “non-whitespace languages”. (Learn more)

A model to increase the visibility of articles

We developed a multilingual model to help editors add links to de-orphanized articles on Wikipedia. This model identifies the best places in the text to add a link when the anchor text does not already exist in the text. We demonstrated that our model outperforms other methods, including GPT-4, across 20 languages in a zero-shot setup. (Learn more)

Metrics to measure knowledge gaps

Our focus was on three primary areas: ensuring the adoption and usage of the knowledge gap metrics and measurements we have developed over the past years, improving accessibility to the relevant data, and continuing research to develop metrics for the knowledge gaps that currently do not have a metric.

We developed an expanded geography content gap metric, which counts the number of Wikipedia articles related to different geographic regions. One limitation of the previous metric was that it geolocated Wikipedia articles solely based on their Wikidata items’ geo coordinates. This approach limited the number of articles that could be geolocated and thus counted by the geographic gap metric. We deployed a new version of the metric, expanding the set of Wikidata properties used to geolocate articles (see cultural model). This resulted in almost doubling the number of geolocated articles (from 28% to 55%) and a more comprehensive geographic gap metric. This work was part of our contributions to one of the Foundation’s top annual plan goals, Infrastructure: Building the infrastructure of knowledge as a service. (Data)

We improved accessibility to our knowledge gaps data. We focused on publishing article quality scores, which are the backbone of all of our knowledge gaps metrics that measure gaps in quality across different categories. We created a dataset of language-agnostic feature values and quality scores for all revisions of all articles in the existing language versions of Wikipedia, which was accepted for publication at ICWSM 2024 (paper, dataset). Additionally, we worked on improving the current article quality model in preparation for its productization and deployment to Liftwing, our ML model serving platform. (Task)

We continued research towards developing more content gap metrics. Due to other priorities, we mostly focused our attention on the development of metrics for the language and readability gaps where you can learn more about in what follows.

We continued developing measurements for language coverage of Wikimedia projects, and found that the number of pre-hosted projects (or ‘test wikis’) is much higher than hosted projects. For the 20 most spoken languages, we also compared the number of Wikipedia articles with the number of language speakers, and found large imbalances in terms of representation. (Learn more)

We continued advancing our research on the readability gap. A paper including the details of our readability scoring model has been accepted for publication to ACL 2024. Our latest results show that for many languages on Wikipedia, the content is not accessible to those with average reading ability We also continued our studies around subjective readability perception, and designed a revised survey where we asked participants to evaluate the readability of text, explicitly taking into account the level of formal education of respondents into the analysis. (Learn more)

Finally, we advanced survey work on three main fronts: Reader surveys, Community Insights surveys, and streamlining survey analysis:

We started analyzing the results of the 2023 Global Reader Surveys. Among the main Readers’ gaps, we found that: readers skew towards men, with about 63% of readers self-identifying as men; they are highly multilingual, with 55% indicating they speak two or more languages fluently; overall, readers are very well-educated (among non-students, 56% have a bachelor degree or more); and they skew young (37% of readers older than 18 are below 30 years of age). (Learn more)

We launched the Community Insights survey across 29 languages over the whole month of April. We deduplicated and de-identified the data, calculated weights and response rates, and harmonized it with the data from previous editions of the survey. We have now started the data analysis. (Task)

We built repeatable frameworks for analyzing survey data, via automation and other documentation practices and templates, and tested those as part of our analysis support for the December 2023 Developer Satisfaction Survey. Compared to the previous iteration of the survey, the time it took to complete the visualizations from the close of the survey was reduced from 18 weeks to 8 weeks. (Task) (Code)

A deeper understanding of the role of visual knowledge in Wikimedia Projects

Due to other priorities this project was on a pause during the period of this report. As we do not expect to prioritize work on this front in the coming six months, we will drop this project from future reports and will reintroduce it when we start investing time on it.

A deeper understanding of reader navigation

As most research work has concluded on this front, we focused on disseminating existing work on reader navigation: We presented research on curious rhythms at ICWSM 2024 (Paper). The “google2wiki” project, our effort to create a dataset combining Google Trends) and Wikipedia clickstream was on a pause during the period of this report due to other priorities.

A unified framework for equitable article prioritization

We are moving to a framework where our multilingual topic models will be adopted by several product features to help contributors prioritize articles based on their topics of choice (See KR WE 2.1 on WMF Annual Plan). We are working on expanding the model so that an article can be associated with countries that are relevant to its content (Learn More). Additionally, we are working on expanding in a more systematic way the taxonomy of topics that the model can detect, to reflect the topical areas that are more relevant to our communities (Task). Additionally, we keep iterating on feedback about our List-Building tool, as it is being tested by campaign organizers to more easily build article worklists that are relevant to their topic areas. (Learn more).

Large language models for text simplification

We completed our first round of exploratory experiments with large language models for text simplification. We developed a model for automatic text simplification by fine-tuning Flan-T5 on pairs of articles from English Wikipedia (original) and Simple English Wikipedia (simplified), with promising results in line with the state of the art. We are now working on adapting this model to a multilingual setting. (Learn more)
Improve Knowledge Integrity

Reference quality in Wikipedia as a service

Building on our research on reference quality in English Wikipedia, we started expanding the reference quality model to more languages. Our goal is to work with the ML team to bring the final model to production in support of Wikipedia Enterprise needs. (Task)

Enhanced models for content patrolling

We have finalized the research for the Wikidata Revert Risk Model (Task) a natural extension to the Wikipedia Language Agnostic Revert Risk model. Once in production, the model which has a higher performance than the existing model (ORES), can be used to improve the efficiency of patrolling edits by Wikidata editors.

A project to help develop critical readers

Our analysis on knowledge networks of Wikipedia readers to understand and operationalize curiosity in self-motivated information seeking has concluded, and we revised the paper based on the reviews we received after submission to the journal Science Advances. We have now resubmitted the revised paper.

Wikipedia Knowledge Integrity Risk Observatory

As of last year, we concluded the research on this front. We continue to raise awareness about the outputs of this project: a private dashboard as well as the datasets. These outputs are currently being used by Trust and Safety and Moderator Tools teams at WMF and can serve other use cases as needs emerge.

A model for understanding knowledge propagation across Wikimedia projects

This project is concluded, and we do not expect to revisit it in the coming period due to other priorities.
Building the foundations

Ethical AI

Today many teams in WMF are interested in experimenting with or using AI models in products and features. In the past, the Research team would receive the usecases and develop or tune an existing AI model for the specific usecases. As demand on the usage of such models has increased and the resources for developing such models in the Research team has remained constant, we identified an opportunity to develop a framework and a set of recommendations for models that the teams can start using in products and features without heavily relying on our team. We will have more updates about this work in the future report. (Task) (Learn more)

Survey support for WMF and the affiliates

We continued our service to support survey work across the Wikimedia movement. We supported surveys from some Wikimedia affiliates (Wikimedia Mexico, Latin America Training Center, Wikimedia Uruguay, Wiki Loves Africa, Wikimedia User Group Nigeria, and Wikimedia Deutschland) by drafting, reviewing, or revising survey questions, sharing data privacy practices, and providing survey tooling. We continued to advocate for the maintenance and improvements of QuickSurveys, our on-wiki survey infrastructure.

Frameworks for metric definition and dissemination

We continued our contributions to Infrastructure: Building the infrastructure of knowledge as a service, one of the top-line goals of Wikimedia Foundation’s annual plan for 2023-2024, on two fronts. First, we formalized and adopted the essential metrics definition which is now adopted by the Wikimedia Foundation. Second, we implemented the process for computing, visualizing, and presenting business critical annual plan metrics (core metrics), and released public metrics reports for the first three quarters of the fiscal year. This project is now concluded.

A data pipeline for improved efficiency

We productionized the wikidiff pipeline. This is a dataset of unified diffs for the full Wikimedia revision history, which has traditionally been difficult and expensive to compute, but is also frequently useful for research scientists in our team. The wikidiff dataset is now used in production by the Risk Observatory and the Revert Risk Model training dataset pipelines, and research scientists are looking to use them in other projects. (Example notebook code)

A proof of concept end-to-end model training pipeline. While we test and develop many ML/AI models, we do not have access to an end-to-end ML model training pipeline. This means researchers developing models have to hack pipelines for specific projects which is time-consuming and subject to errors. We developed our first proof of concept end-to-end model training pipeline in airflow (code) which eventually will allow us to take advantage of more state-of-the-art models (e.g. LLMs) with fresher data, address model quality and latency issues, and shorten the path from research models to product features.
Strengthening the research communities

Research Community Mapping

Over the years we have started and maintained multiple initiatives to build and strengthen the Wikimedia research communities. With the updated team mission, we took the opportunity to step back and reflect on what we mean when we say we want to strengthen the Wikimedia research communities. We also (re-)defined why we do each of these initiatives. We documented the result of this work in Wikimedia Research Community Mapping which we hope to help you lean more about what we do on this front, why we do it, and how you can join in.

Wikimedia Hackathon

We participated in the Wikimedia Hackathon, an annual event that brings together the global Wikimedia technical community to improve the technological infrastructure and software that powers and benefits the Wikimedia projects. We ran a session on ML models developed by the Research Team that are hosted in LiftWing whose goal was to provide developers with insights on how they can utilize research models to improve their work. We also worked on a tool to easily visualize the dataset containing the differentially private country-level pageview data and on creating a dataset for automatic summarization of discussions in Wikipedia.

Research Showcases

Our January showcase focused on Connecting Actions with Policy and featured a discussion on the impact of reliable source guidelines on marginalized communities and on transparent stance detection in Wikipedia editor discussions. In February, we explored sociotechnical designs for democratic governance of social media and AI. March focused on gender gaps with studies on recommender systems for content gaps and the importance of offline Wikipedia meetups. April's showcase was about multimedia support on Wikipedia and addressed image accessibility and introduced Wiki2Story, an automatic approach to converting articles into mobile-friendly Web Stories. May's theme was the reader to editor pipeline, featuring research on journey transitions and the Wikimedia Foundation’s Growth features to improve newcomer experience. (Learn more)

Office Hour Series

We continued to provide one-on-one office hours to answer questions regarding proposed or in-progress research work, provide information on how to access datasets or run an analysis, share insights about programming and initiatives led by the Wikimedia Research team, explore collaborations, and more. (Learn more)

Mentorship through Outreachy

We continue to participate in Outreachy and are excited to have a new intern onboard. Mahima Agarwal will be contributing towards building a data visualization tool for the evolution of Wikipedia articles maintained by WikiProjects. She will be working with Pablo Aragón, Isaac Johnson, and Caroline Myrick.

Mentorship through Internships

In February, Destinie James began working on improving our language-agnostic article quality model. Destinie’s work involved testing the accuracy of the model in more languages, adding more features, making our feature extraction pipeline more robust by switching from wikitext to html input, and improving our library to parse html dumps.

Presentations and keynotes

We engaged with research audiences through the following presentations and keynotes during the past six months:

In January, we participated in the Wikipedia Day NYC event. The special focus for this year was Artificial intelligence in Wikimedia projects and we took part in a panel conversation titled “Wikimedia AI Futures”. (Video)

In February, we gave a talk at EPFL on “Research at the Wikimedia Foundation to advance knowledge equity”. We were invited by Robert West, head of the Data Science Lab at EPFL and Research Fellow of the Wikimedia Research Team. We presented recent research on better understanding and supporting Wikipedia’s readers and discussed ongoing and potential formal collaborations with members of the team.

In March, we participated in the 3rd edition of the Free Culture Days (III Jornadas de Cultura Libre) at Universidad Rey Juan Carlos in Madrid (Spain). We were invited by the Office of Free Knowledge and Free Culture of this university for a talk on “Wikimedia Research in the age of Artificial Intelligence”. (Video)

In May, we gave a talk on "Research to serve Wikimedia developers" at esLibre 2024, a conference in Valencia (Spain) on free software, free hardware and free culture. The talk was the opening session of a track organized by the Wikimedia Spain chapter, that included other talks by WMF staff designing translation tools for Wikipedia and academic researchers working on graphic galleries with Wikidata. (Website)

In June, along with formal collaborators, we attended the ICWSM 2024 conference for multiple contributions. First, we gave a tutorial on using public data from Wikipedia and its sister projects for academic research (tutorial website). Then, we presented 4 peer-reviewed papers (publications) and engaged with the computational social science community interested in research on Wikimedia projects.

In June, we participated in a panel at the Wikipedia Zukunftskongress organized by Wikimedia Deutschland (in German, automatic translation). The panel’s theme was to discuss about the future of Wikipedia focusing on where the community is heading. We shared some of our insights from our recent work on Addressing Knowledge Gaps such as the size of the gender gap for readers.

Research Fund. We received 76 applications as part of this year’s Research Fund grant cycle. From those, 23 were desk-rejected. The remaining 53 applications went through the complete first-stage review process. 13 were invited to Stage II. We accepted 9 submissions out of the 13 Stage II applications and are distributing a total amount of 270,687 USD among these applications. This year's round of funding continues our commitment to expanding and strengthening the global network of Wikimedia researchers. We aim to support projects that support the technology and policy needs of the Wikimedia projects and advance our understanding of the Wikimedia projects.

Wiki Workshop: We celebrated the 11th edition of Wiki Workshop on June 20, 2024. More than 200 participants joined us in this year's edition, composed of researchers, Wikimedia volunteers, WMF staff, and more. We received 50 submissions on a variety of topics and we had dedicated sessions on Governance & Policy, Representation and Gender, Inclusion of Communities, and more. A new track was created for staff members, affiliates, developers, and volunteers to hold thematic sessions that foster connections between the Wikimedia research community and the Wikimedia movement at large.

If you missed the event, worry not! You can review the schedule, check out the accepted extended abstracts, and even watch the recorded sessions.
Additionally

An updated framework for user centered product development. In the past, product teams at WMF used user personas to think, discuss, and develop for the users of the products. With the strategic focus in WMF Product and Technology teams shifting to support the projects to be multigenerational, there was a need to revisit the existing personas. And that is what we did. In collaboration with the Product Design team, we developed the first version of Wikimedia Volunteer Archetypes. We conducted a survey and gathered data from roughly 1400 respondents who volunteer their time online. We then analyzed and grouped the responses to arrive at 6 archetypes. (Task)

Research to bridge knowledge gaps with MinT (Machine in Translation) for readers. We supported the Language Team via the MinT Research Project. Through research with readers and editors, we tested and drove iteration on new design concepts aimed at bridging language gaps by leveraging machine translation to aid readers in accessing content otherwise only available in some language versions of Wikipedia. Of various impacts, improvements based on research results led to new entry points including a footer entry point (task). Enablement of this entry point on 23 pilot wikis resulted almost immediately in a 300% increase in translation requests, thereby helping bridge the gap between readers and encyclopedic content.

Collaborations

The Research team's work has been made possible through the contributions of our past and present formal collaborators. During the last six months, we established the following new collaborations:

Marija Šakota is a PhD student at EPFL. Her work focuses on natural language processing (NLP) including a previous project focused on generating article description recommendations for editors. She collaborates with us on generating edit summaries for Wikipedia edits.
Michael Zimmer is the Director of the Center for Data, Ethics, and Society at Marquette University, and Professor and Vice Chair in the Department of Computer Science. He is a privacy and data ethics scholar whose work focuses on digital privacy and surveillance, the ethics of big data, internet research ethics, and the broader social and ethical dimensions of emerging digital technologies. Michael collaborates with us in writing a whitepaper about privacy best practices when conducting research on Wikipedia.

Events

NLP for Wikipedia

November 2024 (Miami, FL and virtual)
We are co-organizing a workshop focused on natural language tasks and datasets related to Wikipedia at EMNLP (a premier natural language processing conference) in November 2024. Learn more
Research Showcases

Every 3rd Wednesday of the month (virtual)
Join us for Wikimedia-related research presentations and discussions. The showcases are great entry points into the world of Wikimedia research and for connecting with other Wikimedia researchers. Learn more
Research Office Hours

Throughout the month (virtual)
You can book a 1:1 consultation session with a member of the Research team to seek advice on your data or research related questions. All are welcome! Book a session
Wikimania 2024

August 2024 (Katowice)
The 19th edition of Wikimania will take place in Katowice, Poland from 7–10 August. Members of the research team will be in attendance and the program will include a dedicated Research track. Program

Donors

Funds for the Research team are provided by donors who give to the Wikimedia Foundation in different ways. Thank you!

Keep in touch with us

The Wikimedia Foundation's Research team is part of a global network of researchers who advance our understanding of the Wikimedia projects. We invite everyone who wants to stay in touch with this network to join the public wiki-research-l mailing list.

In addition, we offer one-on-one conversation times with our team members. You are welcome to schedule one of our upcoming Public Office Hours.

We look forward to connecting with you.

Previous Report

Research Report Nº 9

Research Report Nº 10

Table of contents

Research Report Nº 10

Executive Summary

Projects

Address Knowledge Gaps

A deeper understanding of content contributions

An improved readership experience through better new user onboarding

A model to increase the visibility of articles

Metrics to measure knowledge gaps

A deeper understanding of the role of visual knowledge in Wikimedia Projects

A deeper understanding of reader navigation

A unified framework for equitable article prioritization

Large language models for text simplification

Improve Knowledge Integrity

Reference quality in Wikipedia as a service

Enhanced models for content patrolling

A project to help develop critical readers

Wikipedia Knowledge Integrity Risk Observatory

A model for understanding knowledge propagation across Wikimedia projects

Building the foundations

Ethical AI

Survey support for WMF and the affiliates

Frameworks for metric definition and dissemination

A data pipeline for improved efficiency

Strengthening the research communities

Research Community Mapping

Wikimedia Hackathon

Research Showcases

Office Hour Series

Mentorship through Outreachy

Mentorship through Internships

Presentations and keynotes

Additionally

Collaborations

Events

NLP for Wikipedia

Research Showcases

Research Office Hours

Wikimania 2024

Donors

Keep in touch with us

Previous Report