#54: From WEIRD to Worldwide: Revolutionizing AI

#54: From WEIRD to Worldwide: Revolutionizing AI

Equitable AI: Raising the Bar for Non-English Language Models

LLM Developers, beware! Overstating the multilingual capabilities of AI models can lead to significant risks in non-English contexts. From inaccurate information to failing to moderate harmful content, the consequences are real. To address this, follow these crucial steps:- Avoid assuming training in one language transfers to others; Include unique benchmarks for specific languages; Use non-machine translated benchmarks; Disclose volume and sources of training data per language; Test for vulnerabilities in non-English languages

Foundation model developers claim impressive performance across multiple languages, but these claims often fall short, especially for "low-resource" languages with limited training data. Models are predominantly tested in English, with fewer and less robust non-English benchmarks. This disparity risks inappropriate deployment in non-English contexts, potentially causing issues like misleading information or inadequate content moderation.

CDT has previously highlighted the limitations of multilingual LLMs in non-English languages and suggested improvements. Now, recommendations are made to foundation model developers to enhance non-English benchmarking and transparency:

  1. Question Cross-Lingual Transfer Assumptions: Training models in one language doesn't ensure competence in others. Models trained mainly on English data perform modestly in low-resource languages due to "cross-lingual transfer," a theory still under debate. Developers shouldn't assume this transfer ensures model safety across languages.
  2. Develop Unique Language Benchmarks: Most non-English benchmarks are translated from English, missing cultural nuances. Foundation model developers should create and use monolingual benchmarks for various languages to better assess performance in real-world contexts.
  3. Avoid Sole Reliance on Machine-Translated Benchmarks: Machine-translated benchmarks can misrepresent real language use. Models should be tested with a mix of human-written, human-translated, and machine-translated texts to ensure accuracy across languages.
  4. Disclose Training Data Details: Sharing information about the volume and sources of training data for each language helps developers fine-tune models for specific languages. Open weight models have been more transparent in this regard, setting an example for others.
  5. Test Multilingual Vulnerabilities: Models can be compromised with translated adversarial prompts. Developers should engage in multilingual red-teaming to identify and address safety issues in all languages, not just English.

By adopting these practices, foundation model developers can ensure their models are reliable and effective across different languages, allowing for safer and more accurate applications worldwide.


From WEIRD to Worldly: Making AI Truly Global

Large language models (LLMs) have advanced significantly in generating and analyzing text. However, when comparing their performance to humans, it's crucial to ask, "Which humans?" Current literature often overlooks the cultural and psychological diversity of humans worldwide, which is not fully represented in the data LLMs are trained on.

This fascinating paper just out demonstrates that when AI researchers describe LLM performance by comparing with that of 'humans', they actually mean humans from WEIRD countries (Western, Educated, Industrialized, Rich and Democratic).

"We show that LLMs’ responses to psychological measures are an outlier compared with large-scale cross-cultural data, and that their performance on cognitive psychological tasks most resembles that of people from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies but declines rapidly as we move away from these populations (r = -.70). Ignoring cross-cultural diversity in both human and machine psychology raises numerous scientific and ethical issues."

Research shows that LLMs' responses to psychological tests are outliers when compared to diverse global data. Their performance closely mirrors that of people from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies but declines significantly with populations outside these groups (correlation of -0.70). This oversight of cross-cultural diversity in both human and machine psychology poses scientific and ethical concerns. The paper concludes by suggesting methods to reduce WEIRD bias in future LLMs.

No Language Left Behind: Revolutionizing Global Translation

Newly published in Nature: No Language Left Behind is an AI model created by researchers at Meta capable of translation between 200 languages — including low-resource languages. NLLB-200 includes 200 languages, contains three times as many low-resource languages as high-resource languages and performs 44% better than prior systems. This work aims to give people the opportunity to access and share web content in their native language, and communicate with anyone, anywhere, regardless of their language preferences.

Neural machine translation (NMT) has made significant strides, enabling translation between multiple languages and even zero-shot translation (translating between language pairs without direct examples). However, high-quality NMT typically requires large amounts of parallel bilingual data, which are not available for the world's 7,000+ languages. This focus on high-resource languages creates digital inequities by neglecting low-resource languages.

To address this, the No Language Left Behind project introduces a massively multilingual model leveraging transfer learning across languages. Using the Sparsely Gated Mixture of Experts architecture and new mining techniques for low-resource languages, the model was trained on vast data sets. Various improvements were implemented to prevent overfitting while training on thousands of tasks.

The model's performance was evaluated over 40,000 translation directions using specialized tools: the FLORES-200 automatic benchmark, the XSTS human evaluation metric, and a comprehensive toxicity detector. The model showed a 44% improvement in translation quality compared to previous state-of-the-art models, measured by the BLEU score.

By demonstrating how to scale NMT to 200 languages and making these resources freely available for non-commercial use, this work sets the stage for developing a universal translation system.


GenAI Use Case Comparison

Generative AI is an enabler of specific use cases for the IT function, and CIOs are tasked with weighing the specifics of their own IT organization before moving forward.Inform strategic conversations and guide investment decisions with Gartner's AI use-case comparison.


Signing Off

Why did the AI go to language school?
Because it couldn't find the right algorithm to "speak" human!

Keep an eye on our upcoming editions for in-depth discussions on specific AI trends, expert insights, and answers to your most pressing AI questions!

Stay connected for more updates and insights in the dynamic world of AI.

For any feedback or topics you'd like us to cover, feel free to contact me via LinkedIn.

DEEPakAI: AI Demystifed Demystifying AI, one newsletter at a time!

p.s. - The newsletter includes smart prompt based LLM generated content. The views and opinions expressed in the newsletter are my personal views and opinions.

Pete Grett

GEN AI Evangelist | #TechSherpa | #LiftOthersUp

7mo

Fascinating insights on global inclusivity. Equitable AI is key for realizing its full potential. Deepak Seth

To view or add a comment, sign in

More articles by Deepak Seth

Insights from the community

Others also viewed

Explore topics