💪 Why Size Doesn't Matter to Your LLMs

💪 Why Size Doesn't Matter to Your LLMs

In this issue:

  1. David vs. Goliath in terms of training data
  2. Researchers in attack mode
  3. Giving Zero-shot Relation Extraction another try


S. AI needs high-quality synthetic data

Watching: MOSTLY AI (demo)

What problem does it solve? AI models are only as good as the data they’re trained on. Large enterprises need to use customer data to train their AI systems in a compliant way but traditional data anonymization techniques destroy data utility, lowering the resulting AI models' accuracy and business value. 

How does it solve the problem? Synthetic data generators can unlock even the most sensitive data assets for AI development to improve data quality via upsampling, bias correction, and imputation. One synthetic data generator, MOSTLY AI, improved ML performance by up to 6% when used for upsampling minority records in an unbalanced dataset, outperforming well-known upsampling methods, such as SMOTE-NC.

What’s next? Synthetic data will no doubt become the building block of AI systems. Upcoming regulations, like the EU's AI Act, might incorporate synthetic data into their auditing requirements. Synthetic data quality should be closely monitored for accuracy and privacy. The market-leading synthetic data generator, MOSTLY AI, offers automated privacy and the highest accuracy synthetic data available today. Get a custom demo of their platform.


1. Scalable Extraction of Training Data from (Production) Language Models

Watching: Extractable Memorization (paper)

What problem does it solve? Extractable memorization in the context of machine learning refers to the ability of adversaries to retrieve portions of training data from a model by querying it. The paper tackles the pressing issue of how open-source, semi-open, and closed models, including popular language models, are susceptible to attack by adversaries who can then extract substantial amounts of sensitive training data without any previous knowledge of it. This vulnerability not only poses privacy concerns but also questions the effectiveness of current alignment techniques in preventing such data leakage.

How does it solve the problem? To address the problem of extractable memorization, the study demonstrates that standard adversarial techniques are adequate to exploit the vulnerability of unaligned models. However, for more sophisticated aligned models like ChatGPT, the researchers developed an advanced ‘divergence attack’, which intentionally prompts the model to deviate from its usual outputs and reveal training data at much higher rates. This innovative attack method marks a significant advancement in understanding model vulnerabilities and showcases how attackers can force even aligned models to compromise their data integrity.

What’s next? The findings from this paper underscore a critical need for the development of new and improved alignment techniques and security measures. Since current methods seem to fall short in safeguarding against memorization exploits, future research will need to focus on creating robust mechanisms to protect models against such attacks. This could entail designing better training protocols, exploring differential privacy in trainings, or inventing novel defenses. As the stakes of model security and user privacy heighten, this research direction will become increasingly vital in the field of machine learning and NLP.


2. MoDS: Model-oriented Data Selection for Instruction Tuning

Watching: MoDS (paper/code)

What problem does it solve? Selecting high-quality and effective instructional data is critical when fine-tuning LLMs such as GPT-3, as using massive amounts of data can be inefficient and unnecessary. The current challenge is identifying which instructional examples are most useful to achieve a high level of performance no matter the LLM's size. Essentially, researchers are trying to optimize the instruction-tuning process to be more cost-effective and time-efficient by narrowing down the required dataset without sacrificing — and potentially even enhancing — the LLM's ability to follow instructions.

How does it solve the problem? The proposed MoDS approach hinges on discerning the quality, coverage, and necessity of instructional data. By evaluating the quality of available instructions with a specialized model, it filters out the most reliable examples. With an algorithmic approach, it then selects a subset that guarantees broad coverage of various scenarios. Finally, by identifying the instructions that the LLM performs poorly on, these are earmarked as most necessary for further improving the model. Through this meticulous and recursive selection process, MoDS distills a compact yet highly effective instruction dataset for fine-tuning.

Key results:

  • Performance Improvement with Fewer Instructions: The paper demonstrates that their MoDS approach, which uses only 1,000 selected instruction data, outperforms a model trained on the entire Alpaca dataset containing ~52,000 instructions
  • Comparison with Self-guided Selection: When compared with the self-guided approach proposed by Li et al. (2023a), MoDS shows comparable performance using only half the data (2000 vs. 1000)
  • Effectiveness of K-center Greedy Algorithm: The ablation study reveals that models fine-tuned with instruction data selected by the K-center greedy algorithm significantly surpass models fine-tuned with instruction data selected via a random sampling approach


3. Revisiting Large Language Models as Zero-shot Relation Extractors

Watching: SumAsk (paper)

What problem does it solve? The task of relation extraction (RE) in natural language processing typically requires labeled data. Studies have highlighted that large language models (LLMs) like ChatGPT can adeptly transition to new tasks using only a natural language prompt, without the need for additional training data. This work notably explores the idea of using LLMs for zero-shot relation extraction, aiming to address the limitations of existing RE approaches and the potential for drawing connections from text without conventional data and parameter tuning efforts.

How does it solve the problem? To enhance zero-shot RE, the study introduces a novel prompting approach named SumAsk, which employs LLMs to iteratively transform RE inputs into a question-answering format. This method seems to leverage the natural language understanding capabilities of LLMs, incorporating sophisticated prompt techniques such as the chain-of-thought (CoT), which guides models through a logical reasoning process. The SumAsk prompting aids the LLMs in better interpreting the task, allowing for improved performance in extracting relations pureply via prompting without the need for extra data.

Key results:

  • Relation Classification Improvements: The proposed SumAsk prompting approach improved performance in zero-shot relation classification tasks. For instance, on the TACRED dataset, SumAsk achieved a micro-F1 score of 79.6%, outperforming the second best zero-shot method, NLI-DeBERTa, which scored a micro-F1 of 73.9%
  • Robustness to Number of Unseen Relations: SumAsk demonstrated less performance degradation as the number of unseen relations increased than comparable methods
  • Overlapping Relation Extraction: For the task of extracting overlapping relations from the NYT dataset, SumAsk prompted models realized a high F1-score of 65.1% across sentences with various overlapping patterns. In comparison, a baseline method like T0 scored a lower F1-score of 46.8%


Papers of the Week:


Thank you to MostlyAI for sponsoring this issue of LLM Watch on LinkedIn. Sponsorships enable me to keep this newsletter 100% free for the readers.

Jim Ford

Head of Community @ Cyrannus | Investments | VC | Web3

1y
Like
Reply
Mahadev Hummanagol

Data Scientist | Generative AI,Hugging Face,NLP,Deep learning,Time series analysis,AWS sagemaker,Bedrock,lamda,Anomely detection,DataEngineering,DataAnalysis,Webapps | Pytorch,Databricks,Sklearn,Azure,Python,Git,Mysql

1y

Good source

To view or add a comment, sign in

More articles by Pascal Biese

  • 🧑🔬 AI Cutting Research Costs by 84%

    🧑🔬 AI Cutting Research Costs by 84%

    In this issue: AI helping researchers to be more efficient LLMs being unreliable when reasoning about time Evaluating…

    2 Comments
  • 🤗 AI Agents: Quick & Easy

    🤗 AI Agents: Quick & Easy

    In this issue: AI agents in a few lines of code An introduction to Graph Neural Networks LLMs for complex medical…

    5 Comments
  • 🎁 Meta Reveals New AI Architecture

    🎁 Meta Reveals New AI Architecture

    In this issue: How Meta wants to take LLMs to the next level A smaller, more transparent o1 alternative Graph agents…

    2 Comments
  • 🌱 Another ChatGPT Moment

    🌱 Another ChatGPT Moment

    In this issue: Simulation’s ChatGPT moment The new era of test-time compute A company with no humans Upgrade now 1…

    4 Comments
  • 🗣️ Microsoft's Best Small Language Model

    🗣️ Microsoft's Best Small Language Model

    In this issue: Microsoft’s best small language model Graph Networks learning without a lot of labels A new go-to…

    4 Comments
  • 🧪The First Fully AI-Designed Drug... Almost

    🧪The First Fully AI-Designed Drug... Almost

    In this issue: AI agents designed a new antibody against SARS-CoV-2 Semantic backpropagation for AI agents Amazon…

    2 Comments
  • 🥇 GraphRAG's Biggest Problem Solved

    🥇 GraphRAG's Biggest Problem Solved

    In this issue: A new standard for GraphRAG Replicating OpenAI’s strongest model LLM-”brained” agents for your devices…

    12 Comments
  • 🍓 Actually Open AI: A Free o1 Alternative

    🍓 Actually Open AI: A Free o1 Alternative

    In this issue: An open o1-like model The LLM Engineer Handbook NVIDIA mixing attention with state spaces Upgrade now 1.…

    4 Comments
  • 🤖 The Future of Designing AI Agents

    🤖 The Future of Designing AI Agents

    In this issue: Towards efficient graph foundation models From LLMOps to AgentOps A Text-to-SQL dataset that breaks LLMs…

    3 Comments
  • 💻 HTML > Plain Text for RAG

    💻 HTML > Plain Text for RAG

    In this issue: A foundation model for time series forecasting More intelligent tool usage for scientific LLMs HTML >…

    7 Comments

Insights from the community

Others also viewed

Explore topics