Understanding how language model performance varies with scale is critical tobenchmark and algorithm development. Scaling laws are one approach to buildingthis understanding, but the requirement of training models across manydifferent scales has limited their use. We propose an alternative,observational approach that bypasses model training and instead builds scalinglaws from ~80 publically available models. Building a single scaling law frommultiple model families is challenging due to large variations in theirtraining compute efficiencies and capabilities. However, we show that thesevariations are consistent with a simple, generalized scaling law where languagemodel performance is a function of a low-dimensional capability space, andmodel families only vary in their efficiency in converting training compute tocapabilities. Using this approach, we show the surprising predictability ofcomplex scaling phenomena: we show that several emergent phenomena follow asmooth, sigmoidal behavior and are predictable from small models; we show thatthe agent performance of models such as GPT-4 can be precisely predicted fromsimpler non-agentic benchmarks; and we show how to predict the impact ofpost-training interventions like Chain-of-Thought and Self-Consistency aslanguage model capabilities continue to improve. #LanguageModels #ScalingLaws #ModelEfficiency #PerformancePrediction #EmergentPhenomena
shakti web3’s Post
More Relevant Posts
-
Understanding how language model performance varies with scale is critical tobenchmark and algorithm development. Scaling laws are one approach to buildingthis understanding, but the requirement of training models across manydifferent scales has limited their use. We propose an alternative,observational approach that bypasses model training and instead builds scalinglaws from ~80 publically available models. Building a single scaling law frommultiple model families is challenging due to large variations in theirtraining compute efficiencies and capabilities. However, we show that thesevariations are consistent with a simple, generalized scaling law where languagemodel performance is a function of a low-dimensional capability space, andmodel families only vary in their efficiency in converting training compute tocapabilities. Using this approach, we show the surprising predictability ofcomplex scaling phenomena: we show that several emergent phenomena follow asmooth, sigmoidal behavior and are predictable from small models; we show thatthe agent performance of models such as GPT-4 can be precisely predicted fromsimpler non-agentic benchmarks; and we show how to predict the impact ofpost-training interventions like Chain-of-Thought and Self-Consistency aslanguage model capabilities continue to improve. #LanguageModels #ScalingLaws #ModelEfficiency #PerformancePrediction #EmergentPhenomena
To view or add a comment, sign in
-
https://lnkd.in/ddZFCnst The paper is about performance of language models over special cases. It showed that, for current generation LLMs, to handle rare cases with zero shot learning, we might need exponentially large datasets. Hence it might not be good idea to keep building bigger and bigger model.
To view or add a comment, sign in
-
Chunking Done Right Retrieval-Augmented Generation (RAG) combines the power of retrieval systems with large language models (LLMs) to deliver responses that are both relevant and context-aware. By leveraging external knowledge from databases or documents, LLMs generate smarter and more insightful outputs. But here’s the catch: 🛑 LLMs have context window limits. If your data isn't properly chunked, you risk missing key information or overwhelming the model with irrelevant details. 💡 Why is Chunking Crucial? ✅ Efficiency: Process only the data that matters, reducing computational overhead. ✅ Relevance: Retrieve accurate and contextually aligned outputs. ✅ Preservation of context: Ensure coherent and meaningful responses. 🔑 The goal? Not to chunk for the sake of it, but to organize your data effectively so it's valuable and retrievable when needed. With the right chunking strategy, your RAG system becomes a powerhouse of precision and performance. Let us look at the chunking strategies in later Posts. #learnai #RAG #chunking
To view or add a comment, sign in
-
Earlier this year, tech companies were racing to develop large language models with billions of parameters, trained on trillions of tokens. Before GPT 4o, Gemini Flash marked a shift away from this trend. Andrej Karpathy's tweet explains this shift to a trend towards creating smaller, more efficient models, trained on highly curated datasets. (https://lnkd.in/gEqZ7Ziy) Language models derive their ability to "think" from the datasets they're trained on. Smaller models can be seen as compressed versions of large models, retaining essential 'reasoning capabilities' without the need for vast amounts of information. This makes small models ideal for applications where cost and latency are critical, such as Retrieval-Augmented Generation (RAG) and Agentic architecture. GPT-4o, for example, competes with Gemini Flash, which costs one-fifth of Gemini 1.5 Pro yet performs equally well in language processing, text-to-SQL tasks, and multimodal functions. The next step might be models like Microsoft's Phi, designed for edge-level deployment.
Andrej Karpathy (@karpathy) on X
twitter.com
To view or add a comment, sign in
-
Don't expect an LLM to navigate your computer and do 'everyday tasks' just yet. This paper will allow us to know when we should start worrying. For now, we humans are still much better at the everyday than language models (but for how long?) #llm #vlm
Musing 21: OSWORLD: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
aiscientist.substack.com
To view or add a comment, sign in
-
Retrieval augmented generation (RAG) in large language models (LLMs) can efficiently retrieve and generate relevant information. However, the quality of the output depends heavily on the accuracy and relevance of the underlying data. RAG-related errors can hinder the effectiveness of your model and lead to inaccurate or generic responses. To ensure your model’s responses are grounded in real-world data, it’s essential to optimize for accuracy to avoid RAG-related errors. Learn more about these errors and how to address them in the slides below. #RAG #LLM
To view or add a comment, sign in
-
Large Language Models (LLMs) are impressive, but how do you get them answer your questions perfectly? Here's how through 2 popular methods - Prompt Engineering and Fine-Tuning. #promptengineering #finetuning #LLMs #machinehack
To view or add a comment, sign in
-
For those who are interested in how LLMs are developed, have a read below. For others, do have a look at high level how the process works. Diagram flow included :) Read “Developing Large Language Models (LLMs): A Step-by-Step Guide from Concept to Deployment“ by Wasim Rajput on Medium: https://lnkd.in/gW9iVwfy
Developing Large Language Models (LLMs): A Step-by-Step Guide from Concept to Deployment
medium.com
To view or add a comment, sign in
-
Can LLMs Handle Unlimited Context? Google researchers introduced a new concept called Infini-attention in their latest paper, enabling LLMs to process inputs of any length. Typical transformers reset their attention memory after each context window to manage new data, losing previous context. For example, in a 500K token document split into 100K token windows, each segment starts fresh without memory from the others. Infini-attention, instead, retains and compresses the attention memory from all previous segments. This means in the same 500K document, each 100K window maintains access to the full document's context. The model compresses and reuses key-value states across all segments, allowing it to pull relevant information from any part of the document. #ai #ml #llm #google #infini #attention
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
arxiv.org
To view or add a comment, sign in
-
Explore #AgentGen which uses LLMs to synthesize diverse environments and planning tasks in a scalable way. The environments are synthesized using a corpus from a variety of domain-specific texts The planning tasks are subsequently generated (easy to difficult) and conditioned on the synthesized environments. The novelty seems to be with how the planning tasks are generated; they use a bidirectional evolution method which effectively automates and simplifies the process. Previously, the trajectories to tune models were generated using manually designed planning tasks. The synthesized data is then used to instruction-tune an LLM which enhances the planning abilities of the LLM-based agent. Results show that AgentGen improves an LLMs’ planning ability, e.g., an instruction-tuned Llama-3 8B surpasses GPT-3.5 in overall performance. It even outperforms GPT-4 in certain tasks. Paper :https://lnkd.in/gkgWMZsg
To view or add a comment, sign in
12 followers