#167 Llama Just Raised The Bar!
Llama raising a bar

#167 Llama Just Raised The Bar!

<< Previous Edition: Flexing Bricks as Open Weights

Yesterday, as I wrote about Llama being utilized as a lowest common denominator, particularly in Open LLMs, I pondered the possibility of Meta elevating the standards. Today, Meta has indeed upped the ante significantly with the unveiling of Llama 3.

Just to add color, let's review DBRX, the topic of our recent discussion. Below, I've outlined how DBRX surpassed Llama 2 but now falls short compared to Llama 3.

Comparison between DBRX, Llama 2, and Llama 3.

Enhanced Capabilities with Llama 3

Llama 3 is available in two sizes 8B and 70B parameters both in pre-trained and instruction-fine-tuned version.

The text-based models we are releasing today are the first in the Llama 3 collection of models. Our goal in the near future is to make Llama 3 multilingual and multimodal, have longer context, and continue to improve overall performance across core LLM capabilities such as reasoning and coding.

Moreover, Meta is currently developing a 400+ billion parameter model set to launch soon, purportedly poised to rival advanced closed LLMs such as GPT-4.

Model Architecture

Llama 3 is a decoder-only model, which simplifies its architecture by focusing on generating output based on the input it's been trained on. It's unclear whether the Llama 2 family exclusively used this architecture but that most likey should be the case. For clarity, there are three main types of architectures in language models:

1. Decoder Only: These models generate text based on the context they receive. They are optimized for tasks like text completion and generation, where the focus is on producing coherent and contextually relevant output. (This is essentially what we understand LLMs to do anyway).

2. Encoder Only: These models are primarily used for tasks that involve understanding input, such as text classification or sentiment analysis, where the model assesses and processes input without the need to generate new text. (This sounds so much like classic AI).

3. Encoder-Decoder: This architecture combines both encoding and decoding capabilities, enabling the model to understand input (encode) and generate output (decode). It's versatile for tasks like translation or summarizing, where both understanding and generating text are necessary. (This sounds like combining RAG with LLMs as we know them).

In the current landscape, many leading large language models (LLMs) are indeed decoder-only, emphasizing their role in generating extensive, coherent text from prompts. These models typically do not convert external data into embeddings directly; instead, they work with embeddings that have been pre-processed by other mechanisms or during initial training stages. This approach allows them to focus on generating high-quality text based on those embeddings.

Pre-Training

Llama 3 is trained using over 15 trillion tokens, with content gleaned from publicly available sources. As readers may know, in the case of pre-training, more data does not necessarily equate to more intelligence. To address this issue, Meta employed a series of data-filtering pipelines, including heuristic filters, NSFW filters, semantic deduplication approaches, and text classifiers to predict data quality. Using common parlance, Llama is adept at sniffing out before ingesting new food.

Another interesting point Meta mentioned is that their model's performance continued to improve linearly even beyond 15 trillion tokens (one might wonder what would happen if the number of tokens exceeded the US GDP). Obviously, large models are less efficient during inference, so multiple sized models are needed to meet various needs.

Three Parallel Tracks

To maximize parallelism for largest mode training Meta used three tracks:

  1. Data parallelization
  2. Model parallelization
  3. Pipeline parallelization

In addition, to maximize uptime,Meta also introduced infrastructure automated error-detection, handing and maintenance of infrastructure.

Instruction Fine-Tuning

Instruction fine-tuning, as outlined by Meta, represents a sophisticated phase in model training where traditional methods like Supervised Fine-Tuning (SFT) are augmented with reinforcement strategies such as Rejection Sampling, Proximal Policy Optimization (PPO), and Direct Policy Optimization (DPO). Essentially, this approach is poised to supersede the combination of SFT and the infamous Reinforcement Learning with Human Feedback (RLHF), aiming to refine the model's adeptness at following detailed user commands.

Conclusion

One key takeaway about benchmarks: they should be regarded seriously, not literally. Vendors naturally showcase their models in the best possible light, which explains the variance in metrics like the MMLU 5-shot across different sources. Ultimately, it's the overarching trend that counts, and here, Meta has significantly reshaped the landscape by setting an exceptionally high standard.

Joseph A S.

Head of Sales and Go to Market - Cloud and Gen AI @ Ampere | Enterprise Adoption, Cloud Native Workload

8mo
Like
Reply
Dipanshu Mansingka

Principal Consultant / NITI's AIM/ATL Mentor

8mo

to meet the need of certain solution will there be need of engine to decide and send request to different models. like first when person starts a dialogue system tries to find the intent based on the knowledge it has. From that once the goal is identified, for that goal call another model to get the result. Repeat the cycle.

Like
Reply
Pete Grett

GEN AI Evangelist | #TechSherpa | #LiftOthersUp

8mo

Meta is definitely making big waves in the world of Open LLMs Exciting times ahead for AI development. #AIevolution Rishi Yadav

400B 🤯 great to see Meta building SOTA open LLMs to close the gap on closed models!

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics