GenAI Weekly — Edition 8
Your Weekly Dose of Gen AI: News, Trends, and Breakthroughs
Stay at the forefront of the Gen AI revolution with Gen AI Weekly! Each week, we curate the most noteworthy news, insights, and breakthroughs in the field, equipping you with the knowledge you need to stay ahead of the curve.
Allen Institute for AI releases OLMo: A truly open LLM
Today, The Allen Institute for AI (AI2) has released OLMo 7B, a truly open, state-of-the-art large language model released alongside the pre-training data and training code. This empowers researchers and developers to use the best and open models to advance the science of language models collectively.
“Open foundation models have been critical in driving a burst of innovation and development around generative AI,” said Yann LeCun, Chief AI Scientist at Meta. “The vibrant community that comes from open source is the fastest and most effective way to build the future of AI.”
OLMo and the framework is designed to aid researchers in training and experimenting with large language models. They are available for direct download on Hugging Face and in GitHub.
As we’ve previously discussed here when organizations say their models are “open”, it can mean many things, but this is as open as open gets:
“With OLMo, open actually means ‘open’ and everyone in the AI research community will have access to all aspects of model creation, including training code, evaluation methods, data, and so on” said Noah Smith, OLMo project lead, a senior director of NLP Research at AI2, and a professor in the UW’s Allen School. “AI was once an open field centered on an active research community, but as models grew, became more expensive, and started turning into commercial products, AI work started to happen behind closed doors. With OLMo we hope to work against this trend and empower the research community to come together to better understand and engage with language models in a scientific way, leading to more responsible AI technology that benefits everyone.”
Intel launches Gaudi 3 AI accelerator chip
Intel Gaudi 3 accelerator will deliver significant performance improvements for training and inference tasks on leading GenAI models.
Specifically, the Intel Gaudi 3 accelerator is projected to deliver on average versus Nvidia H100:
We’ve discussed Nvidia’s moat before in this newsletter. I guess that we’ll discuss it more and more—especially if it goes away—albeit very slowly.
The lifecycle of a code AI completion
Recommended by LinkedIn
Generative AI, whether for code, text, images, or other use cases, appears as a magic black box to many users. Users typically navigate to a website, install an app, or set up an extension and start seeing the results of the AI tool. But, have you ever wondered what goes into this magic black box or how it really works?
In this post, we want to demystify what goes into a code AI completion for Cody, our code AI assistant that knows your entire codebase. Leveraging a Large Language Model (LLM) to generate a code AI response is fairly trivial, but doing so in a production-grade application that serves many different use cases, coding languages, workflows, and other variables while achieving a high-level of completion acceptance and developer happiness is a whole other thing. We’ll cover the importance of the underlying LLM but also expand the implementation to a fully featured AI engineering system that features various pre and post processing steps, discuss the role of context and how to retrieve it, and more as we explore the lifecycle of a code AI completion. Let’s dive in!
A fantastic and detailed dive into the challenges of building real-world applications with the current LLM stack and how to potentially overcome them.
Groq CEO: ‘We No Longer Sell Hardware’
Groq CEO Jonathan Ross is adamant his company no longer sells hardware—the data center AI chip startup is now an AI cloud services provider.
“Long term, we always wanted to go there, but the realization was, you cannot sell chips as a startup, it’s just too hard,” Ross told EE Times in a recent in-person interview. “The reason is the minimum quantity of purchase for it to make sense is high, the expense is high, and no-one wants to take the risk of buying a whole bunch of hardware—it doesn’t matter how amazing it is.”
Groq’s customer is now the AI developer. Following a number of viral social media posts showcasing the latency of its rack-scale AI inference systems, the company currently has 70,000 developers registered for its real-time large language model (LLM) inference cloud service, GroqCloud, with 19,000 new applications running.
“You get the sort of developer traction we’ve gotten, and people want to buy hardware, but we are no longer selling hardware, because why would we at this point?” Ross said. “It’s not a pivot—we always intended to have a cloud service, we just expected we would do both.”
Hardware is hard.
How faithful is the output across various LLMs in book-length summarization
While long-context large language models (LLMs) can technically summarize book-length documents (>100K tokens), the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books. Our study mitigates the issue of data contamination by focusing on summaries of books published in 2023 or 2024, and we hire annotators who have fully read each book prior to the annotation task to minimize cost and cognitive burden. We collect FABLES, a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, at a cost of $5.2K USD, which allows us to rank LLM summarizers based on faithfulness: Claude-3-Opus significantly outperforms all closed-source LLMs, while the open-source Mixtral is on par with GPT-3.5-Turbo. An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate. While LLM-based auto-raters have proven reliable for factuality and coherence in other settings, we implement several LLM raters of faithfulness and find that none correlates strongly with human annotations, especially with regard to detecting unfaithful claims. Our experiments suggest that detecting unfaithful claims is an important future direction not only for summarization evaluation but also as a testbed for long-context understanding. Finally, we move beyond faithfulness by exploring content selection errors in book-length summarization: we develop a typology of omission errors related to crucial narrative elements and also identify a systematic over-emphasis on events occurring towards the end of the book.
LLMs are just like people—they differ in their ability to both “understand” and “speak”.
Unemployed (currently)
8moInsightful!