Breaking down Gemma, Google’s new open-source AI model
Welcome to AI Decoded, Fast Company’s weekly LinkedIn newsletter that breaks down the most important news in the world of AI. I’m Mark Sullivan, a senior writer at Fast Company, covering emerging tech, AI, and tech policy.
This week, I’m focusing on why Google is releasing its new Gemma models as open-source. I also look at the implications and limitations of OpenAI’s new Sora video generator, as well as how Google Gemini’s huge context window might be used in practice.
Sign up to receive this newsletter every week via email here or through LinkedIn here. And if you have comments on this issue and/or ideas for future ones, drop me a line at sullivan@fastcompany.com, and follow me on X (formerly Twitter) @thesullivan.
Google revives its open-source game with Gemma models
Google announced today a set of new large language models, collectively called “Gemma,” and a return to the practice of releasing new research into the open-source ecosystem. The new models were developed by Google DeepMind and other teams within the company that already brought us the state-of-the-art Gemini models.
The Gemma models come in two sizes: one that is comprised of a neural network with 2 billion adjustable variables (called parameters) and one with a neural network with 7 billion parameters. Both sizes are significantly smaller than the largest Gemini model, “Ultra,” which is said to be well beyond a trillion parameters, and more in line with the 1.8B- and 3.25B-parameter Gemini Nano models. While the Gemini Ultra is capable of handling large or nuanced requests, it requires data centers full of expensive servers.
The Gemma models, meanwhile, are small enough to run on a laptop or desktop workstation. Or they can run in the Google cloud, for a price. (Google says its researchers optimized the Gemma models to run on Nvidia GPUs and Google Cloud TPUs.)
The Gemma models will be released to developers on Hugging Face, accompanied by the model weights that resulted from pretraining. Google will also include the inference code and the code for fine-tuning the models. It is not supplying the data or code used during pretraining. Both Gemma sizes are released in two variants—one that’s been pretrained and the other that’s already been fine-tuned with pairs of questions and corresponding answers.
But why is Google releasing open models in a climate where state-of-the-art LLMs are hidden away as proprietary? In short, it means that Google is acknowledging that a great many developers, large and small, don’t just build their apps atop a third-party LLM (such as Google’s Gemini or OpenAI’s GPT-4), but that they access via a paid API, but also use free and open-source models at certain times and for certain tasks.
The company may rather see non-API developers build with a Google model than move their app to Meta’s Llama or some other open-source model. That developer would remain in Google’s ecosystem and might be more likely to host their models in Google Cloud, for example. For the same reasons, Google built Gemma to work on a variety of common development platforms.
There’s of course a risk that bad actors will use open-source generative AI models to do harm. Google DeepMind director Tris Warkentin said during a call with media on Tuesday that Google researchers tried to simulate all the nasty ways that bad actors might try to use Gemma, then used extensive fine-tuning and reinforcement-learning to keep the model from doing those things.
OpenAI’s Sora image generator still has a way to go
Remember that scene in The Fly when the scientist Seth (played by Jeff Goldblum) tries to teleport a piece of steak from one pod to another but fails? “It tastes synthetic,” says science journalist Ronnie (Geena Davis). “The computer is rethinking it rather than reproducing it, and something’s getting lost in the translation,” Seth concludes. I was reminded of that scene, and that problem, last week when I was getting over my initial open-mouthed reaction to videos created by OpenAI’s new Sora tool.
Sora uses a hybrid architecture that leverages the accuracy of diffusion models with the scalability of transformer models (meaning that the more computing power you give the model, the better the results). The resultant videos seem more realistic and visually pleasing than those created by the text-to-video generator from Runway, which has been the leader in that space.
Recommended by LinkedIn
But as I looked a bit closer at some of the Sora videos, the cracks began to show. The shapes and movements of things are no longer ridiculously, nightmarishly, wrong, but they’re still not quite right—enough so to break the spell. Objects in videos often move in unnatural ways. The generation of human hands remains a challenge in some cases. For all its flash appeal, Sora still has one foot in the Uncanny Valley.
The model still seems to lack a real understanding of the laws of physics that govern the play of light over objects and surfaces, the fineries of facial expressions, the textures of things. That’s why text-to-video AI still isn’t ready to start putting thousands of actors out of work. However, it’s hard to argue that Sora couldn’t be useful for producing “just in time” or “just good enough” videos, such as for short-run ads for social media.
OpenAI has been able to rapidly improve the capabilities of its large language models by increasing their size, the amount of data they train on, and the amount of compute power they use. A unique quality of the transformer architecture that underpins GPT-4 is that it scales up in predictable and (surprisingly) productive ways. Sora is built on the same transformer architecture. We may see the same rapid improvements in Sora that we’ve seen in the GPT language models in just a few years.
Developers are doing crazy things with Google’s Gemini 1.5 Pro
Google announced last week that a new version of its Gemini LLM called Gemini 1.5 Pro offers a one-million-token (words or word parts) context window. This is far larger than the previous industry leader, Anthropic’s Claude 2, which offered a 200,000-token window. You can tell Gemini 1.5 Pro to digest an hour of video, or 11 hours of audio, or 30,000 lines of computer code, or 700,000 words.
In the past, the “context window size” metric has been somewhat overplayed because, regardless of the prompt’s capacity for data, there’s no guarantee the LLM will be able to make sense of it all. As one developer told me, LLMs can become overwhelmed by large amounts of prompt data and start spitting out gibberish. This doesn’t seem to be the case with Gemini 1.5 Pro, however. Here are some of the things developers have been doing with the model and its context window:
More AI coverage from Fast Company:
From around the web:
👑 COO at Northera: Your Design Powerhouse 💥| 🤖 AI Consultant
9moInteresting read, thanks for sharing
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
9moAbsolutely, Gemma's open-source approach brings a refreshing shift in releasing new research. It fosters collaboration and innovation within the community. Have you noticed any specific trends or impacts resulting from this return to open-source practices? I'm curious about your thoughts on the potential benefits or challenges it might pose in the broader landscape of AI development.
Rosemary Hood DVM Emerita
9moInference code - define parameters, decisions, data source etc