FOD#32: Mixture of Experts – What is it?
It should be illegal to ship that many updates and releases so close to the holidays, but here we are, two weeks before Christmas, with our hands full of news and research papers (thank you, Conference on Neural Information Processing Systems (NeurIPs), very much!). Let’s dive in; it was truly a fascinating week.
But first, a reminder: we are piecing together expert views on the trajectory of ML&AI for 2024. Send at ks@turingpost.com your thoughts on what you believe 2024 will bring!
Many many thanks to those who already shared their views.
Now, to the news. Everybody will be talking about Mixture of Experts these days, thanks to Mistral AI’s almost punkish release of their new model on torrent, which they announced simply like this:
The concept of MoE, though, has been around for a while. To be exact: it was first mentioned in 1988 at Connectionist Summer. The idea, introduced by Robert Jacobs and Geoffrey Hinton, involves using several specialized networks, called 'expert' networks, each handling different tasks, along with a controlling network that chooses the right expert for each task. This approach was suggested because using one network for all tasks often leads to problems and slow learning. By dividing tasks among experts, learning becomes faster and more efficient. This idea is the basis of the Mixture of Experts model, where different networks learn different things more effectively, emphasizing specialized learning over a one-size-fits-all strategy in neural networks. The first paper, 'Adaptive Mixtures of Local Experts about MoE’, was published in 1991.
Despite its initial promise, MoE's complexity and computational demands led to it being overshadowed by more straightforward algorithms during the early days of AI's resurgence. However, with the advent of more powerful computing resources and vast datasets, MoE has experienced a renaissance, proving integral to advancements in neural network architectures.
The MoE Framework
The essence of MoE lies in its unique structure. Unlike traditional neural networks that rely on a singular, monolithic approach to problem-solving, MoE employs a range of specialized sub-models. Each 'expert' is adept at handling specific types of data or tasks. A gating network then intelligently directs input data to the most appropriate expert(s). This division of labor not only enhances model accuracy but also scales efficiently, as experts can be trained in parallel.
Google Research was especially dedicated to researching the topic:
Sparse Mixture-of-Experts model Mixtral 8×7B by Mistral
This week, MoE is on the rise due to Mistral’s release of their open-source Sparse Mixture-of-Experts model, Mixtral 8x7B. This model outperforms Llama 2 70B and GPT-3.5 in benchmarks, boasting an inference speed that’s six times faster. Licensed under Apache 2.0, Mixtral strikes an efficient balance between cost and performance. It handles five languages, excels in code generation, and can be fine-tuned for instruction-following tasks.
The community is buzzing with excitement. While Mistral is raising another $415 million, hitting a $2 billion valuation and momentarily joining the AI Unicorn Family. (Welcome! We'll be covering you shortly.)
Additional read: to nerd out more on Mixtral and MoE, please refer to Hugging Face’s blog, Interconnects newsletter, and Mistral’s own release post.
News from The Usual Suspects ©
Elon Musk's "Grok" is available for Premium+
Google: A Mosaic of Success and Shortcomings
HoneyBee from Intel Labs and Mila
Recommended by LinkedIn
CoreWeave's Funding
Meta AI's Codec Avatars
EU's AI Act
Other news, categorized for your convenience
An exceptionally rich week! As always, we offer you only the freshest and the most relevant research papers of the week. The best curated selection:
Language Models and Code Generation
Video and Image Synthesis
Advances in Learning and Training Methods
Multimodal and General AI
Reinforcement Learning and Reranking
Pathfinding and Reasoning
Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍
Another week with fascinating innovations! We call this overview “Froth on the Daydream" - or simply, FOD. It’s a reference to the surrealistic and experimental novel by Boris Vian – after all, AI is experimental and feels quite surrealistic, and a lot of writing on this topic is just a froth on the daydream.