How does OpenAI train the Strawberry 🍓 (o1) model to spend more time thinking? After reading the report, I noticed it mainly focuses on the what—the impressive benchmark results. But when it comes to the how, the report gives us just one sentence: “Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses.” I took some time to really break this down and created an animation to illustrate my understanding. Two key concepts from this sentence are: Reinforcement Learning (RL) and Chain of Thought (CoT). From the list of contributors, two people stood out: • Ilya Sutskever, known for creating RL with Human Feedback (RLHF). Even though he recently left OpenAI to start Safe Superintelligence, his mention suggests RLHF is still part of training the Strawberry model. • Jason Wei, the author of the famous Chain of Thought paper, who joined OpenAI last year after leaving Google Brain. His inclusion points to CoT being a major part of the RLHF alignment process. Here’s what I aim to convey in my animation: 💡 In RLHF+CoT, the Chain of Thought tokens are fed to the reward model, which scores them to improve LLM alignment. This is different from traditional RLHF, where only the prompt and response were used for alignment. 💡 During inference, the model starts by generating Chain of Thought tokens (taking up to 30 seconds) before producing its final response. This is how the model is “thinking” more! Of course, some technical details are still unclear, like how the reward model was trained and how human preferences for the “thinking process” were gathered. Finally, a disclaimer: this animation represents my educated guess. I can’t fully verify its accuracy, but I’d love to hear feedback from someone at OpenAI to help us all learn more! 🙌 #OpenAI #Strawberry #AIByHand #RLHF
Great insights into OpenAI’s Strawberry model! 🍓 Your animation sheds light on how RLHF and Chain of Thought are enhancing AI’s ‘thinking’ process. Excited to see how this evolves and would love to hear from OpenAI for more details! 🚀💡 #AI #RLHF #Strawberry
Interesting
Very informative
Amazing ,super fast speed of open AI thinking
Insightful
Very informative
Insightfull
Insightful 🤔
Experienced Web Developer | Helping Businesses & Startups Build Scalable & Impactful Digital Solutions | Specializing in AI Integration, UI/UX & Full-Stack Development
3moYour detailed breakdown of OpenAI's training methods for the Strawberry model is truly insightful, Muhammad Qasim. Your animation beautifully illustrates the complexities of reinforcement learning and the importance of refining the chain of thought. It's commendable how you've delved deep into the technical aspects while remaining open to feedback and learning.