Muhammad Qasim’s Post

View profile for Muhammad Qasim, graphic

Chief Data Scientist at Cancer Clarity LLC | HONORARY Chief Agentic AI Officer at PIAIC & GIAIC | Chief AI Officer at Panacloud | We have 50,000 Developers Network :)

How does OpenAI train the Strawberry 🍓 (o1) model to spend more time thinking? After reading the report, I noticed it mainly focuses on the what—the impressive benchmark results. But when it comes to the how, the report gives us just one sentence: “Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses.” I took some time to really break this down and created an animation to illustrate my understanding. Two key concepts from this sentence are: Reinforcement Learning (RL) and Chain of Thought (CoT). From the list of contributors, two people stood out: • Ilya Sutskever, known for creating RL with Human Feedback (RLHF). Even though he recently left OpenAI to start Safe Superintelligence, his mention suggests RLHF is still part of training the Strawberry model. • Jason Wei, the author of the famous Chain of Thought paper, who joined OpenAI last year after leaving Google Brain. His inclusion points to CoT being a major part of the RLHF alignment process. Here’s what I aim to convey in my animation: 💡 In RLHF+CoT, the Chain of Thought tokens are fed to the reward model, which scores them to improve LLM alignment. This is different from traditional RLHF, where only the prompt and response were used for alignment. 💡 During inference, the model starts by generating Chain of Thought tokens (taking up to 30 seconds) before producing its final response. This is how the model is “thinking” more! Of course, some technical details are still unclear, like how the reward model was trained and how human preferences for the “thinking process” were gathered. Finally, a disclaimer: this animation represents my educated guess. I can’t fully verify its accuracy, but I’d love to hear feedback from someone at OpenAI to help us all learn more! 🙌 #OpenAI #Strawberry #AIByHand #RLHF

  • diagram
Shahdin Salman🇵🇸

Experienced Web Developer | Helping Businesses & Startups Build Scalable & Impactful Digital Solutions | Specializing in AI Integration, UI/UX & Full-Stack Development

3mo

Your detailed breakdown of OpenAI's training methods for the Strawberry model is truly insightful, Muhammad Qasim. Your animation beautifully illustrates the complexities of reinforcement learning and the importance of refining the chain of thought. It's commendable how you've delved deep into the technical aspects while remaining open to feedback and learning.

Like
Reply
AbuBakar Khan Lakhwera

Aspiring AI Engineer | AWS, Docker, Kubernetes, Linux Administration | Flask Developer | Relational Databases (MySQL, PostgreSQL) | Ansible, Jenkins | Git & Bash Scripting Enthusiast | Microsoft MLSA

4mo

Great insights into OpenAI’s Strawberry model! 🍓 Your animation sheds light on how RLHF and Chain of Thought are enhancing AI’s ‘thinking’ process. Excited to see how this evolves and would love to hear from OpenAI for more details! 🚀💡 #AI #RLHF #Strawberry

Like
Reply
Abdul Raffay

| Ardent CEO • Entrepreneur • Technologist |

4mo

Interesting

Like
Reply

Very informative

Like
Reply
Iqra Ehsan

Web Designer |Web Developer | |GIAIC Student | Frontend Web Developer |HTML| CSS| Typescript | Javascript| Content Writer

3mo

Amazing ,super fast speed of open AI thinking

Like
Reply
Muhammad Faisal Sikander

Microsoft Certified Software Engineer

4mo

Insightful

Like
Reply
Khan Wali

DataScience| Python | at Presidential Initiative for Artificial Intelligence & Computing Pakistan

4mo

Very informative

Like
Reply
Shamas Liaqat

Building AI Agents | Expertise in Generative AI | Data-Driven Technologies | Agricultural Economics & Innovation

4mo

Insightfull

Like
Reply
Saad Iqbal

Transforming Businesses with Scalable Design-as-a-Service (DaaS), Global Tech Talent, and AI-Driven Automation | Executive Search & Managed Outsourcing for Rapid Growth

4mo

Insightful 🤔

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics