Btw you can view your training loss across open source models AND Gemini models on OpenPipe!
OpenPipe
Software Development
Automatically convert unreliable LLM prompts into high-quality, fast fine-tuned models.
About us
Fine-tune models to replace your LLM prompts.
- Website
-
https://openpipe.ai
External link for OpenPipe
- Industry
- Software Development
- Company size
- 2-10 employees
- Type
- Privately Held
Employees at OpenPipe
Updates
-
OpenPipe reposted this
Was great to be back in Austin with Kyle Corbitt and Saumya G. for MLOps World '24. We had a lot of great conversations with engineering Founders, Leaders, and Expert ICs building cutting edge LLMOps infrastructure tooling and teaching the latest best practices in this exciting new field. Speaking of LLMOps, OpenPipe is helping engineering teams and product owners take advantage of their product's human and AI generated feedback data -- to create a Data Flywheel. We can help you continuously plug that data into a Reinforcement Learning pipeline that dramatically improves the performance of your LLMs on your proprietary use-case (and continuously improves performance over time). Interested in building (and growing) a sustainable competitive advantage backed by your data? Message me and let's set up a time to chat! Some special shout-outs to folks we met on the conference trail 🤠 Skyler Saucedo, Marty Dytrych, Juan Diego Balbi, Martin Picovsky, Ramon Serrallonga, Stephan Broquie, Rahul Sheth, Jared Zoneraich, Aaron Cheng, Ph.D, Beatrice Lovely, Nitin Gupta, Stefan Krawczyk, Adam Probst ^ some incredibly smart folks we learned a thing or two from (and hopefully taught a thing or two back)
-
RLHF-curious? I’ve put together a very practical guide to building a task-specific reward model! Includes lots tips on choosing the right metric and data, and all code is included. Hope it’s helpful. 🙂 If your application has human feedback (regenerations, user choices, etc.) please DM me and I’d love to chat about how we can use RLHF to improve your response quality significantly with minimal marginal effort!
-
Yesterday two new major model families became available for fine-tuning: Llama 3.1, which comes in 8B, 70B and 405B(!) variants, and GPT-4o mini. We’ve added them to the OpenPipe platform and ran all of them (except Llama 3.1 405B) through our evaluation harness. The good news is, all 3 of models are extremely high quality. The bad news is, they saturate most of the standard evals we ran, which makes comparing them difficult! In fact, both Llama 3.1 variants we tried saturate all 3 of the standard evals we ran, and GPT-4o mini also saturated 2/3 of them. What do we mean by saturate? For any given input, you can imagine there is a potential “perfect” output (or set of outputs) that cannot be improved upon. The more complex the task, the more difficult it is for a model to generate a perfect output. However, once a model is strong enough to consistently generate a perfect output for that task, we consider the task saturated for that model. In our LLM-as-judge evals, this usually shows up as a cluster of models all doing about the same on the task without any model significantly outperforming. And in fact, that's what we see in the evaluations below. All 3 fine-tuned models do about as well as each other (win rates within 6%) on both the "Resume Summarization" and "Data Extraction" tasks. On "Chatbot Responses" however, both Llama 3.1 variants significantly outperform GPT-4o mini. So the "Chatbot Responses" task isn’t saturated for GPT-4o mini, but all other tasks and models are. This is very significant—we chose these tasks explicitly because older models on our platform, like Mistral 7B and Llama 3 8B, did not saturate these tasks! There are two main reasons why we’re seeing this saturation now: - The new models we’re testing here are stronger than the previous generation of models available on-platform. - Our benchmark models are now all trained on datasets relabeled with Mixture of Agents, which substantially improves the quality of the dataset and thus the fine-tuned model. We’re working on developing better benchmarks, and once we have some higher-difficulty ones we’ll analyze Llama 3.1 405B as well. And again, you can try all these out today on OpenPipe to run your own evaluations!
-
OpenPipe reposted this
Fine-tuned Llama 3.1 8B is completely cracked. Just ran it through our fine-tuning test suite and blows GPT-4o mini out of the water on every task. There has never been an open model this small, this good.
-
OpenPipe reposted this
One week away from All About Fine-Tuning LLMs 🛠 Join us next Tuesday, June 25th at 11 AM PDT on Zoom! We're excited to announce two new panelists: 🤩 Sophia Yang, Ph.D. : Head of Developer Relations at Mistral AI Aditya Jain: Applied Research Scientist at Meta They'll be joining alongside- Kyle Corbitt: Co-founder OpenPipe Wing Lian: Founder, Axolotl AI Benjamin Hamm: Senior Principal Product Manager at OctoAI And our host - Naomi Chetrit Band, The GenAI Collective! Don't miss this deep dive into fine-tuning models for optimal performance, level up your tuning knowledge, and gain the knowledge and strategies to help you tailor open source models. Sign up here 👇 https://lnkd.in/e_3dasMu
🧠 GenAI Collective x OctoAI 🐙 All About Fine-Tuning LLMs · Zoom · Luma
lu.ma