Aishwarya Naresh Reganti’s Post

5mo

🤔 OpenAI's latest structured outputs feature is definitely a life-saver for building LLM-based apps, but it could come with some performance trade-offs. A new paper titled "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models" highlights this concern. 📖 Some insights: ⛳ The paper examines the effects of structured generation, where LLMs produce content in standardized formats like JSON or XML, which is common in real-world applications for extracting key information. ⛳ It shows that while structured generation simplifies parsing and integration into applications, it also has a significant downside. Specifically, LLMs exhibit a notable decline in reasoning abilities when restricted to these formats, with stricter format constraints leading to greater performance degradation. ⛳ Looser format restrictions generally improve performance and reduce variance in reasoning tasks. Parsing errors, while not the primary cause of performance differences, can be mitigated through corrective prompting. 👉 I’m not sure if this applies to the latest OpenAI models since the authors only tested it on GPT-3.5 and a few other models that might not be fully optimized for structured outputs. But it’s definitely something to keep in mind and check if you’re planning to use this feature a lot. Link: https://lnkd.in/eHRURmSH

4 Comments

Raja Duraisingam

Principal Engineer at VOICEplug | Conversational AI | Building Voice Assistants

5mo

Aishwarya Naresh Reganti I just learnt this week that the reverse is also true. Given a structred input and asking the LLM to generate a natural sounding response ( dialog ). The prompt that was performing good with gpt-3.5-turbo is giving terrible results on gpt-4o-mini.

2 Reactions

Faisal Ladak

AI Strategy & Innovation | Technical Team Leadership | Data Monetization & Governance

5mo

Is it because the instructions to generate structured output getting mixed up with the prompt? I mean hit on performance and reasoning.

1 Reaction

AnilKumar Podugu

Sr.Client Partner @ Pronix Inc | Digital Applications Practice

5mo

This is an insightful perspective! The study really highlights the balance between structure and performance in LLM-based apps. Your expertise in Gen AI technology would add valuable insights to this discussion, Aishwarya Naresh Reganti.

Anne- Marie VERDIN-MULOT

Senior Director Of Digital Marketing & Communication @ Value Retail | Global Marketing and Communication Strategy. Brand & Business Building Expertise. AI & New Technology For Building State of the Art Marketing Models.

5mo

Good to know!

See more comments

To view or add a comment, sign in

More Relevant Posts

Smriti Sharma

SDE II at Oracle OCI, Distributed Systems
5mo
Report this post
Checkout key highlights from Gauri about new Open AI structured output feature in the API. #llm #openai #genai
Aishwarya Naresh Reganti

Tech Lead @ AWS | Lecturer | Advisor | Researcher | Speaker | Investor | CMU LTI Alumni |
5mo

🤔 OpenAI's latest structured outputs feature is definitely a life-saver for building LLM-based apps, but it could come with some performance trade-offs. A new paper titled "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models" highlights this concern. 📖 Some insights: ⛳ The paper examines the effects of structured generation, where LLMs produce content in standardized formats like JSON or XML, which is common in real-world applications for extracting key information. ⛳ It shows that while structured generation simplifies parsing and integration into applications, it also has a significant downside. Specifically, LLMs exhibit a notable decline in reasoning abilities when restricted to these formats, with stricter format constraints leading to greater performance degradation. ⛳ Looser format restrictions generally improve performance and reduce variance in reasoning tasks. Parsing errors, while not the primary cause of performance differences, can be mitigated through corrective prompting. 👉 I’m not sure if this applies to the latest OpenAI models since the authors only tested it on GPT-3.5 and a few other models that might not be fully optimized for structured outputs. But it’s definitely something to keep in mind and check if you’re planning to use this feature a lot. Link: https://lnkd.in/eHRURmSH
Like Comment
To view or add a comment, sign in
Simon Spurrier

AI for engineering teams at Engine Labs
2mo
Report this post
The best model for Codegen? GPT-4o vs 3.5 Sonnet vs everyone else TL;DR use Claude 3.5 Sonnet 20241022 OpenAI and Anthropic have traded the code generation crown a couple of times. Currently, Anthropic is in the lead. Anthropic's latest flagship model, Claude 3.5 Sonnet 20241022 tops most benchmarks and is widely anecdotally considered ‘better’. This extended their lead from the previous release of Claude 3.5 Sonnet. OpenAI’s latest models that focus on improved reasoning, the o1 family, have yet to be fully released and are missing some important API features like system messages and tool use. The full release of o1 is likely to happen before the end of 2024 and could challenge Claude 3.5. OpenAI’s next generation model, known as Orion or GPT-5, is rumoured to be months away. A jump from GPT-4 to 5 similar to that from 3.5 to 4 would reset the landscape and put OpenAI far ahead. These models will probably improve with time in a fairly linear fashion for at least the next couple of years. Code generation will benefit. There are a bunch of other options - Gemini, Llama etc. However closed source models with the most funding and compute capacity appear to have a strong lead. For high value use cases, like production code generation, the extra capability is worth the cost and closed source trade-offs. There are also several notable fine tunes and foundation models optimised specifically for code generation. In the long term, it's likely foundation models will outperform, especially in code generation, where their models have demonstrated exceptional generalisability. Progress is being made in areas other than raw code generation ability. OpenAI’s mini models and Google’s Flash models are around 30 times cheaper than their full-size counterparts. Groq runs open-source models on custom hardware, delivering extremely fast inference times. However, for code generation in a work setting, value exists mostly at the frontier.

1 Comment
Like Comment
To view or add a comment, sign in
Leliuga

87 followers
6mo
Report this post
Llama 3.1 405B matches or beats the Openai GPT-4o across many text benchmarks! New and improvements of 3.1: - 8B, 70B & 405B versions as Instruct and Base with 128k context - Multilingual, supports 8 languages, including English, German, French, and more. - Trained on >15T Tokens & fine-tuned on 25M human and synthetic samples - Commercial friendly license with allowance to use model outputs to improve other LLMs - Quantized versions in FP8, AWQ, and GPTQ for efficient inference. - Llama 3 405B matches and beast GPT-4o on many benchmarks - 8B & 70B improved Coding and instruction, following up to 12% - Supports Tool use and Function Calling Blog: https://lnkd.in/g9yTBFnv Model Collection: https://lnkd.in/g_bVRpmp

Llama 3.1 - 405B, 70B & 8B with multilinguality and long context

huggingface.co
Like Comment
To view or add a comment, sign in
Minh Le Duc

AI Engineer
4mo
Report this post
Hugging Face has done something great! OpenAI's o1 is a significant advancement in language models, but it's not yet reliable for planning tasks. PlanBench is a challenging benchmark for testing planning abilities. While o1 excels at simple planning tasks, it struggles with longer, more complex ones. Additionally, o1 often confidently gives wrong answers to unsolvable problems. Furthermore, o1 is significantly more expensive to use than specialized planning algorithms. Hybrid approaches combining language models with specialized planners may be more promising for real-world planning tasks. #AI #LLMs #Benchmark #Reasoning #HuggingFace #OpenAI #ChatGPT #Planning
Aymeric Roucher

Building agents @ Hugging Face 🤗 | Polytechnique - Cambridge
4mo

📄 OpenAI's o1 still can't plan reliably - but is still a massive leap forward 🤔 Can OpenAI's o1 actually plan and reason, as claimed in its release? Researchers put it to the test using PlanBench, a planning benchmark that has stumped even the best language models. 🎯 The benchmark has interesting challenges: Blocksworld is similar to the well-known “towers of Hanoi”, where several specific steps are to be taken in succession to move blocks around. It has a more difficult version, called “Mystery blocksworld”, where some terms are replaced to obfuscate their meaning and prevent LLMs from imitating existing reasonings from its training corpus. 𝐈𝐧𝐬𝐢𝐠𝐡𝐭𝐬: 🚀 On simple planning tasks, o1 nearly aced it - 97.8% accuracy vs 62.5% for the best language model 🧠 Unlike language models, o1 showed some ability to reason through obfuscated planning problems 📉 But performance drops sharply on longer/more complex plans 🙈 Still confidently gives wrong answers ~54% of the time on unsolvable problems! 💰 Costs skyrocket - researchers racked up a $1,897 bill in just a week of testing! ⏱️ Much slower than specialized planning algorithms like FastDownward, that gives 100% accuracy results The researchers conclude that while o1 is a big step forward, it's not yet reliable or efficient enough for real-world planning tasks. They suggest that hybrid approaches combining language models with specialized planners may be more promising for now. Read the paper 👉 https://lnkd.in/eRubBDsj
Like Comment
To view or add a comment, sign in
Rajandran R

Creator - OpenAlgo
7mo
Report this post
With the rapid growth of artificial intelligence technology, converting spoken language into text has become an incredibly useful skill. OpenAI’s Whisper API is a powerful tool for doing just this—it can accurately turn your spoken words into written text. https://lnkd.in/gUTQB3Ah

How to Use OpenAI’s Whisper API for Speech-to-Text Conversion

https://www.marketcalls.in
Like Comment
To view or add a comment, sign in
Shishir Rahman

Founder & CEO at AdwizeX | Work at Data Scientist| Digital Entrepreneur Technology & Innovation🤖
4mo
Report this post
How to work RAG Pipeline💡🙂 1. Document Processing Input: Everything starts with a document containing information pertinent to a particular case. Chunking: Long documents are pre-processed by dividing them into smaller portions or "chunks". This is a necessary step in handling long documents because large blocks of texts cannot be fed as a whole into the model. 2. Embedding Model Embedding Generation: Each chunk of the document is then fed into an embedding model, which would transform the text content into numerical vectors-embeddings. In that way, the vectors represent word meanings in a high-dimensional space. The embedding model can be any pre-trained model like OpenAI GPT or any other language model. 3. Vector Store Saving Embeddings: It saves the embeddings for all chunks within a vector store. That is somewhat like a database for the vectorized chunks, where every vector will correspond to a particular chunk of the source document. The Vector Store will enable fast and efficient retrieval by similarity searches. 4. Query by User User Query: A user provides a query related to the document. This query is fed into the system to fetch highly relevant information. 5. Semantic Search Search in Vector Store**: This user query gets first embedded by the very same embedding model. Then, this query embedding gets matched against the stored embeddings in the vector store through the process of semantic search. In contrast to keyword matching, semantic search tries to look for semantically similar chunks to the query. 6. Retrieved Chunks Chunk Retrieval: Most relevant chunks, on the basis of semantic similarity, fetched from the vector store. These chunks best match with the user's query in terms of meaning and context. 7. Response Generation LLM Response: These chunks are then fed to a Large Language Model [LLM]. The LLM would process the retrieved content into a coherent response. Such a generative model would combine the information retrieved with its own language generation capabilities, thereby providing a detailed and well-formed answer to the query. 8. User Receives Response Output: The final output, which returns the user with a generated response based on information retrieved. This lets the system return responses that are grounded in specific documents but enhanced through the generative model's capability to devise human-like responses.
Like Comment
To view or add a comment, sign in
Aymeric Roucher

Building agents @ Hugging Face 🤗 | Polytechnique - Cambridge
4mo
Report this post
📄 OpenAI's o1 still can't plan reliably - but is still a massive leap forward 🤔 Can OpenAI's o1 actually plan and reason, as claimed in its release? Researchers put it to the test using PlanBench, a planning benchmark that has stumped even the best language models. 🎯 The benchmark has interesting challenges: Blocksworld is similar to the well-known “towers of Hanoi”, where several specific steps are to be taken in succession to move blocks around. It has a more difficult version, called “Mystery blocksworld”, where some terms are replaced to obfuscate their meaning and prevent LLMs from imitating existing reasonings from its training corpus. 𝐈𝐧𝐬𝐢𝐠𝐡𝐭𝐬: 🚀 On simple planning tasks, o1 nearly aced it - 97.8% accuracy vs 62.5% for the best language model 🧠 Unlike language models, o1 showed some ability to reason through obfuscated planning problems 📉 But performance drops sharply on longer/more complex plans 🙈 Still confidently gives wrong answers ~54% of the time on unsolvable problems! 💰 Costs skyrocket - researchers racked up a $1,897 bill in just a week of testing! ⏱️ Much slower than specialized planning algorithms like FastDownward, that gives 100% accuracy results The researchers conclude that while o1 is a big step forward, it's not yet reliable or efficient enough for real-world planning tasks. They suggest that hybrid approaches combining language models with specialized planners may be more promising for now. Read the paper 👉 https://lnkd.in/eRubBDsj
10 Comments
Like Comment
To view or add a comment, sign in
Roger Kibbe

Conversational and Generative AI Technology and Strategy Leader. Head of Conversational AI Developer Relations
4mo
Report this post
This is spot on! LLM benchmarks are easy to game. As mentioned, crowd sourced solutions like LMSys Chatbot or private solutions like Scale AI's are some of the few that can be trusted. I don't completely dismiss benchmarks but look at them with a very very sceptical eye. You can see this problem often when a random model goes to the top/near the top of the Huggingface Open LLM leaderboard, which unfortunately has become so polluted with models trained on the benchmarks that it's lost much of it's value.

Jim Fan Jim Fan is an Influencer

NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Lab). Stanford Ph.D. Building Humanoid Robots and Physical AI. OpenAI's first intern. Sharing insights on the bleeding edge of AI.
4mo

It is *incredibly* easy to game the LLM benchmarks. Training on test set is for the rookies. Here're some tricks to practice magic at home: 1. Train on paraphrased examples of the test set. "LLM-decontaminator" paper from LMSys found that you can beat GPT-4 with a 13B model (!!) on MMLU, GSK-8K, and HumanEval (coding) just by rewriting the exact same test question in different formats, phrasing, or even foreign languages. Easily +10 point gains. 2. It's easy to game the LLM-decontaminator as well. It only checks for paraphrasing, but you can use any frontier model to generate *new questions* that are different on the surface but very similar in solution template/logic. In other words, you attempt to overfit to a close distribution of the test set, but not to individual samples. HumanEval, for example, is a bunch of simple Python questions (i.e. a specific, narrow distribution) that do not reflect the real world coding complexity at all. 3. You can also prompt-engineer the heck out of your generator to fool the LLM-decontaminator or whatever detector. The detector is public but your data gen is private. Take advantage of that. 4. Increasing inference-time compute budget almost always helps. Self-reflection is a technique known for a long time (see Reflexion, Shinn et al. 2023). Also try simple majority voting or Tree of Thought. These thought traces are essentially test-time ensemble methods, and the more the merrier. It's obvious that ensemble of N things > 1 thing if you don't control for inference-time tokens. It's incredible that people still get excited by MMLU or HumanEval numbers in Sept, 2024. These benchmarks are seriously broken, and gaming them can be an undergrad homework project. I would not trust any claims of a superior model until I see the following: 1. ELO points on LMSys Chatbot Arena. It's difficult to game democracy in the wild. 2. Private LLM evaluation from a trusted 3rd party, such as Scale AI's SEAL benchmark. The test set must be well-curated and held secret, otherwise it quickly loses potency. Reading materials: - LLM-decontaminator: https://lnkd.in/gSjbC-uu - Code is open-source: https://lnkd.in/gMBHQPCC - Reflexion: https://lnkd.in/gr8c8EZV - Tree of Thought: https://lnkd.in/gPBfj36d

Catch me if you can! How to beat GPT-4 with a 13B model | LMSYS Org

lmsys.org
Like Comment
To view or add a comment, sign in

85,599 followers

1,016 Posts

View Profile Connect

Aishwarya Naresh Reganti’s Post

More Relevant Posts

Explore topics