Artificial Analysis

Artificial Analysis

Technology, Information and Internet

Independent analysis of AI models and hosting providers: https://artificialanalysis.ai/

About us

Leading provider of independent analysis of AI models and providers. Understand the AI landscape to choose the best AI technologies for your use-case.

Website
https://artificialanalysis.ai/
Industry
Technology, Information and Internet
Company size
11-50 employees
Type
Privately Held

Employees at Artificial Analysis

Updates

  • View organization page for Artificial Analysis, graphic

    6,992 followers

    Thanks for the support Andrew Ng! Completely agree, faster token generation will become increasingly important as a greater proportion of output tokens are consumed by models, such as in multi-step agentic workflows, rather than being read by people.

    View profile for Andrew Ng, graphic
    Andrew Ng Andrew Ng is an Influencer

    Founder of DeepLearning.AI; Managing General Partner of AI Fund; Exec Chairman of Landing AI

    Shoutout to the team that built https://lnkd.in/g3Y-Zj3W . Really neat site that benchmarks the speed of different LLM API providers to help developers pick which models to use. This nicely complements the LMSYS Chatbot Arena, Hugging Face open LLM leaderboards and Stanford's HELM that focus more on the quality of the outputs. I hope benchmarks like this encourage more providers to work on fast token generation, which is critical for agentic workflows!

    Model & API Providers Analysis | Artificial Analysis

    Model & API Providers Analysis | Artificial Analysis

    artificialanalysis.ai

  • View organization page for Artificial Analysis, graphic

    6,992 followers

    Wait - is the new GPT-4o a smaller and less intelligent model? We have completed running our independent evals on OpenAI’s GPT-4o release yesterday and are consistently measuring materially lower eval scores than the August release of GPT-4o. GPT-4o (Yesterday's Nov release) vs GPT-4o (Aug): ➤ Artificial Analysis Quality Index decrease from 77 to 71 (now equal to GPT-4o mini) ➤ GPQA Diamond decrease from 51% to 39%, MATH decrease from 78% to 69% ➤ Speed increase from ~80 output tokens/s to ~180 tokens/s ➤ No pricing change Our Output Speed benchmarks are currently measuring ~180 output tokens/s for the Nov 20th model, while the August model shows ~80 tokens/s. We have generally observed significantly faster speeds on launch day for OpenAI models (likely due to OpenAI provisioning capacity ahead of adoption), but previously have not seen a 2x speed difference. Based on this data, we conclude that it is likely that OpenAI’s Nov 20th GPT-4o model is a smaller model than the August release. Given that OpenAI has not cut prices for the Nov 20th version, we recommend that developers do not shift workloads away from the August version without careful testing.

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,992 followers

    Cerebras is capable of offering Llama 3.1 405B at 969 output tokens/s and they have announced they will soon be offering a public inference endpoint 🏁 We have independently benchmarked a private endpoint shared by Cerebras Systems and have measured 969 output tokens/s, >10X faster than the median provider we benchmark. Cerebras have announced they will be conducting trials with select inference customers in Q4 '24 and will be offering access to all customers of their serverless inference offering in Q1 '25. The endpoint will be offered at $6 / 1M input tokens & $12 / 1M output tokens. While the endpoint we benchmarked had a 16k context window, Cerebras have shared that they will be supporting the model's full 128k context window. Link to our analysis of Llama 3.1 405B providers below 👇 Cerebras' endpoint will be present as soon as it is launched to the public.

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,992 followers

    Nebius is now live on Artificial Analysis! Nebius is a leading AI-focused, 🇪🇺 EU-based cloud provider that has recently launched serverless inference endpoints Nebius offers serverless inference endpoints with per-token pricing across a range of open-weight models, including Meta's Llama 3.1 series, Mistral 8x22B & Nemo, and Alibaba's Qwen. Nebius' endpoints are hosted in the EU. They have also announced that a new US region is coming in 2025. See below for a link to our analysis of Nebius' endpoints, including comparisons across quality, speed, and price 👇

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,992 followers

    Groq has launched a new Llama 3.1 70B endpoint with >6X faster output tokens/s than their current endpoint and >20X the median of other providers Groq's new endpoint achieves 1,665 output tokens/s through leveraging speculative decoding. Speculative decoding is an inference optimization technique that when correctly implemented, does not decrease the quality of responses. We have run evaluations on the endpoint and have independently verified Groq's implementation has not impacted response quality. Speculative decoding uses a smaller draft model to generate token predictions, which the primary model then verifies. Since verifying tokens is much faster than generating them, this approach increases inference speed when the predictions match. This also means 'simpler' prompts are faster. In our "Write out the Gettysburg Address verbatim" prompt, we saw ~2,500 output tokens/s. Groq is offering the endpoint to paying customers (those not on their free plan) at the price of $0.59/1M input tokens & $0.99/1M output tokens. The model is offered with a 8k context window. Groq's LPU chips run on 14nm silicon, far from the latest 3nm process technology, leaving room for greater speeds in the future from next generation Groq LPU chips. See below for a link to our analysis of providers of Llama 3.1 70B 👇

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,992 followers

    SambaNova extends its Llama 3.1 405B inference speed lead - achieving 163 output tokens/s, >2X other providers SambaNova Systems has rolled out a speculative decoding on their 405B endpoint, now delivering speeds ranging up to 200 tokens/s (depending on prompt complexity). As we’ve covered previously, speculative decoding is an inference optimization technique that uses a smaller draft model to generate speculative tokens for an LLM to verify. Speculative decoding does not impact quality when implemented with the full model verifying all output tokens. One implication of speculative decoding is that ‘simpler’ prompts can run faster than ‘harder’ prompts because a higher proportion of the draft model’s output tokens are accepted as correct by the target model. We have independently benchmarked SambaNova’s updated endpoint and can confirm that we have observed no quality degradation in the latest version of the API. See below for a link to our analysis of providers of Llama 3.1 405B 👇

    • No alternative text description for this image
    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,992 followers

    xAI's API has now been launched! While Grok Beta's intelligence exceeds Llama 3.1 70B, its pricing is a barrier to its competitiveness In our suite of evaluations run independently, Grok Beta achieves a Artificial Analysis Quality Index of 70. While this intelligence exceeds Llama 3.1 70B and Claude 3.5 Haiku, Grok Beta's pricing at $5/1M Input tokens and $15/1M Output tokens is significantly more expensive that these models. It is also even more expensive than the leading frontier models, GPT-4o and Claude 3.5 Sonnet, with the exception of o1-Preview. This makes it hard to justify choosing Grok Beta for production applications. Grok Beta's stated differentiation of less 'censorship' may mean that it is best suited for use-cases with specific needs. Congratulations to xAI on the launch of their API offering. We are excited to see what comes next from their new 100k H100 cluster. See below for a link to our analysis 👇

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,992 followers

    Anthropic’s Claude 3.5 Haiku release is a significant jump in intelligence from Claude 3 Haiku but its higher price makes it a tricky choice for developers Claude 3.5 Haiku now achieves an Artificial Analysis Quality Index of 69, substantially above Claude 3 Haiku’s 54 and just below the Claude 3 Opus’ 70. However, Anthropic has quadrupled per-token pricing from Claude 3 Haiku - leaving Claude 3.5 Haiku nearly 10x more expensive than Google’s latest Gemini 1.5 Flash and OpenAI’s GPT-4o mini. Our initial speed measurements also show Anthropic's Claude 3.5 Haiku API delivering ~2x slower output speeds than Claude 3 Haiku. While there are a range of factors that can drive both speed and pricing, we would speculate that these changes indicate that Claude 3.5 Haiku is a larger model than Claude 3 Haiku. Seeing Haiku achieve near Opus-level intelligence a mere 8 months since the original launch of the Claude 3 family is incredible but it enters a highly competitive market. See below for pricing comparisons, initial speed results, our full benchmark breakdown and a link to our analysis 👇

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,992 followers

    Revealing red_panda as Recraft V3, the new frontier Text to Image model! Recraft V3 is the latest model from London-based AI graphic design start-up Recraft. Since launching Recraft V3 under the the pseudonym ‘red_panda’ we have received >100k votes. With an ELO of 1172, users are preferring Recraft V3 to every other model on the Artificial Analysis leaderboard. The Artificial Analysis Image Arena now has over one million votes and the Image Arena ELO score is the leading independent metric for comparing image generation models. See below for example images comparing Recraft V3 to other leading image models 👇

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,992 followers

    Initial results from our AI video generation model arena are in! With almost 20k votes, we now have an initial ranking of video generation models 🥇 MiniMax's Hailuo is the clear leader with an ELO of 1092 and a win rate of 67% 🥈 Genmo's Mochi 1 model, released last week, takes the silver and is the leading open-source video generation model 🥉 Runway, a long-time leader in the video generation model space, takes bronze with Runway Gen 3 Alpha which has an ELO of 1051 and a win rate of 61% The Video Arena provides a comparison of video generation models across a wide variety of prompts. Each model has unique strengths, and so we encourage you to test them based on your specific use case. Link below to contribute to the Artificial Analysis Video Arena 👇 . After 30 votes you will also be able to see your own personalized ranking of the video generation models - feel free to share yours below in the comments.

    • No alternative text description for this image

Similar pages