Lepton AI

Lepton AI

Software Development

Cupertino, California 1,971 followers

We are building a scalable and efficient AI application platform.

About us

We are building a scalable and efficient AI Application platform.

Website
https://www.lepton.ai
Industry
Software Development
Company size
11-50 employees
Headquarters
Cupertino, California
Type
Privately Held
Founded
2023

Locations

  • Primary

    20863 Stevens Creek Blvd

    Cupertino, California 95014, US

    Get directions

Employees at Lepton AI

Updates

  • View organization page for Lepton AI, graphic

    1,971 followers

    "The only reason to give a speech is to change the world." We are bring native, real-time voice generation to all open source LLMs. Instead of separate, duck-taped modules, we build one single engine to deliver both the text and audio stream with time to first audio around 300 milliseconds or lower. More specifically, our engine: Seamless integration with any major open-source LLM, including Llama3.1-8B, 70B, and 405B Outpaces traditional text-to-speech workflows with up to 10x faster TTFA (Time to first audio) Delivers smooth, customizable dialogue, minimizing pauses and interruptions Fully customizable voice profiles. To learn more about how we did it, the blog post is at https://lnkd.in/g5c4F48M We are currently opening this up to existing Lepton AI customers and beta users, and will make it generally available soon. Shoot us an email at info@lepton.ai if you would like early access! Here's how it works:

  • View organization page for Lepton AI, graphic

    1,971 followers

    View profile for Yangqing Jia, graphic

    Founder @ Lepton AI | Berkeley alumni | Cloud & Open-source AI leadership

    People often ask why prices like $2.8/m token for Llama 405B, while being super fast, are still profitable at Lepton AI. We've even been asked by a leading GPU provider! So, I figured we should share some technical analysis. This information could benefit the community. We've taken these statistics and analysis for granted, but they might not be obvious to everyone. 1. Big batches: Each request receives an output of ~30 tokens/second. Batching (grouping multiple requests simultaneously) significantly improves total throughput, often 10x or higher than a single request. GPUs are more efficient with larger batches. 2. Dynamic batching: This technique immediately adds a new request to an existing batch instead of making it wait, ensuring the GPU always works at high capacity. 3. Input tokens: The ~30 tokens/second refers to output tokens. Input tokens are processed much faster (known as "prefilling"). Typically, the input length is many times larger than the output (3x to 10x). This increases the total number of tokens processed, explaining why there is often separate billing for input and output. 4. Quantization: Using 8-bit integers or 8-bit floats instead of 16-bit floats reduces memory usage and speeds up processing because the GPU accesses less memory. Newer GPUs also have hardware instructions for lower bit numbers, increasing speed further. For example, the new Nvidia Blackwell GPU supports 4-bit floats (fp4). Quantization also saves memory, allowing even bigger batches from point 1, making it more economic. 5. Speculative decoding: This method uses a smaller model to predict the next token. For example, predicting "you" after "it is good to see" doesn't require a large model. Smaller models make such predictions faster. The Medusa algorithm by Tianle Cai is a specific example of this approach. 6. Prompt caching: LLMs often encounter repeated prefixes, such as "you are a smart AI agent" in system prompts. Caching these prefilled prompts avoids recalculating them, speeding up repeated requests. 7. Optimizing GPU setups: This involves using large GPUs for big models, small GPUs for small models, and matching GPUs to specific tasks—some are better for prefilling, others for decoding. There are many optimization opportunities here. This is not a complete list. We integrate these methods (and a growing number of more) in our runtime to ensure profitability with reasonable traffic. Lepton is created by experts who have developed key AI software over the past decade - Caffe, onnx, pytorch - alongside cloud experts like the creator of etcd and core contributors to Kubernetes. We provide not only LLM APIs, but also a full cloud-native experience to help you find, use, and optimize GPUs on our cloud platform. We love the open-source and open-access community. What AI technical explanation would you like to hear next?

  • View organization page for Lepton AI, graphic

    1,971 followers

    View profile for Yangqing Jia, graphic

    Founder @ Lepton AI | Berkeley alumni | Cloud & Open-source AI leadership

    # Llama3 405B, API, Quantization, and Model Size Performance measurements of Llama3.1 405B, orchestrated from OpenRouter , one of the leading LLM aggregation platforms. Here are my couple cents of the model: - It's amazing to see the quick support of the model by almost all providers. Open source makes software and model co-development much easy. In our case, it took us minimal python code change to support it (like, minutes). - Llama 3.1 405B is indeed a model hard to make profitable. Taking half a machine or a machine to run, its cost is significant and speed is still so-so. Most providers keep it around 30 tokens/s (see pic) to make economic sense. In comparison, 70B models can go north of 150 tokens/s. - You will still be able to break even. Of course, this is dependent on a good optimization, and a good workload saturation. To our VC friends: for pure API service at this price tag, kindly not expect an 80% profit margin like conventional SaaS though. - In addition to top performance optimization, the Lepton AI API makes conscious balances between the many parameters - speed, price, concurrency, cost - to make sure that it is sustainable. - Quantization is going to be a standard. Folks, forget about FP16. Int8/FP8 is the way to go. If you still feel uncomfortable, let me tell you that back in the days AI frameworks worried about precision and still supports FP64. Have you ever used FP64 in your neural nets? - Quantization needs care. Gone are the days when one scale is enough for the whole tensor. You'll need to do channel wise / group wise quantization to make sure things do not degrade. - My bold prediction is that, 405B adoption will still be limited by the speed and price constraint. But I am not much worried, as I expect at least another 4x efficiency improvement over the next year or so. - I am looking forward to testing out Mistral Large 123B! Our Tuna engine supports it out of box, although to honor the research license, we'll refrain from hosting a public API. If you are interested, let us know. - Andrej Karpathy has an awesome tweet about small models FTW. I totally agree. In vertical applications you probably don't need models that big. 70B is normally good enough, and in many cases 8B is really good with finetuning! - It's great that llama 3.1 allows (and in some way recommends) finetuning your own model. - I also want to give a shout out to the vLLM project. We have our own engine but vLLM is simply great. Our platform supports it too. Last but not least, public API is one thing but feel free to reach out to us for enterprise / dedicated deployments. We believe that AI is awesome beyond APIs, and we build a full AI cloud to serve your end to end needs.

    • No alternative text description for this image
  • View organization page for Lepton AI, graphic

    1,971 followers

    View profile for Yangqing Jia, graphic

    Founder @ Lepton AI | Berkeley alumni | Cloud & Open-source AI leadership

    Memory Matters for LLM While everyone is rushing to provide the serverless Llama3-405b model endpoints, I want to talk about one key choice that matters a lot, especially for dedicated enterprise deployments when traffic is not very high: memory. - The normal deployment of a model the size of 405b takes 8xH100 GPUs with a total of 640G memory. You'll quantize the weights to int8 or fp8, leaving about 230G memory for KV cache and others. Doable with care. - If you need to do fine-tuning (full fine-tuning, or Lora, or Medusa), memory size is going to be stressful. Your choices are probably (1) do quantized training with careful control of scale, (2) go distributed, both require extra care. - AMD MI-300 is a particularly interesting card for this scenario, as each card has 192G memory - 4 cards with a total of 768G memory will be very comfortably host the model, as well as giving you a good amount of remaining memory for KV cacheing / prompt cacheing and other tricks. - Attached is a screenshot showing our runtime ("tuna") running the 405b model on 4xMI300 out of the box at Lepton AI. Speed is good. - We know there are a lot of claims out there saying one is faster than the other, but based on our experience, with reasonable quantization, continuous batching, chunked decoding and other known optimization techniques, MI300 and H100 exhibit on-par performance. - We haven't thoroughly tested some of the optimization techniques, such as Medusa, on the 405b models. So it is hard to say for sure which GPU takes the lead. - The upcoming Blackwell GPUs will have 192G memory as well, so we are definitely seeing appetite for larger models. - Large memory definitely gives you opportunity to do more within one box: 1.536TB memory per machine means you can do almost whatever you want with the 405b sized models: fine-tune them, serve multiple models at once, hot swap Loras, etc. It's exciting times for model, and also exciting times for infra. (This is a re-post of my twitter post here: https://lnkd.in/gj7s5xET )

    • No alternative text description for this image
  • Lepton AI reposted this

    View profile for Yangqing Jia, graphic

    Founder @ Lepton AI | Berkeley alumni | Cloud & Open-source AI leadership

    Buying a GPU is not the hardest thing in the world now. But how do you maximize the return of your hardware investment? We're super excited to announce Lepton's free GPU pilot program for AI startups. If you are building AI apps, we'll give you access to our AI cloud and high-performance GPUs, to help you 10x your productivity. Interested? Register and we'll be in touch! Read more at: https://lnkd.in/gzfqYbGJ Register at: https://lnkd.in/g4qY8icf

    Boost Your AI Productivity with Lepton’s Free GPU Pilot Program!

    Boost Your AI Productivity with Lepton’s Free GPU Pilot Program!

    blog.lepton.ai

  • View organization page for Lepton AI, graphic

    1,971 followers

    View profile for Song Han, graphic

    Assoc. Prof. @MIT, distinguished scientist @NVIDIA, co-founder of DeePhi (now part of AMD) and OmniML(now part of NVIDIA). PhD @Stanford. Efficient AI Computing.

    DistriFusion: multi-GPU parallel diffusion model acceleration at CVPR poster #232, highlight poster

    • No alternative text description for this image
  • Lepton AI reposted this

    View profile for Lu Zhang, graphic

    Founder & Managing Partner at Fusion Fund | Serial Entrepreneur & Board Member | Young Global Leader with WEF (Davos) | Lecturer at Stanford University

    Our portfolio company Lepton AI is unleashing the productivity of AI application development. The team recently demonstrated an impressive conversational search application using just 500 lines of Python code. The code has been open-sourced and it is currently trending at #1 on @GitHub! Check it out live at https://search.lepton.run/ Congratulations to the Lepton AI team for this achievement. Fusion Fund is excited to be part of your journey to make AI technology more accessible! Learn more about how to run AI applications efficiently, at scale, and in minutes with a cloud-native platform here: https://www.lepton.ai/ #AIInfrastructure #MLframework #LLM #artificialintelligence #opensource

    • No alternative text description for this image

Similar pages

Funding

Lepton AI 1 total round

Last Round

Seed
See more info on crunchbase