Nyun AI’s Post

We just published a new blog on 𝗠𝗲𝗱𝗶𝘂𝗺 about how to optimize large language models (LLMs) like 𝗟𝗹𝗮𝗺𝗮𝟯.𝟭-𝟴𝗕 using the 𝗪𝟰𝗔𝟴𝗞𝗩𝟰 𝗾𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 technique from LMQuant. In the blog, we walk through how this approach can help reduce model size, improve throughput, and cut down on computational costs, especially for large-scale deployments! 𝗞𝗲𝘆 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆𝘀: • How to apply 𝟰-𝗯𝗶𝘁 𝘄𝗲𝗶𝗴𝗵𝘁𝘀 (𝗪𝟰), 𝟴-𝗯𝗶𝘁 𝗮𝗰𝘁𝗶𝘃𝗮𝘁𝗶𝗼𝗻𝘀 (𝗔𝟴), and 𝟰-𝗯𝗶𝘁 𝗸𝗲𝘆-𝘃𝗮𝗹𝘂𝗲 𝗰𝗮𝗰𝗵𝗲𝘀 (𝗞𝗩𝟰). • Achieving up to 𝟯.𝟱𝘅 𝘀𝗽𝗲𝗲𝗱 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁𝘀 on GPUs like the A100 and L40S with the 𝗤𝗦𝗲𝗿𝘃𝗲 system. • Step-by-step guide to model quantization, installation, and performance evaluation. 𝗟𝗶𝗻𝗸 𝘁𝗼 𝘁𝗵𝗲 𝗯𝗹𝗼𝗴 𝗽𝗼𝘀𝘁: https://bit.ly/4dIWbJf 𝗟𝗶𝗻𝗸 𝘁𝗼 𝘁𝗵𝗲 𝗽𝗮𝗽𝗲𝗿: https://bit.ly/4g787WK 𝗟𝗶𝗻𝗸 𝘁𝗼 𝗡𝘆𝘂𝗻𝘁𝗮𝗺 𝗥𝗲𝗽𝗼𝘀𝗶𝘁𝗼𝗿𝘆: https://bit.ly/4ggbISl Check out the blog to learn how you can maximize performance while minimizing computational overhead, and explore Nyuntam to start optimizing your models today! #𝗔𝗜 #𝗠𝗮𝗰𝗵𝗶𝗻𝗲𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 #𝗠𝗼𝗱𝗲𝗹𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 #𝗡𝘆𝘂𝗻𝘁𝗮𝗺 #𝗢𝗽𝗲𝗻𝗦𝗼𝘂𝗿𝗰𝗲

No KV left unquantized: Achieving Faster inference than TensorRT-LLM

No KV left unquantized: Achieving Faster inference than TensorRT-LLM

medium.com

To view or add a comment, sign in

Explore topics