João Gante on LinkedIn: #llms #textgeneration #pytorch #llama

ML (generation) @ Hugging Face 🤗

10x more tokens per second 🔥 -20% less GPU RAM 🤏 Same `generate` or `pipeline` interface you're used to 🙏 These are roughly the benefits you get when generating text with AutoGPTQ, when compared to native 🤗 transformers with 4-bit quantization. AutoGPTQ is a python library that built all the functionality you need to load GPTQ weights from the 🤗 Hub. Because it shares the same backbone as the 🤗 transformers models, you have access to the exact same text generation interfaces! Together with the collection of GPTQ weights on the Hub, you have access to the best of both worlds: - low GPU memory usage and a much higher throughput - all the hub and text generation features you're used to As always, don't blindly trust what I write here, try it for yourself! If you have a GPU with >= 12GB of RAM, have a look at the example script I link below🔥 If you have more modest resources, check the examples in the repository itself 🤗 GH repository: https://lnkd.in/dMPXHtsi Hub weights: https://lnkd.in/dPS74Urh Example script: https://lnkd.in/dNyPCBav #llms #textgeneration #pytorch #llama

To view or add a comment, sign in

João Gante’s Post

Explore topics