n2 Group reposted this
62% of developers are planning to deploy an LLM application to production within the next year, according to a recent survey. Here are 9 questions the STAC-AI inferencing benchmark and test harness can help you answer if you’re building an LLM-based system. 1. How quickly can your system load and get ready to use a large language model? 2. What’s the trade-off between how fast users send requests and how quickly the model responds? 3. Can the system keep up with how fast people read during real-time interactions? 4. How does the system's performance change with different context sizes? 5. What’s the highest number of requests the system can handle while maintaining an acceptable level of performance? 6. How does the system manage multiple users at once, and when should we add more GPUs to keep up with demand? 7. How closely do the model's responses match a reference set of answers? 8. How much text does the system produce for every dollar spent? 9. How does the system perform in terms of energy use, space, and cost efficiency? To demonstrate this, we ran the benchmark on a stack with: - Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct - 8 x NVIDIA A100-SXM4-80GB GPUs - 720 GiB of virtualized memory The benchmark uses LLMs to analyze financial data from quarterly and annual reports filed by publicly traded companies, illustrating how latency and throughput vary with request rates. This analysis raises important questions about user satisfaction at different times of day and the trade-offs between resource allocation and the cost of inference. The benchmark captures the following metrics plus a whole lot more: - Inferences per second - Words generated per second - Response smoothness - Energy efficiency (words per kWh) - Space efficiency (words per cubic foot of rack space) - Price Performance (words per USD) Here are just some of our findings from the Llama 3.1-8B-Instruct test: - The server loaded the model from storage into the GPUs and was ready for inference in 90 seconds. - A 26% reduction in the rate of prompts hitting the system reduced the median response time by 38%. - The system was 20x more efficient when processing a smaller-context data set than when processing a larger-context dataset. (EDGAR4a vs EDGAR5a datasets in STAC-AI) - At a peak request rate of 21.5 requests per second, the system achieved an output profile of 12.2 words per second, just below the typical maximum reading speed of 13 words per second for fast readers. We will share even more data from these tests at AI STAC conference in London on December 4th. If you’re interested in the finer details, we’ll be presenting these along with more information about STAC-AI at an AI workshop the following day. Register for either or both events if you want to learn more. They’re free to attend. 📅 Conference, December 4th 🔗 https://lnkd.in/eCwd4Q5a 📅 Workshop, December 5th 🔗 https://lnkd.in/eTKqfn6X