GenAI: Struggling to choose the right foundation model?
Are you struggling to choose the right foundation model and infrastructure setup for your generative AI workload? AWS’s open-source FM Bench might be exactly what you need.
The Challenge of Model Selection
In today’s rapidly evolving generative AI landscape, organisations face a critical challenge: how do you select the optimal foundation model while balancing performance, cost, and accuracy? With numerous models available — from open-source options like Llama to proprietary solutions like Anthropic’s Claude — and various deployment options on AWS, making the right choice can feel overwhelming.
How can it help businesses?
FM Bench is an open-source tool from AWS that simplifies the selection and optimization of foundation models for generative AI. It benchmarks models across cost, performance, and accuracy, supporting various AWS services, instance types, and inference containers.
With FM Bench, businesses can:
FM Bench generates comprehensive reports with visualisations, recommendations, and insights. It supports a wide range of models and is continuously updated with new features.
What is FM Bench?
FM Bench is AWS’s answer to this challenge — an open-source Python package that provides comprehensive benchmarking capabilities for any foundation model deployed on AWS’s generative AI services. What makes FM Bench particularly powerful is its ability to evaluate models across three critical dimensions:
Performance: Measures inference latency and transaction throughput
Cost: Calculates dollar cost per transaction
Accuracy: Evaluates model responses using a panel of LLM judges
FM Bench is a powerful, flexible tool designed to run performance benchmarks and accuracy tests for any foundation model deployed on AWS generative AI services. Whether you’re using Amazon SageMaker, AWS Bedrock, Amazon EKS, or Amazon EC2, FM Bench provides a standardised way to evaluate and compare models.
Key Features
2. Flexible Model Support
3. Sophisticated Evaluation System
4. Automated Analysis
How It Works
FM Bench simplifies the benchmarking process into three main steps:
The tool handles everything from model deployment to data collection and analysis, providing you with actionable insights about which model and infrastructure combination best meets your requirements.
Getting Started with FM Bench
Ready to optimise your foundation model deployment? Here’s how to get started:
Follow these steps:
pip install fmbench
2. Create a configuration file: Create a YAML configuration file specifying the models, instance types, and evaluation parameters you want to test. You can find example configuration files in the FMBench GitHub repository. Run FMBench: Execute FMBench:
fmbench - config-file config-file-name.yml > fmbench.log 2>&1
3. Analyse the results: After the benchmarking is complete, FMBench generates an auto-generated report in Markdown format called report.md in the results directory. This report contains:
4. Interpret the results: The report provides insights such as[3]:
Following figure shows the heat map chart showcases the price performance metrics for running the Llama2–13B model on various Amazon SageMaker instance types. The data is based on benchmarking the model using prompts from the LongBench Q&A dataset, where the prompt lengths ranged from 3,000 to 3,840 tokens.
The key metrics displayed include:
- Inference latency (P95 latency threshold set at 3 seconds)
- Transactions per minute that can be supported
- Concurrency level (number of parallel requests)
The chart allows you to quickly identify the most cost-effective and performant instance type options for your specific workload requirements. For example, at 100 transactions per minute, a single P4d instance would be the optimal choice, providing the lowest cost per hour. However, as the throughput needs to scale to 1,000 transactions per minute, utilising multiple G5.2XL instances becomes the recommended configuration, balancing cost and instance count.
This granular price-performance data empowers you to make informed decisions on the right serving infrastructure for deploying your Llama2–13B model in production, ensuring it meets your latency, throughput and cost targets.
For example in the following figure, the benchmarking report also includes charts that illustrate the relationship between inference latency and prompt size, across different concurrency levels. As expected, the inference latency tends to increase as the prompt size grows larger.
However, what’s particularly interesting to observe is that the rate of latency increase is much more pronounced at higher concurrency levels. In other words, as you scale up the number of parallel requests being processed, the latency starts to rise more steeply as the prompt size increases.
These detailed latency vs. prompt size charts provide valuable insights into how the model performance scales under different workload conditions. This information can help you make more informed decisions about provisioning the right infrastructure to meet your latency requirements, especially as the complexity of the input prompts changes. More you can see here.
Customise for specific needs
You can modify the configuration file to benchmark specific models, use custom datasets, or evaluate fine-tuned models for your particular use case. To get started with FM Bench for benchmarking your own models, you can follow these steps:
Recommended by LinkedIn
3. Prepare your data:
4. Run FM Bench:
fmbench - config-file your_config_file.yml > fmbench.log 2>&1
5. View results:
Key points for benchmarking your own models:
Following are are the key steps to create a configuration file for FM Bench:
2. Specify the model details:
3. Define the deployment settings:
5. Configure benchmarking parameters:
6. Set constraints and metrics:
7. Specify output settings:
8. Add any custom parameters:
9. Save the configuration as a YAML file
The config-llama2–7b-g5-quick.yml file provided in the FM Bench repository serves as a good annotated example to reference when creating your own configuration.
Essential parameters to include in an FM Bench configuration file:
2. Deployment settings:
3. Benchmarking parameters:
4. Constraints and metrics:
5. Output settings:
6. Custom parameters:
The configuration file is usually YAML format. FM Bench provides example configuration files in its GitHub repository that can be used as templates and customised for specific benchmarking needs.
Key points:
This sample configuration file should includes following info
Users can modify this template based on their specific benchmarking needs, adjusting parameters such as model names, instance types, constraints, and evaluation settings as required for their use case.
Real-World Benefits
Organisations using FM Bench can:
Latest Enhancements
Recent updates to FM Bench include:
Call to action
FM Bench represents a significant step forward in making foundation model selection and optimization a more systematic and data-driven process. Whether you’re a platform team managing deployments at scale or an application team looking to optimise your specific workload, FM Bench provides the insights you need to make informed decisions about your generative AI infrastructure.
As the generative AI landscape continues to evolve, tools like FM Bench will play an increasingly important role in helping organisations navigate their AI infrastructure choices. The open-source nature and active development of FM Bench make it an invaluable resource for anyone working with foundation models on AWS.
Take the next step and start leveraging the power of FM Bench for your foundation model benchmarking and optimization needs. Join the FM Bench interest channel to engage with the development team, share your feedback, and contribute to the growth of this essential open-source tool.
Don’t let your foundation model deployment decisions be driven by guesswork — empower your team with the data-driven insights provided by FM Bench. Start today and unlock the full potential of your generative AI workloads on AWS.
Thinker | Artificial Intelligence (incl. GenAI) and Machine Learning | Cloud | Solutions Architecture | ex-AWS | ex-IBM
4moThis is extremely useful. Getting the choice right is key to ROI.