GenAI: Struggling to choose the right foundation model?

GenAI: Struggling to choose the right foundation model?

Are you struggling to choose the right foundation model and infrastructure setup for your generative AI workload? AWS’s open-source FM Bench might be exactly what you need.

The Challenge of Model Selection

In today’s rapidly evolving generative AI landscape, organisations face a critical challenge: how do you select the optimal foundation model while balancing performance, cost, and accuracy? With numerous models available — from open-source options like Llama to proprietary solutions like Anthropic’s Claude — and various deployment options on AWS, making the right choice can feel overwhelming.

How can it help businesses?

FM Bench is an open-source tool from AWS that simplifies the selection and optimization of foundation models for generative AI. It benchmarks models across cost, performance, and accuracy, supporting various AWS services, instance types, and inference containers.

With FM Bench, businesses can:

  • Identify cost-effective instance types
  • Make data-driven decisions about model selection and infrastructure
  • Test custom datasets and fine-tuned models
  • Compare different serving strategies
  • Validate performance across workload sizes

FM Bench generates comprehensive reports with visualisations, recommendations, and insights. It supports a wide range of models and is continuously updated with new features.

What is FM Bench?

FM Bench is AWS’s answer to this challenge — an open-source Python package that provides comprehensive benchmarking capabilities for any foundation model deployed on AWS’s generative AI services. What makes FM Bench particularly powerful is its ability to evaluate models across three critical dimensions:

Performance: Measures inference latency and transaction throughput

Cost: Calculates dollar cost per transaction

Accuracy: Evaluates model responses using a panel of LLM judges

FM Bench is a powerful, flexible tool designed to run performance benchmarks and accuracy tests for any foundation model deployed on AWS generative AI services. Whether you’re using Amazon SageMaker, AWS Bedrock, Amazon EKS, or Amazon EC2, FM Bench provides a standardised way to evaluate and compare models.

Key Features

  1. Universal Compatibility

  • Works with any AWS service (SageMaker, Bedrock, EKS, EC2)
  • Supports various instance types (g5, p4d, p5, Inf2)
  • Compatible with multiple inference containers (DeepSpeed, TensorRT, HuggingFace TGI)

2. Flexible Model Support

  • Open-source models (Llama, Mistral)
  • Third-party models through Bedrock (Claude, Cohere)
  • Custom fine-tuned models
  • First-party AWS models (Titan)

3. Sophisticated Evaluation System

  • Uses a panel of three LLM judges (Claude 3 Sonnet, Cohere Command-R+, Llama 2 70B)
  • Implements majority voting for accuracy assessment
  • Supports custom evaluation datasets

4. Automated Analysis

  • Generates comprehensive HTML reports
  • Provides interactive visualisations
  • Creates heat maps for cost-performance analysis
  • Tracks accuracy trajectories across different prompt sizes

How It Works

FM Bench simplifies the benchmarking process into three main steps:

  1. Configuration: Create a YAML file specifying your benchmarking parameters (or use pre-built configurations)
  2. Execution: Run a single command to execute the benchmarking suite
  3. Analysis: Review the auto-generated report with detailed insights and recommendations

The tool handles everything from model deployment to data collection and analysis, providing you with actionable insights about which model and infrastructure combination best meets your requirements.

Getting Started with FM Bench

Ready to optimise your foundation model deployment? Here’s how to get started:

  • Visit the FM Bench GitHub repository and start the project.
  • Join the FM Bench interest channel to engage with the community and developers.
  • Try FM Bench with your own models and datasets for valuable insights.

Follow these steps:

  1. Installation: Install FMBench using pip:

pip install fmbench

2. Create a configuration file: Create a YAML configuration file specifying the models, instance types, and evaluation parameters you want to test. You can find example configuration files in the FMBench GitHub repository. Run FMBench: Execute FMBench:

fmbench - config-file config-file-name.yml > fmbench.log 2>&1        

3. Analyse the results: After the benchmarking is complete, FMBench generates an auto-generated report in Markdown format called report.md in the results directory. This report contains:

  • Price-performance comparisons across different models and instance types
  • Accuracy evaluations using a panel of LLM judges
  • Visualisations like heat maps and charts

4. Interpret the results: The report provides insights such as[3]:

  • Optimal model and serving stack based on price-performance requirements
  • Model accuracy across different prompt sizes
  • Latency and throughput metrics
  • Cost estimates for running the benchmarks

Following figure shows the heat map chart showcases the price performance metrics for running the Llama2–13B model on various Amazon SageMaker instance types. The data is based on benchmarking the model using prompts from the LongBench Q&A dataset, where the prompt lengths ranged from 3,000 to 3,840 tokens.

The key metrics displayed include:

- Inference latency (P95 latency threshold set at 3 seconds)

- Transactions per minute that can be supported

- Concurrency level (number of parallel requests)

The chart allows you to quickly identify the most cost-effective and performant instance type options for your specific workload requirements. For example, at 100 transactions per minute, a single P4d instance would be the optimal choice, providing the lowest cost per hour. However, as the throughput needs to scale to 1,000 transactions per minute, utilising multiple G5.2XL instances becomes the recommended configuration, balancing cost and instance count.

This granular price-performance data empowers you to make informed decisions on the right serving infrastructure for deploying your Llama2–13B model in production, ensuring it meets your latency, throughput and cost targets.

For example in the following figure, the benchmarking report also includes charts that illustrate the relationship between inference latency and prompt size, across different concurrency levels. As expected, the inference latency tends to increase as the prompt size grows larger.

However, what’s particularly interesting to observe is that the rate of latency increase is much more pronounced at higher concurrency levels. In other words, as you scale up the number of parallel requests being processed, the latency starts to rise more steeply as the prompt size increases.

These detailed latency vs. prompt size charts provide valuable insights into how the model performance scales under different workload conditions. This information can help you make more informed decisions about provisioning the right infrastructure to meet your latency requirements, especially as the complexity of the input prompts changes. More you can see here.

Customise for specific needs

You can modify the configuration file to benchmark specific models, use custom datasets, or evaluate fine-tuned models for your particular use case. To get started with FM Bench for benchmarking your own models, you can follow these steps:

  1. Install FM Bench as discussed before.
  2. Create a configuration file:


  • Use one of the provided config files in the FM Bench GitHub repo as a template
  • Edit the config file to specify your model, deployment settings, and test parameters
  • A simple annotated config file example is provided in the repo (config-llama2–7b-g5-quick.yml)

3. Prepare your data:


  • FM Bench supports using datasets from LongBench
  • You can also use your own custom dataset

4. Run FM Bench:

fmbench - config-file your_config_file.yml > fmbench.log 2>&1        

5. View results:

  • FM Bench will generate a benchmarking report as a markdown file (report.md) in the results directory
  • Metrics and other result files will be stored in an S3 bucket and downloaded locally to the results directory

Key points for benchmarking your own models:

  • FM Bench is flexible and can benchmark models deployed on SageMaker, Bedrock, EKS, or EC2
  • You can use the “Bring your own endpoint” mode to benchmark already deployed custom models
  • Customise the config file to specify your model, instance types, inference containers, and other parameters
  • Use your own dataset or fine-tuned model by specifying it in the config file

Following are are the key steps to create a configuration file for FM Bench:

  1. Choose a base configuration file:

  • Use an existing config file from the configs folder in the FM Bench GitHub repository as a starting point
  • Or edit an existing config file to customise it for your specific requirements

2. Specify the model details:

  • Model name/type (e.g. Llama2–7b)
  • Model source (e.g. Hugging Face model ID)

3. Define the deployment settings:

  • AWS service to use (SageMaker, Bedrock, EKS, EC2)
  • Instance types to benchmark (e.g. ml.g5.xlarge, ml.g5.2xlarge)
  • Inference container (e.g. huggingface-pytorch-tgi-inference)

5. Configure benchmarking parameters:

  • Dataset to use (e.g. LongBench or custom dataset)
  • Prompt sizes/token ranges to test
  • Number of concurrent requests
  • Latency thresholds
  • Accuracy requirements

6. Set constraints and metrics:

  • Price/cost limits
  • Performance targets (latency, throughput)
  • Accuracy thresholds

7. Specify output settings:

  • S3 bucket to store results
  • Report format preferences

8. Add any custom parameters:

  • Model-specific settings
  • Advanced inference options (e.g. tensor parallelism)

9. Save the configuration as a YAML file

The config-llama2–7b-g5-quick.yml file provided in the FM Bench repository serves as a good annotated example to reference when creating your own configuration.

Essential parameters to include in an FM Bench configuration file:

  1. Model details:

  • Model name/type (e.g. Llama2–7b)
  • Model source (e.g. Hugging Face model ID)

2. Deployment settings:

  • AWS service to use (SageMaker, Bedrock, EKS, EC2)
  • Instance types to benchmark (e.g. ml.g5.xlarge, ml.g5.2xlarge)
  • Inference container (e.g. huggingface-pytorch-tgi-inference)

3. Benchmarking parameters:

  • Dataset to use (e.g. LongBench or custom dataset)
  • Prompt sizes/token ranges to test
  • Number of concurrent requests
  • Latency thresholds
  • Accuracy requirements

4. Constraints and metrics:

  • Price/cost limits
  • Performance targets (latency, throughput)
  • Accuracy thresholds

5. Output settings:

  • S3 bucket to store results
  • Report format preferences

6. Custom parameters:

  • Model-specific settings
  • Advanced inference options (e.g. tensor parallelism)

The configuration file is usually YAML format. FM Bench provides example configuration files in its GitHub repository that can be used as templates and customised for specific benchmarking needs.

Key points:

  • The config file specifies all the details needed to run the benchmark
  • It allows customising the models, deployment, testing parameters, and constraints
  • Example config files are provided that can be modified as needed
  • The file format is YAML

This sample configuration file should includes following info

  1. Model details: Specifies the model name and source.
  2. Deployment settings: Defines the AWS service, instance types, and inference container.
  3. Benchmarking parameters: Sets the dataset, prompt sizes, concurrent requests, and latency threshold.
  4. Constraints and metrics: Specifies price limits and accuracy thresholds.
  5. Output settings: Defines where to store results and report format.
  6. Custom parameters: Includes optional model-specific settings.
  7. LLM Judges: Lists the models to be used for evaluation.
  8. Evaluation settings: Specifies parameters for model evaluation.

Users can modify this template based on their specific benchmarking needs, adjusting parameters such as model names, instance types, constraints, and evaluation settings as required for their use case.

Real-World Benefits

Organisations using FM Bench can:

  • Make data-driven decisions about model selection
  • Optimise infrastructure costs
  • Ensure accuracy requirements are met
  • Compare different serving strategies
  • Validate performance across varying workload sizes

Latest Enhancements

Recent updates to FM Bench include:

  • Support for NVIDIA Triton Inference Server
  • Model evaluation using a panel of LLM judges (LLAMA 70B, Claude, and Cohere)
  • Compilation support for AWS Inferentia and Trainium chips
  • A dedicated website with comprehensive documentation

Call to action

FM Bench represents a significant step forward in making foundation model selection and optimization a more systematic and data-driven process. Whether you’re a platform team managing deployments at scale or an application team looking to optimise your specific workload, FM Bench provides the insights you need to make informed decisions about your generative AI infrastructure.

As the generative AI landscape continues to evolve, tools like FM Bench will play an increasingly important role in helping organisations navigate their AI infrastructure choices. The open-source nature and active development of FM Bench make it an invaluable resource for anyone working with foundation models on AWS.

Take the next step and start leveraging the power of FM Bench for your foundation model benchmarking and optimization needs. Join the FM Bench interest channel to engage with the development team, share your feedback, and contribute to the growth of this essential open-source tool.

Don’t let your foundation model deployment decisions be driven by guesswork — empower your team with the data-driven insights provided by FM Bench. Start today and unlock the full potential of your generative AI workloads on AWS.


Rahul Shringarpure

Thinker | Artificial Intelligence (incl. GenAI) and Machine Learning | Cloud | Solutions Architecture | ex-AWS | ex-IBM

4mo

This is extremely useful. Getting the choice right is key to ROI.

To view or add a comment, sign in

More articles by Shailesh Mishra

Insights from the community

Others also viewed

Explore topics