Exploring the LLM Infra Stack, Part 2: The Model Layer
2. Model Layer
This is the second post in our 4-part series on the LLM infra stack. You can catch up on our last post on the Data Layer here. If you are building in this space, we’d love to hear from you at info@theory.ventures.
The Model Layer
The core component of any LLM system is the model itself. They provide the fundamental capabilities that enable new products.
LLMs are foundation models: models trained on a broad set of data that can be adapted to a wide range of tasks. They tend to fall into two categories: large, hosted, closed-source models like GPT-4 and PaLM 2 or smaller, open-source models like Llama 2 and Falcon.
It’s unclear which models will dominate. The state of the art is changing rapidly – a new best-in-class model is announced monthly. So few products have been built with LLMs to test trade-offs. There are two emerging paths:
1. Large foundation models dominate: Closed-source LLMs continue to provide capabilities other models can’t match. A managed service offers simplicity & rapid cost reduction. These vendors solve security concerns.
2. Fine-tuned models provide the best value: Smaller, less-expensive models prove just as good for most applications after fine-tuning. Businesses prefer them because they control the model and can create intellectual property.
Large foundation models will work best for use cases that require broad knowledge and reasoning. Asking a model to plan a month-long vacation itinerary in Southeast Asia for a vegetarian couple interested in Buddhism requires planning and lots of working memory. These models are also best for new tasks where you don’t have data.
Smaller, fine-tuned models will excel in products composed of repeatable, well-defined tasks. They’ll perform just as well as larger models for a fraction of the cost. Need to categorize customer feedback as positive or negative? Extracting details from an email into a spreadsheet? A smaller model will work great.
Which models are safer and more secure? Which are best for compliance and governance? It’s still unclear.
Many companies would prefer to self-host their models to avoid sending their data to a third party. However, emerging research suggests fine-tuning may compromise safety. Foundation models are tuned to avoid inappropriate language and content. LLMs, especially smaller ones, can forget these rules when fine-tuned. Large model providers may also find it easier to demonstrate that their models and data are compliant with nascent regulations. More work needs to be done to establish these conclusions.
In the long term, we can’t predict how LLMs will develop. A large research org could discover a new type of model that is an order of magnitude better than what we have today.
Progress in model development will drastically change the structure of the Model Layer. Regardless of the direction, it will have the following key components:
Core model:
Training LLMs from scratch (also known as pre-training) costs hundreds of thousands to tens of millions of dollars for each model. OpenAI is said to have spent over $100 million on GPT-4. Only companies with huge balance sheets can afford to develop them.
Conveniently, fine-tuning a pre-trained model with your own data can provide similar results at a fraction of the cost. This can cost well under $100 with data you already have on hand. During inference, fine-tuned models can be faster and an order of magnitude cheaper to boot.
Fine-tuning is supported by a rapidly improving set of open-source models. The most capable ones, like Meta’s Llama 2, perform similarly to the best LLMs from 6-12 months ago (e.g., GPT-3.5). When fine-tuned for a concrete task, open-source models often reach near-parity with today’s best closed-source LLMs (e.g., GPT-4).
Serving/compute:
Training LLMs is expensive, but inference isn’t cheap, either. LLMs require an extreme amount of memory and substantial computing. For a product owner experimenting on OpenAI, each GPT-3.5 query will cost $0.01-0.03. GPT-4 is an order of magnitude higher at up to $3.00 per query.
If you try to self-host for privacy/security or cost reasons, you won’t find it easy to match OpenAI. Virtual machines with NVIDIA A100 GPUs (the best one for most large foundation models) run $3.67 to $40.55 per hour on Google Cloud. Availability of these machines is very limited and sporadic, even through major cloud providers.
For LLM applications with thousands or millions of users, product owners will face new challenges. Unlike most classic SaaS applications, they’ll need to think about costs. At $3 per query, it’s hard to build a viable product. Similarly, users won’t hang around long if it takes 30 seconds to return an answer.
Recommended by LinkedIn
It will be critical to serve LLMs in a performant and cost-effective way. Even “smaller” models described above still have billions of parameters.
How can you reduce cost and latency in production? There are lots of approaches. On the inference side, batch queries or cache their results. Optimize memory allocation to reduce fragmentation. Use speculative decoding (when a smaller model suggests tokens and a large model “accepts” them). Infrastructure usage can be optimized by comparing GPU prices in real-time and sending workloads to the cheapest option. Use spot pricing if you can manage spotty availability and failovers.
Most companies won’t have in-house expertise to do all these themselves. Startups will rise to fill this need.
Model routing/abstraction:
For some LLM applications, it will be obvious which model to use where. Summarize emails with one LLM. Transform the data to a spreadsheet with another.
But for others, it can be unclear. You might have a chat interface where some customer messages are simple, and others are complex. How will you know which should go to a large model vs. a small one?
For these types of applications, there may be an abstraction layer to analyze each request and route it to the best model.
As a product owner, it is already difficult to evaluate the behavior of an LLM in production. If your system might route to many different LLMs, that problem could get even more challenging.
However, if dynamic routing improves capabilities and performance, the trade-off might be worth it. This layer could almost be thought of as a separate model itself and evaluated on its own merit.
Fine-tuning & optimization:
As we wrote about above, many product owners will fine-tune a model for their use case.
For most use cases, fine-tuning will not be a one-off endeavor. It will start in product development and continue indefinitely as the product is used.
The key to great fine-tuning will be fast feedback loops. Today, product owners triage user feedback/bugs in code and ask engineers to fix them. For LLM applications, product owners will monitor qualitative and quantitative data on LLM behavior. They will work with engineers, data ops teams, product/marketing, and evaluators to curate new data and re-fine-tune models.
We’ll discuss post-deployment monitoring more in our next post on the
Deployment Layer:
To fine-tune a model, there are two major components. There is an important cohort of LLM infrastructure companies serving each of them.
Key open questions:
Companies working on the Model Layer
If you are building in this space, we’d love to hear from you at info@theory.ventures! In our next post, we’ll explore the Deployment Layer. Subscribe here to follow along!
Partner at Reflexive Capital
1yLatency and reliability/stability are the biggest issues I run into working with this generation of models. Think there is definitely a role of software optimization such as the batching, caching, RAG/finetuning. I wonder if we also need a step-function improvement in hardware as software surpasses ability.