Aleph Alpha: Rome Wasn't Built in a Day—And neither was Pharia 1- LLM-7B model
In the fast-paced world of AI development, understanding and optimizing the performance of models is essential. Recently, I conducted an experiment to evaluate the inference times and GPU utilization of the Aleph-Alpha/Pharia-1-LLM-7B model. This test aimed to provide a clear, data-driven picture of how the model performs under different conditions and what kind of computational resources it requires.
The Pharia-1-LLM-7B Model
The Pharia-1-LLM-7B model family, developed by Aleph Alpha Research. We tested the foundation model Pharia-1-LLM-7B-control-aligned for this articel. This model is designed for multilingual support, specifically optimized for German, French, and Spanish, making them culturally and linguistically vrsatile. The models are available under the Open Aleph License, permitting non-commercial research and educational use.
The Setup
For our test, the Pharia-1-LLM-7B model was run on a Gustav AGX Orin platform with 64 GB of GPU RAM, supported by a Braincell module. The AGX Orin is a low-energy GPU system built on an ARM 64-bit architecture. This setup allowed us to load and test the model almost instantly, making it easier to benchmark its performance in real-time. Throughout the test, the system consumed approximately 30-35 watts of power, which is impressively efficient given the computational load. We had to custom cross-compile pytorch to run the model under Jetpack 5.1 as well under Jetpack 6 to make sure that we can accelerate all GPU cuda cores
The Testing Process
The experiment involved processing a series of queries to measure how long it took for the model to generate responses. Importantly, the model was loaded into memory only once at the start of the test. After this initial load, multiple queries were sent to the model in a loop, simulating a real-world scenario where the model needs to handle a continuous stream of requests.
We enabled CPU profiling and GPU memory check to capture detailed metrics about resource usage and identify the most resource-intensive operations during inference.
Key steps in the process included:
Loop questions
# Array of different questions
Q: When was Rome founded?
A: Rome was founded on April 21, 753 BC, according to tradition.
Time taken: 3.0334 seconds
Q: What is the capital of France?"
A: The capital of France is Paris.
Time taken: 0.8495 seconds
Q: Who invented the telephone?
A: The telephone was invented by Alexander Graham Bell. He made the first successful telephone call on March 10, 1876, to his assistant, Thomas A. Watson, saying, "Mr. Watson, come here!
Time taken: 5.2807 seconds
Q: How does photosynthesis work?
A: Photosynthesis is a process used by plants, algae, and some bacteria to convert sunlight into chemical energy in the form of glucose (a sugar). It occurs primarily in the chloroplasts, which contain chlorophyll (a green pigment)
Time taken: 5.2832 seconds
Q: What is quantum mechanics?
A: Quantum mechanics is a fundamental theory in physics that describes the behavior of matter and energy on microscopic scales, such as atoms and subatomic particles. It was developed in the early 20th century to overcome the limitations of classical physics in explaining phenomena.
Time taken: 5.2799 seconds
Recommended by LinkedIn
Q: Who was Albert Einstein?
A: Albert Einstein was a German-born theoretical physicist who is widely regarded as one of the most influential scientists of the 20th century. He is best known for his development of the theory of relativity and his mass-energy equivalence formula.
Time taken: 5.2893 seconds
Q: How many planets are there in the solar system?
A: In our solar system, there are eight planets: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. However, Pluto was reclassified as a dwarf planet in 2006 by the International Astronomical Union
Time taken: 5.2768 seconds
Q: What is the meaning of life?
A: The meaning of life varies for each individual, as it encompasses personal beliefs, values, and experiences. Generally, it refers to understanding one's purpose, finding happiness, and exploring the essence of existence.
Time taken: 5.2846 seconds
Q: How high is Mount Everest?
A: Mount Everest is approximately 8,848 meters (29,029 feet) tall. However, its height can vary slightly due to factors such as atmospheric pressure and climate.
Time taken: 4.3254 seconds
Q: What is artificial intelligence?
A: Artificial intelligence (AI) is the development of computer systems or software that can perform tasks that typically require human intelligence. These tasks may include problem-solving, learning, reasoning, perception, natural language understanding, and manipulation.
Time taken: 5.2832 seconds
Testing profiling:
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
with record_function("model_inference"):
start_time = time.time()
outputs = model.generate(**inputs, max_new_tokens=50)
end_time = time.time()
# If you want detailed profiling for the last question
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
print(prof.key_averages().table(sort_by="gpu_time_total", row_limit=10))
Results: InferGPU memory utilization
The results showed that the model's response times varied depending on the complexity of the queries:
Conclusion
The Pharia-1-LLM-7B-control model from Aleph Alpha demonstrated a first performance across various queries, with response times that compare favorably to other models in the 7B to 8B parameter range. The additional alignment training in the Pharia-1-LLM-7B-control-aligned variant helps mitigate risks associated with model usage, making it suitable for proof of concept, but not yet for critical applications.
A good first start Aleph Alpha .