Aleph Alpha: Rome Wasn't Built in a Day—And neither was Pharia 1- LLM-7B model

Aleph Alpha: Rome Wasn't Built in a Day—And neither was Pharia 1- LLM-7B model

In the fast-paced world of AI development, understanding and optimizing the performance of models is essential. Recently, I conducted an experiment to evaluate the inference times and GPU utilization of the Aleph-Alpha/Pharia-1-LLM-7B model. This test aimed to provide a clear, data-driven picture of how the model performs under different conditions and what kind of computational resources it requires.

The Pharia-1-LLM-7B Model

The Pharia-1-LLM-7B model family, developed by Aleph Alpha Research. We tested the foundation model Pharia-1-LLM-7B-control-aligned for this articel. This model is designed for multilingual support, specifically optimized for German, French, and Spanish, making them culturally and linguistically vrsatile. The models are available under the Open Aleph License, permitting non-commercial research and educational use.

The Setup

For our test, the Pharia-1-LLM-7B model was run on a Gustav AGX Orin platform with 64 GB of GPU RAM, supported by a Braincell module. The AGX Orin is a low-energy GPU system built on an ARM 64-bit architecture. This setup allowed us to load and test the model almost instantly, making it easier to benchmark its performance in real-time. Throughout the test, the system consumed approximately 30-35 watts of power, which is impressively efficient given the computational load. We had to custom cross-compile pytorch to run the model under Jetpack 5.1 as well under Jetpack 6 to make sure that we can accelerate all GPU cuda cores


The Testing Process

The experiment involved processing a series of queries to measure how long it took for the model to generate responses. Importantly, the model was loaded into memory only once at the start of the test. After this initial load, multiple queries were sent to the model in a loop, simulating a real-world scenario where the model needs to handle a continuous stream of requests.

We enabled CPU profiling and GPU memory check to capture detailed metrics about resource usage and identify the most resource-intensive operations during inference.

Key steps in the process included:

  1. Timing Inference: Recording the time taken by the model to generate a response for each query.
  2. Profiling: Focusing on CPU and mem GPU activities to monitor which operations consumed the most time and resources.
  3. Resource Utilization: Measuring both the active and reserved GPU memory usage to ensure the model was running efficiently.

Loop questions

# Array of different questions

Q: When was Rome founded?

A: Rome was founded on April 21, 753 BC, according to tradition.

Time taken: 3.0334 seconds


Q: What is the capital of France?"

A: The capital of France is Paris.

Time taken: 0.8495 seconds


Q: Who invented the telephone?

A: The telephone was invented by Alexander Graham Bell. He made the first successful telephone call on March 10, 1876, to his assistant, Thomas A. Watson, saying, "Mr. Watson, come here!

Time taken: 5.2807 seconds


Q: How does photosynthesis work?

A: Photosynthesis is a process used by plants, algae, and some bacteria to convert sunlight into chemical energy in the form of glucose (a sugar). It occurs primarily in the chloroplasts, which contain chlorophyll (a green pigment)

Time taken: 5.2832 seconds


Q: What is quantum mechanics?

A: Quantum mechanics is a fundamental theory in physics that describes the behavior of matter and energy on microscopic scales, such as atoms and subatomic particles. It was developed in the early 20th century to overcome the limitations of classical physics in explaining phenomena.

Time taken: 5.2799 seconds


Q: Who was Albert Einstein?

A: Albert Einstein was a German-born theoretical physicist who is widely regarded as one of the most influential scientists of the 20th century. He is best known for his development of the theory of relativity and his mass-energy equivalence formula.

Time taken: 5.2893 seconds


Q: How many planets are there in the solar system?

A: In our solar system, there are eight planets: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. However, Pluto was reclassified as a dwarf planet in 2006 by the International Astronomical Union

Time taken: 5.2768 seconds


Q: What is the meaning of life?

A: The meaning of life varies for each individual, as it encompasses personal beliefs, values, and experiences. Generally, it refers to understanding one's purpose, finding happiness, and exploring the essence of existence.

Time taken: 5.2846 seconds


Q: How high is Mount Everest?

A: Mount Everest is approximately 8,848 meters (29,029 feet) tall. However, its height can vary slightly due to factors such as atmospheric pressure and climate.

Time taken: 4.3254 seconds


Q: What is artificial intelligence?

A: Artificial intelligence (AI) is the development of computer systems or software that can perform tasks that typically require human intelligence. These tasks may include problem-solving, learning, reasoning, perception, natural language understanding, and manipulation.

Time taken: 5.2832 seconds


Testing profiling:

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:

with record_function("model_inference"):

start_time = time.time()

outputs = model.generate(**inputs, max_new_tokens=50)

end_time = time.time()

# If you want detailed profiling for the last question

print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

print(prof.key_averages().table(sort_by="gpu_time_total", row_limit=10))

Results: InferGPU memory utilization

The results showed that the model's response times varied depending on the complexity of the queries:

  • Inference Times (on GPU): Response times ranged from 0.8434 seconds to 6.3229 seconds, reflecting the model's resource allocation based on query complexity.
  • Profiling: Profiling revealed that cpu operations like aten::linear, aten::addmm, and aten::matmul were among most time-consuming and not running on GPU, which directly can impact overall model efficiency
  • GPU Memory Usage: The model consistently used around 13.5 GB of GPU memory, both in active and reserved states, indicating efficient memory management across different queries.

Conclusion

The Pharia-1-LLM-7B-control model from Aleph Alpha demonstrated a first performance across various queries, with response times that compare favorably to other models in the 7B to 8B parameter range. The additional alignment training in the Pharia-1-LLM-7B-control-aligned variant helps mitigate risks associated with model usage, making it suitable for proof of concept, but not yet for critical applications.

A good first start Aleph Alpha .

To view or add a comment, sign in

More articles by Gary Hilgemann

Insights from the community

Others also viewed

Explore topics