Kyle Leaders’ Post

Director of Engineering, Open Source and AI

7mo Edited

https://lnkd.in/ga2HjkmJ Lately there's been a lot of talk about the environmental impact of Gen AI and LLMs, so that made me think about what we could do to minimize that. I'm starting a blog series where I explore running various tiny LLMs on low powered hardware and alternatives to GPUs like an embedded NPU.

2 Comments

Thomas Mcelroy

7mo

So, running the LLM, not training it? Theres some interesting research in more narrowly trained LLMs and and LLMs trained with negative signals that lead to better results at smaller sizes.

1 Reaction

To view or add a comment, sign in

More Relevant Posts

NVIDIA AI

1,130,762 followers
2w
Report this post
Generating KV cache during #inference requires a lot of compute and memory resources, so efficient use is key to improving model response, accelerating inference, and increasing system throughput. TensorRT-LLM provides advanced reuse features to further optimize TTFT response times for peak performance. Start using TensorRT-LLM KV cache reuse with the documentation on GitHub ➡ https://lnkd.in/gHeHRcyr Technical blog: ➡ https://lnkd.in/gP6WcFtN
NVIDIA Data Center

141,462 followers
2w

Dive into how KV cache early reuse, fine-grained blocks, and efficient eviction algorithms can supercharge TTFT speeds. Efficient KV cache use is key to improving #LLM model response, speeding up #inference, and maximizing throughput. With NVIDIA TensorRT-LLM's advanced KV cache management features, developers can take inference performance to the next level. ➡️ https://nvda.ws/3YJzpe4
2 Comments
Like Comment
To view or add a comment, sign in
NVIDIA Data Center

141,462 followers
2w
Report this post
Dive into how KV cache early reuse, fine-grained blocks, and efficient eviction algorithms can supercharge TTFT speeds. Efficient KV cache use is key to improving #LLM model response, speeding up #inference, and maximizing throughput. With NVIDIA TensorRT-LLM's advanced KV cache management features, developers can take inference performance to the next level. ➡️ https://nvda.ws/3YJzpe4
5 Comments
Like Comment
To view or add a comment, sign in
emily so

Independent Fine Art Professional
1mo
Report this post
🌟WITH llama.cpp [通过 llama.cpp] Significantly boost the performance of your hashtag#AI workloads on GPUs by using llama.cpp on RTX AI PCs. 通过在 RTX AI PC 上使用 llama.cpp 显著提升 GPU 上标签#AI 工作负载的性能。 ➡️ https://nvda.ws/406X6zp 🦙 🌟With llama.cpp, you gain access to a C++ implementation designed for LLM inferencing, packaged in a lightweight installation. 通过 llama.cpp，您可以取用专为 LLM 推理设计的 C++ 执行，并包装在轻量级安装中。 🔎 Explore and begin utilizing llama.cpp through the RTX AI Toolkit. 通过 RTX AI 工具包探索并开始使用 llama.cpp。🛠️
NVIDIA AI

1,130,762 followers
1mo

🌟 Significantly boost the performance of your #AI workloads on GPUs by using llama.cpp on RTX AI PCs. ➡️ https://nvda.ws/406X6zp 🦙 With llama.cpp, you gain access to a C++ implementation designed for LLM inferencing, packaged in a lightweight installation. 🔎 Explore and begin utilizing llama.cpp through the RTX AI Toolkit. 🛠️
Like Comment
To view or add a comment, sign in
Ana H.

OEM Alliances Representative at NVIDIA
10mo
Report this post
Introducing NVIDIA Modulus version 24.01, featuring updates to distributed utilities and samples for physics-informed DeepONet and GNNs.

Modulus | NVIDIA NGC

share.nvidia.com
Like Comment
To view or add a comment, sign in
何京

英伟达 - social media manager
1mo
Report this post
Learn how the right parallelism technique increases #Llama 3.1 405B performance by 1.5x in throughput-sensitive scenarios on an NVIDIA HGX H200 system with NVLink and NVSwitch, and enables a 1.2x speedup in the MLPerf Inference v4.1 Llama 2 70B benchmark.

Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch | NVIDIA Technical Blog

developer.nvidia.com
Like Comment
To view or add a comment, sign in
Markus Holtmanns

Developer Community Manager at NVIDIA
1mo
Report this post
Learn how the right parallelism technique increases #Llama 3.1 405B performance by 1.5x in throughput-sensitive scenarios on an NVIDIA HGX H200 system with NVLink and NVSwitch, and enables a 1.2x speedup in the MLPerf Inference v4.1 Llama 2 70B benchmark.

Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch | NVIDIA Technical Blog

developer.nvidia.com
Like Comment
To view or add a comment, sign in
Jigar Halani

Director - Solution Architect & Engg. at NVIDIA | Hiring | Twitter: jigarhalani3
1mo
Report this post
Learn how the right parallelism technique increases #Llama 3.1 405B performance by 1.5x in throughput-sensitive scenarios on an NVIDIA HGX H200 system with NVLink and NVSwitch, and enables a 1.2x speedup in the MLPerf Inference v4.1 Llama 2 70B benchmark.

Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch | NVIDIA Technical Blog

developer.nvidia.com
Like Comment
To view or add a comment, sign in
Claudio Polla

NVIDIA Telco Solutions - UKI & Africa
1mo
Report this post
Learn how the right parallelism technique increases #Llama 3.1 405B performance by 1.5x in throughput-sensitive scenarios on an NVIDIA HGX H200 system with NVLink and NVSwitch, and enables a 1.2x speedup in the MLPerf Inference v4.1 Llama 2 70B benchmark.

Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch | NVIDIA Technical Blog

developer.nvidia.com
Like Comment
To view or add a comment, sign in
inVISION News

17,409 followers
7mo
Report this post
With the Akhet Server VarioFlex 5U with dual GPU and the high performance computing platform Akhet VarioScaler xI with multiple dual GPUs, Pyramid Computer GmbH is presenting two new systems at Embedded World that are optimized for machine learning applications. In addition to both systems, a camera bar is shown that includes two 48MP cameras and an ARM Cortex A73 with integrated NPU. The machine vision solution recognizes objects based on a trained model. https://lnkd.in/eBnaCeVc #machinevision #imageprocessing #edgecomputing #embeddedsystems #machinelearning #ai #deeplearning
Like Comment
To view or add a comment, sign in
Arthur Voslaev

🚀 Nvidia Networking Pro: Revolutionizing Connectivity in AI, HPC, ML & Cloud
1mo
Report this post
Learn how the right parallelism technique increases #Llama 3.1 405B performance by 1.5x in throughput-sensitive scenarios on an NVIDIA HGX H200 system with NVLink and NVSwitch, and enables a 1.2x speedup in the MLPerf Inference v4.1 Llama 2 70B benchmark.

Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch | NVIDIA Technical Blog

developer.nvidia.com
Like Comment
To view or add a comment, sign in

1,202 followers

53 Posts

View Profile Connect

Kyle Leaders’ Post

More Relevant Posts

Explore topics