(Image credit: Nvidia)

Nvidia says its new TensorRT-LL open-source software can dramatically boost performance of large language models (LLMs) on its GPUs. According to the company, the capabilities of Nvidia’s TensorRT-LL let it boost performance of its H100 compute GPU by two times in GPT-J LLM with six billion parameters. Importantly, the software can enable this performance improvement without re-training the model.

Nvidia developed TensorRT-LLM specifically to speed up performance of LLM inference and performance graphcs provided by Nvidia indeed show a 2X speed boost for its H100 due to appropriate software optimizations. A particular standout feature of Nvidia’s TensorRT-LLM is its innovative in-flight batching technique. This method addresses the dynamic and diverse workloads of LLMs, which can vary greatly in their computational demands. 

In-flight batching optimizes the scheduling of these workloads, ensuring that GPU resources are used to their maximum potential. As a result, real-world LLM requests on the H100 Tensor Core GPUs see a doubling in throughput, leading to faster and more efficient AI inference processes.


(Image credit: Nvidia)

Nvidia says that its TensorRT-LLM integrates a deep learning compiler with optimized kernels, pre- and post-processing steps, and multi-GPU/multi-node communication primitives, ensuring that they run more efficiently on its GPUs. This integration is further complemented by a modular Python API, which provides a developer-friendly interface to further augment capabilities of the software and hardware without delving deep into complex programming languages. For example, MosaicML has added specific features that it needed on top of TensorRT-LLM seamlessly and integrated them into their inference serving. 

“TensorRT-LLM is easy to use, feature-packed with streaming of tokens, in-flight batching, paged-attention, quantization, and more, and is efficient,” said Naveen Rao, vice president of engineering at Databricks. “It delivers state-of-the-art performance for LLM serving using NVIDIA GPUs and allows us to pass on the cost savings to our customers.”


(Image credit: Nvidia)

The performance of Nvidia’s H100 when coupled with TensorRT-LLM is impressive. On NVIDIA’s Hopper architecture, the H100 GPU, when paired with TensorRT-LLM, outperforms the A100 GPU by a factor of eight. Furthermore, when testing the Llama 2 model developed by Meta, TensorRT-LLM achieved a 4.6x acceleration in inference performance compared to the A100 GPUs. These figures underscore the transformative potential of the software in the realm of AI and machine learning.

Lastly, the H100 GPUs, when used in conjunction with TensorRT-LLM, support the FP8 format. This capability allows for a reduction in memory consumption without any loss in model accuracy, which is beneficial for enterprises that have limited budget and/or datacenter space and cannot install a sufficient number of servers to tune their LLMs.

Join the experts who read Tom’s Hardware for the inside track on enthusiast PC tech news — and have for over 25 years. We’ll send breaking news and in-depth reviews of CPUs, GPUs, AI, maker hardware and more straight to your inbox.

Anton Shilov is a Freelance News Writer at Tom’s Hardware US. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

[ For more curated Computing news, check out the main news page here]

The post Nvidia Claims Doubled Inference Performance with H100 | Tom’s Hardware first appeared on

New reasons to get excited everyday.

Get the latest tech news delivered right in your mailbox

You may also like

Notify of
Inline Feedbacks
View all comments

More in computing