NVIDIA have announced improvements in large language model (LLM) inference performance, promising better return on investment for organisations deploying AI applications.
Dave Salvator, writing for NVIDIA's official blog on October 9, 2024, detailed the company's ongoing efforts to optimise LLM performance across its GPU platforms. These efforts are crucial for applications requiring high throughput and low latency in real-time environments.
"NVIDIA regularly optimises the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B," Salvator wrote. This continuous improvement allows customers to run more complex models or reduce the infrastructure needed to host existing ones.
A standout achievement highlighted in the blog post is the 3.5x improvement in minimum latency performance for the open-source Llama 70B model in less than a year. This progress is attributed to enhancements in NVIDIA's TensorRT-LLM library, which is specifically designed for LLM inference on NVIDIA GPUs.
The article also touched on NVIDIA's recent submission to the MLPerf Inference 4.1 benchmark, featuring the new Blackwell platform. This submission demonstrated a 4x performance increase over the previous generation and marked the first-ever use of FP4 precision in MLPerf, showcasing Blackwell's advanced capabilities.
Salvator emphasised the importance of parallelism techniques in LLM deployments, noting that the choice between tensor and pipeline parallelism depends on specific application requirements. For instance, tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, while pipeline parallelism offers 50% more performance for maximum throughput use cases.
"These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing their ROI," Salvator explained.