NVIDIA has released Mistral-NeMo-Minitron 8B, a new small language model that combines state-of-the-art accuracy with computational efficiency, challenging the traditional trade-off between model size and performance in generative AI.
The model, a miniaturised version of the open Mistral NeMo 12B, excels across multiple benchmarks for AI-powered applications while being compact enough to run on an NVIDIA RTX-powered workstation.
"We combined two different AI optimisation methods — pruning to shrink Mistral NeMo's 12 billion parameters into 8 billion, and distillation to improve accuracy," said Bryan Catanzaro, vice president of applied deep learning research at NVIDIA. "By doing so, Mistral-NeMo-Minitron 8B delivers comparable accuracy to the original model at lower computational cost."
This breakthrough allows organisations with limited resources to deploy generative AI capabilities across their infrastructure while optimising for cost, operational efficiency, and energy use. The model's ability to run locally on edge devices also offers security benefits by eliminating the need to transmit data to servers.
Mistral-NeMo-Minitron 8B leads on nine popular benchmarks for language models of its size, covering tasks such as language understanding, common sense reasoning, mathematical reasoning, summarization, coding, and generating truthful answers.
NVIDIA's innovative approach combines pruning, which removes less important model weights, with distillation, which retrains the pruned model on a small dataset to boost accuracy. This technique results in a smaller, more efficient model that maintains the predictive accuracy of its larger counterpart while requiring only a fraction of the original dataset for training.