Meta Releases Quantised Llama 3.2 Models, Achieving Major Speed and Memory Improvements

"Quantised Llama 3.2 achieves 56% size reduction using QLoRA and SpinQuant, with 4-bit weights and 8-bit activations for mobile deployment."

AI-360

Oct 25, 2024 2 min read

Following last month's release of Llama 3.2 1B and 3B models, Meta has announced their quantised versions, achieving an average 56% reduction in model size and 41% reduction in memory usage while maintaining quality and safety standards.

The newly optimised models demonstrate 2-4x faster performance, with testing on Android OnePlus 12 devices showing decode latency improved by 2.5x and prefill latency enhanced by 4.2x on average. Given the limited runtime memory available on mobile devices, Meta has prioritised short-context applications up to 8K for these new quantised models.

"We want to make it easier for more developers to build with Llama, without needing significant compute resources and expertise," Meta stated in its announcement. The company has developed these models using two distinct techniques: Quantisation-Aware Training with LoRA adaptors (QLoRA) for accuracy, and SpinQuant for portability.

The quantisation scheme involves three specific components:

- All linear layers in transformer blocks use 4-bit groupwise quantisation for weights and 8-bit per-token dynamic quantisation for activations

- The classification layer employs 8-bit per-channel quantisation for weight and 8-bit per-token dynamic quantisation for activation

- Embedding uses 8-bit per-channel quantisation

Through collaboration with industry partners, the models are now available on Qualcomm and MediaTek SoCs with Arm CPUs. Performance has been verified on Samsung S24+ for both 1B and 3B models and Samsung S22 for 1B. While the models run with comparable accuracy on iOS devices, performance metrics haven't yet been evaluated on Apple's platform.

Sign up for AI-360