Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer

As large language models (LLMs) are becoming even bigger, it is increasingly important to provide easy-to-use and efficient deployment paths because the cost of serving such LLMs is becoming higher. One way to reduce this cost is to apply post-training quantization (PTQ), which consists of techniques to reduce computational and memory requirements for serving trained models.

In this post, we provide an overview of how PTQ is implemented in NVIDIA NeMo. This is made available by using NVIDIA TensorRT Model Optimizer, which is a library that quantizes and compresses deep learning models for optimized inference on GPUs. It also uses NVIDIA TensorRT-LLM, which is an open-source library for optimizing LLM inference. We present both accuracy and performance results for quantized models. Throughout the example, we use the Llama 3 models.

PTQ is a natural extension of the NeMo LLM building and customizing capabilities for seamless and efficient deployment paths using NVIDIA TensorRT Model Optimizer and NVIDIA TensorRT-LLM. As an example, NVIDIA NIM benefits from the PTQ workflow in NeMo.

From a technical perspective, quantization has several benefits:

It reduces model size, which makes it suitable for deploying using fewer GPUs with lower total device memory available.
It reduces memory bandwidth pressure by using fewer-bit data types.
It significantly speeds up matrix multiplication (GEMM) operations on the NVIDIA architecture, for example, up to 2x for FP8 compared to FP16/BF16 data type in microbenchmarks.

Overview of NeMo features

NVIDIA NeMo is an end-to-end platform for developing custom generative AI, anywhere. It includes tools for training, finetuning, retrieval-augmented generation, guardrailing, and toolkits, data curation tools, and pretrained models, offering enterprises an easy, cost-effective, and fast way to adopt generative AI.

After you build a model in NeMo using a wide array of options offered by the toolkit, NeMo export and deployment tools can be used to apply PTQ methods and serve the optimized model.

The recent NeMo container release is a self-contained toolkit coming with all the required dependencies for applying PTQ and deploying quantized LLMs.

NeMo and TensorRT Model Optimizer offer a broad range of models suitable for quantization, including the following families:

GPT
Llama
Gemma
StarCoder
Nemotron (including the recently announced Nemotron4-340b)

PTQ support also comes with multi-node support for calibrating the largest size LLMs using appropriate tensor and pipeline parallelism settings.

Quantizing and deploying NeMo models

At a high level, the PTQ workflow consists of the following steps:

Loading a model.
Calibrating the model to obtain scaling factors for lower-precision GEMMs and exporting the quantized model to the TensorRT-LLM checkpoint.
Building the TensorRT-LLM engine.
Deploying the model (for example, using PyTriton).

Diagram shows the loading, calibrating, exporting, building, and deploying steps. — *Figure 1. PTQ workflow in NeMo*

Loading the model

A typical PTQ use case starts with a model trained in a high-precision format, for example, FP16 or BF16, that should be served in a lower-precision data type, for example, FP8. The input model can be a foundation or instruction-tuned LLM obtained from previous pipeline steps.

NeMo also offers community model converters for a wide array of models that can be used to produce corresponding NeMo checkpoints.

Calibrating and exporting the quantized model

In PTQ, calibrating is a process of getting scaling factors for matrix multiplication operations performed in model layers so they can be computed using lower precision formats than those used for training.

This step can be conveniently launched directly from the NeMo container (using torchrun, for example) or using NeMo framework Launcher on Slurm clusters for multi-node use cases.

In short, quantization code boils down to the following code example:

from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
from nemo.export.quantize import Quantizer

# Set quantization and export configs appropriately, see https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_ptq.yaml

quantizer = Quantizer(quantization_config, export_config)

model = MegatronGPTModel.restore_from(...)

dataloader = ...  # A dataloader that yields lists of strings

def forward_loop(model):
    # Model forward pass for collecting activation statistics for calibration
    ...

model = quantizer.quantize(model, forward_loop)

quantizer.export(model)

The full script megatron_gpt_ptq.py is the entry point for the calibration workflow. Important quantization parameters are specified in the megatron_gpt_ptq.yaml config with default settings recommended. Most importantly, the low-precision formats and quantization algorithms available are FP8, INT4 AWQ, and INT8 SQ.

Typically, the choice of dataset does not significantly impact accuracy. However, for highly domain-specific applications, such as code completion models like StarCoder2, using a code dataset is recommended to estimate calibration statistics accurately.

The final step of the calibration step is to save the model in the TensorRT-LLM checkpoint format that is suitable for building a TensorRT-LLM engine in the next step.

Overall, the calibration process is a matter of minutes using an NVIDIA DGX H100 GPU node for a model of moderate size with 70B parameters using tensor parallelism.

Building the TensorRT-LLM engine

Before running TensorRT-LLM, you build the inference engine by compiling a set of binaries that take into account optimizations for the specific GPU hardware, model architecture, and inference settings.

Use the same API as for regular NeMo models to build engines for the quantized checkpoint obtained in the calibrating step. For building FP8 engines, this step must be run using compute resources with the necessary FP8 support, for example, the NVIDIA H100 Hopper or the NVIDIA L40 Ada Lovelace GPUs.

The following Python commands show how to build a TensorRT-LLM engine and pass an example prompt through the model.

from nemo.export.tensorrt_llm import TensorRTLLM

trt_llm_exporter = TensorRTLLM(model_dir=”path/to/trt_llm_engine”)
trt_llm_exporter.export(
    nemo_checkpoint_path=”path/to/model_qnemo”,
    max_batch_size=8,
    max_input_len=2048,
    max_output_len=512,
)
trt_llm_exporter.forward(["How does PTQ work?"])

The export command takes typically several minutes to complete building or exporting a TensorRT-LLM engine, saving it into the model_dir parameter.

Deploying the model

A given TensorRT-LLM engine can be conveniently deployed using PyTriton.

from nemo.deploy import DeployPyTriton
from nemo.export.tensorrt_llm import TensorRTLLM


trt_llm_exporter = TensorRTLLM(model_dir="path/to/trt_llm_engine")

nm = DeployPyTriton(
    model=trt_llm_exporter,
    triton_model_name="llama3_70b_fp8",
    port=8000,
)
nm.deploy()
nm.serve()

Finally, on the client, NeMo Framework provides a dedicated class to send a query to the server. The following code example shows how to use it.

from nemo.deploy.nlp import NemoQueryLLM


nq = NemoQueryLLM(
    url="localhost:8000",
    model_name="llama3_70b_fp8",
)

nq.query_llm(
    prompts=["How does PTQ work?"],
    top_k=1,
)

Llama 3 PTQ example and results

For demonstration purposes, we present Llama 3 PTQ throughput and accuracy results for two pretrained Llama 3 model variants: 8B and 70B We evaluated TensorRT-LLM engine performance and accuracy using the benchmark.py and mmlu.py scripts, respectively.

The following results were obtained for NVIDIA H100 80GB GPUs with TensorRT-LLM 0.12.0 and TensorRT Model Optimizer 0.17.0. The software stack comes with the latest NeMo framework container to provide you with the complete environment.

Accuracy results

Figure 2 shows MMLU results for two Llama 3 model sizes across different quantization methods, along with the baseline FP16 result.

Graph shows that Llama 3 8B has the best accuracy for FP16 at 0.654 and Llama 3 70B has the best accuracy for FP16 at 0.79. — *Figure 2. MMLU accuracy results for Llama 3 models 8B and 70B*

Notably, FP8 quantization preserves the accuracy to the highest extent. In the case of the INT8 SQ and both Llama 3 model sizes, we found that the SmoothQuant alpha parameter can improve accuracy. This parameter governs quantization focus from weight-only to activation-only. It can be conveniently set in the quantization config. In the case of both Llama 3 model sizes an intermediate value of alpha=0.8 yields the best MMLU results.

In Table 1, the percentage number in brackets is a fraction of the baseline FP16 score and measures the extent to which a given scenario preserves accuracy.

	FP16	FP8	INT8 SQ	INT4 AWQ
LLAMA 3 8B	0.654	0.649 (99.2%)	0.629 (96.2%)	0.629 (96.2%)
LLAMA 3 70B	0.790	0.787 (99.6%)	0.772 (97.7%)	0.777 (98.4%)

Table 1. MMLU accuracy results for Llama 3 models

Performance results

Figure 3 shows inference speedups defined as the throughput ratio of a quantized model over the FP16 baseline for different quantization methods and two Llama 3 model sizes. The exact throughput results achieved are detailed later in this post.

In all the experiments, we used input and output lengths of 2048 and 512, respectively, to build TensorRT-LLM engines and collect performance data. These values can be considered representative parameters for text summarization scenarios.

Bar chart shows that inference speedup over FP16 baseline for batch size 32 is the highest at 1.81 for FP8 and INT8 SQ and for batch size 1 it is the highest at 2.66 for INT4 AWQ for Llama 3 70B model in both cases. — *Figure 3. Performance benchmark on FP8, INT8 SQ, and INT4 AWQ*

The following tables show the number of GPUs used to build the engine for a given quantization format as well as the FP16 baseline results for two batch sizes, 32 and 1. The throughput is normalized with respect to the number of GPUs used.

MODEL	FORMAT	GPUs	THROUGHPUT [TOKENS/SEC]	SPEEDUP
LLAMA 3 8B	FP16	1	2293.08	–
	FP8	1	3330.85	1.45
	INT8 SQ	1	3203.50	1.40
	INT4 AWQ	1	2475.98	1.08
LLAMA 3 70B	FP16	4	256.10	–
	FP8	2	464.22	1.81
	INT8 SQ	2	463.25	1.81
	INT4 AWQ	2	360.57	1.41

Table 2. TensorRT-LLM engine throughput results for batch size = 32 for the baseline and quantized Llama 3 models

We observed 1.45, 1.40, and 1.08x speedups for FP8, INT8 SQ, and INT4 AWQ, respectively, for the smaller Llama 3 variant. In the case of the larger model, the speedup is up to 1.81x for both FP8 and INT8 SQ.

INT4 AWQ is a weight-only quantization method that is recommended for use with small batch sizes. It mostly improves memory bandwidth but becomes compute-bound for larger batches.

We present results for batch size = 1 for comparison. In this case, we obtained up to 1.56x and 2.66x performance benefits over the FP16 baseline for the Llama 3 8B and Llama 3 70B models, respectively. All the quantized variants of the Llama 3 70B model can be served using only one NVIDIA H100 GPU while the baseline FP16 precision requires at least two GPUs.

MODEL	QFORMAT	GPUs	THROUGHPUT [TOKENS/SEC]	SPEEDUP
LLAMA 3 8B	FP16	1	135.79	–
	FP8	1	170.75	1.26
	INT8 SQ	1	158.90	1.17
	INT4 AWQ	1	211.50	1.56
LLAMA 3 70B	FP16	2	17.75	–
	FP8	1	32.64	1.84
	INT8 SQ	1	32.18	1.81
	INT4 AWQ	1	47.13	2.66

Table 3. TensorRT-LLM engine throughput results for batch size = 1 for the baseline and quantized Llama 3 models

The throughput numbers reported should not be considered peak performance, as they could be further improved using other features of TensorRT-LLM such as in-flight batching, for example.

We also examined performance statistics using the TensorRT-LLM gptManagerBenchmark tool, focusing on the FP16 baseline and FP8 quantized engines for batch size = 32.

In the case of the Llama 3 8B model, time to first token (TTFT) improves and inter-token latency (ITL) speedups are roughly equivalent to the throughput-based speedups reported earlier in this post.

For the larger Llama 3 70B model, both the TTFT and ITL results achieved by quantized engines running on 2x fewer GPUs are similar to the baseline FP16 results. This directly translates into 2x savings for resources used. With PTQ, models can be served more efficiently using fewer GPUs.

Summary

This post showed you how to use PTQ in NeMo to build efficient TensorRT-LLM engines for LLM deployment. For future iterations, the number of bits used for models will decrease substantially, as FP4 support comes with the next-generation NVIDIA B100 Blackwell architecture.

It is also worth mentioning that for some applications, PTQ may be sufficient while other applications might require quantization-aware Training (QAT) techniques to fine-tune quantized weights to maintain model accuracy. QAT is also available in NeMo to meet these needs.

For more information, see Post-Training Quantization (PTQ). The entry point for PTQ is the megatron_gpt_ptq.py script. You may also find the NeMo Framework Post-Training Quantization (PTQ) playbook useful. It guides you through the whole deployment process using two example models: Llama 3 and Nemotron-340b.

As for QAT, the entry point is the megatron_gpt_qat.py script and the corresponding playbook is NeMo Framework Quantization Aware Training (QAT) for Llama2 SFT Model. For more information, see Best Practices for Tuning the Performance of TensorRT-LLM.

Acknowledgments

The help of many dedicated engineers across various teams at NVIDIA is greatly appreciated for their contributions to successful NeMo and TensorRT Model Optimizer integration, including Asma Kuriparambil Thekkumpate, Keval Morabia, Wei-Ming Chen, Huizi Mao, Ao Tang, Dong Hyuk Chang, Alexandros Koumparoulis, Enwei Zhu, and Simon Layton.