NVIDIA Triton Inference Server Achieves Outstanding Performance in MLPerf Inference 4.1 Benchmarks

Six years ago, we embarked on a journey to develop an AI inference serving solution specifically designed for high-throughput and time-sensitive production use cases from the ground up. At that time, ML developers were deploying bespoke, framework-specific AI solutions, which were driving up their operational costs and not meeting their latency and throughput service level agreements.

We made the decision early on to build a versatile open-source server capable of serving any model, irrespective of its AI backend framework.

Today, NVIDIA Triton Inference Server stands as one of NVIDIA’s most widely downloaded open-source projects, used by some of the world’s leading organizations to deploy AI models in production, including Amazon, Microsoft, Oracle Cloud, American Express, Snap, Docusign, and many others.

We are excited to announce that NVIDIA Triton, running on a system with eight H200 GPUs, has achieved a significant milestone, achieving virtually identical performance compared to the NVIDIA bare-metal submission on the Llama 2 70B benchmark in MLPerf Inference v4.1. This shows that enterprises no longer need to choose between a feature-rich, production-grade AI inference server and peak throughput performance—they can achieve both simultaneously with NVIDIA Triton.

This post explores the NVIDIA Triton key features that have driven its rapid adoption and provides detailed insights into our MLPerf Inference v4.1 results.

NVIDIA Triton key features

NVIDIA Triton is an open-source AI model-serving platform that streamlines and accelerates the deployment of AI inference workloads in production. It helps ML developers and researchers reduce the complexity of model-serving infrastructure, shorten the time needed to deploy new AI models, and increase AI inferencing and prediction capacity.

NVIDIA Triton key features include the following:

Universal AI framework support
Seamless cloud integration
Business logic scripting
Model Ensembles
Model Analyzer

Universal AI framework support

When NVIDIA Triton launched in 2016, it initially supported the NVIDIA TensorRT backend, an open-source framework for running performant AI models on NVIDIA GPUs. Since then, it has expanded its support to include CPUs and encompass all major frameworks:

TensorFlow
PyTorch
ONNX
OpenVINO
Python
RAPIDS FIL
TensorRT-LLM
vLLM

Today, developers using NVIDIA Triton in production use it to accelerate time to market for AI applications. Instead of deploying a new AI framework-specific server for each new use case that arises, you can seamlessly load a new model into an existing NVIDIA Triton production instance, regardless of its backend framework. This capability reduces the time to market for new use cases from months to mere minutes.

NVIDIA Triton also empowers you to streamline operations by eliminating the need to patch, secure, and maintain multiple AI framework–specific inference servers. This reduction in overhead enhances efficiency and enables you to focus more on AI innovation rather than maintenance tasks.

Seamless cloud integration

We collaborated closely with every major cloud service provider to ensure that NVIDIA Triton can be seamlessly deployed in the cloud with minimal or no code required:

Whatever you use, NVIDIA Triton integrates deeply with the AI tools that your IT teams are certified and trained on. This integration saves valuable setup time, reduces costs, and accelerates developer productivity.

For instance, if you use the OCI Data Science platform, deploying NVIDIA Triton is as simple as passing Triton as an environment variable in your command line arguments during model deployment, instantly launching an NVIDIA Triton inference endpoint.

Likewise, with the Azure ML CLI, you can deploy NVIDIA Triton by adding triton_model to your YAML deployment configuration file.

GCP provides a one-click deployment option through their GKE-managed clusters, while AWS offers NVIDIA Triton on their AWS Deep Learning containers.

NVIDIA Triton also uses popular serving protocols like KServe, ensuring that it scales automatically to meet the evolving needs of your users in a Kubernetes cluster.

If your organization has standardized on any of the major CSP MLOps tools, you will find a no-code or low-code deployment approach to deploying NVIDIA Triton on your favorite cloud.

Business logic scripting

Recognizing the need for organizations to incorporate custom logic and scripts into their AI workloads to differentiate their use cases and tailor them to their end users, we introduced business logic scripting (BLS). This collection of utility functions enables you to seamlessly integrate custom Python or C++ code into production pipelines.

Companies like SNAP have used scripting to seamlessly transition workloads from notebooks to production environments.

Model Ensembles

Responding to feedback from users who operate integrated AI pipelines rather than standalone models in production, we developed Model Ensembles. This no-code development tool enables enterprises to effortlessly connect pre– and post-processing workflows into cohesive pipelines without the need for programming.

You can choose to run the pre– and post-processing steps on CPUs and the AI model on GPUs to optimize infrastructure costs or run the entire pipeline on GPUs for ultra-low latency applications.

Model Analyzer

A standout feature of NVIDIA Triton, Model Analyzer enables you to experiment with various deployment configurations by adjusting the number of concurrent models loaded on the GPU and the number of requests that are batched together during inference run time. It then visually maps out these configurations on an intuitive chart, facilitating the quick identification and deployment of the most efficient setup for production use.

For organizations deploying LLMs, we introduced GenA-Perf, a new generative AI performance benchmarking tool designed specifically to provide generative AI performance metrics, including first-token latency and token-to-token latency.

Exceptional throughput results at MLPerf 4.1

At this year’s MLPerf Inf v4.1 benchmark, hosted by MLCommons, we showcased the performance of NVIDIA Triton on a TensorRT-LLM optimized Llama-v2-70B model. We wanted to demonstrate that enterprises can use the advanced production-grade capabilities of NVIDIA Triton without incurring the high latency and throughput overhead typically associated with inference serving platforms.

In our submissions, NVIDIA Triton achieved virtually identical performance to our submissions that did not use NVIDIA Triton (bare-metal). This demonstrates that enterprises no longer have to trade off and choose between a feature-rich, production-grade AI inference server and peak throughput performance. They can accomplish both at the same time with NVIDIA Triton, as demonstrated by the exceptional performance in MLPerf v4.1.

An image showing MLPerf results on Llama 2 70B with and without Triton Inference Server. — *Figure 1. MLPerf Llama 2 70B performance*

MLPerf Inference v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries :4.1-0048, 4.1-0050. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. For more information, see www.mlcommons.org.

MLPerf benchmark submission details

Our submission consisted of the following scenarios:

Offline: All the inputs of the workload are passed to the inference serving system at one time as a single batch.
Server: Input requests are sent discretely to the inference serving system to mimic real-world production deployments. This scenario is more challenging as it imposes strict constraints on first-token latency and inter-token latency, placing stringent expectations on responsiveness and the speed of the inference system.

The NVIDIA Triton implementation of the benchmark included a client and server. For the client, we used the NVIDIA Triton gRPC client to communicate to the NVIDIA Triton server instance and the benchmark’s load generator. This enabled a simplistic and readable Pythonic interface into an LLM inference server.

For the server, we used NVIDIA Triton Inference Server to provide a gRPC endpoint to interact with TensorRT-LLM, loosely coupled to it with the TensorRT-LLM backend. The server is agnostic to the version of TensorRT-LLM, as implementation details are abstracted into the backend to provide peak performance.

For round v4,1, we submitted our NVIDIA Triton implementation of the benchmark under closed division, which meant that both client and server run on the same node, communicating through a loopback interface. This implementation can be easily extended to multi-node scenarios with the client running on a separate node from the server, communicating with each other using gRPC.

NVIDIA Triton also supports an HTTP communication option that could be used for generative AI workloads.

Next in-person user meetup

While we are excited about our achievements so far, the NVIDIA Triton journey keeps going.

To foster ongoing open-source innovation, we are thrilled to announce the next NVIDIA Triton user meetup on September 9, 2024, at the Fort Mason Center For Arts & Culture in San Francisco, where we will share new LLM features and envision the future together. Register for the NVIDIA Triton meetup now.

We look forward to seeing you at the event as we embark on the next phase of the NVIDIA Triton journey.