Deploy NVIDIA Riva ASR on Kubernetes GPU Cluster in Just 5 Minutes

Community Article Published September 5, 2025

Voice is the new keyboard. In a world of smart speakers, voice assistants, and an ever-increasing need for hands-free interaction, Automatic Speech Recognition (ASR) is no longer a futuristic concept but a present-day necessity. From transcribing meetings to powering in-car navigation, ASR is at the heart of the conversational AI revolution.

But how do you, a developer or an MLOps engineer, bring this powerful technology into your own applications? What if I told you that you could deploy a state-of-the-art ASR service on your Kubernetes cluster in the time it takes to brew a cup of coffee?

Welcome to your 5-minute guide to deploying NVIDIA Riva, a powerful SDK for building and deploying speech AI applications. We'll use a GPU-accelerated Kubernetes cluster to get a high-performance ASR service up and running. Ready? Let's dive in!

image/png

What is Helm and Why Use It?

Throughout this guide, we'll be using Helm, which is often described as the package manager for Kubernetes. Think of it like apt or yum but for Kubernetes applications. Helm uses "charts" to define, install, and upgrade even the most complex Kubernetes applications. It simplifies deployment, making it repeatable and manageable.

Step 1: Validate Your Kubernetes Cluster

NVIDIA Riva leverages GPUs to deliver high-performance, low-latency ASR, so your cluster must have NVIDIA GPU support enabled.

Run this command to create a test pod that checks for GPU availability:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: nvidia/cuda:12.0.0-base-ubuntu22.04
    args:
    - "nvidia-smi"
    resources:
      limits:
         nvidia.com/gpu: 1
EOF

Then check the output with: kubectl logs nvidia-smi

If you see your GPU details, you’re all set to run Riva!

Step 2: Grab the Riva Helm Chart

Before deploying, you must obtain an API key from NVIDIA's NGC catalog. Generate one here: NGC API Key Setup. Export your API key:

export NGC_API_KEY=<your_api_key>

Fetch the Riva API Helm chart with Helm:

helm fetch https://helm.ngc.nvidia.com/nvidia/riva/charts/riva-api-2.19.0.tgz
--username='$oauthtoken' --password=$NGC_API_KEY --untar

This creates a riva-api directory with the deployment files.

Step 3: Customize Your Deployment

Open the values.yaml file inside riva-api:

cd riva-api
nano values.yaml

Here you can enable or disable various ASR, speaker diarization, NMT, and TTS models. For example, uncomment lines to enable:

Streaming ASR model (best throughput)
nvidia/riva/rmir_asr_parakeet_0-6b_en_us_str_thr:2.19.0

Offline ASR model
nvidia/riva/rmir_asr_conformer_en_us_ofl:2.19.0

Enable only the models you need to optimize resource usage.

Step 4: Deploy Riva!

Run the Helm install command:

helm install riva-api riva-api/ \
  --set ngcCredentials.password=$(echo -n $NGC_API_KEY | base64 -w0) \
  --set modelRepoGenerator.modelDeployKey=$(echo -n tlt_encode | base64 -w0) \
  -f path/to/values.yaml

To deploy in a specific namespace, add -n <namespace>.

Riva will now deploy the API and Triton Inference Server pods, downloading and optimizing models automatically.

image/png

Step 5: Check Pod Status

Check if pods are running:

kubectl get pods

Wait for pods like riva-api-riva-api-... and riva-api-riva-triton-... to reach Running state.

image/png

Step 6: Access the Riva API

Port forward the Riva service to your localhost:

kubectl port-forward service/riva-api 50051:50051 --address=0.0.0.0

The service is now accessible on localhost:50051.

Bonus: Test with Streamlit UI

Find it the my Medium blog - click here

✨ If you’ve made it this far and found this guide useful, share it with someone who might need it. Let’s keep building smarter voice-powered systems together.

🤝 I’d love to connect and exchange ideas

Community

Sign up or log in to comment