CLIP Inference with AMD Ryzen AI

This repository contains a Python script for running CLIP (Contrastive Language-Image Pre-training) model inference on the CIFAR-100 dataset using AMD Ryzen AI NPU or CPU.

The models, caches, and config files in this repo assume the user is using RAI 1.5

Overview

The clip_inference.py script performs zero-shot image classification using OpenAI's CLIP model on the CIFAR-100 dataset. It supports both CPU and NPU (Neural Processing Unit) execution, allowing you to leverage AMD Ryzen AI acceleration for improved performance.

Key Features

Zero-shot Classification: Classify CIFAR-100 images without fine-tuning
Dual Execution Modes: Run on CPU or AMD Ryzen AI NPU
Performance Metrics: Measures latency, throughput, and classification accuracy
Flexible Dataset Size: Process from 1 to 10,000 images
ONNX Runtime Integration: Uses optimized ONNX models for inference

Prerequisites

Environment Setup

AMD Ryzen AI Installation: Follow the Ryzen AI Installation Guide to prepare your environment.
Activate Conda Environment:
```
conda activate <env_name>
```
Install Dependencies:
```
pip install -r requirements.txt
```

Required Files

Ensure the following files are present in the same directory as clip_inference.py:

ONNX Model Files

clip_text_model.onnx - ONNX text encoder model
clip_vision_model.onnx - ONNX vision encoder model

Configuration Files (for NPU execution)

vitisai_config.json - VitisAI configuration

Model Cache Directories

clip_text_model_cache/ - Cached text model artifacts
clip_vision_model_cache/ - Cached vision model artifacts

Cache Directory Structure

The cache directories contain pre-compiled model artifacts and optimization files for improved performance.

They eliminate the need for model compilation, which may be timely.

CLIP uses two models, and has two cache files provided as zip files.

Please unzip the cache files and make sure that the directories in Model_Cache_Directories section are in the same location as the inference script. This may require moving the unzipped directories up one level in the dir hierarchy

clip_text_model_cache/
├── aie_unsupported_original_ops.json
├── context.json
├── final-vaiml-pass-summary.txt
├── gops.csv
├── graph_nodes.json
├── graph_partition_trace.csv
├── original-info-signature.txt
├── original-model-signature.txt
├── partition_io_shapes.json
├── preliminary-vaiml-pass-summary.txt
├── tensor_shape.json
├── cache/
├── vaiml_par_0/
└── vaiml_partition_fe.flexml/

clip_vision_model_cache/
├── aie_unsupported_original_ops.json
├── context.json
├── final-vaiml-pass-summary.txt
├── gops.csv
├── graph_nodes.json
├── graph_partition_trace.csv
├── original-info-signature.txt
├── original-model-signature.txt
├── partition_io_shapes.json
├── preliminary-vaiml-pass-summary.txt
├── tensor_shape.json
├── cache/
├── vaiml_par_0/
└── vaiml_partition_fe.flexml/

Cache Directory Descriptions

Root Level Files: Contain compilation metadata, graph analysis, and performance summaries
cache/: Hash-based cache storage for model artifacts
vaiml_par_0/: Contains compiled model artifacts, MLIR representations, and native libraries
vaiml_partition_fe.flexml/: Contains optimized ONNX models and visualization files

Note: These cache directories are automatically generated during the first NPU compilation and significantly reduce subsequent startup times.

Usage

Command Line Interface

python clip_inference.py [-h] (--npu | --cpu) [--num_images NUM_IMAGES]

Arguments

Required (mutually exclusive):

--cpu: Run inference on CPU using CPUExecutionProvider
--npu: Run inference on NPU using VitisAIExecutionProvider

Optional:

--num_images: Number of images to process from CIFAR-100 test set (default: 50, max: 10,000)

Examples

CPU inference with default settings (50 images):
```
python clip_inference.py --cpu
```

NPU inference with 100 images:

python clip_inference.py --npu --num_images 100

NPU inference on complete test dataset:

python clip_inference.py --npu --num_images 10000

How It Works

Model Architecture

Text Encoder: Processes text descriptions ("a photo of a {class_name}")
Vision Encoder: Processes CIFAR-100 images (32x32 RGB)
Classification: Computes similarity between image and text embeddings

Inference Pipeline

Text Processing: Pre-compute text features for all 100 CIFAR-100 class labels
Image Processing: Process each image through the vision encoder
Classification: Compute cosine similarity between image and text features
Prediction: Select the class with highest similarity score

Performance Optimization

NPU Acceleration: Leverages AMD Ryzen AI NPU for faster inference
Caching: Uses pre-compiled model caches for reduced startup time

Output Metrics

The script reports the following performance metrics:

Text Latency: Average time per text inference (ms)
Text Throughput: Text inferences per second (inf/s)
Vision Latency: Average time per image inference (ms)
Vision Throughput: Image inferences per second (inf/s)
Classification Accuracy: Percentage of correctly classified images

Example Output

NPU Execution (50 images):

Compilation Done
Session on NPU

Processing images...
Image inference: 100%|███████████████████████████████████████████████████████| 50/50 [00:03<00:00, 13.45it/s]

Results:
Text latency: 26.65 ms
Text throughput: 37.52 inf/s
Vision latency: 73.46 ms
Vision throughput: 13.61 inf/s
Classification accuracy: 77.55%

Performance Benchmarks

Expected Results

Execution Mode	Dataset Size	Accuracy (%)	Text Throughput (inf/s)	Text Latency (ms)	Vision Throughput (inf/s)	Vision Latency (ms)
NPU	50	77.55	28.4	35.22	9.48	105.48
NPU	10,000	62.19	28.08	35.61	9.39	106.54
CPU	50	75.51	58.46	17.11	40.04	24.97
CPU	10,000	61.0	58.49	17.10	40.99	24.40

Model Specifications

Image Size: 224x224 (resized from CIFAR-100's 32x32)
Sequence Length: 77 tokens
Batch Size: 1

Technical Details

Dependencies

transformers: Hugging Face transformers library
datasets: Hugging Face datasets library
onnxruntime: ONNX Runtime for model inference
torch: PyTorch for tensor operations
numpy: Numerical computing
tqdm: Progress bars

Model Details

Base Model: OpenAI CLIP ViT-Base-Patch32
Text Encoder: Transformer-based language model
Vision Encoder: Vision Transformer (ViT) with 32x32 patches
Output: 512-dimensional feature vectors

Environment Variables

The script sets the following environment variables:

XLNX_ENABLE_CACHE=0: Disables certain caching mechanisms
PATH: Adds FlexML runtime library path

Troubleshooting

Common Issues

Missing ONNX Models: Ensure clip_text_model.onnx and clip_vision_model.onnx are in the script directory
NPU Compilation Errors: Verify VitisAI configuration files are present and correctly formatted
Memory Issues: Reduce --num_images if encountering out-of-memory errors
Accuracy Variations: Results may vary slightly due to random sampling and hardware differences

Performance Tips

First Run: NPU execution includes compilation time on first run
Warm-up: Performance metrics exclude warm-up iterations
Batch Size: Current implementation uses batch size 1 for compatibility
Cache Directory: Ensure cache directories have write permissions

License

This project uses the OpenAI CLIP model, which is subject to OpenAI's licensing terms. Please refer to the original CLIP repository for license details.

amd
/

NPU-CLIP-Python

You need to agree to share your contact information to access this model