|
# CLIP Inference with AMD Ryzen AI |
|
|
|
This repository contains a Python script for running CLIP (Contrastive Language-Image Pre-training) model inference on the CIFAR-100 dataset using AMD Ryzen AI NPU or CPU. |
|
|
|
**The models, caches, and config files in this repo assume the user is using RAI 1.5** |
|
|
|
## Overview |
|
|
|
The `clip_inference.py` script performs zero-shot image classification using OpenAI's CLIP model on the CIFAR-100 dataset. It supports both CPU and NPU (Neural Processing Unit) execution, allowing you to leverage AMD Ryzen AI acceleration for improved performance. |
|
|
|
### Key Features |
|
|
|
- **Zero-shot Classification**: Classify CIFAR-100 images without fine-tuning |
|
- **Dual Execution Modes**: Run on CPU or AMD Ryzen AI NPU |
|
- **Performance Metrics**: Measures latency, throughput, and classification accuracy |
|
- **Flexible Dataset Size**: Process from 1 to 10,000 images |
|
- **ONNX Runtime Integration**: Uses optimized ONNX models for inference |
|
|
|
## Prerequisites |
|
|
|
### Environment Setup |
|
|
|
1. **AMD Ryzen AI Installation**: Follow the [Ryzen AI Installation Guide](https://ryzenai.docs.amd.com/en/latest/inst.html) to prepare your environment. |
|
|
|
2. **Activate Conda Environment**: |
|
```bash |
|
conda activate <env_name> |
|
``` |
|
|
|
3. **Install Dependencies**: |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
### Required Files |
|
|
|
Ensure the following files are present in the same directory as `clip_inference.py`: |
|
|
|
#### ONNX Model Files |
|
- `clip_text_model.onnx` - ONNX text encoder model |
|
- `clip_vision_model.onnx` - ONNX vision encoder model |
|
|
|
#### Configuration Files (for NPU execution) |
|
- `vitisai_config.json` - VitisAI configuration |
|
|
|
#### Model Cache Directories |
|
- `clip_text_model_cache/` - Cached text model artifacts |
|
- `clip_vision_model_cache/` - Cached vision model artifacts |
|
|
|
### Cache Directory Structure |
|
|
|
The cache directories contain pre-compiled model artifacts and optimization files for improved performance. |
|
|
|
They eliminate the need for model compilation, which may be timely. |
|
|
|
CLIP uses two models, and has two cache files provided as zip files. |
|
|
|
Please unzip the cache files and make sure that the directories in `Model_Cache_Directories` section are in the same location as the inference script. |
|
This may require moving the unzipped directories up one level in the dir hierarchy |
|
|
|
``` |
|
clip_text_model_cache/ |
|
βββ aie_unsupported_original_ops.json |
|
βββ context.json |
|
βββ final-vaiml-pass-summary.txt |
|
βββ gops.csv |
|
βββ graph_nodes.json |
|
βββ graph_partition_trace.csv |
|
βββ original-info-signature.txt |
|
βββ original-model-signature.txt |
|
βββ partition_io_shapes.json |
|
βββ preliminary-vaiml-pass-summary.txt |
|
βββ tensor_shape.json |
|
βββ cache/ |
|
βββ vaiml_par_0/ |
|
βββ vaiml_partition_fe.flexml/ |
|
|
|
clip_vision_model_cache/ |
|
βββ aie_unsupported_original_ops.json |
|
βββ context.json |
|
βββ final-vaiml-pass-summary.txt |
|
βββ gops.csv |
|
βββ graph_nodes.json |
|
βββ graph_partition_trace.csv |
|
βββ original-info-signature.txt |
|
βββ original-model-signature.txt |
|
βββ partition_io_shapes.json |
|
βββ preliminary-vaiml-pass-summary.txt |
|
βββ tensor_shape.json |
|
βββ cache/ |
|
βββ vaiml_par_0/ |
|
βββ vaiml_partition_fe.flexml/ |
|
``` |
|
|
|
#### Cache Directory Descriptions |
|
|
|
- **Root Level Files**: Contain compilation metadata, graph analysis, and performance summaries |
|
- **`cache/`**: Hash-based cache storage for model artifacts |
|
- **`vaiml_par_0/`**: Contains compiled model artifacts, MLIR representations, and native libraries |
|
- **`vaiml_partition_fe.flexml/`**: Contains optimized ONNX models and visualization files |
|
|
|
**Note**: These cache directories are automatically generated during the first NPU compilation and significantly reduce subsequent startup times. |
|
|
|
## Usage |
|
|
|
### Command Line Interface |
|
|
|
```bash |
|
python clip_inference.py [-h] (--npu | --cpu) [--num_images NUM_IMAGES] |
|
``` |
|
|
|
### Arguments |
|
|
|
**Required (mutually exclusive):** |
|
- `--cpu`: Run inference on CPU using CPUExecutionProvider |
|
- `--npu`: Run inference on NPU using VitisAIExecutionProvider |
|
|
|
**Optional:** |
|
- `--num_images`: Number of images to process from CIFAR-100 test set (default: 50, max: 10,000) |
|
|
|
### Examples |
|
|
|
1. **CPU inference with default settings (50 images):** |
|
```bash |
|
python clip_inference.py --cpu |
|
``` |
|
|
|
2. **NPU inference with 100 images:** |
|
```bash |
|
python clip_inference.py --npu --num_images 100 |
|
``` |
|
|
|
3. **NPU inference on complete test dataset:** |
|
```bash |
|
python clip_inference.py --npu --num_images 10000 |
|
``` |
|
|
|
## How It Works |
|
|
|
### Model Architecture |
|
- **Text Encoder**: Processes text descriptions ("a photo of a {class_name}") |
|
- **Vision Encoder**: Processes CIFAR-100 images (32x32 RGB) |
|
- **Classification**: Computes similarity between image and text embeddings |
|
|
|
### Inference Pipeline |
|
1. **Text Processing**: Pre-compute text features for all 100 CIFAR-100 class labels |
|
2. **Image Processing**: Process each image through the vision encoder |
|
3. **Classification**: Compute cosine similarity between image and text features |
|
4. **Prediction**: Select the class with highest similarity score |
|
|
|
### Performance Optimization |
|
- **NPU Acceleration**: Leverages AMD Ryzen AI NPU for faster inference |
|
- **Caching**: Uses pre-compiled model caches for reduced startup time |
|
|
|
## Output Metrics |
|
|
|
The script reports the following performance metrics: |
|
|
|
- **Text Latency**: Average time per text inference (ms) |
|
- **Text Throughput**: Text inferences per second (inf/s) |
|
- **Vision Latency**: Average time per image inference (ms) |
|
- **Vision Throughput**: Image inferences per second (inf/s) |
|
- **Classification Accuracy**: Percentage of correctly classified images |
|
|
|
### Example Output |
|
|
|
**NPU Execution (50 images):** |
|
``` |
|
Compilation Done |
|
Session on NPU |
|
|
|
Processing images... |
|
Image inference: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 50/50 [00:03<00:00, 13.45it/s] |
|
|
|
Results: |
|
Text latency: 26.65 ms |
|
Text throughput: 37.52 inf/s |
|
Vision latency: 73.46 ms |
|
Vision throughput: 13.61 inf/s |
|
Classification accuracy: 77.55% |
|
``` |
|
|
|
## Performance Benchmarks |
|
|
|
### Expected Results |
|
|
|
| Execution Mode | Dataset Size | Accuracy (%) | Text Throughput (inf/s) | Text Latency (ms) | Vision Throughput (inf/s) | Vision Latency (ms) | |
|
|---|---|---|---|---|---|---| |
|
| NPU | 50 | 77.55 | 28.4 | 35.22 | 9.48 | 105.48 | |
|
| NPU | 10,000 | 62.19 | 28.08 | 35.61 | 9.39 | 106.54 | |
|
| CPU | 50 | 75.51 | 58.46 | 17.11 | 40.04 | 24.97 | |
|
| CPU | 10,000 | 61.0 | 58.49 | 17.10 | 40.99 | 24.40 | |
|
|
|
### Model Specifications |
|
- **Image Size**: 224x224 (resized from CIFAR-100's 32x32) |
|
- **Sequence Length**: 77 tokens |
|
- **Batch Size**: 1 |
|
|
|
|
|
## Technical Details |
|
|
|
### Dependencies |
|
- `transformers`: Hugging Face transformers library |
|
- `datasets`: Hugging Face datasets library |
|
- `onnxruntime`: ONNX Runtime for model inference |
|
- `torch`: PyTorch for tensor operations |
|
- `numpy`: Numerical computing |
|
- `tqdm`: Progress bars |
|
|
|
### Model Details |
|
- **Base Model**: OpenAI CLIP ViT-Base-Patch32 |
|
- **Text Encoder**: Transformer-based language model |
|
- **Vision Encoder**: Vision Transformer (ViT) with 32x32 patches |
|
- **Output**: 512-dimensional feature vectors |
|
|
|
### Environment Variables |
|
The script sets the following environment variables: |
|
- `XLNX_ENABLE_CACHE=0`: Disables certain caching mechanisms |
|
- `PATH`: Adds FlexML runtime library path |
|
|
|
## Troubleshooting |
|
|
|
### Common Issues |
|
|
|
1. **Missing ONNX Models**: Ensure `clip_text_model.onnx` and `clip_vision_model.onnx` are in the script directory |
|
2. **NPU Compilation Errors**: Verify VitisAI configuration files are present and correctly formatted |
|
3. **Memory Issues**: Reduce `--num_images` if encountering out-of-memory errors |
|
4. **Accuracy Variations**: Results may vary slightly due to random sampling and hardware differences |
|
|
|
### Performance Tips |
|
|
|
1. **First Run**: NPU execution includes compilation time on first run |
|
2. **Warm-up**: Performance metrics exclude warm-up iterations |
|
3. **Batch Size**: Current implementation uses batch size 1 for compatibility |
|
4. **Cache Directory**: Ensure cache directories have write permissions |
|
|
|
## License |
|
|
|
This project uses the OpenAI CLIP model, which is subject to OpenAI's licensing terms. Please refer to the original CLIP repository for license details. |
|
|
|
## References |
|
|
|
- [OpenAI CLIP Paper](https://arxiv.org/abs/2103.00020) |
|
- [AMD Ryzen AI Documentation](https://ryzenai.docs.amd.com/) |
|
- [CIFAR-100 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html) |
|
- [Hugging Face CLIP Model](https://huggingface.co/openai/clip-vit-base-patch32) |
|
|