NPU-CLIP-Python / README.md

Update README.md

39f1e98 verified 21 days ago

8.65 kB

	# CLIP Inference with AMD Ryzen AI

	This repository contains a Python script for running CLIP (Contrastive Language-Image Pre-training) model inference on the CIFAR-100 dataset using AMD Ryzen AI NPU or CPU.

	The models, caches, and config files in this repo assume the user is using RAI 1.5

	## Overview

	The `clip_inference.py` script performs zero-shot image classification using OpenAI's CLIP model on the CIFAR-100 dataset. It supports both CPU and NPU (Neural Processing Unit) execution, allowing you to leverage AMD Ryzen AI acceleration for improved performance.

	### Key Features

	- Zero-shot Classification: Classify CIFAR-100 images without fine-tuning
	- Dual Execution Modes: Run on CPU or AMD Ryzen AI NPU
	- Performance Metrics: Measures latency, throughput, and classification accuracy
	- Flexible Dataset Size: Process from 1 to 10,000 images
	- ONNX Runtime Integration: Uses optimized ONNX models for inference

	## Prerequisites

	### Environment Setup

	1. AMD Ryzen AI Installation: Follow the [Ryzen AI Installation Guide](https://ryzenai.docs.amd.com/en/latest/inst.html) to prepare your environment.

	2. Activate Conda Environment:
	```bash
	conda activate <env_name>
	```

	3. Install Dependencies:
	```bash
	pip install -r requirements.txt
	```

	### Required Files

	Ensure the following files are present in the same directory as `clip_inference.py`:

	#### ONNX Model Files
	- `clip_text_model.onnx` - ONNX text encoder model
	- `clip_vision_model.onnx` - ONNX vision encoder model

	#### Configuration Files (for NPU execution)
	- `vitisai_config.json` - VitisAI configuration

	#### Model Cache Directories
	- `clip_text_model_cache/` - Cached text model artifacts
	- `clip_vision_model_cache/` - Cached vision model artifacts

	### Cache Directory Structure

	The cache directories contain pre-compiled model artifacts and optimization files for improved performance.

	They eliminate the need for model compilation, which may be timely.

	CLIP uses two models, and has two cache files provided as zip files.

	Please unzip the cache files and make sure that the directories in `Model_Cache_Directories` section are in the same location as the inference script.
	This may require moving the unzipped directories up one level in the dir hierarchy

	```
	clip_text_model_cache/
	├── aie_unsupported_original_ops.json
	├── context.json
	├── final-vaiml-pass-summary.txt
	├── gops.csv
	├── graph_nodes.json
	├── graph_partition_trace.csv
	├── original-info-signature.txt
	├── original-model-signature.txt
	├── partition_io_shapes.json
	├── preliminary-vaiml-pass-summary.txt
	├── tensor_shape.json
	├── cache/
	├── vaiml_par_0/
	└── vaiml_partition_fe.flexml/

	clip_vision_model_cache/
	├── aie_unsupported_original_ops.json
	├── context.json
	├── final-vaiml-pass-summary.txt
	├── gops.csv
	├── graph_nodes.json
	├── graph_partition_trace.csv
	├── original-info-signature.txt
	├── original-model-signature.txt
	├── partition_io_shapes.json
	├── preliminary-vaiml-pass-summary.txt
	├── tensor_shape.json
	├── cache/
	├── vaiml_par_0/
	└── vaiml_partition_fe.flexml/
	```

	#### Cache Directory Descriptions

	- Root Level Files: Contain compilation metadata, graph analysis, and performance summaries
	- `cache/`: Hash-based cache storage for model artifacts
	- `vaiml_par_0/`: Contains compiled model artifacts, MLIR representations, and native libraries
	- `vaiml_partition_fe.flexml/`: Contains optimized ONNX models and visualization files

	Note: These cache directories are automatically generated during the first NPU compilation and significantly reduce subsequent startup times.

	## Usage

	### Command Line Interface

	```bash
	python clip_inference.py [-h] (--npu \| --cpu) [--num_images NUM_IMAGES]
	```

	### Arguments

	Required (mutually exclusive):
	- `--cpu`: Run inference on CPU using CPUExecutionProvider
	- `--npu`: Run inference on NPU using VitisAIExecutionProvider

	Optional:
	- `--num_images`: Number of images to process from CIFAR-100 test set (default: 50, max: 10,000)

	### Examples

	1. CPU inference with default settings (50 images):
	```bash
	python clip_inference.py --cpu
	```

	2. NPU inference with 100 images:
	```bash
	python clip_inference.py --npu --num_images 100
	```

	3. NPU inference on complete test dataset:
	```bash
	python clip_inference.py --npu --num_images 10000
	```

	## How It Works

	### Model Architecture
	- Text Encoder: Processes text descriptions ("a photo of a {class_name}")
	- Vision Encoder: Processes CIFAR-100 images (32x32 RGB)
	- Classification: Computes similarity between image and text embeddings

	### Inference Pipeline
	1. Text Processing: Pre-compute text features for all 100 CIFAR-100 class labels
	2. Image Processing: Process each image through the vision encoder
	3. Classification: Compute cosine similarity between image and text features
	4. Prediction: Select the class with highest similarity score

	### Performance Optimization
	- NPU Acceleration: Leverages AMD Ryzen AI NPU for faster inference
	- Caching: Uses pre-compiled model caches for reduced startup time

	## Output Metrics

	The script reports the following performance metrics:

	- Text Latency: Average time per text inference (ms)
	- Text Throughput: Text inferences per second (inf/s)
	- Vision Latency: Average time per image inference (ms)
	- Vision Throughput: Image inferences per second (inf/s)
	- Classification Accuracy: Percentage of correctly classified images

	### Example Output

	NPU Execution (50 images):
	```
	Compilation Done
	Session on NPU

	Processing images...
	Image inference: 100%\|███████████████████████████████████████████████████████\| 50/50 [00:03<00:00, 13.45it/s]

	Results:
	Text latency: 26.65 ms
	Text throughput: 37.52 inf/s
	Vision latency: 73.46 ms
	Vision throughput: 13.61 inf/s
	Classification accuracy: 77.55%
	```

	## Performance Benchmarks

	### Expected Results

	\| Execution Mode \| Dataset Size \| Accuracy (%) \| Text Throughput (inf/s) \| Text Latency (ms) \| Vision Throughput (inf/s) \| Vision Latency (ms) \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| NPU \| 50 \| 77.55 \| 28.4 \| 35.22 \| 9.48 \| 105.48 \|
	\| NPU \| 10,000 \| 62.19 \| 28.08 \| 35.61 \| 9.39 \| 106.54 \|
	\| CPU \| 50 \| 75.51 \| 58.46 \| 17.11 \| 40.04 \| 24.97 \|
	\| CPU \| 10,000 \| 61.0 \| 58.49 \| 17.10 \| 40.99 \| 24.40 \|

	### Model Specifications
	- Image Size: 224x224 (resized from CIFAR-100's 32x32)
	- Sequence Length: 77 tokens
	- Batch Size: 1


	## Technical Details

	### Dependencies
	- `transformers`: Hugging Face transformers library
	- `datasets`: Hugging Face datasets library
	- `onnxruntime`: ONNX Runtime for model inference
	- `torch`: PyTorch for tensor operations
	- `numpy`: Numerical computing
	- `tqdm`: Progress bars

	### Model Details
	- Base Model: OpenAI CLIP ViT-Base-Patch32
	- Text Encoder: Transformer-based language model
	- Vision Encoder: Vision Transformer (ViT) with 32x32 patches
	- Output: 512-dimensional feature vectors

	### Environment Variables
	The script sets the following environment variables:
	- `XLNX_ENABLE_CACHE=0`: Disables certain caching mechanisms
	- `PATH`: Adds FlexML runtime library path

	## Troubleshooting

	### Common Issues

	1. Missing ONNX Models: Ensure `clip_text_model.onnx` and `clip_vision_model.onnx` are in the script directory
	2. NPU Compilation Errors: Verify VitisAI configuration files are present and correctly formatted
	3. Memory Issues: Reduce `--num_images` if encountering out-of-memory errors
	4. Accuracy Variations: Results may vary slightly due to random sampling and hardware differences

	### Performance Tips

	1. First Run: NPU execution includes compilation time on first run
	2. Warm-up: Performance metrics exclude warm-up iterations
	3. Batch Size: Current implementation uses batch size 1 for compatibility
	4. Cache Directory: Ensure cache directories have write permissions

	## License

	This project uses the OpenAI CLIP model, which is subject to OpenAI's licensing terms. Please refer to the original CLIP repository for license details.

	## References

	- [OpenAI CLIP Paper](https://arxiv.org/abs/2103.00020)
	- [AMD Ryzen AI Documentation](https://ryzenai.docs.amd.com/)
	- [CIFAR-100 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html)
	- [Hugging Face CLIP Model](https://huggingface.co/openai/clip-vit-base-patch32)