|
--- |
|
license: mit |
|
tags: |
|
- llama-cpp-python, |
|
- cuda, |
|
- gemma |
|
- gemma-3, |
|
- windows, |
|
- wheel, |
|
- prebuilt, |
|
- .whl, |
|
- local-llm, |
|
--- |
|
# llama-cpp-python Prebuilt Wheel (Windows x64, CUDA 12.8, Gemma 3 Support) |
|
|
|
--- |
|
🛠️ **Built with** [llama.cpp (b5192)](https://github.com/ggml-org/llama.cpp) + [CUDA 12.8](https://developer.nvidia.com/cuda-toolkit) |
|
--- |
|
**Prebuilt `.whl` for llama-cpp-python 0.3.8 — CUDA 12.8 acceleration with full Gemma 3 model support (Windows x64).** |
|
|
|
This repository provides a prebuilt Python wheel (`.whl`) file for **llama-cpp-python**, specifically compiled for Windows 10/11 (x64) with NVIDIA CUDA 12.8 acceleration enabled. |
|
|
|
Building `llama-cpp-python` with CUDA support on Windows can be a complex process involving specific Visual Studio configurations, CUDA Toolkit setup, and environment variables. This prebuilt wheel aims to simplify installation for users with compatible systems. |
|
|
|
This build is based on **llama-cpp-python** version `0.3.8` of the Python bindings, and the underlying **llama.cpp** source code as of **April 26, 2025**. It has been verified to work with **Gemma 3 models**, correctly offloading layers to the GPU. |
|
|
|
--- |
|
|
|
## Features |
|
|
|
- **Prebuilt for Windows x64**: Ready to install using `pip` on 64-bit Windows systems. |
|
- **CUDA 12.8 Accelerated**: Leverages your NVIDIA GPU for faster inference. |
|
- **Gemma 3 Support**: Verified compatibility with Gemma 3 models. |
|
- **Based on llama-cpp-python version `0.3.8` bindings.** |
|
- **Uses [llama.cpp release b5192](https://github.com/ggml-org/llama.cpp/releases/tag/b5192) from April 26, 2025.** |
|
|
|
--- |
|
|
|
## Compatibility & Prerequisites |
|
|
|
To use this wheel, you must have: |
|
|
|
- An **NVIDIA GPU**. |
|
- NVIDIA drivers compatible with **CUDA 12.8** installed. |
|
- **Windows 10 or Windows 11 (x64)**. |
|
- **Python 3.8 or higher** (the wheel is built specifically for **Python 3.11** (`cp311`)). |
|
- The **Visual C++ Redistributable for Visual Studio 2015-2022** installed. |
|
|
|
--- |
|
|
|
## Installation |
|
|
|
It is highly recommended to install this wheel within a Python virtual environment. |
|
|
|
1. Ensure you have met all the prerequisites listed above. |
|
2. Create and activate a Python virtual environment: |
|
|
|
```bash |
|
python -m venv venv_llama |
|
.\venv_llama\Scripts\activate |
|
``` |
|
|
|
3. Download the `.whl` file from this repository's **Releases** section. |
|
4. Open your Command Prompt or PowerShell. |
|
5. Navigate to the directory where you downloaded the `.whl` file. |
|
6. Install the wheel using `pip`: |
|
|
|
```bash |
|
pip install llama_cpp_python-0.3.8+cu128.gemma3-cp311-cp311-win_amd64.whl |
|
``` |
|
|
|
--- |
|
|
|
## Verification (Check CUDA Usage) |
|
|
|
To verify that `llama-cpp-python` is using your GPU via CUDA after installation: |
|
|
|
```bash |
|
python -c "from llama_cpp import Llama; print('Attempting to initialize Llama with GPU offload...'); try: model = Llama(model_path='path/to/a/small/model.gguf', n_gpu_layers=-1, verbose=True); print('Initialization attempted. Check output above for GPU layers.'); except FileNotFoundError: print('Model file not found, but library initialization output above might still indicate CUDA usage.'); except Exception as e: print(f'An error occurred during initialization: {e}');" |
|
``` |
|
|
|
Note: Replace path/to/a/small/model.gguf with the actual path to a small .gguf model file. |
|
|
|
Look for output messages indicating layers being offloaded to the GPU, such as assigned to device CUDA0 or memory buffer reports. |
|
|
|
## Alternative Verification: Python Script |
|
|
|
If you prefer, you can verify that llama-cpp-python is correctly using CUDA by running a small Python script inside your virtual environment. |
|
|
|
Replace the placeholder paths below with your actual .dll and .gguf file locations: |
|
|
|
```bash |
|
import os |
|
from llama_cpp import Llama |
|
|
|
# Set the environment variable to point to your custom-built llama.dll |
|
os.environ['LLAMA_CPP_LIB'] = r'PATH_TO_YOUR_CUSTOM_LLAMA_DLL' |
|
|
|
try: |
|
print('Attempting to initialize Llama with GPU offload (-1 layers)...') |
|
|
|
# Initialize the Llama model with full GPU offloading |
|
model = Llama( |
|
model_path=r'PATH_TO_YOUR_MODEL_FILE.gguf', |
|
n_gpu_layers=-1, |
|
verbose=True |
|
) |
|
|
|
print('Initialization attempted. Check the output above for CUDA device assignments (e.g., CUDA0, CUDA1).') |
|
|
|
except FileNotFoundError: |
|
print('Error: Model file not found. Please double-check your model_path.') |
|
except Exception as e: |
|
print(f'An error occurred during initialization: {e}') |
|
``` |
|
**What to look for in the output:** |
|
|
|
Lines like assigned to device CUDA0, assigned to device CUDA1. |
|
|
|
VRAM buffer allocations such as CUDA0 model buffer size = ... MiB. |
|
|
|
Confirmation that your GPU(s) are being used for model layer offloading. |
|
|
|
## Usage |
|
Once installed and verified, you can use llama-cpp-python in your projects as you normally would. Refer to the official llama-cpp-python documentation for detailed usage instructions. |
|
|
|
## Acknowledgments |
|
This prebuilt wheel is based on the excellent llama-cpp-python project by Andrei Betlen (@abetlen). All credit for the core library and Python bindings goes to the original maintainers and to llama.cpp by Georgi Gerganov (@ggerganov) and the ggml team. |
|
|
|
This specific wheel was built by Bernard Peter Fitzgerald (@boneylizardwizard) using the source code from abetlen/llama-cpp-python, compiled with CUDA 12.8 support for Windows x64 systems, and verified for Gemma 3 model compatibility. |
|
|
|
## License |
|
This prebuilt wheel is distributed under the MIT License, the same license as the original llama-cpp-python project. |
|
|
|
## Reporting Issues |
|
If you encounter issues specifically with installing this prebuilt wheel or getting CUDA offloading to work using this wheel, please report them on this repository's Issue Tracker. |
|
|
|
For general issues with llama-cpp-python itself, please report them upstream at the [official llama-cpp-python GitHub Issues page](https://github.com/ggml-org/llama.cpp/issues). |