Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,138 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
tags:
|
4 |
+
- llama-cpp-python,
|
5 |
+
- cuda,
|
6 |
+
- gemma
|
7 |
+
- gemma-3,
|
8 |
+
- windows,
|
9 |
+
- wheel,
|
10 |
+
- prebuilt,
|
11 |
+
- .whl,
|
12 |
+
- local-llm,
|
13 |
---
|
14 |
+
# llama-cpp-python Prebuilt Wheel (Windows x64, CUDA 12.8, Gemma 3 Support)
|
15 |
+
|
16 |
+
---
|
17 |
+
🛠️ **Built with** [llama.cpp (b5192)](https://github.com/ggml-org/llama.cpp) + [CUDA 12.8](https://developer.nvidia.com/cuda-toolkit)
|
18 |
+
---
|
19 |
+
**Prebuilt `.whl` for llama-cpp-python 0.3.8 — CUDA 12.8 acceleration with full Gemma 3 model support (Windows x64).**
|
20 |
+
|
21 |
+
This repository provides a prebuilt Python wheel (`.whl`) file for **llama-cpp-python**, specifically compiled for Windows 10/11 (x64) with NVIDIA CUDA 12.8 acceleration enabled.
|
22 |
+
|
23 |
+
Building `llama-cpp-python` with CUDA support on Windows can be a complex process involving specific Visual Studio configurations, CUDA Toolkit setup, and environment variables. This prebuilt wheel aims to simplify installation for users with compatible systems.
|
24 |
+
|
25 |
+
This build is based on **llama-cpp-python** version `0.3.8` of the Python bindings, and the underlying **llama.cpp** source code as of **April 26, 2025**. It has been verified to work with **Gemma 3 models**, correctly offloading layers to the GPU.
|
26 |
+
|
27 |
+
---
|
28 |
+
|
29 |
+
## Features
|
30 |
+
|
31 |
+
- **Prebuilt for Windows x64**: Ready to install using `pip` on 64-bit Windows systems.
|
32 |
+
- **CUDA 12.8 Accelerated**: Leverages your NVIDIA GPU for faster inference.
|
33 |
+
- **Gemma 3 Support**: Verified compatibility with Gemma 3 models.
|
34 |
+
- **Based on llama-cpp-python version `0.3.8` bindings.**
|
35 |
+
- **Uses [llama.cpp release b5192](https://github.com/ggml-org/llama.cpp/releases/tag/b5192) from April 26, 2025.**
|
36 |
+
|
37 |
+
---
|
38 |
+
|
39 |
+
## Compatibility & Prerequisites
|
40 |
+
|
41 |
+
To use this wheel, you must have:
|
42 |
+
|
43 |
+
- An **NVIDIA GPU**.
|
44 |
+
- NVIDIA drivers compatible with **CUDA 12.8** installed.
|
45 |
+
- **Windows 10 or Windows 11 (x64)**.
|
46 |
+
- **Python 3.8 or higher** (the wheel is built specifically for **Python 3.11** (`cp311`)).
|
47 |
+
- The **Visual C++ Redistributable for Visual Studio 2015-2022** installed.
|
48 |
+
|
49 |
+
---
|
50 |
+
|
51 |
+
## Installation
|
52 |
+
|
53 |
+
It is highly recommended to install this wheel within a Python virtual environment.
|
54 |
+
|
55 |
+
1. Ensure you have met all the prerequisites listed above.
|
56 |
+
2. Create and activate a Python virtual environment:
|
57 |
+
|
58 |
+
```bash
|
59 |
+
python -m venv venv_llama
|
60 |
+
.\venv_llama\Scripts\activate
|
61 |
+
```
|
62 |
+
|
63 |
+
3. Download the `.whl` file from this repository's **Releases** section.
|
64 |
+
4. Open your Command Prompt or PowerShell.
|
65 |
+
5. Navigate to the directory where you downloaded the `.whl` file.
|
66 |
+
6. Install the wheel using `pip`:
|
67 |
+
|
68 |
+
```bash
|
69 |
+
pip install llama_cpp_python-0.3.8+cu128.gemma3-cp311-cp311-win_amd64.whl
|
70 |
+
```
|
71 |
+
|
72 |
+
---
|
73 |
+
|
74 |
+
## Verification (Check CUDA Usage)
|
75 |
+
|
76 |
+
To verify that `llama-cpp-python` is using your GPU via CUDA after installation:
|
77 |
+
|
78 |
+
```bash
|
79 |
+
python -c "from llama_cpp import Llama; print('Attempting to initialize Llama with GPU offload...'); try: model = Llama(model_path='path/to/a/small/model.gguf', n_gpu_layers=-1, verbose=True); print('Initialization attempted. Check output above for GPU layers.'); except FileNotFoundError: print('Model file not found, but library initialization output above might still indicate CUDA usage.'); except Exception as e: print(f'An error occurred during initialization: {e}');"
|
80 |
+
```
|
81 |
+
|
82 |
+
Note: Replace path/to/a/small/model.gguf with the actual path to a small .gguf model file.
|
83 |
+
|
84 |
+
Look for output messages indicating layers being offloaded to the GPU, such as assigned to device CUDA0 or memory buffer reports.
|
85 |
+
|
86 |
+
## Alternative Verification: Python Script
|
87 |
+
|
88 |
+
If you prefer, you can verify that llama-cpp-python is correctly using CUDA by running a small Python script inside your virtual environment.
|
89 |
+
|
90 |
+
Replace the placeholder paths below with your actual .dll and .gguf file locations:
|
91 |
+
|
92 |
+
```bash
|
93 |
+
import os
|
94 |
+
from llama_cpp import Llama
|
95 |
+
|
96 |
+
# Set the environment variable to point to your custom-built llama.dll
|
97 |
+
os.environ['LLAMA_CPP_LIB'] = r'PATH_TO_YOUR_CUSTOM_LLAMA_DLL'
|
98 |
+
|
99 |
+
try:
|
100 |
+
print('Attempting to initialize Llama with GPU offload (-1 layers)...')
|
101 |
+
|
102 |
+
# Initialize the Llama model with full GPU offloading
|
103 |
+
model = Llama(
|
104 |
+
model_path=r'PATH_TO_YOUR_MODEL_FILE.gguf',
|
105 |
+
n_gpu_layers=-1,
|
106 |
+
verbose=True
|
107 |
+
)
|
108 |
+
|
109 |
+
print('Initialization attempted. Check the output above for CUDA device assignments (e.g., CUDA0, CUDA1).')
|
110 |
+
|
111 |
+
except FileNotFoundError:
|
112 |
+
print('Error: Model file not found. Please double-check your model_path.')
|
113 |
+
except Exception as e:
|
114 |
+
print(f'An error occurred during initialization: {e}')
|
115 |
+
```
|
116 |
+
**What to look for in the output:**
|
117 |
+
|
118 |
+
Lines like assigned to device CUDA0, assigned to device CUDA1.
|
119 |
+
|
120 |
+
VRAM buffer allocations such as CUDA0 model buffer size = ... MiB.
|
121 |
+
|
122 |
+
Confirmation that your GPU(s) are being used for model layer offloading.
|
123 |
+
|
124 |
+
## Usage
|
125 |
+
Once installed and verified, you can use llama-cpp-python in your projects as you normally would. Refer to the official llama-cpp-python documentation for detailed usage instructions.
|
126 |
+
|
127 |
+
## Acknowledgments
|
128 |
+
This prebuilt wheel is based on the excellent llama-cpp-python project by Andrei Betlen (@abetlen). All credit for the core library and Python bindings goes to the original maintainers and to llama.cpp by Georgi Gerganov (@ggerganov).
|
129 |
+
|
130 |
+
This specific wheel was built by Bernard Peter Fitzgerald (@boneylizard) using the source code from abetlen/llama-cpp-python, compiled with CUDA 12.8 support for Windows x64 systems, and verified for Gemma 3 model compatibility.
|
131 |
+
|
132 |
+
## License
|
133 |
+
This prebuilt wheel is distributed under the MIT License, the same license as the original llama-cpp-python project.
|
134 |
+
|
135 |
+
## Reporting Issues
|
136 |
+
If you encounter issues specifically with installing this prebuilt wheel or getting CUDA offloading to work using this wheel, please report them on this repository's Issue Tracker.
|
137 |
+
|
138 |
+
For general issues with llama-cpp-python itself, please report them upstream at the [official llama-cpp-python GitHub Issues page](https://github.com/ggml-org/llama.cpp/issues).
|