boneylizardwizard commited on
Commit
f7de1d8
·
verified ·
1 Parent(s): 8e79651

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -0
README.md CHANGED
@@ -1,3 +1,138 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - llama-cpp-python,
5
+ - cuda,
6
+ - gemma
7
+ - gemma-3,
8
+ - windows,
9
+ - wheel,
10
+ - prebuilt,
11
+ - .whl,
12
+ - local-llm,
13
  ---
14
+ # llama-cpp-python Prebuilt Wheel (Windows x64, CUDA 12.8, Gemma 3 Support)
15
+
16
+ ---
17
+ 🛠️ **Built with** [llama.cpp (b5192)](https://github.com/ggml-org/llama.cpp) + [CUDA 12.8](https://developer.nvidia.com/cuda-toolkit)
18
+ ---
19
+ **Prebuilt `.whl` for llama-cpp-python 0.3.8 — CUDA 12.8 acceleration with full Gemma 3 model support (Windows x64).**
20
+
21
+ This repository provides a prebuilt Python wheel (`.whl`) file for **llama-cpp-python**, specifically compiled for Windows 10/11 (x64) with NVIDIA CUDA 12.8 acceleration enabled.
22
+
23
+ Building `llama-cpp-python` with CUDA support on Windows can be a complex process involving specific Visual Studio configurations, CUDA Toolkit setup, and environment variables. This prebuilt wheel aims to simplify installation for users with compatible systems.
24
+
25
+ This build is based on **llama-cpp-python** version `0.3.8` of the Python bindings, and the underlying **llama.cpp** source code as of **April 26, 2025**. It has been verified to work with **Gemma 3 models**, correctly offloading layers to the GPU.
26
+
27
+ ---
28
+
29
+ ## Features
30
+
31
+ - **Prebuilt for Windows x64**: Ready to install using `pip` on 64-bit Windows systems.
32
+ - **CUDA 12.8 Accelerated**: Leverages your NVIDIA GPU for faster inference.
33
+ - **Gemma 3 Support**: Verified compatibility with Gemma 3 models.
34
+ - **Based on llama-cpp-python version `0.3.8` bindings.**
35
+ - **Uses [llama.cpp release b5192](https://github.com/ggml-org/llama.cpp/releases/tag/b5192) from April 26, 2025.**
36
+
37
+ ---
38
+
39
+ ## Compatibility & Prerequisites
40
+
41
+ To use this wheel, you must have:
42
+
43
+ - An **NVIDIA GPU**.
44
+ - NVIDIA drivers compatible with **CUDA 12.8** installed.
45
+ - **Windows 10 or Windows 11 (x64)**.
46
+ - **Python 3.8 or higher** (the wheel is built specifically for **Python 3.11** (`cp311`)).
47
+ - The **Visual C++ Redistributable for Visual Studio 2015-2022** installed.
48
+
49
+ ---
50
+
51
+ ## Installation
52
+
53
+ It is highly recommended to install this wheel within a Python virtual environment.
54
+
55
+ 1. Ensure you have met all the prerequisites listed above.
56
+ 2. Create and activate a Python virtual environment:
57
+
58
+ ```bash
59
+ python -m venv venv_llama
60
+ .\venv_llama\Scripts\activate
61
+ ```
62
+
63
+ 3. Download the `.whl` file from this repository's **Releases** section.
64
+ 4. Open your Command Prompt or PowerShell.
65
+ 5. Navigate to the directory where you downloaded the `.whl` file.
66
+ 6. Install the wheel using `pip`:
67
+
68
+ ```bash
69
+ pip install llama_cpp_python-0.3.8+cu128.gemma3-cp311-cp311-win_amd64.whl
70
+ ```
71
+
72
+ ---
73
+
74
+ ## Verification (Check CUDA Usage)
75
+
76
+ To verify that `llama-cpp-python` is using your GPU via CUDA after installation:
77
+
78
+ ```bash
79
+ python -c "from llama_cpp import Llama; print('Attempting to initialize Llama with GPU offload...'); try: model = Llama(model_path='path/to/a/small/model.gguf', n_gpu_layers=-1, verbose=True); print('Initialization attempted. Check output above for GPU layers.'); except FileNotFoundError: print('Model file not found, but library initialization output above might still indicate CUDA usage.'); except Exception as e: print(f'An error occurred during initialization: {e}');"
80
+ ```
81
+
82
+ Note: Replace path/to/a/small/model.gguf with the actual path to a small .gguf model file.
83
+
84
+ Look for output messages indicating layers being offloaded to the GPU, such as assigned to device CUDA0 or memory buffer reports.
85
+
86
+ ## Alternative Verification: Python Script
87
+
88
+ If you prefer, you can verify that llama-cpp-python is correctly using CUDA by running a small Python script inside your virtual environment.
89
+
90
+ Replace the placeholder paths below with your actual .dll and .gguf file locations:
91
+
92
+ ```bash
93
+ import os
94
+ from llama_cpp import Llama
95
+
96
+ # Set the environment variable to point to your custom-built llama.dll
97
+ os.environ['LLAMA_CPP_LIB'] = r'PATH_TO_YOUR_CUSTOM_LLAMA_DLL'
98
+
99
+ try:
100
+ print('Attempting to initialize Llama with GPU offload (-1 layers)...')
101
+
102
+ # Initialize the Llama model with full GPU offloading
103
+ model = Llama(
104
+ model_path=r'PATH_TO_YOUR_MODEL_FILE.gguf',
105
+ n_gpu_layers=-1,
106
+ verbose=True
107
+ )
108
+
109
+ print('Initialization attempted. Check the output above for CUDA device assignments (e.g., CUDA0, CUDA1).')
110
+
111
+ except FileNotFoundError:
112
+ print('Error: Model file not found. Please double-check your model_path.')
113
+ except Exception as e:
114
+ print(f'An error occurred during initialization: {e}')
115
+ ```
116
+ **What to look for in the output:**
117
+
118
+ Lines like assigned to device CUDA0, assigned to device CUDA1.
119
+
120
+ VRAM buffer allocations such as CUDA0 model buffer size = ... MiB.
121
+
122
+ Confirmation that your GPU(s) are being used for model layer offloading.
123
+
124
+ ## Usage
125
+ Once installed and verified, you can use llama-cpp-python in your projects as you normally would. Refer to the official llama-cpp-python documentation for detailed usage instructions.
126
+
127
+ ## Acknowledgments
128
+ This prebuilt wheel is based on the excellent llama-cpp-python project by Andrei Betlen (@abetlen). All credit for the core library and Python bindings goes to the original maintainers and to llama.cpp by Georgi Gerganov (@ggerganov).
129
+
130
+ This specific wheel was built by Bernard Peter Fitzgerald (@boneylizard) using the source code from abetlen/llama-cpp-python, compiled with CUDA 12.8 support for Windows x64 systems, and verified for Gemma 3 model compatibility.
131
+
132
+ ## License
133
+ This prebuilt wheel is distributed under the MIT License, the same license as the original llama-cpp-python project.
134
+
135
+ ## Reporting Issues
136
+ If you encounter issues specifically with installing this prebuilt wheel or getting CUDA offloading to work using this wheel, please report them on this repository's Issue Tracker.
137
+
138
+ For general issues with llama-cpp-python itself, please report them upstream at the [official llama-cpp-python GitHub Issues page](https://github.com/ggml-org/llama.cpp/issues).