Update README.md

18a1fb5 verified 3 months ago

5.89 kB

	---
	license: mit
	tags:
	- llama-cpp-python,
	- cuda,
	- gemma
	- gemma-3,
	- windows,
	- wheel,
	- prebuilt,
	- .whl,
	- local-llm,
	---
	# llama-cpp-python Prebuilt Wheel (Windows x64, CUDA 12.8, Gemma 3 Support)

	---
	🛠️ Built with [llama.cpp (b5192)](https://github.com/ggml-org/llama.cpp) + [CUDA 12.8](https://developer.nvidia.com/cuda-toolkit)
	---
	Prebuilt `.whl` for llama-cpp-python 0.3.8 — CUDA 12.8 acceleration with full Gemma 3 model support (Windows x64).

	This repository provides a prebuilt Python wheel (`.whl`) file for llama-cpp-python, specifically compiled for Windows 10/11 (x64) with NVIDIA CUDA 12.8 acceleration enabled.

	Building `llama-cpp-python` with CUDA support on Windows can be a complex process involving specific Visual Studio configurations, CUDA Toolkit setup, and environment variables. This prebuilt wheel aims to simplify installation for users with compatible systems.

	This build is based on llama-cpp-python version `0.3.8` of the Python bindings, and the underlying llama.cpp source code as of April 26, 2025. It has been verified to work with Gemma 3 models, correctly offloading layers to the GPU.

	---

	## Features

	- Prebuilt for Windows x64: Ready to install using `pip` on 64-bit Windows systems.
	- CUDA 12.8 Accelerated: Leverages your NVIDIA GPU for faster inference.
	- Gemma 3 Support: Verified compatibility with Gemma 3 models.
	- Based on llama-cpp-python version `0.3.8` bindings.
	- Uses [llama.cpp release b5192](https://github.com/ggml-org/llama.cpp/releases/tag/b5192) from April 26, 2025.

	---

	## Compatibility & Prerequisites

	To use this wheel, you must have:

	- An NVIDIA GPU.
	- NVIDIA drivers compatible with CUDA 12.8 installed.
	- Windows 10 or Windows 11 (x64).
	- Python 3.8 or higher (the wheel is built specifically for Python 3.11 (`cp311`)).
	- The Visual C++ Redistributable for Visual Studio 2015-2022 installed.

	---

	## Installation

	It is highly recommended to install this wheel within a Python virtual environment.

	1. Ensure you have met all the prerequisites listed above.
	2. Create and activate a Python virtual environment:

	```bash
	python -m venv venv_llama
	.\venv_llama\Scripts\activate
	```

	3. Download the `.whl` file from this repository's Releases section.
	4. Open your Command Prompt or PowerShell.
	5. Navigate to the directory where you downloaded the `.whl` file.
	6. Install the wheel using `pip`:

	```bash
	pip install llama_cpp_python-0.3.8+cu128.gemma3-cp311-cp311-win_amd64.whl
	```

	---

	## Verification (Check CUDA Usage)

	To verify that `llama-cpp-python` is using your GPU via CUDA after installation:

	```bash
	python -c "from llama_cpp import Llama; print('Attempting to initialize Llama with GPU offload...'); try: model = Llama(model_path='path/to/a/small/model.gguf', n_gpu_layers=-1, verbose=True); print('Initialization attempted. Check output above for GPU layers.'); except FileNotFoundError: print('Model file not found, but library initialization output above might still indicate CUDA usage.'); except Exception as e: print(f'An error occurred during initialization: {e}');"
	```

	Note: Replace path/to/a/small/model.gguf with the actual path to a small .gguf model file.

	Look for output messages indicating layers being offloaded to the GPU, such as assigned to device CUDA0 or memory buffer reports.

	## Alternative Verification: Python Script

	If you prefer, you can verify that llama-cpp-python is correctly using CUDA by running a small Python script inside your virtual environment.

	Replace the placeholder paths below with your actual .dll and .gguf file locations:

	```bash
	import os
	from llama_cpp import Llama

	# Set the environment variable to point to your custom-built llama.dll
	os.environ['LLAMA_CPP_LIB'] = r'PATH_TO_YOUR_CUSTOM_LLAMA_DLL'

	try:
	print('Attempting to initialize Llama with GPU offload (-1 layers)...')

	# Initialize the Llama model with full GPU offloading
	model = Llama(
	model_path=r'PATH_TO_YOUR_MODEL_FILE.gguf',
	n_gpu_layers=-1,
	verbose=True
	)

	print('Initialization attempted. Check the output above for CUDA device assignments (e.g., CUDA0, CUDA1).')

	except FileNotFoundError:
	print('Error: Model file not found. Please double-check your model_path.')
	except Exception as e:
	print(f'An error occurred during initialization: {e}')
	```
	What to look for in the output:

	Lines like assigned to device CUDA0, assigned to device CUDA1.

	VRAM buffer allocations such as CUDA0 model buffer size = ... MiB.

	Confirmation that your GPU(s) are being used for model layer offloading.

	## Usage
	Once installed and verified, you can use llama-cpp-python in your projects as you normally would. Refer to the official llama-cpp-python documentation for detailed usage instructions.

	## Acknowledgments
	This prebuilt wheel is based on the excellent llama-cpp-python project by Andrei Betlen (@abetlen). All credit for the core library and Python bindings goes to the original maintainers and to llama.cpp by Georgi Gerganov (@ggerganov) and the ggml team.

	This specific wheel was built by Bernard Peter Fitzgerald (@boneylizardwizard) using the source code from abetlen/llama-cpp-python, compiled with CUDA 12.8 support for Windows x64 systems, and verified for Gemma 3 model compatibility.

	## License
	This prebuilt wheel is distributed under the MIT License, the same license as the original llama-cpp-python project.

	## Reporting Issues
	If you encounter issues specifically with installing this prebuilt wheel or getting CUDA offloading to work using this wheel, please report them on this repository's Issue Tracker.

	For general issues with llama-cpp-python itself, please report them upstream at the [official llama-cpp-python GitHub Issues page](https://github.com/ggml-org/llama.cpp/issues).