Gemma3-1B-IT / README.md

Update README.md

f565928 verified about 1 month ago

19.1 kB

	---
	license: gemma
	base_model: google/Gemma-3-1B-IT
	pipeline_tag: text-generation
	tags:
	- chat
	extra_gated_heading: Access Gemma3-1B-IT on Hugging Face
	extra_gated_prompt: >-
	To access Gemma3-1B-IT on Hugging Face, you are required to review and agree
	to the gemma license. To do this, please ensure you are logged in to
	Hugging Face and click below. Requests are processed immediately.
	extra_gated_button_content: Acknowledge licensed
	---

	# litert-community/Gemma3-1B-IT

	This model provides a few variants of
	[google/Gemma-3-1B-IT](https://huggingface.co/google/Gemma-3-1B-IT) that are ready for
	deployment on Android using the
	[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert) and
	[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference).

	## Use the models

	### Colab

	*Disclaimer: The target deployment surface for the LiteRT models is
	Android/iOS/Web and the stack has been optimized for performance on these
	targets. Trying out the system in Colab is an easier way to familiarize yourself
	with the LiteRT stack, with the caveat that the performance (memory and latency)
	on Colab could be much worse than on a local device.*

	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/google-ai-edge/mediapipe-samples/blob/main/codelabs/litert_inference/gemma3_1b_tflite.ipynb)

	### Customize

	Fine tune Gemma 3 1B and deploy with either LiteRT or Mediapipe LLM Inference API:

	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://github.com/google-ai-edge/mediapipe-samples/blob/main/codelabs/litert_inference/Gemma3_1b_fine_tune.ipynb)

	### Android via Google AI Edge Gallery and MediaPipe

	* Download and install
	[the apk](https://github.com/google-ai-edge/gallery/releases/latest/download/ai-edge-gallery.apk).
	* Follow the instructions in the app.

	To build the demo app from source, please follow the [instructions](https://github.com/google-ai-edge/gallery/blob/main/README.md)
	from the GitHub repository.

	### Android or Desktop via LiteRT LM

	Follow the LitRT LM [instructions](https://github.com/google-ai-edge/LiteRT-LM/blob/main/README.md) to build our Open Source LiteRT LM runtime to run LiteRT models.

	### iOS via MediaPipe

	* Clone the [MediaPipe samples](https://github.com/google-ai-edge/mediapipe-samples)
	repository and follow the [instructions](https://github.com/google-ai-edge/mediapipe-samples/tree/main/examples/llm_inference/ios/README.md)
	to build the LLM Inference iOS Sample App using XCode.
	* Run the app via the iOS simulator or deploy to an iOS device.

	## Performance

	### Android via Google AI Edge Gallery and MediaPipe

	Note that all benchmark stats are from a Samsung S24 Ultra and use models with multiple prefill signatures.

	<table border="1">
	<tr>
	<th style="text-align: left">Backend</th>
	<th style="text-align: left">Quantization scheme</th>
	<th style="text-align: left">Context length</th>
	<th style="text-align: left">Prefill (tokens/sec)</th>
	<th style="text-align: left">Decode (tokens/sec)</th>
	<th style="text-align: left">Time-to-first-token (sec)</th>
	<th style="text-align: left">CPU Memory (RSS in MB)</th>
	<th style="text-align: left">GPU Memory (RSS in MB)</th>
	<th style="text-align: left">Model size (MB)</th>
	<th></th>
	</tr>
	<tr>
	<td rowspan="8"><p style="text-align: left">CPU</p></td>
	<td><p style="text-align: left">fp32 (baseline)</p></td>
	<td><p style="text-align: right">1280</p></td>
	<td><p style="text-align: right">49 tk/s</p></td>
	<td><p style="text-align: right">10 tk/s</p></td>
	<td><p style="text-align: right">5.59 s</p></td>
	<td><p style="text-align: right">4,123 MB</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">3,824 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/Gemma3-1B-IT_multi-prefill-seq_f32_ekv1280.task">🔗</a></p></td>
	</tr>
	<tr>
	<td rowspan="2"><p style="text-align: left">dynamic_int4 (block size 128)</p></td>
	<td><p style="text-align: right">1280</p></td>
	<td><p style="text-align: right">138 tk/s</p></td>
	<td><p style="text-align: right">50 tk/s</p></td>
	<td><p style="text-align: right">2.33 s</p></td>
	<td><p style="text-align: right">982 MB</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">657 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/Gemma3-1B-IT_multi-prefill-seq_q4_block128_ekv1280.task">🔗</a></p></td>
	</tr>
	<tr>
	<td><p style="text-align: right">4096</p></td>
	<td><p style="text-align: right">87 tk/s</p></td>
	<td><p style="text-align: right">37 tk/s</p></td>
	<td><p style="text-align: right">3.40 s</p></td>
	<td><p style="text-align: right">1,145 MB</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">657 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/Gemma3-1B-IT_multi-prefill-seq_q4_block128_ekv4096.task">🔗</a></p></td>
	</tr>
	<tr>
	<td rowspan="2"><p style="text-align: left">dynamic_int4 (block size 32)</p></td>
	<td><p style="text-align: right">1280</p></td>
	<td><p style="text-align: right">107 tk/s</p></td>
	<td><p style="text-align: right">48 tk/s</p></td>
	<td><p style="text-align: right">3.49 s</p></td>
	<td><p style="text-align: right">1,045 MB</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">688 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/Gemma3-1B-IT_multi-prefill-seq_q4_block32_ekv1280.task">🔗</a></p></td>
	</tr>
	<tr>
	<td><p style="text-align: right">4096</p></td>
	<td><p style="text-align: right">79 tk/s</p></td>
	<td><p style="text-align: right">36 tk/s</p></td>
	<td><p style="text-align: right">4.40 s</p></td>
	<td><p style="text-align: right">1,210 MB</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">688 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/Gemma3-1B-IT_multi-prefill-seq_q4_block32_ekv4096.task">🔗</a></p></td>
	</tr>
	<tr>
	<td><p style="text-align: left">dynamic_int4 QAT</p></td>
	<td><p style="text-align: right">2048</p></td>
	<td><p style="text-align: right">322 tk/s</p></td>
	<td><p style="text-align: right">47 tk/s</p></td>
	<td><p style="text-align: right">3.10 s</p></td>
	<td><p style="text-align: right">1,138 MB</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">529 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/gemma3-1b-it-int4.task">🔗</a></p></td>
	</tr>
	<tr>
	<td rowspan="2"><p style="text-align: left">dynamic_int8</p></td>
	<td><p style="text-align: right">1280</p></td>
	<td><p style="text-align: right">177 tk/s</p></td>
	<td><p style="text-align: right">33 tk/s</p></td>
	<td><p style="text-align: right">1.69 s</p></td>
	<td><p style="text-align: right">1,341 MB</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">1,005 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/Gemma3-1B-IT_multi-prefill-seq_q8_ekv1280.task">🔗</a></p></td>
	</tr>
	<tr>
	<td><p style="text-align: right">4096</p></td>
	<td><p style="text-align: right">123 tk/s</p></td>
	<td><p style="text-align: right">29 tk/s</p></td>
	<td><p style="text-align: right">2.34 s</p></td>
	<td><p style="text-align: right">1,504 MB</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">1,005 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/Gemma3-1B-IT_multi-prefill-seq_q8_ekv4096.task">🔗</a></p></td>
	</tr>
	<tr>
	<td rowspan="3"><p style="text-align: left">GPU</p></td>
	<td><p style="text-align: left">dynamic_int4 QAT</p></td>
	<td><p style="text-align: right">2048</p></td>
	<td><p style="text-align: right">2585 tk/s</p></td>
	<td><p style="text-align: right">56 tk/s</p></td>
	<td><p style="text-align: right">4.50 s</p></td>
	<td><p style="text-align: right">1,205 MB</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">529 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/gemma3-1b-it-int4.task">🔗</a></p></td>
	</tr>
	<tr>
	<td rowspan="2"><p style="text-align: left">dynamic_int8</p></td>
	<td><p style="text-align: right">1280</p></td>
	<td><p style="text-align: right">1191 tk/s</p></td>
	<td><p style="text-align: right">24 tk/s</p></td>
	<td><p style="text-align: right">4.68 s</p></td>
	<td><p style="text-align: right">2,164 MB</p></td>
	<td><p style="text-align: right">1,059 MB</p></td>
	<td><p style="text-align: right">1,005 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/Gemma3-1B-IT_multi-prefill-seq_q8_ekv1280.task">🔗</a></p></td>
	</tr>
	<tr>
	<td><p style="text-align: right">4096</p></td>
	<td><p style="text-align: right">814 tk/s</p></td>
	<td><p style="text-align: right">24 tk/s</p></td>
	<td><p style="text-align: right">4.99 s</p></td>
	<td><p style="text-align: right">2,167 MB</p></td>
	<td><p style="text-align: right">1,181 MB</p></td>
	<td><p style="text-align: right">1,005 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/Gemma3-1B-IT_multi-prefill-seq_q8_ekv4096.task">🔗</a></p></td>
	</tr>

	</table>

	* For the list of supported quantization schemes see [supported-schemes](https://github.com/google-ai-edge/ai-edge-torch/tree/main/ai_edge_torch/generative/quantize#supported-schemes).
	For these models, we are using prefill signature lengths of 32, 128, 512 and 1280.
	* Model Size: measured by the size of the .tflite flatbuffer (serialization
	format for LiteRT models)
	* Memory: indicator of peak RAM usage
	* The inference on CPU is accelerated via the LiteRT
	[XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
	* Benchmark is run with cache enabled and initialized. During the first run,
	the time to first token may differ.

	### Android via LiteRT LM

	Note that all benchmark stats are from a Samsung S24 Ultra and use models with multiple prefill signatures.

	<table border="1">
	<tr>
	<th style="text-align: left">Backend</th>
	<th style="text-align: left">Quantization scheme</th>
	<th style="text-align: left">Context length</th>
	<th style="text-align: left">Prefill (tokens/sec)</th>
	<th style="text-align: left">Decode (tokens/sec)</th>
	<th style="text-align: left">Time-to-first-token (sec)</th>
	<th style="text-align: left">CPU Memory (RSS in MB)</th>
	<th style="text-align: left">GPU Memory (RSS in MB)</th>
	<th style="text-align: left">Model size (MB)</th>
	<th></th>
	</tr>
	<tr>
	<td><p style="text-align: left">CPU</p></td>
	<td><p style="text-align: left">dynamic_int4 QAT</p></td>
	<td><p style="text-align: right">2048</p></td>
	<td><p style="text-align: right">379 tk/s</p></td>
	<td><p style="text-align: right">55 tk/s</p></td>
	<td><p style="text-align: left"></p></td>
	<td><p style="text-align: right">1,009 MB</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">529 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/gemma3-1b-it-int4.litertlm">🔗</a></p></td>
	</tr>
	<tr>
	<td><p style="text-align: left">GPU</p></td>
	<td><p style="text-align: left">dynamic_int4 QAT</p></td>
	<td><p style="text-align: right">2048</p></td>
	<td><p style="text-align: right">2531 tk/s</p></td>
	<td><p style="text-align: right">49 tk/s</p></td>
	<td><p style="text-align: left"></p></td>
	<td><p style="text-align: right">1,205 MB</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">529 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/gemma3-1b-it-int4.litertlm">🔗</a></p></td>
	</tr>

	</table>


	* For the list of supported quantization schemes see [supported-schemes](https://github.com/google-ai-edge/ai-edge-torch/tree/main/ai_edge_torch/generative/quantize#supported-schemes).
	For these models, we are using prefill signature lengths of 32, 128, 512 and 1280.
	* Model Size: measured by the size of the .tflite flatbuffer (serialization
	format for LiteRT models)
	* Memory: indicator of peak RAM usage
	* The inference on CPU is accelerated via the LiteRT
	[XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
	* Benchmark is run with cache enabled and initialized. During the first run,
	the time to first token may differ.

	#### Android via LiteRT LM with NPU

	Note that the benchmark stats are from a Samsung S25 Ultra and use models with 128 token prefill chunks towards 1024 tokens.

	<table border="1">
	<tr>
	<th style="text-align: left">Backend</th>
	<th style="text-align: left">Quantization scheme</th>
	<th style="text-align: left">Context length</th>
	<th style="text-align: left">Prefill (tokens/sec)</th>
	<th style="text-align: left">Decode (tokens/sec)</th>
	<th style="text-align: left">Time-to-first-token (sec)</th>
	<th style="text-align: left">CPU Memory (RSS in MB)</th>
	<th style="text-align: left">GPU Memory (RSS in MB)</th>
	<th style="text-align: left">Model size (MB)</th>
	<th></th>
	</tr>
	<tr>
	<td><p style="text-align: left">NPU</p></td>
	<td><p style="text-align: left">a16w4 QAT</p></td>
	<td><p style="text-align: right">1280</p></td>
	<td><p style="text-align: right">5836 tk/s</p></td>
	<td><p style="text-align: right">85 tk/s</p></td>
	<td><p style="text-align: left"></p></td>
	<td><p style="text-align: right">626 MB</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">689 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/Gemma3-1B-IT_q4_ekv1280_sm8750.litertlm">🔗</a></p></td>
	</tr>

	</table>

	* Model Size: measured by the size of the .tflite flatbuffer (serialization
	format for LiteRT models)
	* Memory: indicator of peak RAM usage from malloc.


	### Web
	Note that all benchmark stats are from a MacBook Pro 2024 (Apple M4 Max chip) running with 1280 KV cache size, 1024 tokens prefill, 256 tokens decode.

	<table border="1">
	<tr>
	<th style="text-align: left">Backend</th>
	<th style="text-align: left">Quantization scheme</th>
	<th style="text-align: left">Precision</th>
	<th style="text-align: left">Prefill (tokens/sec)</th>
	<th style="text-align: left">Decode (tokens/sec)</th>
	<th style="text-align: left">Time-to-first-token (sec)</th>
	<th style="text-align: left">CPU Memory</th>
	<th style="text-align: left">GPU Memory</th>
	<th style="text-align: left">Model size (MB)</th>
	<th></th>
	</tr>
	<tr>
	<td rowspan="5"><p style="text-align: left">GPU</p></td>
	<td rowspan="2"><p style="text-align: left">dynamic_int4</p></td>
	<td><p style="text-align: left">F16</p></td>
	<td><p style="text-align: right">4339 tk/s</p></td>
	<td><p style="text-align: right">133 tk/s</p></td>
	<td><p style="text-align: right">0.51 s</p></td>
	<td><p style="text-align: right">460 MB</p></td>
	<td><p style="text-align: right">1,331 MB</p></td>
	<td><p style="text-align: right">700 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/gemma3-1b-it-int4-web.task">🔗</a></p></td>
	</tr>
	<tr>
	<td><p style="text-align: left">F32</p></td>
	<td><p style="text-align: right">2837 tk/s</p></td>
	<td><p style="text-align: right">134 tk/s</p></td>
	<td><p style="text-align: right">0.49 s</p></td>
	<td><p style="text-align: right">481 MB</p></td>
	<td><p style="text-align: right">1,331 MB</p></td>
	<td><p style="text-align: right">700 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/gemma3-1b-it-int4-web.task">🔗</a></p></td>
	</tr>
	<tr>
	<td><p style="text-align: left">dynamic_int4 QAT</p></td>
	<td><p style="text-align: left">F16</p></td>
	<td><p style="text-align: right">1702 tk/s</p></td>
	<td><p style="text-align: right">77 tk/s</p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right"></p></td>
	<td><p style="text-align: right">529 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/gemma3-1b-it-int4.task">🔗</a></p></td>
	</tr>
	<tr>
	<td rowspan="2"><p style="text-align: left">dynamic_int8</p></td>
	<td><p style="text-align: left">F16</p></td>
	<td><p style="text-align: right">4321 tk/s</p></td>
	<td><p style="text-align: right">126 tk/s</p></td>
	<td><p style="text-align: right">0.6 s</p></td>
	<td><p style="text-align: right">471 MB</p></td>
	<td><p style="text-align: right">1,740 MB</p></td>
	<td><p style="text-align: right">1,011 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/gemma3-1b-it-int8-web.task">🔗</a></p></td>
	</tr>
	<tr>
	<td><p style="text-align: left">F32</p></td>
	<td><p style="text-align: right">2805 tk/s</p></td>
	<td><p style="text-align: right">129 tk/s</p></td>
	<td><p style="text-align: right">0.58 s</p></td>
	<td><p style="text-align: right">474 MB</p></td>
	<td><p style="text-align: right">1,740 MB</p></td>
	<td><p style="text-align: right">1,011 MB</p></td>
	<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/gemma3-1b-it-int8-web.task">🔗</a></p></td>
	</tr>
	</table>

	* Model size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models)
	* dynamic_int4: quantized model with int4 weights and float activations.
	* dynamic_int8: quantized model with int8 weights and float activations.
	* a16w4: quantized model with int4 weights and int16 activations.