--- license: gemma base_model: google/Gemma-3-1B-IT pipeline_tag: text-generation tags: - chat extra_gated_heading: Access Gemma3-1B-IT on Hugging Face extra_gated_prompt: >- To access Gemma3-1B-IT on Hugging Face, you are required to review and agree to the gemma license. To do this, please ensure you are logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge licensed --- # litert-community/Gemma3-1B-IT This model provides a few variants of [google/Gemma-3-1B-IT](https://huggingface.co/google/Gemma-3-1B-IT) that are ready for deployment on Android using the [LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert) and [MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference). ## Use the models ### Colab *Disclaimer: The target deployment surface for the LiteRT models is Android/iOS/Web and the stack has been optimized for performance on these targets. Trying out the system in Colab is an easier way to familiarize yourself with the LiteRT stack, with the caveat that the performance (memory and latency) on Colab could be much worse than on a local device.* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/google-ai-edge/mediapipe-samples/blob/main/codelabs/litert_inference/gemma3_1b_tflite.ipynb) ### Customize Fine tune Gemma 3 1B and deploy with either LiteRT or Mediapipe LLM Inference API: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://github.com/google-ai-edge/mediapipe-samples/blob/main/codelabs/litert_inference/Gemma3_1b_fine_tune.ipynb) ### Android * Download and install [the apk](https://github.com/google-ai-edge/gallery/releases/latest/download/ai-edge-gallery.apk). * Follow the instructions in the app. To build the demo app from source, please follow the [instructions](https://github.com/google-ai-edge/gallery/blob/main/README.md) from the GitHub repository. ### iOS * Clone the [MediaPipe samples](https://github.com/google-ai-edge/mediapipe-samples) repository and follow the [instructions](https://github.com/google-ai-edge/mediapipe-samples/tree/main/examples/llm_inference/ios/README.md) to build the LLM Inference iOS Sample App using XCode. * Run the app via the iOS simulator or deploy to an iOS device. ## Performance ### Android Note that all benchmark stats are from a Samsung S24 Ultra and multiple prefill signatures enabled.

Backend	Quantization scheme	Context length	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	CPU Memory (RSS in MB)	GPU Memory (RSS in MB)	Model size (MB)
CPU	fp32 (baseline)	1280	49 tk/s	10 tk/s	5.59 s	4,123 MB		3,824 MB	🔗
	dynamic_int4 (block size 128)	1280	138 tk/s	50 tk/s	2.33 s	982 MB		657 MB	🔗
	dynamic_int4 (block size 128)	4096	87 tk/s	37 tk/s	3.40 s	1,145 MB		657 MB	🔗
	dynamic_int4 (block size 32)	1280	107 tk/s	48 tk/s	3.49 s	1,045 MB		688 MB	🔗
	dynamic_int4 (block size 32)	4096	79 tk/s	36 tk/s	4.40 s	1,210 MB		688 MB	🔗
	dynamic_int4 QAT	2048	322 tk/s	47 tk/s	3.10 s	1,138 MB		529 MB	🔗
	dynamic_int8	1280	177 tk/s	33 tk/s	1.69 s	1,341 MB		1,005 MB	🔗
	dynamic_int8	4096	123 tk/s	29 tk/s	2.34 s	1,504 MB		1,005 MB	🔗
GPU	dynamic_int4 QAT	2048	2585 tk/s	56 tk/s	4.50 s	1,205 MB		529 MB	🔗
	dynamic_int8	1280	1191 tk/s	24 tk/s	4.68 s	2,164 MB	1,059 MB	1,005 MB	🔗
	dynamic_int8	4096	814 tk/s	24 tk/s	4.99 s	2,167 MB	1,181 MB	1,005 MB	🔗

* For the list of supported quantization schemes see [supported-schemes](https://github.com/google-ai-edge/ai-edge-torch/tree/main/ai_edge_torch/generative/quantize#supported-schemes). For these models, we are using prefill signature lengths of 32, 128, 512 and 1280. * Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models) * Memory: indicator of peak RAM usage * The inference on CPU is accelerated via the LiteRT [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads * Benchmark is run with cache enabled and initialized. During the first run, the time to first token may differ. ### Web Note that all benchmark stats are from a MacBook Pro 2024 (Apple M4 Max chip) running with 1280 KV cache size, 1024 tokens prefill, 256 tokens decode.

Backend	Quantization scheme	Precision	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	CPU Memory	GPU Memory	Model size (MB)
GPU	dynamic_int4	F16	4339 tk/s	133 tk/s	0.51 s	460 MB	1,331 MB	700 MB	🔗
	dynamic_int4	F32	2837 tk/s	134 tk/s	0.49 s	481 MB	1,331 MB	700 MB	🔗
	dynamic_int4 QAT	F16	1702 tk/s	77 tk/s				529 MB	🔗
	dynamic_int8	F16	4321 tk/s	126 tk/s	0.6 s	471 MB	1,740 MB	1,011 MB	🔗
	dynamic_int8	F32	2805 tk/s	129 tk/s	0.58 s	474 MB	1,740 MB	1,011 MB	🔗

* Model size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models) * dynamic_int4: quantized model with int4 weights and float activations. * dynamic_int8: quantized model with int8 weights and float activations.