--- license: gemma base_model: google/Gemma-3-1B-IT pipeline_tag: text-generation tags: - chat extra_gated_heading: Access Gemma3-1B-IT on Hugging Face extra_gated_prompt: >- To access Gemma3-1B-IT on Hugging Face, you are required to review and agree to the gemma license. To do this, please ensure you are logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge licensed --- # litert-community/Gemma3-1B-IT This model provides a few variants of [google/Gemma-3-1B-IT](https://huggingface.co/google/Gemma-3-1B-IT) that are ready for deployment on Android using the [LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert) and [MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference). ## Use the models ### Colab *Disclaimer: The target deployment surface for the LiteRT models is Android/iOS/Web and the stack has been optimized for performance on these targets. Trying out the system in Colab is an easier way to familiarize yourself with the LiteRT stack, with the caveat that the performance (memory and latency) on Colab could be much worse than on a local device.* [](https://colab.sandbox.google.com/github/google-ai-edge/mediapipe-samples/blob/main/codelabs/litert_inference/gemma3_1b_tflite.ipynb) ### Customize Fine tune Gemma 3 1B and deploy with either LiteRT or Mediapipe LLM Inference API: [](https://colab.research.google.com/#fileId=https://github.com/google-ai-edge/mediapipe-samples/blob/main/codelabs/litert_inference/Gemma3_1b_fine_tune.ipynb) ### Android * Download and install [the apk](https://github.com/google-ai-edge/gallery/releases/latest/download/ai-edge-gallery.apk). * Follow the instructions in the app. To build the demo app from source, please follow the [instructions](https://github.com/google-ai-edge/gallery/blob/main/README.md) from the GitHub repository. ### iOS * Clone the [MediaPipe samples](https://github.com/google-ai-edge/mediapipe-samples) repository and follow the [instructions](https://github.com/google-ai-edge/mediapipe-samples/tree/main/examples/llm_inference/ios/README.md) to build the LLM Inference iOS Sample App using XCode. * Run the app via the iOS simulator or deploy to an iOS device. ## Performance ### Android Note that all benchmark stats are from a Samsung S24 Ultra and multiple prefill signatures enabled.
Backend | Quantization scheme | Context length | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | CPU Memory (RSS in MB) | GPU Memory (RSS in MB) | Model size (MB) | |
---|---|---|---|---|---|---|---|---|---|
CPU |
fp32 (baseline) |
1280 |
49 tk/s |
10 tk/s |
5.59 s |
4,123 MB |
3,824 MB |
||
dynamic_int4 (block size 128) |
1280 |
138 tk/s |
50 tk/s |
2.33 s |
982 MB |
657 MB |
|||
4096 |
87 tk/s |
37 tk/s |
3.40 s |
1,145 MB |
657 MB |
||||
dynamic_int4 (block size 32) |
1280 |
107 tk/s |
48 tk/s |
3.49 s |
1,045 MB |
688 MB |
|||
4096 |
79 tk/s |
36 tk/s |
4.40 s |
1,210 MB |
688 MB |
||||
dynamic_int4 QAT |
2048 |
322 tk/s |
47 tk/s |
3.10 s |
1,138 MB |
529 MB |
|||
dynamic_int8 |
1280 |
177 tk/s |
33 tk/s |
1.69 s |
1,341 MB |
1,005 MB |
|||
4096 |
123 tk/s |
29 tk/s |
2.34 s |
1,504 MB |
1,005 MB |
||||
GPU |
dynamic_int4 QAT |
2048 |
2585 tk/s |
56 tk/s |
4.50 s |
1,205 MB |
529 MB |
||
dynamic_int8 |
1280 |
1191 tk/s |
24 tk/s |
4.68 s |
2,164 MB |
1,059 MB |
1,005 MB |
||
4096 |
814 tk/s |
24 tk/s |
4.99 s |
2,167 MB |
1,181 MB |
1,005 MB |
Backend | Quantization scheme | Precision | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | CPU Memory | GPU Memory | Model size (MB) | |
---|---|---|---|---|---|---|---|---|---|
GPU |
dynamic_int4 |
F16 |
4339 tk/s |
133 tk/s |
0.51 s |
460 MB |
1,331 MB |
700 MB |
|
F32 |
2837 tk/s |
134 tk/s |
0.49 s |
481 MB |
1,331 MB |
700 MB |
|||
dynamic_int4 QAT |
F16 |
1702 tk/s |
77 tk/s |
529 MB |
|||||
dynamic_int8 |
F16 |
4321 tk/s |
126 tk/s |
0.6 s |
471 MB |
1,740 MB |
1,011 MB |
||
F32 |
2805 tk/s |
129 tk/s |
0.58 s |
474 MB |
1,740 MB |
1,011 MB |