# RecurrentGemma - 2B & 2B-it

RecurrentGemma is a family of open language models built on a novel recurrent architecture developed at Google. Both pre-trained (2B) and instruction-tuned (2B-it) versions are available in English.

Like Gemma, [RecurrentGemma](https://huggingface.co/google/recurrentgemma-2b-it) models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Because of its novel architecture, RecurrentGemma requires less memory than Gemma and achieves faster inference when generating long sequences.

In [None]:
!pip install git+https://github.com/huggingface/transformers.git

Collecting transformers==4.40.0.dev0
  Downloading https://huggingface.co/datasets/reach-vb/random-wheels/resolve/main/transformers-4.40.0.dev0-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.2
    Uninstalling transformers-4.38.2:
      Successfully uninstalled transformers-4.38.2
Successfully installed transformers-4.40.0.dev0


## Load the model checkpoints

Make sure to accept the terms and conditions for the model before running the code further here: https://huggingface.co/google/recurrentgemma-2b-it.


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/recurrentgemma-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/recurrentgemma-2b-it", torch_dtype=torch.float16).to("cuda:0")

tokenizer_config.json:   0%|          | 0.00/40.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Prepare our input text with chat template.

The instruction-tuned models use a chat template that must be adhered to for conversational use. The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.

Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:

In [None]:
chat = [
    { "role": "user", "content": "Write a hello world program" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

## Tokenize the inputs

In [None]:
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)

## Pass the input through the model and generate.

In [None]:
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['<start_of_turn>user\nWrite a hello world program<end_of_turn>\n<start_of_turn>model\n```python\nprint("Hello, world!")\n```\n\nThis program will print the message "Hello, world!" to the console.\n\n**Explanation:**\n\n* `print()` is a built-in Python function that prints the given argument to the console.\n* `"Hello, world!"` is the string that will be printed.\n\n**Output:**\n\n```\nHello, world!\n```']


Enjoy! There's much more you can do to maximise the output of your generation. Check out this guide: https://huggingface.co/docs/transformers/generation_strategies