File size: 5,461 Bytes

---
library_name: keras-hub
license: apache-2.0
language:
- en
tags:
- text-generation-inference
- keras
pipeline_tag: text-generation
---
### Model Overview
# Model Summary

Falcon-RW-1B is a 1B parameters causal decoder-only model built by [TII](https://www.tii.ae/) and trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). The architecture of the model is adopted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)) but it uses ALiBi.

## Links

* [Falcon Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/falcon-quickstart-notebook)
* [Falcon API Documentation](https://keras.io/keras_hub/api/models/falcon/)
* [Falcon Model Card](https://huggingface.co/docs/transformers/en/model_doc/falcon)
* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)

## Presets

The following model checkpoints are provided by the Keras team. Full code examples for each are available below.
| Preset name    | Parameters | Description                                      |
|----------------|------------|--------------------------------------------------|
| falcon_refinedweb_1b_en |   1.31B  | 24-layer Falcon model (Falcon with 1B parameters), trained on 350B tokens of RefinedWeb dataset.|


## Use

### Direct Use
Research on large language models, specifically the influence of adequately filtered and deduplicated web data on the properties of large language models (fairness, safety, limitations, capabilities, etc.).

### Out-of-scope Use
Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

## Bias, Risks, and Limitations

Falcon-RW-1B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

## Recommendations

We recommend users of Falcon-RW-1B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.

## Training Details

### Training Data
Falcon-RW-1B was trained on 350B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset. The data was tokenized with the GPT-2 tokenizer.

### Training Procedure
Falcon-RW-1B was trained on 32 A100 40GB GPUs, using only data parallelism with ZeRO.

### Training Hyperparameters
Hyperparameters were adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)).

| Hyperparameter | Value    | Comment                                   |
|----------------|----------|-------------------------------------------|
| Precision      | bfloat16 |                                           |
| Optimizer      | AdamW    |                                           |
| Learning rate  | 2e-4     | 500M tokens warm-up, cosine decay to 2e-5 |
| Weight decay   | 1e-1     |                                           |
| Batch size     | 512      | 4B tokens ramp-up                         |

### Speeds, Sizes, Times
Training happened in early December 2022 and took about six days.

### Evaluation
See the [paper on arXiv](https://arxiv.org/abs/2306.01116) for in-depth evaluation.

## Technical Specifications

### Model Architecture and Objective
Falcon-RW-1B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

The architecture is adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), but uses ALiBi ([Ofir et al., 2021](https://arxiv.org/abs/2108.12409)).

| **Hyperparameter** | **Value** |
|:------------------:|:---------:|
| Layers             | 24        |
| d_model            | 2048      |
| head_dim           | 64        |
| Vocabulary         | 50304     |
| Sequence length    | 2048      |

## Citation
```
@article{refinedweb,
  title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
  author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
  journal={arXiv preprint arXiv:2306.01116},
  eprint={2306.01116},
  eprinttype = {arXiv},
  url={https://arxiv.org/abs/2306.01116},
  year={2023}
}
```

## Example Usage
```Python

import os

os.environ["KERAS_BACKEND"] = "jax"

import keras
import keras_hub

# When running only inference, bfloat16 saves memory usage significantly.
keras.config.set_floatx("bfloat16")

causal_lm = keras_hub.models.FalconCausalLM.from_preset(
    "falcon_refinedweb_1b_en"
)
causal_lm.summary()

outputs = causal_lm.generate([
    "What is Jax?",
    "Give me your best brownie recipe.",
], max_length=512)

```

## Example Usage with Hugging Face URI

```Python

import os

os.environ["KERAS_BACKEND"] = "jax"

import keras
import keras_hub

# When running only inference, bfloat16 saves memory usage significantly.
keras.config.set_floatx("bfloat16")

causal_lm = keras_hub.models.FalconCausalLM.from_preset(
    "hf://keras/falcon_refinedweb_1b_en"
)
causal_lm.summary()

outputs = causal_lm.generate([
    "What is Jax?",
    "Give me your best brownie recipe.",
], max_length=512)

```