|
--- |
|
library_name: keras-hub |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- text-generation-inference |
|
- keras |
|
pipeline_tag: text-generation |
|
--- |
|
### Model Overview |
|
# Model Summary |
|
|
|
Falcon-RW-1B is a 1B parameters causal decoder-only model built by [TII](https://www.tii.ae/) and trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). The architecture of the model is adopted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)) but it uses ALiBi. |
|
|
|
## Links |
|
|
|
* [Falcon Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/falcon-quickstart-notebook) |
|
* [Falcon API Documentation](https://keras.io/keras_hub/api/models/falcon/) |
|
* [Falcon Model Card](https://huggingface.co/docs/transformers/en/model_doc/falcon) |
|
* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/) |
|
* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/) |
|
|
|
## Presets |
|
|
|
The following model checkpoints are provided by the Keras team. Full code examples for each are available below. |
|
| Preset name | Parameters | Description | |
|
|----------------|------------|--------------------------------------------------| |
|
| falcon_refinedweb_1b_en | 1.31B | 24-layer Falcon model (Falcon with 1B parameters), trained on 350B tokens of RefinedWeb dataset.| |
|
|
|
|
|
## Use |
|
|
|
### Direct Use |
|
Research on large language models, specifically the influence of adequately filtered and deduplicated web data on the properties of large language models (fairness, safety, limitations, capabilities, etc.). |
|
|
|
### Out-of-scope Use |
|
Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
Falcon-RW-1B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online. |
|
|
|
## Recommendations |
|
|
|
We recommend users of Falcon-RW-1B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
Falcon-RW-1B was trained on 350B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset. The data was tokenized with the GPT-2 tokenizer. |
|
|
|
### Training Procedure |
|
Falcon-RW-1B was trained on 32 A100 40GB GPUs, using only data parallelism with ZeRO. |
|
|
|
### Training Hyperparameters |
|
Hyperparameters were adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)). |
|
|
|
| Hyperparameter | Value | Comment | |
|
|----------------|----------|-------------------------------------------| |
|
| Precision | bfloat16 | | |
|
| Optimizer | AdamW | | |
|
| Learning rate | 2e-4 | 500M tokens warm-up, cosine decay to 2e-5 | |
|
| Weight decay | 1e-1 | | |
|
| Batch size | 512 | 4B tokens ramp-up | |
|
|
|
### Speeds, Sizes, Times |
|
Training happened in early December 2022 and took about six days. |
|
|
|
### Evaluation |
|
See the [paper on arXiv](https://arxiv.org/abs/2306.01116) for in-depth evaluation. |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
Falcon-RW-1B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token). |
|
|
|
The architecture is adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), but uses ALiBi ([Ofir et al., 2021](https://arxiv.org/abs/2108.12409)). |
|
|
|
| **Hyperparameter** | **Value** | |
|
|:------------------:|:---------:| |
|
| Layers | 24 | |
|
| d_model | 2048 | |
|
| head_dim | 64 | |
|
| Vocabulary | 50304 | |
|
| Sequence length | 2048 | |
|
|
|
## Citation |
|
``` |
|
@article{refinedweb, |
|
title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only}, |
|
author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay}, |
|
journal={arXiv preprint arXiv:2306.01116}, |
|
eprint={2306.01116}, |
|
eprinttype = {arXiv}, |
|
url={https://arxiv.org/abs/2306.01116}, |
|
year={2023} |
|
} |
|
``` |
|
|
|
## Example Usage |
|
```Python |
|
|
|
import os |
|
|
|
os.environ["KERAS_BACKEND"] = "jax" |
|
|
|
import keras |
|
import keras_hub |
|
|
|
# When running only inference, bfloat16 saves memory usage significantly. |
|
keras.config.set_floatx("bfloat16") |
|
|
|
causal_lm = keras_hub.models.FalconCausalLM.from_preset( |
|
"falcon_refinedweb_1b_en" |
|
) |
|
causal_lm.summary() |
|
|
|
outputs = causal_lm.generate([ |
|
"What is Jax?", |
|
"Give me your best brownie recipe.", |
|
], max_length=512) |
|
|
|
``` |
|
|
|
## Example Usage with Hugging Face URI |
|
|
|
```Python |
|
|
|
import os |
|
|
|
os.environ["KERAS_BACKEND"] = "jax" |
|
|
|
import keras |
|
import keras_hub |
|
|
|
# When running only inference, bfloat16 saves memory usage significantly. |
|
keras.config.set_floatx("bfloat16") |
|
|
|
causal_lm = keras_hub.models.FalconCausalLM.from_preset( |
|
"hf://keras/falcon_refinedweb_1b_en" |
|
) |
|
causal_lm.summary() |
|
|
|
outputs = causal_lm.generate([ |
|
"What is Jax?", |
|
"Give me your best brownie recipe.", |
|
], max_length=512) |
|
|
|
``` |
|
|