File size: 5,461 Bytes
b2c6cae 3f27ed6 7d195ea b2c6cae ff35905 4d6c850 6acd2af 4d6c850 ff35905 6acd2af |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
library_name: keras-hub
license: apache-2.0
language:
- en
tags:
- text-generation-inference
- keras
pipeline_tag: text-generation
---
### Model Overview
# Model Summary
Falcon-RW-1B is a 1B parameters causal decoder-only model built by [TII](https://www.tii.ae/) and trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). The architecture of the model is adopted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)) but it uses ALiBi.
## Links
* [Falcon Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/falcon-quickstart-notebook)
* [Falcon API Documentation](https://keras.io/keras_hub/api/models/falcon/)
* [Falcon Model Card](https://huggingface.co/docs/transformers/en/model_doc/falcon)
* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)
## Presets
The following model checkpoints are provided by the Keras team. Full code examples for each are available below.
| Preset name | Parameters | Description |
|----------------|------------|--------------------------------------------------|
| falcon_refinedweb_1b_en | 1.31B | 24-layer Falcon model (Falcon with 1B parameters), trained on 350B tokens of RefinedWeb dataset.|
## Use
### Direct Use
Research on large language models, specifically the influence of adequately filtered and deduplicated web data on the properties of large language models (fairness, safety, limitations, capabilities, etc.).
### Out-of-scope Use
Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.
## Bias, Risks, and Limitations
Falcon-RW-1B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
## Recommendations
We recommend users of Falcon-RW-1B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.
## Training Details
### Training Data
Falcon-RW-1B was trained on 350B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset. The data was tokenized with the GPT-2 tokenizer.
### Training Procedure
Falcon-RW-1B was trained on 32 A100 40GB GPUs, using only data parallelism with ZeRO.
### Training Hyperparameters
Hyperparameters were adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)).
| Hyperparameter | Value | Comment |
|----------------|----------|-------------------------------------------|
| Precision | bfloat16 | |
| Optimizer | AdamW | |
| Learning rate | 2e-4 | 500M tokens warm-up, cosine decay to 2e-5 |
| Weight decay | 1e-1 | |
| Batch size | 512 | 4B tokens ramp-up |
### Speeds, Sizes, Times
Training happened in early December 2022 and took about six days.
### Evaluation
See the [paper on arXiv](https://arxiv.org/abs/2306.01116) for in-depth evaluation.
## Technical Specifications
### Model Architecture and Objective
Falcon-RW-1B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
The architecture is adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), but uses ALiBi ([Ofir et al., 2021](https://arxiv.org/abs/2108.12409)).
| **Hyperparameter** | **Value** |
|:------------------:|:---------:|
| Layers | 24 |
| d_model | 2048 |
| head_dim | 64 |
| Vocabulary | 50304 |
| Sequence length | 2048 |
## Citation
```
@article{refinedweb,
title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
journal={arXiv preprint arXiv:2306.01116},
eprint={2306.01116},
eprinttype = {arXiv},
url={https://arxiv.org/abs/2306.01116},
year={2023}
}
```
## Example Usage
```Python
import os
os.environ["KERAS_BACKEND"] = "jax"
import keras
import keras_hub
# When running only inference, bfloat16 saves memory usage significantly.
keras.config.set_floatx("bfloat16")
causal_lm = keras_hub.models.FalconCausalLM.from_preset(
"falcon_refinedweb_1b_en"
)
causal_lm.summary()
outputs = causal_lm.generate([
"What is Jax?",
"Give me your best brownie recipe.",
], max_length=512)
```
## Example Usage with Hugging Face URI
```Python
import os
os.environ["KERAS_BACKEND"] = "jax"
import keras
import keras_hub
# When running only inference, bfloat16 saves memory usage significantly.
keras.config.set_floatx("bfloat16")
causal_lm = keras_hub.models.FalconCausalLM.from_preset(
"hf://keras/falcon_refinedweb_1b_en"
)
causal_lm.summary()
outputs = causal_lm.generate([
"What is Jax?",
"Give me your best brownie recipe.",
], max_length=512)
```
|