Update README.md with new model card content

6acd2af verified 5 months ago

5.46 kB

	---
	library_name: keras-hub
	license: apache-2.0
	language:
	- en
	tags:
	- text-generation-inference
	- keras
	pipeline_tag: text-generation
	---
	### Model Overview
	# Model Summary

	Falcon-RW-1B is a 1B parameters causal decoder-only model built by [TII](https://www.tii.ae/) and trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). The architecture of the model is adopted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)) but it uses ALiBi.

	## Links

	* [Falcon Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/falcon-quickstart-notebook)
	* [Falcon API Documentation](https://keras.io/keras_hub/api/models/falcon/)
	* [Falcon Model Card](https://huggingface.co/docs/transformers/en/model_doc/falcon)
	* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
	* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)

	## Presets

	The following model checkpoints are provided by the Keras team. Full code examples for each are available below.
	\| Preset name \| Parameters \| Description \|
	\|----------------\|------------\|--------------------------------------------------\|
	\| falcon_refinedweb_1b_en \| 1.31B \| 24-layer Falcon model (Falcon with 1B parameters), trained on 350B tokens of RefinedWeb dataset.\|


	## Use

	### Direct Use
	Research on large language models, specifically the influence of adequately filtered and deduplicated web data on the properties of large language models (fairness, safety, limitations, capabilities, etc.).

	### Out-of-scope Use
	Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

	## Bias, Risks, and Limitations

	Falcon-RW-1B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

	## Recommendations

	We recommend users of Falcon-RW-1B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.

	## Training Details

	### Training Data
	Falcon-RW-1B was trained on 350B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset. The data was tokenized with the GPT-2 tokenizer.

	### Training Procedure
	Falcon-RW-1B was trained on 32 A100 40GB GPUs, using only data parallelism with ZeRO.

	### Training Hyperparameters
	Hyperparameters were adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)).

	\| Hyperparameter \| Value \| Comment \|
	\|----------------\|----------\|-------------------------------------------\|
	\| Precision \| bfloat16 \| \|
	\| Optimizer \| AdamW \| \|
	\| Learning rate \| 2e-4 \| 500M tokens warm-up, cosine decay to 2e-5 \|
	\| Weight decay \| 1e-1 \| \|
	\| Batch size \| 512 \| 4B tokens ramp-up \|

	### Speeds, Sizes, Times
	Training happened in early December 2022 and took about six days.

	### Evaluation
	See the [paper on arXiv](https://arxiv.org/abs/2306.01116) for in-depth evaluation.

	## Technical Specifications

	### Model Architecture and Objective
	Falcon-RW-1B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

	The architecture is adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), but uses ALiBi ([Ofir et al., 2021](https://arxiv.org/abs/2108.12409)).

	\| Hyperparameter \| Value \|
	\|:------------------:\|:---------:\|
	\| Layers \| 24 \|
	\| d_model \| 2048 \|
	\| head_dim \| 64 \|
	\| Vocabulary \| 50304 \|
	\| Sequence length \| 2048 \|

	## Citation
	```
	@article{refinedweb,
	title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
	author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
	journal={arXiv preprint arXiv:2306.01116},
	eprint={2306.01116},
	eprinttype = {arXiv},
	url={https://arxiv.org/abs/2306.01116},
	year={2023}
	}
	```

	## Example Usage
	```Python

	import os

	os.environ["KERAS_BACKEND"] = "jax"

	import keras
	import keras_hub

	# When running only inference, bfloat16 saves memory usage significantly.
	keras.config.set_floatx("bfloat16")

	causal_lm = keras_hub.models.FalconCausalLM.from_preset(
	"falcon_refinedweb_1b_en"
	)
	causal_lm.summary()

	outputs = causal_lm.generate([
	"What is Jax?",
	"Give me your best brownie recipe.",
	], max_length=512)

	```

	## Example Usage with Hugging Face URI

	```Python

	import os

	os.environ["KERAS_BACKEND"] = "jax"

	import keras
	import keras_hub

	# When running only inference, bfloat16 saves memory usage significantly.
	keras.config.set_floatx("bfloat16")

	causal_lm = keras_hub.models.FalconCausalLM.from_preset(
	"hf://keras/falcon_refinedweb_1b_en"
	)
	causal_lm.summary()

	outputs = causal_lm.generate([
	"What is Jax?",
	"Give me your best brownie recipe.",
	], max_length=512)

	```