Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -16,7 +16,7 @@ metrics:
|
|
| 16 |
|
| 17 |
<!-- Provide a quick summary of what the model is/does. -->
|
| 18 |
|
| 19 |
-
This model is a reproduction of gpt2 following Andrej Karpathy's GPT tutorial series
|
| 20 |
and the original [GPT-2 Paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
|
| 21 |
|
| 22 |
The model was trained using the [nano-gpt](https://github.com/allenporter/nano-gpt/) library
|
|
@@ -30,6 +30,11 @@ packaging and infrastructure work to make it more maintainable and reusable.
|
|
| 30 |
GPT-2 is a transformers model pretrained on a large corpus of english only text
|
| 31 |
with no labeling. This is the smallest version of GPT-2, with 124M parameters.
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
The model was trained using a sample of the FineWeb-EDU using 10B tokens. The
|
| 34 |
dataset contains educational web pages.
|
| 35 |
|
|
@@ -83,13 +88,19 @@ from the 10B token sample.
|
|
| 83 |
|
| 84 |
#### Preprocessing
|
| 85 |
|
| 86 |
-
The
|
| 87 |
-
|
|
|
|
| 88 |
|
| 89 |
#### Training Hyperparameters
|
| 90 |
|
| 91 |
- **Training regime:** See `train_config` in `config.json` for the hyper parameters.
|
| 92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
#### Speeds, Sizes, Times
|
| 94 |
|
| 95 |
The model was trained using 8 x A100s. The model was run for one full epoch of
|
|
@@ -104,6 +115,11 @@ hours.
|
|
| 104 |
|
| 105 |
The model was evaluated using hellaswag dataset. TBD results.
|
| 106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
## Environmental Impact
|
| 108 |
|
| 109 |
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
|
@@ -114,3 +130,5 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
| 114 |
- **Hours used:** 2 hours
|
| 115 |
- **Cloud Provider:** Lambda Labs
|
| 116 |
- **Compute Region:** Arizona
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
<!-- Provide a quick summary of what the model is/does. -->
|
| 18 |
|
| 19 |
+
This model is a reproduction of gpt2 following Andrej Karpathy's GPT [tutorial series](https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)
|
| 20 |
and the original [GPT-2 Paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
|
| 21 |
|
| 22 |
The model was trained using the [nano-gpt](https://github.com/allenporter/nano-gpt/) library
|
|
|
|
| 30 |
GPT-2 is a transformers model pretrained on a large corpus of english only text
|
| 31 |
with no labeling. This is the smallest version of GPT-2, with 124M parameters.
|
| 32 |
|
| 33 |
+
The model follows the standard GPT-2 architecture with transformer blocks containing:
|
| 34 |
+
- Multi-head causal self-attention
|
| 35 |
+
- Layer normalization
|
| 36 |
+
- MLP blocks with GELU activation
|
| 37 |
+
|
| 38 |
The model was trained using a sample of the FineWeb-EDU using 10B tokens. The
|
| 39 |
dataset contains educational web pages.
|
| 40 |
|
|
|
|
| 88 |
|
| 89 |
#### Preprocessing
|
| 90 |
|
| 91 |
+
The model was pre-processed and sharded to make data loading efficient. The
|
| 92 |
+
dataset was pre-tokenized using GPT-2 tokenizer using the `nano-gpt prepare_dataset`
|
| 93 |
+
command. The file is split into manageable chunks.
|
| 94 |
|
| 95 |
#### Training Hyperparameters
|
| 96 |
|
| 97 |
- **Training regime:** See `train_config` in `config.json` for the hyper parameters.
|
| 98 |
|
| 99 |
+
The main features of the training process are:
|
| 100 |
+
- Learning rate scheduling with warmup
|
| 101 |
+
- Gradient clipping for stable training
|
| 102 |
+
- Model compilation for improved performance where available
|
| 103 |
+
|
| 104 |
#### Speeds, Sizes, Times
|
| 105 |
|
| 106 |
The model was trained using 8 x A100s. The model was run for one full epoch of
|
|
|
|
| 115 |
|
| 116 |
The model was evaluated using hellaswag dataset. TBD results.
|
| 117 |
|
| 118 |
+
The `nano-gpt train` command has built in support for evaluating against
|
| 119 |
+
the val dataset as well as HellaSwag in between training steps. Every 500 steps
|
| 120 |
+
the model was evaluated against the val dataset, as well as HellaSwag and
|
| 121 |
+
geneated annecdotal samples from each worker.
|
| 122 |
+
|
| 123 |
## Environmental Impact
|
| 124 |
|
| 125 |
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
|
|
|
| 130 |
- **Hours used:** 2 hours
|
| 131 |
- **Cloud Provider:** Lambda Labs
|
| 132 |
- **Compute Region:** Arizona
|
| 133 |
+
- **Power estimate:** 8 GPUs * 0.325 kW/GPU = 2.6 kW. 2.6 kW * 2 hours = 5.2 kWh
|
| 134 |
+
- **CO2 Estimate:** 2.6 pounds of CO2 equivalent assuming 500 lbs CO2e per MWh
|