allenporter commited on
Commit
fb104b4
·
verified ·
1 Parent(s): 383b298

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +21 -3
README.md CHANGED
@@ -16,7 +16,7 @@ metrics:
16
 
17
  <!-- Provide a quick summary of what the model is/does. -->
18
 
19
- This model is a reproduction of gpt2 following Andrej Karpathy's GPT tutorial series
20
  and the original [GPT-2 Paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
21
 
22
  The model was trained using the [nano-gpt](https://github.com/allenporter/nano-gpt/) library
@@ -30,6 +30,11 @@ packaging and infrastructure work to make it more maintainable and reusable.
30
  GPT-2 is a transformers model pretrained on a large corpus of english only text
31
  with no labeling. This is the smallest version of GPT-2, with 124M parameters.
32
 
 
 
 
 
 
33
  The model was trained using a sample of the FineWeb-EDU using 10B tokens. The
34
  dataset contains educational web pages.
35
 
@@ -83,13 +88,19 @@ from the 10B token sample.
83
 
84
  #### Preprocessing
85
 
86
- The data was pre-tokenized using the `nano-gpt prepare_dataset` command
87
- line tool.
 
88
 
89
  #### Training Hyperparameters
90
 
91
  - **Training regime:** See `train_config` in `config.json` for the hyper parameters.
92
 
 
 
 
 
 
93
  #### Speeds, Sizes, Times
94
 
95
  The model was trained using 8 x A100s. The model was run for one full epoch of
@@ -104,6 +115,11 @@ hours.
104
 
105
  The model was evaluated using hellaswag dataset. TBD results.
106
 
 
 
 
 
 
107
  ## Environmental Impact
108
 
109
  <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
@@ -114,3 +130,5 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
114
  - **Hours used:** 2 hours
115
  - **Cloud Provider:** Lambda Labs
116
  - **Compute Region:** Arizona
 
 
 
16
 
17
  <!-- Provide a quick summary of what the model is/does. -->
18
 
19
+ This model is a reproduction of gpt2 following Andrej Karpathy's GPT [tutorial series](https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)
20
  and the original [GPT-2 Paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
21
 
22
  The model was trained using the [nano-gpt](https://github.com/allenporter/nano-gpt/) library
 
30
  GPT-2 is a transformers model pretrained on a large corpus of english only text
31
  with no labeling. This is the smallest version of GPT-2, with 124M parameters.
32
 
33
+ The model follows the standard GPT-2 architecture with transformer blocks containing:
34
+ - Multi-head causal self-attention
35
+ - Layer normalization
36
+ - MLP blocks with GELU activation
37
+
38
  The model was trained using a sample of the FineWeb-EDU using 10B tokens. The
39
  dataset contains educational web pages.
40
 
 
88
 
89
  #### Preprocessing
90
 
91
+ The model was pre-processed and sharded to make data loading efficient. The
92
+ dataset was pre-tokenized using GPT-2 tokenizer using the `nano-gpt prepare_dataset`
93
+ command. The file is split into manageable chunks.
94
 
95
  #### Training Hyperparameters
96
 
97
  - **Training regime:** See `train_config` in `config.json` for the hyper parameters.
98
 
99
+ The main features of the training process are:
100
+ - Learning rate scheduling with warmup
101
+ - Gradient clipping for stable training
102
+ - Model compilation for improved performance where available
103
+
104
  #### Speeds, Sizes, Times
105
 
106
  The model was trained using 8 x A100s. The model was run for one full epoch of
 
115
 
116
  The model was evaluated using hellaswag dataset. TBD results.
117
 
118
+ The `nano-gpt train` command has built in support for evaluating against
119
+ the val dataset as well as HellaSwag in between training steps. Every 500 steps
120
+ the model was evaluated against the val dataset, as well as HellaSwag and
121
+ geneated annecdotal samples from each worker.
122
+
123
  ## Environmental Impact
124
 
125
  <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 
130
  - **Hours used:** 2 hours
131
  - **Cloud Provider:** Lambda Labs
132
  - **Compute Region:** Arizona
133
+ - **Power estimate:** 8 GPUs * 0.325 kW/GPU = 2.6 kW. 2.6 kW * 2 hours = 5.2 kWh
134
+ - **CO2 Estimate:** 2.6 pounds of CO2 equivalent assuming 500 lbs CO2e per MWh