Update app/src/content/article.mdx

#1
by sergiopaniego HF Staff - opened
Files changed (1) hide show
  1. app/src/content/article.mdx +6 -6
app/src/content/article.mdx CHANGED
@@ -88,7 +88,7 @@ The [transformers documentation](https://huggingface.co/docs/transformers/index)
88
 
89
  </Sidenote>
90
 
91
- Most of know the `transformers` library as the backbone of modern machine learning, but if we dig a little deeper, it's a powerful piece of education.
92
 
93
  If you don't know… transformers is the de facto implementation of modern AI models that bear the same name; 'transformers' like models in GPT, DeepSeek, Claude, series. `transformers` is a special project because it contains the implementation of all major open model architecture and those model architectures are modularized to reuse functionality from each other.
94
 
@@ -157,11 +157,11 @@ Learn about [model quantization](https://huggingface.co/docs/transformers/en/qua
157
  - Quantize models in llama.cpp ($0)
158
  - Integrate models into the browser and WebGPU ($0)
159
  - SFT training in TRL/torch on Google Colab ($0)
160
- - RL training TRL/torch on Google Colab ($0 - $9)
161
- - Agentic RL in TRL on Google Colab ($0 - $9)
162
 
163
 
164
- Finally, training AI models is expensive. Running the nanochat `speedrun.sh` costs between $200 and $2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
165
 
166
  <Sidenote>
167
 
@@ -181,7 +181,7 @@ The original [gpt.py](https://github.com/karpathy/nanochat/blob/master/nanochat/
181
 
182
  As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works. The core model implementation demonstrates modern transformer architecture, with every design decision documented and justified.
183
 
184
- The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 ($300 budget) or 30 ($1,000 budget).
185
 
186
  The architecture incorporates five key improvements over vanilla transformers. Let's work through the components of this architecture and compare them across implementation:
187
 
@@ -570,7 +570,7 @@ In `modular_nanochat.py`, we don't need to write this logic at all. As seen in t
570
 
571
  It's very clear that Andrej Karpathy's implementation offers 10 times more to learn from than the transformer version which inherits almost entirely from existing models or features. That said, we can still take more away from the inherited modular modeling implementation. Models like Llama, Llama4, Gemma2, Qwen3, and CLIP are all reused to create a genuinely canonical implementation of a transformer.
572
 
573
- Ok. Let's cut the philosphy and see what we can do with `nanochat` in transformers.
574
 
575
  <Inference />
576
 
 
88
 
89
  </Sidenote>
90
 
91
+ Most of us know the `transformers` library as the backbone of modern machine learning, but if we dig a little deeper, it's a powerful piece of education.
92
 
93
  If you don't know… transformers is the de facto implementation of modern AI models that bear the same name; 'transformers' like models in GPT, DeepSeek, Claude, series. `transformers` is a special project because it contains the implementation of all major open model architecture and those model architectures are modularized to reuse functionality from each other.
94
 
 
157
  - Quantize models in llama.cpp ($0)
158
  - Integrate models into the browser and WebGPU ($0)
159
  - SFT training in TRL/torch on Google Colab ($0)
160
+ - RL training TRL/torch on Google Colab (\$0 - \$9)
161
+ - Agentic RL in TRL on Google Colab (\$0 - \$9)
162
 
163
 
164
+ Finally, training AI models is expensive. Running the nanochat `speedrun.sh` costs between \$200 and \$2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
165
 
166
  <Sidenote>
167
 
 
181
 
182
  As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works. The core model implementation demonstrates modern transformer architecture, with every design decision documented and justified.
183
 
184
+ The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 (\$300 budget) or 30 (\$1,000 budget).
185
 
186
  The architecture incorporates five key improvements over vanilla transformers. Let's work through the components of this architecture and compare them across implementation:
187
 
 
570
 
571
  It's very clear that Andrej Karpathy's implementation offers 10 times more to learn from than the transformer version which inherits almost entirely from existing models or features. That said, we can still take more away from the inherited modular modeling implementation. Models like Llama, Llama4, Gemma2, Qwen3, and CLIP are all reused to create a genuinely canonical implementation of a transformer.
572
 
573
+ Ok. Let's cut the philosophy and see what we can do with `nanochat` in transformers.
574
 
575
  <Inference />
576