transformers

Running

App Files Files Community

Update app/src/content/article.mdx

by sergiopaniego HF Staff - opened 16 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

-6

Files changed (1) hide show

app/src/content/article.mdx +6 -6

app/src/content/article.mdx CHANGED Viewed

@@ -88,7 +88,7 @@ The [transformers documentation](https://huggingface.co/docs/transformers/index)
 </Sidenote>
-Most of know the `transformers` library as the backbone of modern machine learning, but if we dig a little deeper, it's a powerful piece of education.
 If you don't know… transformers is the de facto implementation of modern AI models that bear the same name; 'transformers' like models in GPT, DeepSeek, Claude, series. `transformers` is a special project because it contains the implementation of all major open model architecture and those model architectures are modularized to reuse functionality from each other.
@@ -157,11 +157,11 @@ Learn about [model quantization](https://huggingface.co/docs/transformers/en/qua
 - Quantize models in llama.cpp ($0)
 - Integrate models into the browser and WebGPU ($0)
 - SFT training in TRL/torch on Google Colab ($0)
-- RL training TRL/torch on Google Colab ($0 - $9)
-- Agentic RL in TRL on Google Colab ($0 - $9)
-Finally, training AI models is expensive. Running the nanochat `speedrun.sh` costs between $200 and $2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
 <Sidenote>
@@ -181,7 +181,7 @@ The original [gpt.py](https://github.com/karpathy/nanochat/blob/master/nanochat/
 As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works. The core model implementation demonstrates modern transformer architecture, with every design decision documented and justified.
-The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 ($300 budget) or 30 ($1,000 budget).
 The architecture incorporates five key improvements over vanilla transformers. Let's work through the components of this architecture and compare them across implementation:
@@ -570,7 +570,7 @@ In `modular_nanochat.py`, we don't need to write this logic at all. As seen in t
 It's very clear that Andrej Karpathy's implementation offers 10 times more to learn from than the transformer version which inherits almost entirely from existing models or features. That said, we can still take more away from the inherited modular modeling implementation. Models like Llama, Llama4, Gemma2, Qwen3, and CLIP are all reused to create a genuinely canonical implementation of a transformer.
-Ok. Let's cut the philosphy and see what we can do with `nanochat` in transformers.
 <Inference />

 </Sidenote>
+Most of us know the `transformers` library as the backbone of modern machine learning, but if we dig a little deeper, it's a powerful piece of education.
 If you don't know… transformers is the de facto implementation of modern AI models that bear the same name; 'transformers' like models in GPT, DeepSeek, Claude, series. `transformers` is a special project because it contains the implementation of all major open model architecture and those model architectures are modularized to reuse functionality from each other.
 - Quantize models in llama.cpp ($0)
 - Integrate models into the browser and WebGPU ($0)
 - SFT training in TRL/torch on Google Colab ($0)
+- RL training TRL/torch on Google Colab (\$0 - \$9)
+- Agentic RL in TRL on Google Colab (\$0 - \$9)
+Finally, training AI models is expensive. Running the nanochat `speedrun.sh` costs between \$200 and \$2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
 <Sidenote>
 As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works. The core model implementation demonstrates modern transformer architecture, with every design decision documented and justified.
+The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 (\$300 budget) or 30 (\$1,000 budget).
 The architecture incorporates five key improvements over vanilla transformers. Let's work through the components of this architecture and compare them across implementation:
 It's very clear that Andrej Karpathy's implementation offers 10 times more to learn from than the transformer version which inherits almost entirely from existing models or features. That said, we can still take more away from the inherited modular modeling implementation. Models like Llama, Llama4, Gemma2, Qwen3, and CLIP are all reused to create a genuinely canonical implementation of a transformer.
+Ok. Let's cut the philosophy and see what we can do with `nanochat` in transformers.
 <Inference />