Nanotron Research

community

Activity Feed

AI & ML interests

Large scale distributed AI model training, model parallelisation, low-level GPU acceleration, make GPUs go brrrrr

Recent Activity

lvwerra updated a Space 9 days ago

nanotron/README

julien-c updated a Space 10 days ago

nanotron/README

lvwerra new activity 10 days ago

nanotron/book:Update README.md

View all activity

lvwerra

updated a Space 9 days ago

README

📉

julien-c

updated a Space 10 days ago

README

📉

lvwerra

in nanotron/book 10 days ago

Update README.md

#1 opened 10 days ago by

lvwerra

Update README.md

#2 opened 10 days ago by

lvwerra

Update README.md

#3 opened 10 days ago by

lvwerra

julien-c

updated a dataset 10 days ago

nanotron/book

Updated 10 days ago • 841 • 3

julien-c

published a dataset 10 days ago

nanotron/book

Updated 10 days ago • 841 • 3

eliebak

posted an update 18 days ago

Post

4535

Kimi K2 tech report is full of gems as always. Here are my notes on it:

> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher)
> Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient.
> They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.

With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.

> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once.
> They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style
> They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.

The infra is also very nice, quick summary:
> PP=16 (1F1B schedule, a bit custom), EP=16, zero1
> No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU

thomwolf

authored a paper about 1 month ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 64

lvwerra

authored a paper about 1 month ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 64

hynky

authored a paper about 1 month ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 64

guipenedo

authored a paper about 1 month ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 64

loubnabnl

authored a paper 2 months ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

guipenedo

authored a paper 2 months ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

eliebak

authored a paper 2 months ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 44

thomwolf

authored a paper 2 months ago

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2 • 126

loubnabnl

posted an update 3 months ago

Post

3636

SmolVLM is now available on PocketPal — you can run it offline on your smartphone to interpret the world around you. 🌍📱

And check out this real-time camera demo by @ngxson , powered by llama.cpp:
https://github.com/ngxson/smolvlm-realtime-webcam
https://x.com/pocketpal_ai

3 replies

lvwerra

in nanotron/ultrascale-playbook 3 months ago

last-edits

#110 opened 3 months ago by

lvwerra

update

#102 opened 5 months ago by

nouamanetazi

julien-c

posted an update 4 months ago

Post

6249

BOOOOM: Today I'm dropping TINY AGENTS

the 50 lines of code Agent in Javascript 🔥

I spent the last few weeks working on this, so I hope you will like it.

I've been diving into MCP (Model Context Protocol) to understand what the hype was all about.

It is fairly simple, but still quite powerful: MCP is a standard API to expose sets of Tools that can be hooked to LLMs.

But while doing that, came my second realization:

Once you have a MCP Client, an Agent is literally just a while loop on top of it. 🤯

➡️ read it exclusively on the official HF blog: https://huggingface.co/blog/tiny-agents

1 reply

AI & ML interests

Recent Activity

Team members 13

nanotron's activity

README

README

Update README.md

Update README.md

Update README.md

last-edits

update