Activity Feed

AI & ML interests

Large scale distributed AI model training, model parallelisation, low-level GPU acceleration, make GPUs go brrrrr

Recent Activity

lvwerraย  updated a Space 9 days ago
nanotron/README
julien-cย  updated a Space 10 days ago
nanotron/README
lvwerraย  new activity 10 days ago
nanotron/book:Update README.md
View all activity

lvwerraย 
updated a Space 9 days ago
julien-cย 
updated a Space 10 days ago
lvwerraย 
in nanotron/book 10 days ago

Update README.md

#1 opened 10 days ago by
lvwerra

Update README.md

#2 opened 10 days ago by
lvwerra

Update README.md

#3 opened 10 days ago by
lvwerra
eliebakย 
posted an update 18 days ago
view post
Post
4535
Kimi K2 tech report is full of gems as always. Here are my notes on it:

> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher)
> Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient.
> They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.

With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.

> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once.
> They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style
> They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.

The infra is also very nice, quick summary:
> PP=16 (1F1B schedule, a bit custom), EP=16, zero1
> No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU
loubnabnlย 
posted an update 3 months ago

last-edits

#110 opened 3 months ago by
lvwerra

update

#102 opened 5 months ago by
nouamanetazi
julien-cย 
posted an update 4 months ago
view post
Post
6249
BOOOOM: Today I'm dropping TINY AGENTS

the 50 lines of code Agent in Javascript ๐Ÿ”ฅ

I spent the last few weeks working on this, so I hope you will like it.

I've been diving into MCP (Model Context Protocol) to understand what the hype was all about.

It is fairly simple, but still quite powerful: MCP is a standard API to expose sets of Tools that can be hooked to LLMs.

But while doing that, came my second realization:

Once you have a MCP Client, an Agent is literally just a while loop on top of it. ๐Ÿคฏ

โžก๏ธ read it exclusively on the official HF blog: https://huggingface.co/blog/tiny-agents
  • 1 reply
ยท