AI & ML interests

AI, NLP, Computer Vision, Transformers

nouamanetaziĀ 
posted an update about 1 month ago
view post
Post
3901
After training š’š¦šØš„š‹šŒšŸ‘ on šŸ‘šŸ–šŸ’ š‡šŸšŸŽšŸŽš¬ for nearly a month, I've come to realize something most people overlook: š¢š§šŸš«ššš¬š­š«š®šœš­š®š«šž š¢š¬ š­š”šž š¦ššš¤šž-šØš«-š›š«šžššš¤ šŸšššœš­šØš« š¢š§ š‹š‹šŒ š­š«ššš¢š§š¢š§š . šŸ”„

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious šš‚š‚š‹ šžš«š«šØš«š¬, or when your expensive GPU cluster is running at šŸ”šŸŽ% šžšŸšŸš¢šœš¢šžš§šœš², the problem isn't your model. It's most probably a š¦š¢š¬š®š¬šž šØšŸ š­š”šž š”ššš«šš°ššš«šž. šŸ› ļø

Questions that seemed simple but had no clear answers: Why is šŒšØš„ š­š«ššš¢š§š¢š§š  š¬š„šØš°šžš« š­š”ššš§ ššžš§š¬šž š¦šØššžš„š¬? Which šš‚š‚š‹ šŸš„ššš š¬ should we actually set? How often should we checkpoint without killing throughput?

That's why we built š“š”šž š’š¦šØš„ š“š«ššš¢š§š¢š§š  šš„ššš²š›šØšØš¤ šŸ“–: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the š¢š§šŸš«ššš¬š­š«š®šœš­š®š«šž š„ššš²šžš« that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: š‡ššŒšŸ‘ š”š¢š­š­š¢š§š  šŸ‘ š“š/š¬, šš•š‹š¢š§š¤ šŸ’.šŸŽ š«šžšššœš”š¢š§š  šŸ•šŸ–šŸ” š†š/š¬, šš‚šˆšž š†šžš§šŸ’ ššš­ šŸšŸ’.šŸ š†š/š¬. Then we ran collective operations across šŸšŸšŸ– š†šš”š¬ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from šŸ’šŸ–šŸŽ š†š/š¬ on a single node to šŸ‘šŸšŸŽ-šŸ‘šŸ“šŸŽ š†š/š¬ across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

š“š”šž š’š¦šØš„ š“š«ššš¢š§š¢š§š  šš„ššš²š›šØšØš¤: https://lnkd.in/e5MKXUHS

Shared with ā¤ļø by the HuggingFace team
maghwaĀ 
updated a Space 10 months ago
nouamanetaziĀ 
posted an update almost 2 years ago