leonardlin 's Collections data
updated
A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity
Paper
• 2305.13169
• Published • 4
A Survey on Data Selection for Language Models
Paper
• 2402.16827
• Published • 4
HuggingFaceFW/fineweb-edu
Viewer
• Updated • 3.5B • 296k
• 998
Updated • 610k
• 158
Viewer
• Updated • 7.18B • 20k
• 602
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper
• 2404.07503
• Published • 31
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Paper
• 2406.20094
• Published • 106
DDK: Distilling Domain Knowledge for Efficient Large Language Models
Paper
• 2407.16154
• Published • 22
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data
Assessment and Selection for Instruction Tuning of Language Models
Paper
• 2408.02085
• Published • 19
Better Alignment with Instruction Back-and-Forth Translation
Paper
• 2408.04614
• Published • 15
The ShareLM Collection and Plugin: Contributing Human-Model Chats for
the Benefit of the Community
Paper
• 2408.08291
• Published • 11