The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published 2 days ago • 27
Emergent and Predictable Memorization in Large Language Models Paper • 2304.11158 • Published Apr 21, 2023
KMMLU: Measuring Massive Multitask Language Understanding in Korean Paper • 2402.11548 • Published Feb 18, 2024
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence Paper • 2404.05892 • Published Apr 8, 2024 • 39
The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources Paper • 2406.16746 • Published Jun 24, 2024
Consent in Crisis: The Rapid Decline of the AI Data Commons Paper • 2407.14933 • Published Jul 20, 2024 • 12
Lessons from the Trenches on Reproducible Evaluation of Language Models Paper • 2405.14782 • Published May 23, 2024
Bridging the Data Provenance Gap Across Text, Speech and Video Paper • 2412.17847 • Published Dec 19, 2024 • 9
Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon Paper • 2406.17746 • Published Jun 25, 2024
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research Paper • 2505.11855 • Published 22 days ago • 9
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published 2 days ago • 27
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published 2 days ago • 27
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published 2 days ago • 27
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Paper • 2506.01844 • Published 5 days ago • 75
view post Post 433 New in smolagents v1.17.0:- Structured generation in CodeAgent 🧱- Streamable HTTP MCP support 🌐- Agent.run() returns rich RunResult 📦Smarter agents, smoother workflows.Try it now: https://github.com/huggingface/smolagents/releases/tag/v1.17.0 See translation 🤗 1 1 😎 1 1 + Reply
Masader: Metadata Sourcing for Arabic Text and Speech Data Resources Paper • 2110.06744 • Published Oct 13, 2021
Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training Paper • 2410.20796 • Published Oct 28, 2024
Ashaar: Automatic Analysis and Generation of Arabic Poetry Using Deep Learning Approaches Paper • 2307.06218 • Published Jul 12, 2023
MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs Paper • 2505.19800 • Published 13 days ago • 1