Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention Paper • 2510.04212 • Published 14 days ago • 22
Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients Paper • 2406.17660 • Published Jun 25, 2024 • 5 • 3