Article 2 Trainable Dynamic Mask Sparse Attention: Bridging Efficiency and Effectiveness in Long-Context Language Models