Representation & Optimization
Understanding about representation sheds light on optimization
Paper • 2405.14544 • Published • 1Note CS inequality for matrix allows penalizing element-wise Frobenius norm to encourage low-rank representations.
Token embeddings violate the manifold hypothesis
Paper • 2504.01002 • Published • 1Note Some token have more synonyms than others.
-
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers
Paper • 2403.10476 • Published • 1 -
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
Paper • 2504.00254 • Published • 1
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
Paper • 2412.05496 • Published • 1Note Customize attention mask with optimized performance comparable with Flashattention
-
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
Paper • 2503.21934 • Published
Value Residual Learning For Alleviating Attention Concentration In Transformers
Paper • 2410.17897 • Published • 9Note Halve KV cache via sharing value embedding across attention blocks
-
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Paper • 2504.06261 • Published • 110 -
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Paper • 2503.01840 • Published • 5 -
Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure
Paper • 2504.01928 • Published • 1 -
Gradient Surgery for Multi-Task Learning
Paper • 2001.06782 • Published • 1 -
SelfCP: Compressing Long Prompt to 1/12 Using the Frozen Large Language Model Itself
Paper • 2405.17052 • Published • 2 -
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Paper • 2403.19647 • Published • 4 -
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Paper • 2504.13837 • Published • 139 -
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
Paper • 2504.13173 • Published • 18 -
Representation Learning with Contrastive Predictive Coding
Paper • 1807.03748 • Published • 1 -
Training Large Language Models to Reason in a Continuous Latent Space
Paper • 2412.06769 • Published • 92 -
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
Paper • 2502.18137 • Published • 59 -
Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light
Paper • 2504.16922 • Published • 1 -
Interpreting Emergent Planning in Model-Free Reinforcement Learning
Paper • 2504.01871 • Published • 12
Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward
Paper • 2504.03206 • Published • 1Note PBRS (Potential Based Reward Shaping) can be used for gated regularization
-
Overtrained Language Models Are Harder to Fine-Tune
Paper • 2503.19206 • Published • 2 -
Long Context In-Context Compression by Getting to the Gist of Gisting
Paper • 2504.08934 • Published • 1 -
Model Diffusion for Certifiable Few-shot Transfer Learning
Paper • 2502.06970 • Published • 1
Memorization-Compression Cycles Improve Generalization
Paper • 2505.08727 • Published • 5Note Occam razor's principle expressed in mathematical term inspires new ways of training LLM to rely less on quantify of data.
-
Chain-of-Model Learning for Language Model
Paper • 2505.11820 • Published • 121 -
Shannon information and integrated information: message and meaning
Paper • 2412.10626 • Published • 1 -
Let's Predict Sentence by Sentence
Paper • 2505.22202 • Published • 19 -
Learning to Reason without External Rewards
Paper • 2505.19590 • Published • 29 -
Pre-trained Large Language Models Learn Hidden Markov Models In-context
Paper • 2506.07298 • Published • 26 -
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Paper • 2506.06941 • Published • 15 -
A projection-based framework for gradient-free and parallel learning
Paper • 2506.05878 • Published • 2 -
In-Context Learning Strategies Emerge Rationally
Paper • 2506.17859 • Published • 10 -
Global and Local Entailment Learning for Natural World Imagery
Paper • 2506.21476 • Published • 1 -
Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation
Paper • 2506.19852 • Published • 42
Data Efficacy for Language Model Training
Paper • 2506.21545 • Published • 11Note The 'learnability' metric require training a small LM beforehand instead of computed online, in that sense, selecting 'easy-to-learn' sample is an old idea.
Energy-Based Transformers are Scalable Learners and Thinkers
Paper • 2507.02092 • Published • 69Note Using a neural network to directly predict outputs makes inference fast but makes search-based reasoning at inference time feel unnatural. In contrast, training a network to predict a loss function naturally supports gradient-based search at inference time—more aligned with tasks like image generation in continuous domains. However, this approach is 3× heavier at both training and inference.
-
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90
Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis
Paper • 2505.11581 • Published • 3Note Deep learning tends to favor high-entropy representation.
-
Towards Distributed Neural Architectures
Paper • 2506.22389 • Published • 2 -
Scaling RL to Long Videos
Paper • 2507.07966 • Published • 159 -
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Paper • 2507.07990 • Published • 45
StreamDiT: Real-Time Streaming Text-to-Video Generation
Paper • 2507.03745 • Published • 31Note Training to stream with monotonously increasing noise level.
-
What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models
Paper • 2507.06952 • Published • 7 -
Potemkin Understanding in Large Language Models
Paper • 2506.21521 • Published • 3
Large Language Diffusion Models
Paper • 2502.09992 • Published • 123Note Current-token prediction with [Mask] token embedding. Iterative inference with re-masking on high-perplexity token.
Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking
Paper • 2505.18495 • Published • 1Note Current-token prediction with interpolated [Mask]--[Predicted Token] embedding
Anchored Diffusion Language Model
Paper • 2505.18456 • Published • 1Note Identified the right problem for diffusion language modeling: I can't imagine what I'd say 5 words away without saying the 5 words first.
Fractal Generative Models
Paper • 2502.17437 • Published • 1Note coarse to fine generation with recursive forward propagation
-
nablaNABLA: Neighborhood Adaptive Block-Level Attention
Paper • 2507.13546 • Published • 124 -
Agentic Reinforced Policy Optimization
Paper • 2507.19849 • Published • 158
The Serial Scaling Hypothesis
Paper • 2507.12549 • Published • 11Note Lots of computation process can't be parallelized, speeding up serial computation is also desirable.
-
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Paper • 2508.01191 • Published • 238
Hierarchical Reasoning Model
Paper • 2506.21734 • Published • 46Note Recurrent model at multiple levels, aligned in different temporal scales.
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
Paper • 2508.05629 • Published • 180Note Down-scale impact of hard examples improve cross entropy loss in post-training stage, a.k.a. "don't break anything yo dog"
-
Differentiable Causal Discovery For Latent Hierarchical Causal Models
Paper • 2411.19556 • Published • 1 -
Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models
Paper • 2502.20332 • Published • 2 -
Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective
Paper • 2505.23833 • Published • 1 -
Untrained neural networks can demonstrate memorization-independent abstract reasoning
Paper • 2407.17791 • Published • 1 -
Building, Reusing, and Generalizing Abstract Representations from Concrete Sequences
Paper • 2410.21332 • Published • 1 -
What is an "Abstract Reasoner"? Revisiting Experiments and Arguments about Large Language Models
Paper • 2507.22457 • Published • 1 -
Residual Connections Harm Generative Representation Learning
Paper • 2404.10947 • Published • 1 -
XAttention: Block Sparse Attention with Antidiagonal Scoring
Paper • 2503.16428 • Published • 15
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 316Note GRPO training is unstable for MoE model because token_ratio = policy_new(token) / policy_old(token) easily spikes under different routing choices for old and new policies. GSPO instead uses seq_ratio = policy_new(seq) / policy_old(seq) and find this to be more stable.
-
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling
Paper • 2508.17445 • Published • 80 -
Reasoning-Intensive Regression
Paper • 2508.21762 • Published • 2 -
Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic
Paper • 2509.01363 • Published • 58 -
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning
Paper • 2509.02479 • Published • 83 -
GenCompositor: Generative Video Compositing with Diffusion Transformer
Paper • 2509.02460 • Published • 25
Towards a Unified View of Large Language Model Post-Training
Paper • 2509.04419 • Published • 75Note When SFT demo data & RL signal are both available, add two loss together and optimize, when model sucks weight more on SFT and vice versa. Duh..
Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning
Paper • 2509.03646 • Published • 32Note Good.
CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning
Paper • 2509.04027 • Published • 2Note I think the "continuous manifold of semantic space when reasoning token approaches infinity" breaks when they introduce a metric based on "prefix" and then "distance" ...
-
What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?
Paper • 2509.03790 • Published • 1 -
Differentiable Entropy Regularization for Geometry and Neural Networks
Paper • 2509.03733 • Published • 1 -
Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning
Paper • 2509.03345 • Published -
Dynamic Speculative Agent Planning
Paper • 2509.01920 • Published • 6 -
Mixture of Contexts for Long Video Generation
Paper • 2508.21058 • Published • 35 -
Reinforcement Learning for Machine Learning Engineering Agents
Paper • 2509.01684 • Published • 1 -
BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design
Paper • 2508.21184 • Published • 2
RL's Razor: Why Online Reinforcement Learning Forgets Less
Paper • 2509.04259 • Published • 6Note Removing KL regularization loss is not the same as having no KL regularization. PPO style clipping is a powerful implicit KL regularization techniques, we'd be stupid to ignore it and rediscover: hey! this thing reduces KL deviation ...
-
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
Paper • 2504.16078 • Published • 21 -
Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time Scaling
Paper • 2509.01649 • Published • 2
Reverse-Engineered Reasoning for Open-Ended Generation
Paper • 2509.06160 • Published • 150Note Manual rollout & selection for reasoning process with perplexity target.
-
Scalable Reinforcement Post-Training Beyond Static Human Prompts: Evolving Alignment via Asymmetric Self-Play
Paper • 2411.00062 • Published • 1 -
LongLive: Real-time Interactive Long Video Generation
Paper • 2509.22622 • Published • 184 -
Mem-α: Learning Memory Construction via Reinforcement Learning
Paper • 2509.25911 • Published • 14 -
Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training
Paper • 2509.25758 • Published • 22 -
Generalized Parallel Scaling with Interdependent Generations
Paper • 2510.01143 • Published • 5 -
RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
Paper • 2510.02263 • Published • 8 -
Nonparametric Identification of Latent Concepts
Paper • 2510.00136 • Published -
The Three Regimes of Offline-to-Online Reinforcement Learning
Paper • 2510.01460 • Published • 1 -
RLP: Reinforcement as a Pretraining Objective
Paper • 2510.01265 • Published • 40 -
Agent Learning via Early Experience
Paper • 2510.08558 • Published • 270 -
Memory Retrieval and Consolidation in Large Language Models through Function Tokens
Paper • 2510.08203 • Published • 9 -
Artificial Hippocampus Networks for Efficient Long-Context Modeling
Paper • 2510.07318 • Published • 30 -
The End of Manual Decoding: Towards Truly End-to-End Language Models
Paper • 2510.26697 • Published • 116 -
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Paper • 2510.25992 • Published • 45 -
Scaling Latent Reasoning via Looped Language Models
Paper • 2510.25741 • Published • 221 -
Distribution Matching Variational AutoEncoder
Paper • 2512.07778 • Published • 28 -
Bolmo: Byteifying the Next Generation of Language Models
Paper • 2512.15586 • Published • 14 -
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
Paper • 2512.17351 • Published • 24 -
When Reasoning Meets Its Laws
Paper • 2512.17901 • Published • 54 -
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics
Paper • 2512.12602 • Published • 41 -
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
Paper • 2512.20605 • Published • 59
Latent Implicit Visual Reasoning
Paper • 2512.21218 • Published • 63Note Creating "path of least resistance" via a prior stage with bottleneck training to encourage latent dependency
-
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
Paper • 2512.19693 • Published • 61 -
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
Paper • 2512.16093 • Published • 90 -
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Paper • 2512.23447 • Published • 87 -
Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process
Paper • 2512.23988 • Published • 12
Pretraining Frame Preservation in Autoregressive Video Memory Compression
Paper • 2512.23851 • Published • 15Note "Random" mask of frame for reconstruction is important to avoid the "global reconstruction" trap which ignores many important local feature that's obvious for human cognition.
An Information Theoretic Perspective on Agentic System Design
Paper • 2512.21720 • Published • 7Note Bit of an over-reach: essentially we use teacher model to 'grade' summary of student model, and compute 'group relative advantage', claiming equivalence to mutual information is like claiming GRPO optimizes mutual information, fine but redundant.