Less is More: Recursive Reasoning with Tiny Networks
Abstract
Tiny Recursive Model (TRM) achieves high generalization on complex puzzle tasks using a small, two-layer network with minimal parameters, outperforming larger language models.
Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.
Community
Less is More
it seems to perform very well on task tuning, eg. the sudoku... even if its not agi or anything like that, this could be revolutionary for small, domain specific tasks
would love to see this 'issue' solved: https://github.com/SamsungSAILMontreal/TinyRecursiveModels/issues/2
;)
How do normal supervised learning methods perform on these type of tasks. I am confused what differentiates this vs a larger forward pass. Or is simply the learning efficiency?
This sentence in the Conclusion implies this architecture is superior to a massive net without recursive reasoning:
the question of why recursion helps so much compared to using a larger and deeper network remains to be explained; we suspect it has to do with overfitting, but we have no theory to back this explaination (sic)
I don't understand this new trend of comparing models like HRM and TRM with LLMs?
How is that relevant? They're not LLMs. The term "reasoning" has nothing to with reasoning in LLMs. I don't even think these techniques are applicable to LLMs, are they?
Like, of course a specialized model, trained for a specific task, is going to perform better than an LLM trained for an entirely different class of problems, right?
For that matter, how is it relevant to test these models on ARC-AGI, which is a benchmark to evaluate the problem solving capabilities of LLMs?
It's apples to oranges, isn't it? Jet airplanes to weather balloons? The weather balloon is obviously way better at monitoring the weather, but jet airplanes have quite a few more uses.
Compare these models against other specialized models: are they significantly smaller or faster?
Give us data we can actually compare.
I honestly have no idea if there's anything truly novel about these models, because you haven't provided any relevant comparison to anything remotely similar. π€·ββοΈ
ARC-AGI are benchmarks to evaluate the fluid intelligence of any AI - not only LLMs
The fact that TRMs perform better than any known architecture on those benchmarks is interesting on its own.
I guess no one grasp the larger truth. This style of recursive style of reasoning is in reality belief-state engineering. This actualizes this on an architectural basis-- although you can achieve something similar with an extra encoder in normal decoder-only LM's during training. Cool paper. Hope to see more like this, and hope people extrapolate belief-state engineering to other research facets, one day it will replace RL.
Thanks, I'd never heard of belief-state engineering.
I found this definition from:
Hu, E.S., Ahn, K., Liu, Q., Xu, H., Tomar, M., Langford, A., Jayaraman, D., Lamb, A. and Langford, J., 2024. Learning to achieve goals with belief state transformers. arXiv e-prints, pp.arXiv-2410. https://doi.org/10.48550/arXiv.2410.23506
Informally, a belief state is a sufficient amount of information from the past to predict
the outcomes of all experiments in the future, which can be expressed as either a distribution over
underlying world states or a distribution over future outcomes.
TRM's comparison to LLMs conflates two separate issues. First, the training regime: TRM uses 1000x data augmentation per example, about 1M effective samples, while Gemini and DeepSeek are zero-shot on ARC-AGI. A Transformer trained the same way would perform similarly. This isn't an architecture advantage, it's a data advantage. Second, the task structure mismatch is fundamental. ARC-AGI has deterministic solutions in fixed-dimensional spaces. LLMs generate variable-length token sequences with discrete vocabulary constraints. Recursive latent refinement and autoregressive token generation operate in completely different optimization spaces. The recursive approach is conceptually interesting for LLMs (Meta's Coconut explores this), but you can't benchmark it fairly on ARC-AGI. A proper test would integrate TRM's latent recursion into an LLM's hidden states and measure actual language task performance, not geometric puzzle accuracy.
That completely changes the perception here. Data augmentation using domain rules for example, in the case of Sudoku would basically result in an infinite supply of training data. Such level of augmentation is impossible for most practical datasets from real world.
Very interesting work and the first question popped in my mind when I see recursive is speed. How fast does each question get processed compared to the other models compared in this paper?
This on a tiny, extremely efficient node will make home AI systems possible for everyone. No need for a huge server rack in the basement.
Hi, I love the work @AlexiaJM . Have you considered adding register tokens? I think register tokens might give some improvement because it alleviates attention noise and outliers. Since it's recursive depth, I think alleviating attention noise and outliers might lead to more stability. Wondering if this has been tested or tried before.
Here is a bite sized podcast I created with AI on the paper! https://open.spotify.com/episode/6OIKWXIFjw1a2PHSBR4Fm6?si=fb252fb2000d45e5
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper