Pointer: Linear-Complexity Long-Range Modeling without Pre-training
Authors: Zixi Li, Noesis Lab · Sun Yat-sen University
Date: 5 August 2025
Paper: arXiv:2508.02631v1
Abstract
We introduce Pointer, a novel sequence model that attains linear complexity by replacing quadratic self-attention with layer-wise pointer chains. Pointer reaches 2–10× speed-ups on long sequences, keeps >95% accuracy on copy tasks up to 2,048 tokens, and learns interpretable long-range patterns – all without any large-scale pre-training.
Key Contributions
- Linear complexity: runtime yields 2–10× faster training/inference
- No pre-training required: Effective from scratch on modest hardware
- Explicit long-range modeling: Pointer chains create token dependency paths
- Interpretability: Each token points to exactly one other token
Method in a Nutshell
Pointer Selection
For position at layer , choose a single target:
where are learned pointer logits.
Pointer Chaining
Re-embed previous layer's pointer index into token representations. Deeper layers form multi-hop chains spanning longer ranges.
Complexity
- Compute cost: per layer
- Memory: (vs. for attention)
Experiments
Throughput (tokens/sec):
Sequence Length | 256 | 512 | 1,024 | 2,048 |
---|---|---|---|---|
Pointer | 14,446 | 34,914 | 37,189 | 28,268 |
Transformer | 30,320 | 29,427 | 19,703 | 11,549 |
→ Speed-up grows from 0.48× (256 tokens) to 2.45× (2,048 tokens)
Copy-Task Accuracy:
Distance | 512 | 1,024 | 1,536 | 2,048 |
---|---|---|---|---|
Pointer | 98.38% | 97.50% | 96.38% | 95.25% |
Transformer | 99.38% | 94.25% | 94.88% | 94.75% |
→ Maintains >95% accuracy at 2,048 tokens
Interpretability
Heatmaps reveal layer specialization:
- Early layers: Local (\(\sim\)47–58 tokens)
- Deep layers: Long-range jumps (\(\leq\)483 tokens)
Limitations & Future Work
- Hardware constraints for Longformer comparisons
- Currently language-only; cross-domain tests planned
- Future: Multi-head pointers, hierarchical chains
TL;DR
Pointer replaces quadratic attention with layer-wise pointer chains—achieving:
- linear complexity
- 2–10× speed-up on long sequences
- Interpretable dependency paths
- No pre-training required
Try it for long-context tasks on modest hardware!