Pointer: Linear-Complexity Long-Range Modeling without Pre-training

Community Article Published August 7, 2025

Authors: Zixi Li, Noesis Lab · Sun Yat-sen University
Date: 5 August 2025
Paper: arXiv:2508.02631v1


Abstract

We introduce Pointer, a novel sequence model that attains linear O(NK)O(NK) complexity by replacing quadratic self-attention with layer-wise pointer chains. Pointer reaches 2–10× speed-ups on long sequences, keeps >95% accuracy on copy tasks up to 2,048 tokens, and learns interpretable long-range patterns – all without any large-scale pre-training.


Key Contributions

  • Linear complexity: O(NK)O(NK) runtime yields 2–10× faster training/inference
  • No pre-training required: Effective from scratch on modest hardware
  • Explicit long-range modeling: Pointer chains create 2K+2^K+ token dependency paths
  • Interpretability: Each token points to exactly one other token

Method in a Nutshell

Pointer Selection

For position ii at layer \ell, choose a single target:
pi()=argmaxjsij()p_i^{(\ell)} = \arg\max_j s_{ij}^{(\ell)}
where ss are learned pointer logits.

Pointer Chaining

Re-embed previous layer's pointer index into token representations. Deeper layers form multi-hop chains spanning longer ranges.

Complexity

  • Compute cost: O(Nd)O(Nd) per layer
  • Memory: O(N)O(N) (vs. O(N2)O(N^2) for attention)

Experiments

Throughput (tokens/sec):

Sequence Length 256 512 1,024 2,048
Pointer 14,446 34,914 37,189 28,268
Transformer 30,320 29,427 19,703 11,549

→ Speed-up grows from 0.48× (256 tokens) to 2.45× (2,048 tokens)

Copy-Task Accuracy:

Distance 512 1,024 1,536 2,048
Pointer 98.38% 97.50% 96.38% 95.25%
Transformer 99.38% 94.25% 94.88% 94.75%

→ Maintains >95% accuracy at 2,048 tokens


Interpretability

Heatmaps reveal layer specialization:

  • Early layers: Local (\(\sim\)47–58 tokens)
  • Deep layers: Long-range jumps (\(\leq\)483 tokens)

Limitations & Future Work

  • Hardware constraints for Longformer comparisons
  • Currently language-only; cross-domain tests planned
  • Future: Multi-head pointers, hierarchical chains

TL;DR

Pointer replaces quadratic attention with layer-wise pointer chains—achieving:

  • O(NK)O(NK) linear complexity
  • 2–10× speed-up on long sequences
  • Interpretable dependency paths
  • No pre-training required
    Try it for long-context tasks on modest hardware!

Community

Sign up or log in to comment