Pointer: Linear-Complexity Long-Range Modeling without Pre-training

Community Article Published August 7, 2025

Authors: Zixi Li, Noesis Lab · Sun Yat-sen University
Date: 5 August 2025
Paper: arXiv:2508.02631v1

Abstract

We introduce Pointer, a novel sequence model that attains linear $O (N K)$ complexity by replacing quadratic self-attention with layer-wise pointer chains. Pointer reaches 2–10× speed-ups on long sequences, keeps >95% accuracy on copy tasks up to 2,048 tokens, and learns interpretable long-range patterns – all without any large-scale pre-training.

Key Contributions

Linear complexity: $O (N K)$ runtime yields 2–10× faster training/inference
No pre-training required: Effective from scratch on modest hardware
Explicit long-range modeling: Pointer chains create $2^{K} +$ token dependency paths
Interpretability: Each token points to exactly one other token

Method in a Nutshell

Pointer Selection

For position $i$ at layer $\ell$ , choose a single target:
$p_i^{(\ell)} = \arg\max_j s_{ij}^{(\ell)}$
where $s$ are learned pointer logits.

Pointer Chaining

Re-embed previous layer's pointer index into token representations. Deeper layers form multi-hop chains spanning longer ranges.

Complexity

Compute cost: $O (N d)$ per layer
Memory: $O (N)$ (vs. $O (N^{2})$ for attention)

Experiments

Throughput (tokens/sec):

Sequence Length	256	512	1,024	2,048
Pointer	14,446	34,914	37,189	28,268
Transformer	30,320	29,427	19,703	11,549

→ Speed-up grows from 0.48× (256 tokens) to 2.45× (2,048 tokens)

Copy-Task Accuracy:

Distance	512	1,024	1,536	2,048
Pointer	98.38%	97.50%	96.38%	95.25%
Transformer	99.38%	94.25%	94.88%	94.75%

→ Maintains >95% accuracy at 2,048 tokens

Interpretability

Heatmaps reveal layer specialization:

Early layers: Local ($\sim$47–58 tokens)
Deep layers: Long-range jumps ($\leq$483 tokens)

Limitations & Future Work

Hardware constraints for Longformer comparisons
Currently language-only; cross-domain tests planned
Future: Multi-head pointers, hierarchical chains

TL;DR

Pointer replaces quadratic attention with layer-wise pointer chains—achieving:

$O (N K)$ linear complexity

2–10× speed-up on long sequences

Interpretable dependency paths

No pre-training required
Try it for long-context tasks on modest hardware!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote