Papers
arxiv:2510.14972

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

Published on Oct 16
· Submitted by Pengyu Nie on Oct 17

Abstract

Misaligned tokenization in large language models for code leads to inconsistent model behavior, necessitating grammar-aware tokenization.

AI-generated summary

Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

Community

Paper author Paper submitter

1️⃣ LLMs' subword tokenizers don't align well with programming language grammar: tiny whitespace or renaming tweaks -> different tokenization -> flipped outputs.

2️⃣ Our framework TokDrift systematically tests 9 code LLMs on 3 tasks, showing their sensitivity to tokenization changes: up to 60% outputs change under a single semantic-preserving rewrite.

3️⃣ If your win margin is ~1 pp, beware: spacing & naming can swing results.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.14972 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.14972 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.14972 in a Space README.md to link it from this page.

Collections including this paper 1