arxiv:2510.14972

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

Published on Oct 16

· Submitted by

Pengyu Nie on Oct 17

University of Waterloo

Upvote

Authors:

Yinxi Li ,

Yuntian Deng ,

Pengyu Nie

Abstract

Misaligned tokenization in large language models for code leads to inconsistent model behavior, necessitating grammar-aware tokenization.

AI-generated summary

Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

View arXiv page View PDF GitHub 4 Add to collection

Community

pengyunie

Paper author Paper submitter 1 day ago

1️⃣ LLMs' subword tokenizers don't align well with programming language grammar: tiny whitespace or renaming tweaks -> different tokenization -> flipped outputs.

2️⃣ Our framework TokDrift systematically tests 9 code LLMs on 3 tasks, showing their sensitivity to tokenization changes: up to 60% outputs change under a single semantic-preserving rewrite.

3️⃣ If your win margin is ~1 pp, beware: spacing & naming can swing results.