arxiv:2602.19020

Learning to Detect Language Model Training Data via Active Reconstruction

Published on Feb 22

· Submitted by

Junjie Oscar Yin on Feb 25

University of Washington NLP

Upvote

Authors:

Abstract

Active Data Reconstruction Attack uses reinforcement learning to identify training data by measuring the reconstructibility of text from model behavior, outperforming existing membership inference attacks.

AI-generated summary

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce Active Data Reconstruction Attack (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are more reconstructible than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, ADRA and its adaptive variant ADRA+, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

osieosie

Paper submitter about 19 hours ago

Many people are using RL to make models smarter.
We used RL to pull training data out of the models themselves.

Our results show that models know a lot more about their training data than most people think.

We develop Active Data Reconstruction Attack (ADRA) — a data detection method that uses RL to induce models to reconstruct data seen during training.
ADRA beats existing methods by an average of >10% across pre-training, post-training, and distillation

Release of our work, with @uwnlp, @Cornell, and @Berkeley_ai, is now available.

Arxiv: https://arxiv.org/pdf/2602.19020

Joint work with @jxmnop @shmatikov @sewon__min @HannaHajishirzi