BERT-Bytecode

Masked language model trained on Python 3.12 bytecode sequences. This model is intended for program mutation in Genetic Improvement (GI): it predicts plausible replacements for masked bytecode spans, which can be used to propose mutations that preserve or improve program behavior.

Model Summary

  • Architecture: BERT (Masked Language Modeling head)
  • Domain: Python bytecode sequences (bytes 0–255 serialized as space-separated integers)
  • Primary use: Propose bytecode-level mutations for Genetic Improvement
  • Tokenizer: Custom WordPiece over space-separated numeric tokens plus special tokens [CLS], [SEP], [PAD], [MASK]
  • Checkpoint: best performing checkpoint at training step N (replace with your step)

Intended Uses and Limitations

Intended:

  • Replace masked spans of bytecode with plausible alternatives.
  • Guide mutation operators in GI pipelines to maintain syntactic/semantic plausibility.

Out of scope:

  • Natural-language understanding or source-level code completion.
  • Security- or safety-critical transformations without human review.

Limitations:

  • Trained only on the specified bytecode distribution for Python 3.12; do not generalize to different Python versions than 3.12 or exotic opcodes.
  • Does not guarantee semantic equivalence; mutated programs may still break or degrade performance.

How to Use

from transformers import BertForMaskedLM, BertTokenizerFast
m = BertForMaskedLM.from_pretrained("lucapernice/BERT-Bytecode")
t = BertTokenizerFast.from_pretrained("lucapernice/BERT-Bytecode")

# Bytecode must be serialized as a space-separated string of integers (0..255)
byte_list = [100, 0, 90, 1, 23, 0, 83, 0]  # example
s = " ".join(map(int.__str__, byte_list))

enc = t(s, return_tensors="pt")
# mask a span (example: positions 4..5 inside non-padding range)
ids = enc["input_ids"]
ids[0, 5] = t.mask_token_id

out = m(input_ids=ids, attention_mask=enc["attention_mask"]).logits
pred = out.argmax(-1)[0].tolist()
decoded = t.decode(pred, skip_special_tokens=True)  # "..." -> space-separated ints
new_bytes = bytes(map(int, decoded.split()))

Dataset

Data construction summary:

  • Source dataset: bigcode/the-stack-dedup (subset: data/python), loaded in streaming mode.
  • For each record, read the Python source from the content field.
  • Compile the source with compile(source, '', 'exec') under CPython 3.12; skip samples that raise SyntaxError/ValueError.
  • Extract raw bytecode bytes from compiled_code.co_code and convert to a list of integers in [0, 255].
  • Save as JSON Lines: one sample per line, each line a JSON array of integers.
  • Cap: up to 100,000,000 samples in this release (10% splitted for validation).

Notes: Bytecode format is Python-version dependent (these samples use CPython 3.12, 2-byte instructions). No extra normalization or dedup beyond the source dataset. Any truncation/padding or chunking is handled at training time.

About Source Dataset

@article{Kocetkov2022TheStack,
  title={The Stack: 3 TB of permissively licensed source code},
  author={Kocetkov, Denis and Li, Raymond and Ben Allal, Loubna and Li, Jia and Mou,Chenghao and Muñoz Ferrandis, Carlos and Jernite, Yacine and Mitchell, Margaret and Hughes, Sean and Wolf, Thomas and Bahdanau, Dzmitry and von Werra, Leandro and de Vries, Harm},
  journal={Preprint},
  year={2022}
}

Preprocessing

Vocabulary is closed over the 256 possible byte values plus five special tokens.

  • Special tokens: [PAD], [UNK], [CLS], [SEP], [MASK].
  • Byte tokens: strings "0" through "255" (see vocab.txt).
  • No subword segmentation: each byte in co_code (opcode or argument) maps 1:1 to a token ID.
  • Input format: space-separated string of integers in [0, 255]. Example:
    • Bytes: [100, 0, 90, 1, 23, 0, 83, 0]
    • Text: "100 0 90 1 23 0 83 0"

Intended Integration: Genetic Improvement

This model supports GI by proposing bytecode mutations:

  • Mask a random contiguous span in the bytecode.
  • Use the model to predict replacements.
  • Validate mutated programs with your test suite and fitness function.
  • Iterate selection/crossover/mutation as in standard GI pipelines.

Citation

If you use this model:

@software{bert_bytecode_2025,
  title        = {BERT-Bytecode: Masked LM for Python Bytecode Mutation},
  author       = {Luca Pernice},
  year         = {2025},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/lucapernice/BERT-Bytecode}
}
Downloads last month
7
Safetensors
Model size
86M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train lucapernice/BERT-Bytecode