Introduction
We are introducing a very small model called haipai. It is trained on cnn/dm dataset. We are opensourcing both the training code and the model checkpoints.
Small-footprint seq2seq Transformer that completes sentences.
Highlights
- ~7.3M trainable parameters (4 encoder + 4 decoder layers, 288 hidden size, 6 heads).
- Factorised embeddings with linear projections back to the model dimension.
- Trained on 80k oracle/reference pairs with heavy denoising corruption (span masking, drops, sentence shuffles).
- All trained on news articles.
- 32k shared subword vocabulary (trained with
tokenizers).
Files
stage1_final.pt: checkpoint with model weights and config.stage1_best.pt: checkpoint of the best model.stage1_tokenizer.json: 32k BPE tokenizer shared across stages.stage1_infer.py: CLI for greedy reconstructions.
Architecture
Shared Subword Embedding (32k) →
Encoder (4 layers) →
Decoder (4 layers) →
Factorized output projection → vocab logits
- Encoder/decoder blocks are standard Transformer layers (multi-head attention + FFN + dropout).
- Embeddings start in a smaller space (
embed_dim) and are linearly projected to the model dimension. - Output logits reuse the embedding matrix (weight tying).
Quick Inference
Stage 1 reconstructs text; feed it any summary to see what the autoencoder learned:
python -m src.stage1_infer \
--run-dir models \
--tokenizer-path models/stage1_tokenizer.json \
--config-checkpoint models/stage1_final.pt \
--input-text "US officials pledged immediate aid after the storm devastated the coastline."
You can also batch inputs from a file (--input-file path/to/texts.txt) or point to specific checkpoints with --checkpoints.
Expected Output
- Greedy decode tends to copy the input with light denoising (that’s the objective).
Requirements
pip install torch tokenizers
The repo already contains the training/inference scripts; no extra setup is needed beyond installing dependencies.
Citation
If you use the model, please reference it as "Haipai-7M"