Introduction

We are introducing a very small model called haipai. It is trained on cnn/dm dataset. We are opensourcing both the training code and the model checkpoints.

Small-footprint seq2seq Transformer that completes sentences.

Highlights

~7.3M trainable parameters (4 encoder + 4 decoder layers, 288 hidden size, 6 heads).
Factorised embeddings with linear projections back to the model dimension.
Trained on 80k oracle/reference pairs with heavy denoising corruption (span masking, drops, sentence shuffles).
All trained on news articles.
32k shared subword vocabulary (trained with tokenizers).

Files

stage1_final.pt: checkpoint with model weights and config.
stage1_best.pt: checkpoint of the best model.
stage1_tokenizer.json: 32k BPE tokenizer shared across stages.
stage1_infer.py: CLI for greedy reconstructions.

Architecture

Shared Subword Embedding (32k) →
  Encoder (4 layers) →
  Decoder (4 layers) →
  Factorized output projection → vocab logits

Encoder/decoder blocks are standard Transformer layers (multi-head attention + FFN + dropout).
Embeddings start in a smaller space (embed_dim) and are linearly projected to the model dimension.
Output logits reuse the embedding matrix (weight tying).

Quick Inference

Stage 1 reconstructs text; feed it any summary to see what the autoencoder learned:

python -m src.stage1_infer \
  --run-dir models \
  --tokenizer-path models/stage1_tokenizer.json \
  --config-checkpoint models/stage1_final.pt \
  --input-text "US officials pledged immediate aid after the storm devastated the coastline."

You can also batch inputs from a file (--input-file path/to/texts.txt) or point to specific checkpoints with --checkpoints.

Expected Output

Greedy decode tends to copy the input with light denoising (that’s the objective).

Requirements

pip install torch tokenizers

The repo already contains the training/inference scripts; no extra setup is needed beyond installing dependencies.

Citation

If you use the model, please reference it as "Haipai-7M"

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train rocky1410/haipai-7M

Evaluation results

Perplexity on abisee/cnn_dailymail
self-reported

16.000

View on Papers With Code