Introduction

We are introducing a very small model called haipai. It is trained on cnn/dm dataset. We are opensourcing both the training code and the model checkpoints.

Small-footprint seq2seq Transformer that completes sentences.

Highlights

  • ~7.3M trainable parameters (4 encoder + 4 decoder layers, 288 hidden size, 6 heads).
  • Factorised embeddings with linear projections back to the model dimension.
  • Trained on 80k oracle/reference pairs with heavy denoising corruption (span masking, drops, sentence shuffles).
  • All trained on news articles.
  • 32k shared subword vocabulary (trained with tokenizers).

Files

  • stage1_final.pt: checkpoint with model weights and config.
  • stage1_best.pt: checkpoint of the best model.
  • stage1_tokenizer.json: 32k BPE tokenizer shared across stages.
  • stage1_infer.py: CLI for greedy reconstructions.

Architecture

Shared Subword Embedding (32k) →
  Encoder (4 layers) →
  Decoder (4 layers) →
  Factorized output projection → vocab logits
  • Encoder/decoder blocks are standard Transformer layers (multi-head attention + FFN + dropout).
  • Embeddings start in a smaller space (embed_dim) and are linearly projected to the model dimension.
  • Output logits reuse the embedding matrix (weight tying).

Quick Inference

Stage 1 reconstructs text; feed it any summary to see what the autoencoder learned:

python -m src.stage1_infer \
  --run-dir models \
  --tokenizer-path models/stage1_tokenizer.json \
  --config-checkpoint models/stage1_final.pt \
  --input-text "US officials pledged immediate aid after the storm devastated the coastline."

You can also batch inputs from a file (--input-file path/to/texts.txt) or point to specific checkpoints with --checkpoints.

Expected Output

  • Greedy decode tends to copy the input with light denoising (that’s the objective).

Requirements

pip install torch tokenizers

The repo already contains the training/inference scripts; no extra setup is needed beyond installing dependencies.

Citation

If you use the model, please reference it as "Haipai-7M"

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train rocky1410/haipai-7M

Evaluation results