|
--- |
|
language: en |
|
license: mit |
|
library_name: pytorch |
|
tags: |
|
- transformer |
|
- adapters |
|
- continual-learning |
|
- dual-memory |
|
- minimal |
|
- educational |
|
- nlp |
|
- language-model |
|
- online-learning |
|
datasets: |
|
- text8 |
|
- tinyshakespeare |
|
model_name: "Microformer" |
|
model_type: "stacked-adapter-transformer" |
|
pipeline_tag: text-generation |
|
widget: |
|
- text: "Describe the internet" |
|
- text: "Who is Buck?" |
|
- text: "Call me Ishmael." |
|
--- |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Microformer |
|
|
|
**Microformer** is a minimal, educational-scale transformer language model built from scratch in PyTorch. |
|
Inspired by [nanoGPT](https://github.com/karpathy/nanoGPT) and OpenAI’s GPT-1, Microformer is designed for learning, experimentation, and prototyping on lightweight datasets like [text8](https://mattmahoney.net/dc/textdata.html) or Tiny Shakespeare. |
|
|
|
--- |
|
|
|
## Features |
|
|
|
- Decoder-only transformer (GPT-style) architecture |
|
- **Stacked adapters per layer for dual-memory:** |
|
- **Long-term adapters** (for corpus/knowledge facts) |
|
- **Session adapters** (for rapid, online, user/session-specific learning) |
|
- Choice of character-level **or** subword/BPE tokenization (configurable) |
|
- Learnable positional encoding |
|
- Multi-head self-attention |
|
- Configurable depth, embedding size, sequence length, and attention heads |
|
- Simple end-to-end pipeline: preprocessing, training, and text generation |
|
- Modular, readable code ideal for educational use and tinkering |
|
- Temperature and multinomial sampling in text generation |
|
|
|
--- |
|
|
|
## What’s Unique: Stacked Adapters for Dual-Memory Learning |
|
|
|
Microformer implements **two adapters in every transformer block**: |
|
|
|
- **Long-term adapter:** |
|
Trained with your full corpus during batch/corpus training. |
|
Stores stable, general “knowledge” (e.g., literary style, factual info). |
|
|
|
- **Session adapter:** |
|
Starts blank and is trained *on the fly* during chat or interactive teaching. |
|
Lets you rapidly “teach” new facts, styles, or user preferences without overwriting core knowledge. |
|
|
|
At inference, the outputs of both adapters (plus the core transformer) are combined—giving the model both stable and flexible, session-specific memory, just like a human brain’s “temporal lobe” and “core memory”. |
|
|
|
--- |
|
|
|
## Project Structure |
|
|
|
``` |
|
microformer/ |
|
├── config.py # Hyperparameters and model settings |
|
├── data/ |
|
│ ├── corpus.txt # Raw training text |
|
│ ├── train.pt # Preprocessed training tensor (token IDs) |
|
│ ├── val.pt # Validation tensor (token IDs) |
|
│ ├── vocab.json # Vocabulary (char or subword, stoi/itos mapping) |
|
│ └── tokenizer.json # (optional) BPE tokenizer file if using subwords |
|
├── models/ |
|
│ └── model.py # Transformer model definition (Microformer) |
|
├── scripts/ |
|
│ ├── prepare_data.py # Data preprocessing/tokenization |
|
│ ├── train.py # Training script (trains long-term adapters) |
|
│ ├── generate_text.py # Inference/generation + online learning (session adapters) |
|
│ └── tokenizer_setup.py # BPE Tokenizer |
|
└── README.md |
|
``` |
|
|
|
--- |
|
|
|
## Quickstart |
|
|
|
1. **Prepare your corpus and run the tokenizer** |
|
|
|
Place your text data in `data/corpus.txt`. |
|
|
|
2. **Choose your tokenizer:** |
|
|
|
- **Character-level (default):** |
|
No extra steps needed. |
|
|
|
- **BPE/Subword (recommended for rich/modern text):** |
|
```bash |
|
python scripts/tokenizer_setup.py --input data/corpus.txt --vocab_size 1000 |
|
``` |
|
|
|
3. **Prepare the dataset** |
|
|
|
```bash |
|
python scripts/prepare_data.py |
|
``` |
|
|
|
4. **Train the model (long-term knowledge)** |
|
|
|
```bash |
|
python scripts/train.py |
|
``` |
|
- This trains only the **long-term adapters** and core weights. |
|
- Session adapters remain untrained (blank) until chat time. |
|
|
|
5. **Generate text and teach interactively (session memory)** |
|
|
|
```bash |
|
python scripts/generate_text.py |
|
``` |
|
- Loads your trained model. |
|
- Prompts for a seed string and temperature. |
|
- **Allows you to “teach” new facts on the fly!** |
|
- New knowledge is stored in session adapters—does *not* overwrite long-term knowledge. |
|
|
|
--- |
|
|
|
## Example Config (`config.py`) |
|
|
|
```python |
|
EMBED_DIM = 128 |
|
NUM_HEADS = 4 |
|
NUM_LAYERS = 2 |
|
FF_DIM = 256 |
|
MAX_SEQ_LEN = 128 |
|
BATCH_SIZE = 32 |
|
ADAPTER_DIM = 32 # Used for both long-term and session adapters |
|
VOCAB_SIZE = 100 # Set automatically from tokenizer/vocab |
|
``` |
|
|
|
--- |
|
|
|
## Using the Dual-Memory System |
|
|
|
- **Long-term adapters:** |
|
Learned during `train.py`—persist between runs. |
|
|
|
- **Session adapters:** |
|
Learned during interactive chat in `generate_text.py`—resettable (optional) between users/sessions. |
|
|
|
- **Teach new facts by entering a prompt and providing your ideal answer.** |
|
The model will “remember” this during the session, even if it wasn’t present in the training corpus. |
|
|
|
--- |
|
|
|
## Customization & Ideas |
|
|
|
- Use BPE/subword tokenization for more expressive modeling (recommended for non-trivial datasets) |
|
- Add more adapters or experiment with gating (e.g., blend adapters by context) |
|
- Combine with a key-value retrieval or buffer for truly persistent “user memory” |
|
- Visualize training with TensorBoard or wandb |
|
- Tinker with alternative attention or memory mechanisms |
|
|
|
--- |
|
|
|
## Requirements |
|
|
|
- Python 3.8+ |
|
- [PyTorch](https://pytorch.org/) |
|
- [tokenizers](https://github.com/huggingface/tokenizers) (for BPE/subword) |
|
|
|
Install dependencies with: |
|
```bash |
|
pip install torch tokenizers |
|
``` |
|
|
|
--- |
|
|
|
## Credits |
|
|
|
- Inspired by [nanoGPT](https://github.com/karpathy/nanoGPT) and [minGPT](https://github.com/karpathy/minGPT) by Andrej Karpathy |
|
- Adapter and continual-learning inspiration from recent NLP research ([Houlsby et al. 2019](https://arxiv.org/abs/1902.00751)) |
|
- Built using concepts from the original [GPT-1 paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) |
|
|
|
--- |
|
|
|
## License |
|
|
|
MIT License – Use freely for learning and experimentation. |
|
|
|
--- |
|
|
|
**Happy tinkering with dual-memory transformers!** |
|
|