---
language:
- uk
datasets:
  - lang-uk/malyuk
tags:
  - subtoken-statistics
  - frequency-list
  - aya-tokenizer
  - ukraine
  - corpus-linguistics
pretty_name: “Malyuk UK Subtoken Inventory”  
---
## Repo Description

This repository hosts a **frequency‐filtered inventory** of byte-level sub-tokens extracted from the [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) (38.9 M lines).  
Tokenizer inherits Aya Expanse [tokenizer](https://huggingface.co/CohereLabs/aya-expanse-32b/blob/main/tokenizer.json) — all of Aya’s special tokens included.  

[//]: # (are retained at the start of the vocabulary and **won’t be removed** by the frequency threshold :contentReference[oaicite:2]{index=2}.)
Any sub-token with **total count ≥ 500** in the corpus survives, resulting in **654 023** unique entries.  
> **Note:** This is *not* a plug-and-play LLM tokenizer, but rather a raw statistical resource.

## Simple example
```python
tokenizer = AutoTokenizer.from_pretrained(
    "transhumanist-already-exists/malyuk-uk-bpe-654k"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) # [11961, 41218, 33300, 63514]
```

## Contents

- **`tokenizer.json`** Byte‐level tokenizer spec (vocab, merges, model settings).

- **`tokenizer_config.json`** Configuration metadata.

- **`special_tokens_map.json`** Mapping of special token (The same with Aya).

- **`readable_tokenizer_utf8.json`** Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.


## Why publish a frequency list?

1. **Bootstrapping smaller/custom tokenizers**   
   -  Start from this *core* if you only need, say, the **top 256_000** or **top 50_256** sub-tokens, simply truncate the tail of `vocab.json`. Aya’s special tokens remain intact at the head.
   -  Merge or interleave these Ukrainian sub-tokens with other language vocabularies to build **UK-centric** multi-language tokenizers. 

2. **Computational-linguistic analyses**  (Check file **`readable_tokenizer_utf8.json`**)
   - **Zipf curve plotting**, type–token ratio studies, morphological productivity analysis.
   - **Stop-word** and **keyword list**.

## Training the Aya-based Ukrainian tokenizer

Below is the Python script we used to shuffle, filter by frequency (≥ 500) and train the byte-level BPE tokenizer:

```python
import os
from datasets import load_dataset
from tokenizers.pre_tokenizers import ByteLevel
from transformers import AutoTokenizer

os.environ["TOKENIZERS_PARALLELISM"] = "true"

# Hyper-parameters
MAX_VOCAB_SIZE  = 1_000_000
CORPUS_NAME     = "lang-uk/malyuk"
SEED            = 42
TEST_SET_SIZE   = 100_000
MIN_FREQUENCY   = 500
TOKENIZER_PATH  = "./malyuk_uk_tokenizer"

# 1) Load base Aya tokenizer and corpus
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-32b")
full_ds = load_dataset(CORPUS_NAME, split="train", cache_dir="./ds")
ds = full_ds.remove_columns([c for c in full_ds.column_names if c != "text"])
ds = ds.shuffle(seed=SEED)

# 2) Skip the first TEST_SET_SIZE examples
ds = ds.select(range(TEST_SET_SIZE, len(ds)))

# 3) Define streaming iterator
def batch_iterator(dataset, batch_size=500_000):
    for batch in dataset.iter(batch_size=batch_size):
        yield batch["text"]

# 4) Train new tokenizer from iterator
new_tok = tokenizer.train_new_from_iterator(
    batch_iterator(ds),
    vocab_size=MAX_VOCAB_SIZE,
    length=len(ds),
    new_special_tokens=list(tokenizer.added_tokens_encoder.keys()),
    min_frequency=MIN_FREQUENCY,
    initial_alphabet=ByteLevel.alphabet()
)

# 5) Save locally
new_tok.save_pretrained(TOKENIZER_PATH)

# 6) Small test
malyuk_uk_tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True)
test_dataset = full_ds.select(range(0, TEST_SET_SIZE))

def tokenize_wrapper(tokenizer):
    def batch_fn(examples):
        outputs = tokenizer(
            examples["text"],
            padding=False,
            truncation=False,
        )
        # list of token-counts, one per example
        return {"tokens_count": [len(ids) for ids in outputs["input_ids"]]}
    return batch_fn

ds = test_dataset.map(tokenize_wrapper(malyuk_uk_tokenizer), batched=True, batch_size=20_000)
print(f"malyuk_uk_tokenizer tokens count for 100_000 malyuk texts: {sum(ds['tokens_count'])}")
```
### Test results:


| Tokenizer           | Tokens for 100 000 texts |
| ------------------- | -----------------------: |
| **Malyuk (custom)** |               33 959 222 |
| **Aya Expanse-32B** |               49 609 840 |

> *Please note: these are total token counts for the sample, would be more correct to measure per-word averages in future.* 

# Citation

**BibTeX:**

```bibtex
@misc{zaduha2025post9138,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9138 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9138}",
  month        = may,
  year         = {2025},
  note         = "[Online; accessed 22 May 2025]"
}
```