--- language: - uk datasets: - lang-uk/malyuk tags: - subtoken-statistics - frequency-list - aya-tokenizer - ukraine - corpus-linguistics pretty_name: “Malyuk UK Subtoken Inventory” --- ## Repo Description This repository hosts a **frequency‐filtered inventory** of byte-level sub-tokens extracted from the [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) (38.9 M lines). Tokenizer inherits Aya Expanse [tokenizer](https://huggingface.co/CohereLabs/aya-expanse-32b/blob/main/tokenizer.json) — all of Aya’s special tokens included. [//]: # (are retained at the start of the vocabulary and **won’t be removed** by the frequency threshold :contentReference[oaicite:2]{index=2}.) Any sub-token with **total count ≥ 500** in the corpus survives, resulting in **654 023** unique entries. > **Note:** This is *not* a plug-and-play LLM tokenizer, but rather a raw statistical resource. ## Simple example ```python tokenizer = AutoTokenizer.from_pretrained( "transhumanist-already-exists/malyuk-uk-bpe-654k" ) toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False) print(toks.input_ids) # [11961, 41218, 33300, 63514] ``` ## Contents - **`tokenizer.json`** Byte‐level tokenizer spec (vocab, merges, model settings). - **`tokenizer_config.json`** Configuration metadata. - **`special_tokens_map.json`** Mapping of special token (The same with Aya). - **`readable_tokenizer_utf8.json`** Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection. ## Why publish a frequency list? 1. **Bootstrapping smaller/custom tokenizers** - Start from this *core* if you only need, say, the **top 256_000** or **top 50_256** sub-tokens, simply truncate the tail of `vocab.json`. Aya’s special tokens remain intact at the head. - Merge or interleave these Ukrainian sub-tokens with other language vocabularies to build **UK-centric** multi-language tokenizers. 2. **Computational-linguistic analyses** (Check file **`readable_tokenizer_utf8.json`**) - **Zipf curve plotting**, type–token ratio studies, morphological productivity analysis. - **Stop-word** and **keyword list**. ## Training the Aya-based Ukrainian tokenizer Below is the Python script we used to shuffle, filter by frequency (≥ 500) and train the byte-level BPE tokenizer: ```python import os from datasets import load_dataset from tokenizers.pre_tokenizers import ByteLevel from transformers import AutoTokenizer os.environ["TOKENIZERS_PARALLELISM"] = "true" # Hyper-parameters MAX_VOCAB_SIZE = 1_000_000 CORPUS_NAME = "lang-uk/malyuk" SEED = 42 TEST_SET_SIZE = 100_000 MIN_FREQUENCY = 500 TOKENIZER_PATH = "./malyuk_uk_tokenizer" # 1) Load base Aya tokenizer and corpus tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-32b") full_ds = load_dataset(CORPUS_NAME, split="train", cache_dir="./ds") ds = full_ds.remove_columns([c for c in full_ds.column_names if c != "text"]) ds = ds.shuffle(seed=SEED) # 2) Skip the first TEST_SET_SIZE examples ds = ds.select(range(TEST_SET_SIZE, len(ds))) # 3) Define streaming iterator def batch_iterator(dataset, batch_size=500_000): for batch in dataset.iter(batch_size=batch_size): yield batch["text"] # 4) Train new tokenizer from iterator new_tok = tokenizer.train_new_from_iterator( batch_iterator(ds), vocab_size=MAX_VOCAB_SIZE, length=len(ds), new_special_tokens=list(tokenizer.added_tokens_encoder.keys()), min_frequency=MIN_FREQUENCY, initial_alphabet=ByteLevel.alphabet() ) # 5) Save locally new_tok.save_pretrained(TOKENIZER_PATH) # 6) Small test malyuk_uk_tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True) test_dataset = full_ds.select(range(0, TEST_SET_SIZE)) def tokenize_wrapper(tokenizer): def batch_fn(examples): outputs = tokenizer( examples["text"], padding=False, truncation=False, ) # list of token-counts, one per example return {"tokens_count": [len(ids) for ids in outputs["input_ids"]]} return batch_fn ds = test_dataset.map(tokenize_wrapper(malyuk_uk_tokenizer), batched=True, batch_size=20_000) print(f"malyuk_uk_tokenizer tokens count for 100_000 malyuk texts: {sum(ds['tokens_count'])}") ``` ### Test results: | Tokenizer | Tokens for 100 000 texts | | ------------------- | -----------------------: | | **Malyuk (custom)** | 33 959 222 | | **Aya Expanse-32B** | 49 609 840 | > *Please note: these are total token counts for the sample, would be more correct to measure per-word averages in future.* # Citation **BibTeX:** ```bibtex @misc{zaduha2025post9138, author = "{Bohdan Didenko}", title = "{Post \#9138 on Telegram Channel Zaduha}", howpublished = "\url{https://t.me/zaduha/9138}", month = may, year = {2025}, note = "[Online; accessed 22 May 2025]" } ```