ottomate
AI & ML interests
Recent Activity
Organizations
microsoft/TRELLIS.2-4B
In only 3 hours, on 10 billion+ tokens, I trained a custom BPE + tiktoken-style tokenizer using my new library microtok — and it hits the same token efficiency as Qwen3.
Tokenizers have always felt like black magic to me. We drop them into every LLM project, but actually training one from scratch? That always seemed way too complicated.
Turns out it doesn’t have to be.
microtok makes the whole process stupidly simple — literally just 3 lines of code. No heavy setup, no GPU required. I built it on top of the Hugging Face tokenizers library so it stays clean, fast, and actually understandable.
If you’ve ever wanted to look under the hood and build your own optimized vocabulary instead of just copying someone else’s, this is the entry point you’ve been waiting for.
I wrote up the full story, threw in a ready-to-run Colab template, and dropped the trained tokenizer on Hugging Face.
Blog → https://parveshiiii.github.io/blogs/microtok/
Trained tokenizer → Parveshiiii/microtok
GitHub repo → https://github.com/Parveshiiii/microtok
I mean more that I would expect the ASR-side of things to worry only about transcription, rather than context, because then one could use an SLM/LLM to actually fix the transcription. I think that ASR models today want to do too much, when they should just do one thing. If I say "I read a book" and the ASR picks up "I red a book" it should be fine, because one could always post-process the output. But today, ASR models do all-in-one, which means that it is even harder to steer them (i.e. wake word detection or out-of-vocabulary terms), so in the end you're in a spot where these ASR models have a lot of overhead, but you still need to post-process the output, when instead they should just be dumb and leave the post-processing to better suited models.
Everything has shifted from fast word recognition to monolithic LLM-based context recognition. With IPA, ASR/STT models could focus solely on words and leave post-processing to other, more capable models. There hasn't really been a good, small ASR model that is truly capable of running locally on low-powered devices. I'm still using Vosk models because they are just good for what they are, but they're approaching the 7-year mark now, which is absurd.
Regarding the video, at first I thought it was a joke because they looked like tokenized words haha
The 10% speed and VRAM usage improvements sound absolutely revolutionary. It would really be a massive breakthrough if you pull it off.
Also, I commented on your post on Twitter, but I'll say it here too: this would work absolutely wonders for speech-to-text and text-to-speech since it also has baked in IPA phonemes. You should definitely consider exploring that angle, because those spaces desperately need improvement.
You just killed 23 dyslexic people (and counting) with that video, be ca use of the we ird wo rd split ting. hahaha
Jokes aside, this looks absolutely amazing, but I think tokenizers are there because this might not work fast enough at scale. I'd be excited and extremely happy to be proven wrong, because the concept is certainly great.
I'm training a 22M params LLM rn to test this "thing" and it's able to formulate coherent sentences 🤯
Bear in mind, this is a completely new, tokenizer-free LLM architecture with built-in language universality.
Check the explainer video to understand what's happening. Feedback welcome on this approach!