AI & ML interests

Wikilangs is an open-source initiative to democratize access to natural language processing models for every language represented on Wikipedia - A project by @OmarKamali. Graciously sponsored by Featherless.ai.

Recent Activity

omarkamali  updated a model 10 days ago
wikilangs/hu
omarkamali  updated a model 10 days ago
wikilangs/ar
omarkamali  updated a model 11 days ago
wikilangs/ceb
View all activity

omarkamali 
posted an update 7 days ago
view post
Post
276
You're probably training on outdated Wikipedia data right now and don't know it. 💡

In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."

He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time.
• For English, that's 700,000 missing articles.
• For Moroccan Arabic, 30% of the language's entire Wikipedia.
• For 31 other languages, there was literally no text corpus at all until recently.

I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).

Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.

Here's the full story of how I built Wikipedia Monthly 👇

https://omarkamali.com/blog/wikipedia-monthly-pipeline