Post
5150
Exciting updates to the Wikipedia Monthly dataset for November! ๐
ใป Fixed a bug to remove infobox leftovers and other wiki markers such as
ใป New python package https://pypi.org/project/wikisets: a dataset builder with efficient sampling so you can combine the languages you want seamlessly for any date (ideal for pretraining data but works for any purpose)
ใป Moved the pipeline to a large server. Much higher costs but with better reliability and predictability (let me know if you'd like to sponsor this!).
ใป Dataset sizes are unfortunately missing for this month due to shenanigans with the migration, but should be back in December's update.
Check out the dataset:
omarkamali/wikipedia-monthly
ใป Fixed a bug to remove infobox leftovers and other wiki markers such as
__TOC__ใป New python package https://pypi.org/project/wikisets: a dataset builder with efficient sampling so you can combine the languages you want seamlessly for any date (ideal for pretraining data but works for any purpose)
ใป Moved the pipeline to a large server. Much higher costs but with better reliability and predictability (let me know if you'd like to sponsor this!).
ใป Dataset sizes are unfortunately missing for this month due to shenanigans with the migration, but should be back in December's update.
Check out the dataset:
omarkamali/wikipedia-monthly