--- language: ms --- # roberta-base-bahasa-cased Pretrained RoBERTa base language model for Malay. ## Pretraining Corpus `roberta-base-bahasa-cased` model was pretrained on ~400 miliion words. Below is list of data we trained on, 1. IIUM confession, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean 2. local Instagram, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean 3. local news, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean 4. local parliament hansards, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean 5. local research papers related to `kebudayaan`, `keagaaman` and `etnik`, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean 6. local twitter, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean 7. Malay Wattpad, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean 8. Malay Wikipedia, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean ## Pretraining details - All steps can reproduce from https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/roberta. ## Example using AutoModelWithLMHead ```python from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline model = AutoModelForMaskedLM.from_pretrained('mesolitica/roberta-base-bahasa-cased') tokenizer = AutoTokenizer.from_pretrained( 'mesolitica/roberta-base-bahasa-cased', do_lower_case = False, ) fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer) fill_mask('Permohonan Najib, anak untuk dengar isu perlembagaan .') ``` Output is, ```json [{'score': 0.3368818759918213, 'token': 746, 'token_str': ' negara', 'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan negara.'}, {'score': 0.09646568447351456, 'token': 598, 'token_str': ' Malaysia', 'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan Malaysia.'}, {'score': 0.029483484104275703, 'token': 3265, 'token_str': ' UMNO', 'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan UMNO.'}, {'score': 0.026470622047781944, 'token': 2562, 'token_str': ' parti', 'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan parti.'}, {'score': 0.023237623274326324, 'token': 391, 'token_str': ' ini', 'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan ini.'}] ```