Update README.md
Browse files
README.md
CHANGED
|
@@ -4,12 +4,14 @@ language: ta
|
|
| 4 |
|
| 5 |
# TaMillion
|
| 6 |
|
| 7 |
-
This is
|
| 8 |
Google Research's [ELECTRA](https://github.com/google-research/electra).
|
| 9 |
|
| 10 |
-
Tokenization and pre-training CoLab: https://colab.research.google.com/drive/
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
| 13 |
|
| 14 |
## Classification
|
| 15 |
|
|
@@ -19,22 +21,24 @@ https://www.kaggle.com/sudalairajkumar/tamil-nlp
|
|
| 19 |
Notebook: https://colab.research.google.com/drive/1_rW9HZb6G87-5DraxHvhPOzGmSMUc67_?usp=sharin
|
| 20 |
|
| 21 |
The model outperformed mBERT on news classification:
|
| 22 |
-
(Random: 16.7%, mBERT: 53.0%, TaMillion:
|
| 23 |
|
| 24 |
The model slightly outperformed mBERT on movie reviews:
|
| 25 |
-
(RMSE - mBERT: 0.657, TaMillion: 0.
|
| 26 |
|
| 27 |
Equivalent accuracy on the Tirukkural topic task.
|
| 28 |
|
| 29 |
## Question Answering
|
| 30 |
|
| 31 |
-
I didn't find a Tamil-language question answering dataset, but this model could be
|
| 32 |
to train a QA model. See Hindi and Bengali examples here: https://colab.research.google.com/drive/1i6fidh2tItf_-IDkljMuaIGmEU6HT2Ar
|
| 33 |
|
| 34 |
## Corpus
|
| 35 |
|
| 36 |
-
Trained on
|
|
|
|
|
|
|
| 37 |
|
| 38 |
## Vocabulary
|
| 39 |
|
| 40 |
-
Included as vocab.txt in the upload
|
|
|
|
| 4 |
|
| 5 |
# TaMillion
|
| 6 |
|
| 7 |
+
This is the second version of a Tamil language model trained with
|
| 8 |
Google Research's [ELECTRA](https://github.com/google-research/electra).
|
| 9 |
|
| 10 |
+
Tokenization and pre-training CoLab: https://colab.research.google.com/drive/1Pwia5HJIb6Ad4Hvbx5f-IjND-vCaJzSE?usp=sharing
|
| 11 |
|
| 12 |
+
V1: small model with GPU; 190,000 steps;
|
| 13 |
+
|
| 14 |
+
V2 (current): base model with TPU and larger corpus; 224,000 steps
|
| 15 |
|
| 16 |
## Classification
|
| 17 |
|
|
|
|
| 21 |
Notebook: https://colab.research.google.com/drive/1_rW9HZb6G87-5DraxHvhPOzGmSMUc67_?usp=sharin
|
| 22 |
|
| 23 |
The model outperformed mBERT on news classification:
|
| 24 |
+
(Random: 16.7%, mBERT: 53.0%, TaMillion: 75.1%)
|
| 25 |
|
| 26 |
The model slightly outperformed mBERT on movie reviews:
|
| 27 |
+
(RMSE - mBERT: 0.657, TaMillion: 0.626)
|
| 28 |
|
| 29 |
Equivalent accuracy on the Tirukkural topic task.
|
| 30 |
|
| 31 |
## Question Answering
|
| 32 |
|
| 33 |
+
I didn't find a Tamil-language question answering dataset, but this model could be finetuned
|
| 34 |
to train a QA model. See Hindi and Bengali examples here: https://colab.research.google.com/drive/1i6fidh2tItf_-IDkljMuaIGmEU6HT2Ar
|
| 35 |
|
| 36 |
## Corpus
|
| 37 |
|
| 38 |
+
Trained on
|
| 39 |
+
IndicCorp Tamil (11GB) https://indicnlp.ai4bharat.org/corpora/
|
| 40 |
+
and 1 October 2020 dump of https://ta.wikipedia.org (482MB)
|
| 41 |
|
| 42 |
## Vocabulary
|
| 43 |
|
| 44 |
+
Included as vocab.txt in the upload
|