File size: 3,400 Bytes
7fa9d3c a93c61f 7fa9d3c a93c61f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
---
language: ja
license: cc-by-sa-4.0
library_name: transformers
datasets:
- cc100
- mc4
- oscar
- wikipedia
- izumi-lab/cc100-ja
- izumi-lab/mc4-ja-filter-ja-normal
- izumi-lab/oscar2301-ja-filter-ja-normal
- izumi-lab/wikipedia-ja-20230720
- izumi-lab/wikinews-ja-20230728
widget:
- text: 東京大学で[MASK]の研究をしています。
---
# DeBERTa V2 base Japanese
This is a [DeBERTaV2](https://github.com/microsoft/DeBERTa) model pretrained on Japanese texts.
The codes for the pretraining are available at [retarfi/language-pretraining](https://github.com/retarfi/language-pretraining/releases/tag/v2.2.1).
## How to use
You can use this model for masked language modeling as follows:
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("izumi-lab/deberta-v2-base-japanese")
model = AutoModelForMaskedLM.from_pretrained("izumi-lab/deberta-v2-base-japanese")
...
```
## Tokenization
The model uses a sentencepiece-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using [sentencepiece](https://github.com/google/sentencepiece).
## Training Data
We used the following corpora for pre-training:
- [Japanese portion of CC-100](https://huggingface.co/datasets/izumi-lab/cc100-ja)
- [Japanese portion of mC4](https://huggingface.co/datasets/izumi-lab/mc4-ja-filter-ja-normal)
- [Japanese portion of OSCAR2301](izumi-lab/oscar2301-ja-filter-ja-normal)
- [Japanese Wikipedia as of July 20, 2023](https://huggingface.co/datasets/izumi-lab/wikipedia-ja-20230720)
- [Japanese Wikinews as of July 28, 2023](https://huggingface.co/datasets/izumi-lab/wikinews-ja-20230728)
We pretrained with the corpora mentioned above for 900k steps, and additionally pretrained with the following financial corpora for 100k steps:
- Summaries of financial results from October 9, 2012, to December 31, 2022
- Securities reports from February 8, 2018, to December 31, 2022
- News articles
## Training Parameters
learning_rate in parentheses indicate the learning rate for additional pre-training with the financial corpus.
- learning_rate: 2.4e-4 (6e-5)
- total_train_batch_size: 2,016
- max_seq_length: 512
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06
- lr_scheduler_type: linear schedule with warmup
- training_steps: 1,000,000
- warmup_steps: 100,000
- precision: FP16
## Fine-tuning on General NLU tasks
We evaluate our model with the average of five seeds.
Other models are from [JGLUE repository](https://github.com/yahoojapan/JGLUE)
| Model | JSTS | JNLI | JCommonsenseQA |
|-------------------------------|------------------|-----------|----------------|
| | Pearson/Spearman | acc | acc |
| **DeBERTaV2 base** | **0.890/0.846** | **0.xxx** | **0.859** |
| Waseda RoBERTa base | 0.913/0.873 | 0.895 | 0.840 |
| Tohoku BERT base | 0.909/0.868 | 0.899 | 0.808 |
## Citation
TBA
## Licenses
The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
## Acknowledgments
This work was supported in part by JSPS KAKENHI Grant Number JP21K12010, and the JST-Mirai Program Grant Number JPMJMI20B1, Japan.
|