radinplaid commited on
Commit
5340c65
·
verified ·
1 Parent(s): 65d61c3

Upload folder using huggingface_hub

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - es
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.es-en
10
+ model-index:
11
+ - name: quickmt-es-en
12
+ results:
13
+ - task:
14
+ name: Translation spa-eng
15
+ type: translation
16
+ args: spa-eng
17
+ dataset:
18
+ name: flores101-devtest
19
+ type: flores_101
20
+ args: spa_Latn eng_Latn devtest
21
+ metrics:
22
+ - name: BLEU
23
+ type: bleu
24
+ value: 28.64
25
+ - name: CHRF
26
+ type: chrf
27
+ value: 58.61
28
+ - name: COMET
29
+ type: comet
30
+ value: 86.11
31
+ ---
32
+
33
+
34
+ # `quickmt-es-en` Neural Machine Translation Model
35
+
36
+ `quickmt-es-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `es` into `en`.
37
+
38
+
39
+ ## Model Information
40
+
41
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
42
+ * 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
43
+ * 50k joint Sentencepiece vocabulary
44
+ * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
45
+ * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.it-en/tree/main
46
+
47
+ See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
48
+
49
+ ## Usage with `quickmt`
50
+
51
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
52
+
53
+ Next, install the `quickmt` python library and download the model:
54
+
55
+ ```bash
56
+ git clone https://github.com/quickmt/quickmt.git
57
+ pip install ./quickmt/
58
+
59
+ quickmt-model-download quickmt/quickmt-es-en ./quickmt-es-en
60
+ ```
61
+
62
+ Finally use the model in python:
63
+
64
+ ```python
65
+ from quickmt import Translator
66
+
67
+ # Auto-detects GPU, set to "cpu" to force CPU inference
68
+ t = Translator("./quickmt-es-en/", device="auto")
69
+
70
+ # Translate - set beam size to 1 for faster speed (but lower quality)
71
+ sample_text = 'La investigación todavía se ubica en su etapa inicial, conforme indicara el Dr. Ehud Ur, docente en la carrera de medicina de la Universidad de Dalhousie, en Halifax, Nueva Escocia, y director del departamento clínico y científico de la Asociación Canadiense de Diabetes.'
72
+ t(sample_text, beam_size=5)
73
+
74
+ > 'The research is still in its early stages, as indicated by Dr. Ehud Ur, a medical professor at the University of Dalhousie, Halifax, Nova Scotia, and director of the clinical and scientific department of the Canadian Diabetes Association.'
75
+
76
+ # Get alternative translations by sampling
77
+ # You can pass any cTranslate2 `translate_batch` arguments
78
+ t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
79
+
80
+ > 'The research is still in its initial stages as instructed by Dr. Ehud Ur, a professor at the medical degree, University of Dalhousie, Halifax, Nova Scotia, and director of the clinical and scientific department of the Canadian Diabetes Association.'
81
+ ```
82
+
83
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
84
+
85
+
86
+ ## Metrics
87
+
88
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("spa_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a larger batch size).
89
+
90
+ | | bleu | chrf2 | comet22 | Time (s) |
91
+ |:---------------------------------|-------:|--------:|----------:|-----------:|
92
+ | quickmt/quickmt-es-en | 28.64 | 58.61 | 86.11 | 1.33 |
93
+ | Helsink-NLP/opus-mt-es-en | 27.62 | 58.38 | 86.01 | 3.67 |
94
+ | facebook/nllb-200-distilled-600M | 30.02 | 59.71 | 86.55 | 21.99 |
95
+ | facebook/nllb-200-distilled-1.3B | 31.58 | 60.96 | 87.25 | 38.2 |
96
+ | facebook/m2m100_418M | 22.85 | 55.04 | 82.9 | 18.83 |
97
+ | facebook/m2m100_1.2B | 26.84 | 57.69 | 85.47 | 36.22 |
98
+
README.md CHANGED
@@ -1,3 +1,98 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - es
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.es-en
10
+ model-index:
11
+ - name: quickmt-es-en
12
+ results:
13
+ - task:
14
+ name: Translation spa-eng
15
+ type: translation
16
+ args: spa-eng
17
+ dataset:
18
+ name: flores101-devtest
19
+ type: flores_101
20
+ args: spa_Latn eng_Latn devtest
21
+ metrics:
22
+ - name: BLEU
23
+ type: bleu
24
+ value: 28.64
25
+ - name: CHRF
26
+ type: chrf
27
+ value: 58.61
28
+ - name: COMET
29
+ type: comet
30
+ value: 86.11
31
+ ---
32
+
33
+
34
+ # `quickmt-es-en` Neural Machine Translation Model
35
+
36
+ `quickmt-es-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `es` into `en`.
37
+
38
+
39
+ ## Model Information
40
+
41
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
42
+ * 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
43
+ * 50k joint Sentencepiece vocabulary
44
+ * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
45
+ * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.it-en/tree/main
46
+
47
+ See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
48
+
49
+ ## Usage with `quickmt`
50
+
51
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
52
+
53
+ Next, install the `quickmt` python library and download the model:
54
+
55
+ ```bash
56
+ git clone https://github.com/quickmt/quickmt.git
57
+ pip install ./quickmt/
58
+
59
+ quickmt-model-download quickmt/quickmt-es-en ./quickmt-es-en
60
+ ```
61
+
62
+ Finally use the model in python:
63
+
64
+ ```python
65
+ from quickmt import Translator
66
+
67
+ # Auto-detects GPU, set to "cpu" to force CPU inference
68
+ t = Translator("./quickmt-es-en/", device="auto")
69
+
70
+ # Translate - set beam size to 1 for faster speed (but lower quality)
71
+ sample_text = 'La investigación todavía se ubica en su etapa inicial, conforme indicara el Dr. Ehud Ur, docente en la carrera de medicina de la Universidad de Dalhousie, en Halifax, Nueva Escocia, y director del departamento clínico y científico de la Asociación Canadiense de Diabetes.'
72
+ t(sample_text, beam_size=5)
73
+
74
+ > 'The research is still in its early stages, as indicated by Dr. Ehud Ur, a medical professor at the University of Dalhousie, Halifax, Nova Scotia, and director of the clinical and scientific department of the Canadian Diabetes Association.'
75
+
76
+ # Get alternative translations by sampling
77
+ # You can pass any cTranslate2 `translate_batch` arguments
78
+ t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
79
+
80
+ > 'The research is still in its initial stages as instructed by Dr. Ehud Ur, a professor at the medical degree, University of Dalhousie, Halifax, Nova Scotia, and director of the clinical and scientific department of the Canadian Diabetes Association.'
81
+ ```
82
+
83
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
84
+
85
+
86
+ ## Metrics
87
+
88
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("spa_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a larger batch size).
89
+
90
+ | | bleu | chrf2 | comet22 | Time (s) |
91
+ |:---------------------------------|-------:|--------:|----------:|-----------:|
92
+ | quickmt/quickmt-es-en | 28.64 | 58.61 | 86.11 | 1.33 |
93
+ | Helsink-NLP/opus-mt-es-en | 27.62 | 58.38 | 86.01 | 3.67 |
94
+ | facebook/nllb-200-distilled-600M | 30.02 | 59.71 | 86.55 | 21.99 |
95
+ | facebook/nllb-200-distilled-1.3B | 31.58 | 60.96 | 87.25 | 38.2 |
96
+ | facebook/m2m100_418M | 22.85 | 55.04 | 82.9 | 18.83 |
97
+ | facebook/m2m100_1.2B | 26.84 | 57.69 | 85.47 | 36.22 |
98
+
config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_source_bos": false,
3
+ "add_source_eos": false,
4
+ "bos_token": "<s>",
5
+ "decoder_start_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "layer_norm_epsilon": 1e-06,
8
+ "multi_query_attention": false,
9
+ "unk_token": "<unk>"
10
+ }
eole-config.yaml ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## IO
2
+ save_data: data
3
+ overwrite: True
4
+ seed: 1234
5
+ report_every: 100
6
+ valid_metrics: ["BLEU"]
7
+ tensorboard: true
8
+ tensorboard_log_dir: tensorboard
9
+
10
+ ### Vocab
11
+ src_vocab: es.eole.vocab
12
+ tgt_vocab: en.eole.vocab
13
+ src_vocab_size: 20000
14
+ tgt_vocab_size: 20000
15
+ vocab_size_multiple: 8
16
+ share_vocab: false
17
+ n_sample: 0
18
+
19
+ data:
20
+ corpus_1:
21
+ # path_src: hf://quickmt/quickmt-train.es-en/es
22
+ # path_tgt: hf://quickmt/quickmt-train.es-en/en
23
+ # path_sco: hf://quickmt/quickmt-train.es-en/sco
24
+ path_src: train.es
25
+ path_tgt: train.en
26
+ valid:
27
+ path_src: dev.es
28
+ path_tgt: dev.en
29
+
30
+ transforms: [sentencepiece, filtertoolong]
31
+ transforms_configs:
32
+ sentencepiece:
33
+ src_subword_model: "es.spm.model"
34
+ tgt_subword_model: "en.spm.model"
35
+ filtertoolong:
36
+ src_seq_length: 256
37
+ tgt_seq_length: 256
38
+
39
+ training:
40
+ # Run configuration
41
+ model_path: quickmt-es-en-eole-model
42
+ train_from: quickmt-es-en-eole-model
43
+ #train_from: model
44
+ keep_checkpoint: 4
45
+ train_steps: 100000
46
+ save_checkpoint_steps: 5000
47
+ valid_steps: 5000
48
+
49
+ # Train on a single GPU
50
+ world_size: 1
51
+ gpu_ranks: [0]
52
+
53
+ # Batching 10240
54
+ batch_type: "tokens"
55
+ batch_size: 6400
56
+ valid_batch_size: 4096
57
+ batch_size_multiple: 8
58
+ accum_count: [12]
59
+ accum_steps: [0]
60
+
61
+ # Optimizer & Compute
62
+ compute_dtype: "fp16"
63
+ optim: "adamw"
64
+ #use_amp: False
65
+ learning_rate: 2.0
66
+ warmup_steps: 4000
67
+ decay_method: "noam"
68
+ adam_beta2: 0.998
69
+
70
+ # Data loading
71
+ bucket_size: 128000
72
+ num_workers: 4
73
+ prefetch_factor: 32
74
+
75
+ # Hyperparams
76
+ dropout_steps: [0]
77
+ dropout: [0.1]
78
+ attention_dropout: [0.1]
79
+ max_grad_norm: 0
80
+ label_smoothing: 0.1
81
+ average_decay: 0.0001
82
+ param_init_method: xavier_uniform
83
+ normalization: "tokens"
84
+
85
+ model:
86
+ architecture: "transformer"
87
+ share_embeddings: false
88
+ share_decoder_embeddings: false
89
+ hidden_size: 1024
90
+ encoder:
91
+ layers: 8
92
+ decoder:
93
+ layers: 2
94
+ heads: 8
95
+ transformer_ff: 4096
96
+ embeddings:
97
+ word_vec_size: 1024
98
+ position_encoding_type: "SinusoidalInterleaved"
99
+
eole-model/config.json ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tgt_vocab": "en.eole.vocab",
3
+ "n_sample": 0,
4
+ "overwrite": true,
5
+ "valid_metrics": [
6
+ "BLEU"
7
+ ],
8
+ "tgt_vocab_size": 20000,
9
+ "tensorboard": true,
10
+ "tensorboard_log_dir_dated": "tensorboard/Apr-28_20-08-59",
11
+ "vocab_size_multiple": 8,
12
+ "src_vocab_size": 20000,
13
+ "save_data": "data",
14
+ "share_vocab": false,
15
+ "src_vocab": "es.eole.vocab",
16
+ "transforms": [
17
+ "sentencepiece",
18
+ "filtertoolong"
19
+ ],
20
+ "tensorboard_log_dir": "tensorboard",
21
+ "report_every": 100,
22
+ "seed": 1234,
23
+ "training": {
24
+ "average_decay": 0.0001,
25
+ "accum_steps": [
26
+ 0
27
+ ],
28
+ "accum_count": [
29
+ 12
30
+ ],
31
+ "attention_dropout": [
32
+ 0.1
33
+ ],
34
+ "train_steps": 100000,
35
+ "warmup_steps": 4000,
36
+ "normalization": "tokens",
37
+ "bucket_size": 128000,
38
+ "compute_dtype": "torch.float16",
39
+ "max_grad_norm": 0.0,
40
+ "batch_type": "tokens",
41
+ "valid_batch_size": 4096,
42
+ "optim": "adamw",
43
+ "world_size": 1,
44
+ "dropout_steps": [
45
+ 0
46
+ ],
47
+ "adam_beta2": 0.998,
48
+ "train_from": "quickmt-es-en-eole-model",
49
+ "gpu_ranks": [
50
+ 0
51
+ ],
52
+ "learning_rate": 2.0,
53
+ "num_workers": 0,
54
+ "dropout": [
55
+ 0.1
56
+ ],
57
+ "batch_size_multiple": 8,
58
+ "label_smoothing": 0.1,
59
+ "batch_size": 6400,
60
+ "model_path": "quickmt-es-en-eole-model",
61
+ "param_init_method": "xavier_uniform",
62
+ "keep_checkpoint": 4,
63
+ "prefetch_factor": 32,
64
+ "decay_method": "noam",
65
+ "valid_steps": 5000,
66
+ "save_checkpoint_steps": 5000
67
+ },
68
+ "model": {
69
+ "share_decoder_embeddings": false,
70
+ "transformer_ff": 4096,
71
+ "position_encoding_type": "SinusoidalInterleaved",
72
+ "heads": 8,
73
+ "share_embeddings": false,
74
+ "hidden_size": 1024,
75
+ "architecture": "transformer",
76
+ "decoder": {
77
+ "transformer_ff": 4096,
78
+ "decoder_type": "transformer",
79
+ "layers": 2,
80
+ "position_encoding_type": "SinusoidalInterleaved",
81
+ "heads": 8,
82
+ "n_positions": null,
83
+ "hidden_size": 1024,
84
+ "tgt_word_vec_size": 1024
85
+ },
86
+ "embeddings": {
87
+ "word_vec_size": 1024,
88
+ "position_encoding_type": "SinusoidalInterleaved",
89
+ "src_word_vec_size": 1024,
90
+ "tgt_word_vec_size": 1024
91
+ },
92
+ "encoder": {
93
+ "transformer_ff": 4096,
94
+ "layers": 8,
95
+ "position_encoding_type": "SinusoidalInterleaved",
96
+ "heads": 8,
97
+ "n_positions": null,
98
+ "encoder_type": "transformer",
99
+ "hidden_size": 1024,
100
+ "src_word_vec_size": 1024
101
+ }
102
+ },
103
+ "data": {
104
+ "corpus_1": {
105
+ "path_src": "train.es",
106
+ "path_tgt": "train.en",
107
+ "transforms": [
108
+ "sentencepiece",
109
+ "filtertoolong"
110
+ ],
111
+ "path_align": null
112
+ },
113
+ "valid": {
114
+ "path_src": "dev.es",
115
+ "path_tgt": "dev.en",
116
+ "transforms": [
117
+ "sentencepiece",
118
+ "filtertoolong"
119
+ ],
120
+ "path_align": null
121
+ }
122
+ },
123
+ "transforms_configs": {
124
+ "filtertoolong": {
125
+ "tgt_seq_length": 256,
126
+ "src_seq_length": 256
127
+ },
128
+ "sentencepiece": {
129
+ "tgt_subword_model": "${MODEL_PATH}/en.spm.model",
130
+ "src_subword_model": "${MODEL_PATH}/es.spm.model"
131
+ }
132
+ }
133
+ }
eole-model/en.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c26488c6db0bdca05f0e9e8edf43e8bdb4f78fc5c41c51749f88aefa6a1d030b
3
+ size 593820
eole-model/es.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:515603821dd149cb66b99febbe4bbb05b9c7819943621d1f66c28ca2270a47e9
3
+ size 603700
eole-model/model.00.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9986aa5e396869b44721a504f83752570705bc23adccaba4345724d6fd2fc5e3
3
+ size 823882912
eole-model/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:309bbb55ecb269d151a6cf72db8df85d8f28a5e79e0510f3b9cdcf2fdcac8cb8
3
+ size 401699775
source_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
src.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:515603821dd149cb66b99febbe4bbb05b9c7819943621d1f66c28ca2270a47e9
3
+ size 603700
target_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
tgt.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c26488c6db0bdca05f0e9e8edf43e8bdb4f78fc5c41c51749f88aefa6a1d030b
3
+ size 593820