radinplaid commited on
Commit
180f838
·
verified ·
1 Parent(s): baeb589

Upload folder using huggingface_hub

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # `quickmt-zh-en` Neural Machine Translation Model
2
+
3
+ # Usage
4
+
5
+ ## Install `quickmt`
6
+
7
+ ```bash
8
+ git clone https://github.com/quickmt/quickmt.git
9
+ pip install ./quickmt/
10
+ ```
11
+
12
+ ## Download model
13
+
14
+ ```bash
15
+ quickmt-model-download quickmt/quickmt-zh-en ./quickmt-zh-en
16
+ ```
17
+
18
+ ## Use model
19
+
20
+ ```python
21
+ from quickmt import Translator
22
+
23
+ # Auto-detects GPU, set to "cpu" to force CPU inference
24
+ t = Translator("./quickmt-zh-en/", device="auto")
25
+
26
+ # Translate - set beam size to 5 for higher quality (but slower speed)
27
+ t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], beam_size=1)
28
+
29
+ # Get alternative translations by sampling
30
+ # You can pass any cTranslate2 `translate_batch` arguments
31
+ t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
32
+ ```
33
+
34
+ # Model Information
35
+
36
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
37
+ * Exported for fast inference to []CTranslate2](https://github.com/OpenNMT/CTranslate2) format
38
+ * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main
39
+
40
+ ## Metrics
41
+
42
+ BLEU and CHRF2 calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the Flores200 `devtest` test set ("zho_Hans"->"eng_Latn").
43
+
44
+ | Model | bleu | chrf2 |
45
+ | ---- | ---- | ---- |
46
+ | quickmt/quickmt-zh-en | 28.58 | 57.46 |
47
+ | Helsinki-NLP/opus-mt-zh-en | 23.35 | 53.60 |
48
+ | facebook/m2m100_418M | 18.96 | 50.06 |
49
+ | facebook/m2m100_1.2B | 24.68 | 54.68 |
50
+ | facebook/nllb-200-distilled-600M | 26.22 | 55.17 |
51
+ | facebook/nllb-200-distilled-1.3B | 28.54 | 57.34 |
52
+ | google/madlad400-3b-mt | 28.74 | 58.01 |
53
+
54
+ ## Training Configuration
55
+
56
+ ```yaml
57
+ ## IO
58
+ save_data: zh_en/data_spm
59
+ overwrite: True
60
+ seed: 1234
61
+ report_every: 100
62
+ valid_metrics: ["BLEU"]
63
+ tensorboard: true
64
+ tensorboard_log_dir: tensorboard
65
+
66
+ ### Vocab
67
+ src_vocab: zh-en/src.eole.vocab
68
+ tgt_vocab: zh-en/tgt.eole.vocab
69
+ src_vocab_size: 20000
70
+ tgt_vocab_size: 20000
71
+ vocab_size_multiple: 8
72
+ share_vocab: False
73
+ n_sample: 0
74
+
75
+ data:
76
+ corpus_1:
77
+ path_src: hf://quickmt/quickmt-train-zh-en/zh
78
+ path_tgt: hf://quickmt/quickmt-train-zh-en/en
79
+ path_sco: hf://quickmt/quickmt-train-zh-en/sco
80
+
81
+ valid:
82
+ path_src: zh-en/dev.zho
83
+ path_tgt: zh-en/dev.eng
84
+
85
+ transforms: [sentencepiece, filtertoolong]
86
+ transforms_configs:
87
+ sentencepiece:
88
+ src_subword_model: "zh-en/src.spm.model"
89
+ tgt_subword_model: "zh-en/tgt.spm.model"
90
+ filtertoolong:
91
+ src_seq_length: 512
92
+ tgt_seq_length: 512
93
+
94
+ training:
95
+ # Run configuration
96
+ model_path: quickmt-zh-en
97
+ keep_checkpoint: 4
98
+ save_checkpoint_steps: 1000
99
+ train_steps: 200000
100
+ valid_steps: 1000
101
+
102
+ # Train on a single GPU
103
+ world_size: 1
104
+ gpu_ranks: [0]
105
+
106
+ # Batching
107
+ batch_type: "tokens"
108
+ batch_size: 13312
109
+ valid_batch_size: 13312
110
+ batch_size_multiple: 8
111
+ accum_count: [4]
112
+ accum_steps: [0]
113
+
114
+ # Optimizer & Compute
115
+ compute_dtype: "bfloat16"
116
+ optim: "pagedadamw8bit"
117
+ learning_rate: 1.0
118
+ warmup_steps: 10000
119
+ decay_method: "noam"
120
+ adam_beta2: 0.998
121
+
122
+ # Data loading
123
+ bucket_size: 262144
124
+ num_workers: 4
125
+ prefetch_factor: 100
126
+
127
+ # Hyperparams
128
+ dropout_steps: [0]
129
+ dropout: [0.1]
130
+ attention_dropout: [0.1]
131
+ max_grad_norm: 0
132
+ label_smoothing: 0.1
133
+ average_decay: 0.0001
134
+ param_init_method: xavier_uniform
135
+ normalization: "tokens"
136
+
137
+ model:
138
+ architecture: "transformer"
139
+ layer_norm: standard
140
+ share_embeddings: false
141
+ share_decoder_embeddings: true
142
+ add_ffnbias: true
143
+ mlp_activation_fn: gated-silu
144
+ add_estimator: false
145
+ add_qkvbias: false
146
+ norm_eps: 1e-6
147
+ hidden_size: 1024
148
+ encoder:
149
+ layers: 8
150
+ decoder:
151
+ layers: 2
152
+ heads: 16
153
+ transformer_ff: 4096
154
+ embeddings:
155
+ word_vec_size: 1024
156
+ position_encoding_type: "SinusoidalInterleaved"
157
+ ```
README.md CHANGED
@@ -1,3 +1,157 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # `quickmt-zh-en` Neural Machine Translation Model
2
+
3
+ # Usage
4
+
5
+ ## Install `quickmt`
6
+
7
+ ```bash
8
+ git clone https://github.com/quickmt/quickmt.git
9
+ pip install ./quickmt/
10
+ ```
11
+
12
+ ## Download model
13
+
14
+ ```bash
15
+ quickmt-model-download quickmt/quickmt-zh-en ./quickmt-zh-en
16
+ ```
17
+
18
+ ## Use model
19
+
20
+ Inference with `quickmt`:
21
+
22
+ ```python
23
+ from quickmt import Translator
24
+
25
+ # Auto-detects GPU, set to "cpu" to force CPU inference
26
+ t = Translator("./quickmt-zh-en/", device="auto")
27
+
28
+ # Translate - set beam size to 5 for higher quality (but slower speed)
29
+ t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], beam_size=1)
30
+
31
+ # Get alternative translations by sampling
32
+ # You can pass any cTranslate2 `translate_batch` arguments
33
+ t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
34
+ ```
35
+
36
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use the model files directly if you want. It would be fairly easy to get them to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
37
+
38
+ # Model Information
39
+
40
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
41
+ - It took about 1 day on a single RTX 4090 on [vast.ai](https://cloud.vast.ai)
42
+ * Exported for fast inference to []CTranslate2](https://github.com/OpenNMT/CTranslate2) format
43
+ * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main
44
+
45
+ ## Metrics
46
+
47
+ BLEU and CHRF2 calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the Flores200 `devtest` test set ("zho_Hans"->"eng_Latn").
48
+
49
+ "Time" is the time to translate the following input with a single CPU core:
50
+
51
+ > 2019冠状病毒病(英語:Coronavirus disease 2019,缩写:COVID-19[17][18]),是一種由嚴重急性呼吸系統綜合症冠狀病毒2型(縮寫:SARS-CoV-2)引發的傳染病,导致了一场持续的疫情,成为人類歷史上致死人數最多的流行病之一。
52
+
53
+ | Model | bleu | chrf2 | Time (s) |
54
+ | -------------------------------- | ----- | ----- | ---- |
55
+ | quickmt/quickmt-zh-en | 28.58 | 57.46 | 0.670 |
56
+ | Helsinki-NLP/opus-mt-zh-en | 23.35 | 53.60 | 0.838 |
57
+ | facebook/m2m100_418M | 18.96 | 50.06 | 11.5 |
58
+ | facebook/nllb-200-distilled-600M | 26.22 | 55.17 | 13.2 |
59
+ | facebook/nllb-200-distilled-1.3B | 28.54 | 57.34 | 23.6 |
60
+ | facebook/m2m100_1.2B | 24.68 | 54.68 | 25.7 |
61
+ | google/madlad400-3b-mt | 28.74 | 58.01 | ??? |
62
+
63
+ `quickmt-zh-en` is the fastest and delivers fairly high quality.
64
+
65
+ Helsinki-NLP/opus-mt-zh-en is one of the most downloaded machine translation models on HuggingFace, and this model is considerably more accurate *and* a bit faster.
66
+
67
+
68
+ ## Training Configuration
69
+
70
+ ```yaml
71
+ ### Vocab
72
+ src_vocab_size: 20000
73
+ tgt_vocab_size: 20000
74
+ share_vocab: False
75
+
76
+ data:
77
+ corpus_1:
78
+ path_src: hf://quickmt/quickmt-train-zh-en/zh
79
+ path_tgt: hf://quickmt/quickmt-train-zh-en/en
80
+ path_sco: hf://quickmt/quickmt-train-zh-en/sco
81
+ valid:
82
+ path_src: zh-en/dev.zho
83
+ path_tgt: zh-en/dev.eng
84
+
85
+ transforms: [sentencepiece, filtertoolong]
86
+ transforms_configs:
87
+ sentencepiece:
88
+ src_subword_model: "zh-en/src.spm.model"
89
+ tgt_subword_model: "zh-en/tgt.spm.model"
90
+ filtertoolong:
91
+ src_seq_length: 512
92
+ tgt_seq_length: 512
93
+
94
+ training:
95
+ # Run configuration
96
+ model_path: quickmt-zh-en
97
+ keep_checkpoint: 4
98
+ save_checkpoint_steps: 1000
99
+ train_steps: 104000
100
+ valid_steps: 1000
101
+
102
+ # Train on a single GPU
103
+ world_size: 1
104
+ gpu_ranks: [0]
105
+
106
+ # Batching
107
+ batch_type: "tokens"
108
+ batch_size: 13312
109
+ valid_batch_size: 13312
110
+ batch_size_multiple: 8
111
+ accum_count: [4]
112
+ accum_steps: [0]
113
+
114
+ # Optimizer & Compute
115
+ compute_dtype: "bfloat16"
116
+ optim: "pagedadamw8bit"
117
+ learning_rate: 1.0
118
+ warmup_steps: 10000
119
+ decay_method: "noam"
120
+ adam_beta2: 0.998
121
+
122
+ # Data loading
123
+ bucket_size: 262144
124
+ num_workers: 4
125
+ prefetch_factor: 100
126
+
127
+ # Hyperparams
128
+ dropout_steps: [0]
129
+ dropout: [0.1]
130
+ attention_dropout: [0.1]
131
+ max_grad_norm: 0
132
+ label_smoothing: 0.1
133
+ average_decay: 0.0001
134
+ param_init_method: xavier_uniform
135
+ normalization: "tokens"
136
+
137
+ model:
138
+ architecture: "transformer"
139
+ layer_norm: standard
140
+ share_embeddings: false
141
+ share_decoder_embeddings: true
142
+ add_ffnbias: true
143
+ mlp_activation_fn: gated-silu
144
+ add_estimator: false
145
+ add_qkvbias: false
146
+ norm_eps: 1e-6
147
+ hidden_size: 1024
148
+ encoder:
149
+ layers: 8
150
+ decoder:
151
+ layers: 2
152
+ heads: 16
153
+ transformer_ff: 4096
154
+ embeddings:
155
+ word_vec_size: 1024
156
+ position_encoding_type: "SinusoidalInterleaved"
157
+ ```
config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_source_bos": false,
3
+ "add_source_eos": false,
4
+ "bos_token": "<s>",
5
+ "decoder_start_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "layer_norm_epsilon": 1e-06,
8
+ "multi_query_attention": false,
9
+ "unk_token": "<unk>"
10
+ }
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:408f5484e3d983d52cabf867241f10e0159e4017b7cb05718fa580ab0f081b86
3
+ size 444765910
source_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
src.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0631c1a3d400ac4f42c8d63fb94ae71c69ee00acd4648c05eb02d952e7f7d0ef
3
+ size 538185
target_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
tgt.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:83dcd0d44ad898117ae6c7fe24d996f186940d97265c4e91a78e3e07f657bc9e
3
+ size 589008