Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.ipynb_checkpoints/README-checkpoint.md +98 -0
README.md +98 -3
config.json +10 -0
eole-config.yaml +99 -0
eole-model/config.json +133 -0
eole-model/en.spm.model +3 -0
eole-model/es.spm.model +3 -0
eole-model/model.00.safetensors +3 -0
eole-model/vocab.json +0 -0
model.bin +3 -0
source_vocabulary.json +0 -0
src.spm.model +3 -0
target_vocabulary.json +0 -0
tgt.spm.model +3 -0

.ipynb_checkpoints/README-checkpoint.md ADDED Viewed

	@@ -0,0 +1,98 @@

+---
+language:
+- en
+- es
+tags:
+- translation
+license: cc-by-4.0
+datasets:
+- quickmt/quickmt-train.es-en
+model-index:
+- name: quickmt-es-en
+  results:
+  - task:
+      name: Translation spa-eng
+      type: translation
+      args: spa-eng
+    dataset:
+      name: flores101-devtest
+      type: flores_101
+      args: spa_Latn eng_Latn devtest
+    metrics:
+    - name: BLEU
+      type: bleu
+      value: 28.64
+    - name: CHRF
+      type: chrf
+      value: 58.61
+    - name: COMET
+      type: comet
+      value: 86.11
+---
+# `quickmt-es-en` Neural Machine Translation Model
+`quickmt-es-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `es` into `en`.
+## Model Information
+* Trained using [`eole`](https://github.com/eole-nlp/eole)
+* 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
+* 50k joint Sentencepiece vocabulary
+* Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
+* Training data: https://huggingface.co/datasets/quickmt/quickmt-train.it-en/tree/main
+See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
+## Usage with `quickmt`
+You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
+Next, install the `quickmt` python library and download the model:
+```bash
+git clone https://github.com/quickmt/quickmt.git
+pip install ./quickmt/
+quickmt-model-download quickmt/quickmt-es-en ./quickmt-es-en
+```
+Finally use the model in python:
+```python
+from quickmt import Translator
+# Auto-detects GPU, set to "cpu" to force CPU inference
+t = Translator("./quickmt-es-en/", device="auto")
+# Translate - set beam size to 1 for faster speed (but lower quality)
+sample_text = 'La investigación todavía se ubica en su etapa inicial, conforme indicara el Dr. Ehud Ur, docente en la carrera de medicina de la Universidad de Dalhousie, en Halifax, Nueva Escocia, y director del departamento clínico y científico de la Asociación Canadiense de Diabetes.'
+t(sample_text, beam_size=5)
+> 'The research is still in its early stages, as indicated by Dr. Ehud Ur, a medical professor at the University of Dalhousie, Halifax, Nova Scotia, and director of the clinical and scientific department of the Canadian Diabetes Association.'
+# Get alternative translations by sampling
+# You can pass any cTranslate2 `translate_batch` arguments
+t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
+> 'The research is still in its initial stages as instructed by Dr. Ehud Ur, a professor at the medical degree, University of Dalhousie, Halifax, Nova Scotia, and director of the clinical and scientific department of the Canadian Diabetes Association.'
+```
+The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible  to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
+## Metrics
+`bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("spa_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a larger batch size).
+|                                  |   bleu |   chrf2 |   comet22 |   Time (s) |
+|:---------------------------------|-------:|--------:|----------:|-----------:|
+| quickmt/quickmt-es-en            |  28.64 |   58.61 |     86.11 |       1.33 |
+| Helsink-NLP/opus-mt-es-en        |  27.62 |   58.38 |     86.01 |       3.67 |
+| facebook/nllb-200-distilled-600M |  30.02 |   59.71 |     86.55 |      21.99 |
+| facebook/nllb-200-distilled-1.3B |  31.58 |   60.96 |     87.25 |      38.2  |
+| facebook/m2m100_418M             |  22.85 |   55.04 |     82.9  |      18.83 |
+| facebook/m2m100_1.2B             |  26.84 |   57.69 |     85.47 |      36.22 |

README.md CHANGED Viewed

@@ -1,3 +1,98 @@
----
-license: cc-by-4.0
----

+---
+language:
+- en
+- es
+tags:
+- translation
+license: cc-by-4.0
+datasets:
+- quickmt/quickmt-train.es-en
+model-index:
+- name: quickmt-es-en
+  results:
+  - task:
+      name: Translation spa-eng
+      type: translation
+      args: spa-eng
+    dataset:
+      name: flores101-devtest
+      type: flores_101
+      args: spa_Latn eng_Latn devtest
+    metrics:
+    - name: BLEU
+      type: bleu
+      value: 28.64
+    - name: CHRF
+      type: chrf
+      value: 58.61
+    - name: COMET
+      type: comet
+      value: 86.11
+---
+# `quickmt-es-en` Neural Machine Translation Model
+`quickmt-es-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `es` into `en`.
+## Model Information
+* Trained using [`eole`](https://github.com/eole-nlp/eole)
+* 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
+* 50k joint Sentencepiece vocabulary
+* Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
+* Training data: https://huggingface.co/datasets/quickmt/quickmt-train.it-en/tree/main
+See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
+## Usage with `quickmt`
+You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
+Next, install the `quickmt` python library and download the model:
+```bash
+git clone https://github.com/quickmt/quickmt.git
+pip install ./quickmt/
+quickmt-model-download quickmt/quickmt-es-en ./quickmt-es-en
+```
+Finally use the model in python:
+```python
+from quickmt import Translator
+# Auto-detects GPU, set to "cpu" to force CPU inference
+t = Translator("./quickmt-es-en/", device="auto")
+# Translate - set beam size to 1 for faster speed (but lower quality)
+sample_text = 'La investigación todavía se ubica en su etapa inicial, conforme indicara el Dr. Ehud Ur, docente en la carrera de medicina de la Universidad de Dalhousie, en Halifax, Nueva Escocia, y director del departamento clínico y científico de la Asociación Canadiense de Diabetes.'
+t(sample_text, beam_size=5)
+> 'The research is still in its early stages, as indicated by Dr. Ehud Ur, a medical professor at the University of Dalhousie, Halifax, Nova Scotia, and director of the clinical and scientific department of the Canadian Diabetes Association.'
+# Get alternative translations by sampling
+# You can pass any cTranslate2 `translate_batch` arguments
+t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
+> 'The research is still in its initial stages as instructed by Dr. Ehud Ur, a professor at the medical degree, University of Dalhousie, Halifax, Nova Scotia, and director of the clinical and scientific department of the Canadian Diabetes Association.'
+```
+The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible  to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
+## Metrics
+`bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("spa_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a larger batch size).
+|                                  |   bleu |   chrf2 |   comet22 |   Time (s) |
+|:---------------------------------|-------:|--------:|----------:|-----------:|
+| quickmt/quickmt-es-en            |  28.64 |   58.61 |     86.11 |       1.33 |
+| Helsink-NLP/opus-mt-es-en        |  27.62 |   58.38 |     86.01 |       3.67 |
+| facebook/nllb-200-distilled-600M |  30.02 |   59.71 |     86.55 |      21.99 |
+| facebook/nllb-200-distilled-1.3B |  31.58 |   60.96 |     87.25 |      38.2  |
+| facebook/m2m100_418M             |  22.85 |   55.04 |     82.9  |      18.83 |
+| facebook/m2m100_1.2B             |  26.84 |   57.69 |     85.47 |      36.22 |

config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "add_source_bos": false,
+  "add_source_eos": false,
+  "bos_token": "<s>",
+  "decoder_start_token": "<s>",
+  "eos_token": "</s>",
+  "layer_norm_epsilon": 1e-06,
+  "multi_query_attention": false,
+  "unk_token": "<unk>"
+}

eole-config.yaml ADDED Viewed

	@@ -0,0 +1,99 @@

+## IO
+save_data: data
+overwrite: True
+seed: 1234
+report_every: 100
+valid_metrics: ["BLEU"]
+tensorboard: true
+tensorboard_log_dir: tensorboard
+### Vocab
+src_vocab: es.eole.vocab
+tgt_vocab: en.eole.vocab
+src_vocab_size: 20000
+tgt_vocab_size: 20000
+vocab_size_multiple: 8
+share_vocab: false
+n_sample: 0
+data:
+    corpus_1:
+        # path_src: hf://quickmt/quickmt-train.es-en/es
+        # path_tgt: hf://quickmt/quickmt-train.es-en/en
+        # path_sco: hf://quickmt/quickmt-train.es-en/sco
+        path_src: train.es
+        path_tgt: train.en
+    valid:
+        path_src: dev.es
+        path_tgt: dev.en
+transforms: [sentencepiece, filtertoolong]
+transforms_configs:
+  sentencepiece:
+    src_subword_model: "es.spm.model"
+    tgt_subword_model: "en.spm.model"
+  filtertoolong:
+    src_seq_length: 256
+    tgt_seq_length: 256
+training:
+    # Run configuration
+    model_path: quickmt-es-en-eole-model
+    train_from: quickmt-es-en-eole-model
+    #train_from: model
+    keep_checkpoint: 4
+    train_steps: 100000
+    save_checkpoint_steps: 5000
+    valid_steps: 5000
+    # Train on a single GPU
+    world_size: 1
+    gpu_ranks: [0]
+    # Batching 10240
+    batch_type: "tokens"
+    batch_size: 6400
+    valid_batch_size: 4096
+    batch_size_multiple: 8
+    accum_count: [12]
+    accum_steps: [0]
+    # Optimizer & Compute
+    compute_dtype: "fp16"
+    optim: "adamw"
+    #use_amp: False
+    learning_rate: 2.0
+    warmup_steps: 4000
+    decay_method: "noam"
+    adam_beta2: 0.998
+    # Data loading
+    bucket_size: 128000
+    num_workers: 4
+    prefetch_factor: 32
+    # Hyperparams
+    dropout_steps: [0]
+    dropout: [0.1]
+    attention_dropout: [0.1]
+    max_grad_norm: 0
+    label_smoothing: 0.1
+    average_decay: 0.0001
+    param_init_method: xavier_uniform
+    normalization: "tokens"
+model:
+    architecture: "transformer"
+    share_embeddings: false
+    share_decoder_embeddings: false
+    hidden_size: 1024
+    encoder:
+        layers: 8
+    decoder:
+        layers: 2
+    heads: 8
+    transformer_ff: 4096
+    embeddings:
+        word_vec_size: 1024
+        position_encoding_type: "SinusoidalInterleaved"

eole-model/config.json ADDED Viewed

	@@ -0,0 +1,133 @@

+{
+  "tgt_vocab": "en.eole.vocab",
+  "n_sample": 0,
+  "overwrite": true,
+  "valid_metrics": [
+    "BLEU"
+  ],
+  "tgt_vocab_size": 20000,
+  "tensorboard": true,
+  "tensorboard_log_dir_dated": "tensorboard/Apr-28_20-08-59",
+  "vocab_size_multiple": 8,
+  "src_vocab_size": 20000,
+  "save_data": "data",
+  "share_vocab": false,
+  "src_vocab": "es.eole.vocab",
+  "transforms": [
+    "sentencepiece",
+    "filtertoolong"
+  ],
+  "tensorboard_log_dir": "tensorboard",
+  "report_every": 100,
+  "seed": 1234,
+  "training": {
+    "average_decay": 0.0001,
+    "accum_steps": [
+      0
+    ],
+    "accum_count": [
+      12
+    ],
+    "attention_dropout": [
+      0.1
+    ],
+    "train_steps": 100000,
+    "warmup_steps": 4000,
+    "normalization": "tokens",
+    "bucket_size": 128000,
+    "compute_dtype": "torch.float16",
+    "max_grad_norm": 0.0,
+    "batch_type": "tokens",
+    "valid_batch_size": 4096,
+    "optim": "adamw",
+    "world_size": 1,
+    "dropout_steps": [
+      0
+    ],
+    "adam_beta2": 0.998,
+    "train_from": "quickmt-es-en-eole-model",
+    "gpu_ranks": [
+      0
+    ],
+    "learning_rate": 2.0,
+    "num_workers": 0,
+    "dropout": [
+      0.1
+    ],
+    "batch_size_multiple": 8,
+    "label_smoothing": 0.1,
+    "batch_size": 6400,
+    "model_path": "quickmt-es-en-eole-model",
+    "param_init_method": "xavier_uniform",
+    "keep_checkpoint": 4,
+    "prefetch_factor": 32,
+    "decay_method": "noam",
+    "valid_steps": 5000,
+    "save_checkpoint_steps": 5000
+  },
+  "model": {
+    "share_decoder_embeddings": false,
+    "transformer_ff": 4096,
+    "position_encoding_type": "SinusoidalInterleaved",
+    "heads": 8,
+    "share_embeddings": false,
+    "hidden_size": 1024,
+    "architecture": "transformer",
+    "decoder": {
+      "transformer_ff": 4096,
+      "decoder_type": "transformer",
+      "layers": 2,
+      "position_encoding_type": "SinusoidalInterleaved",
+      "heads": 8,
+      "n_positions": null,
+      "hidden_size": 1024,
+      "tgt_word_vec_size": 1024
+    },
+    "embeddings": {
+      "word_vec_size": 1024,
+      "position_encoding_type": "SinusoidalInterleaved",
+      "src_word_vec_size": 1024,
+      "tgt_word_vec_size": 1024
+    },
+    "encoder": {
+      "transformer_ff": 4096,
+      "layers": 8,
+      "position_encoding_type": "SinusoidalInterleaved",
+      "heads": 8,
+      "n_positions": null,
+      "encoder_type": "transformer",
+      "hidden_size": 1024,
+      "src_word_vec_size": 1024
+    }
+  },
+  "data": {
+    "corpus_1": {
+      "path_src": "train.es",
+      "path_tgt": "train.en",
+      "transforms": [
+        "sentencepiece",
+        "filtertoolong"
+      ],
+      "path_align": null
+    },
+    "valid": {
+      "path_src": "dev.es",
+      "path_tgt": "dev.en",
+      "transforms": [
+        "sentencepiece",
+        "filtertoolong"
+      ],
+      "path_align": null
+    }
+  },
+  "transforms_configs": {
+    "filtertoolong": {
+      "tgt_seq_length": 256,
+      "src_seq_length": 256
+    },
+    "sentencepiece": {
+      "tgt_subword_model": "${MODEL_PATH}/en.spm.model",
+      "src_subword_model": "${MODEL_PATH}/es.spm.model"
+    }
+  }
+}

eole-model/en.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c26488c6db0bdca05f0e9e8edf43e8bdb4f78fc5c41c51749f88aefa6a1d030b
+size 593820

eole-model/es.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:515603821dd149cb66b99febbe4bbb05b9c7819943621d1f66c28ca2270a47e9
+size 603700

eole-model/model.00.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9986aa5e396869b44721a504f83752570705bc23adccaba4345724d6fd2fc5e3
+size 823882912

eole-model/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:309bbb55ecb269d151a6cf72db8df85d8f28a5e79e0510f3b9cdcf2fdcac8cb8
+size 401699775

source_vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

src.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:515603821dd149cb66b99febbe4bbb05b9c7819943621d1f66c28ca2270a47e9
+size 603700

target_vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tgt.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c26488c6db0bdca05f0e9e8edf43e8bdb4f78fc5c41c51749f88aefa6a1d030b
+size 593820