MarianMT Biblical Hebrew Vocalization Model

A fine-tuned MarianMT model for automatic Biblical Hebrew vocalization, converting consonantal (unvocalized) Biblical Hebrew text to fully vocalized text with niqqud (vowel marks).

Model Description

This model is fine-tuned from Helsinki-NLP/opus-mt-sem-sem to perform Biblical Hebrew vocalization—the task of adding niqqud (vowel signs) to consonantal Biblical Hebrew text. The model is trained in a single direction: consonantal → vocalized.

Key Features

  • Single-direction model: Converts consonantal Biblical Hebrew (>>heb_cons<<) to vocalized Biblical Hebrew (>>heb_voc<<)
  • Leverages pretrained Biblical Hebrew tokenization: Built on a model that already includes >>heb<< tokenization
  • High performance: Achieves 50.74 BLEU, 86.31 chrF, and 68.89% character accuracy on test set
  • Biblical text optimized: Trained on biblical Hebrew texts for accurate vocalization
  • MAQAF preservation: Preserves maqaf (־) in vocalized output, converts to space in consonantal input

Model Details

Model Information

  • Architecture: MarianMT (Transformer-based sequence-to-sequence)
  • Base Model: Helsinki-NLP/opus-mt-sem-sem
  • Parameters: 61,918,208 (~62M)
  • Vocabulary Size: 33,702 tokens
  • Language Tags:
    • Source: >>heb_cons<< (consonantal Biblical Hebrew)
    • Target: >>heb_voc<< (vocalized Biblical Hebrew)

Training Data

  • Source: Biblical Hebrew texts (vocalized text from which consonantal forms are derived)
  • Dataset Format: CSV with book|chapter|verse|content where content contains vocalized Biblical Hebrew
  • Text Processing:
    • Consonantal: Removes niqqud, cantillation, punctuation; converts maqaf to space
    • Vocalized: Keeps Hebrew letters, niqqud marks, and maqaf; removes other punctuation

Training Configuration

  • Batch Size: 8
  • Effective Batch Size: 32 (with gradient accumulation)
  • Learning Rate: 1e-5
  • Max Input/Target Length: 384 tokens
  • Training Steps: 54,000
  • Epochs: 86.4
  • Optimizer: AdamW with cosine learning rate schedule
  • Precision: bfloat16
  • Early Stopping: 5 evaluation calls without improvement
  • Best Checkpoint: Step 49,000

Performance

Best Validation Metrics (Step 49,000)

  • BLEU: 51.95
  • chrF: 86.95
  • Character Accuracy: 68.22%
  • Validation Loss: 0.1393

Final Test Metrics

  • BLEU: 50.74
  • chrF: 86.31
  • Character Accuracy: 68.89%
  • Test Loss: 0.1430

Usage

Direct Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("johnlockejrr/marianmt_heb_voc")
model = AutoModelForSeq2SeqLM.from_pretrained("johnlockejrr/marianmt_heb_voc")

# Input: consonantal Biblical Hebrew text
text = "בראשית ברא אלהים את השמים ואת הארץ"

# Add language tag
input_text = f">>heb_cons<< {text}"

# Tokenize
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=384)

# Generate
outputs = model.generate(**inputs, max_length=384, num_beams=4, length_penalty=0.6)

# Decode
vocalized = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(vocalized)

Using the Pipeline

from transformers import pipeline

vocalizer = pipeline("text2text-generation", 
                     model="johnlockejrr/marianmt_heb_voc",
                     tokenizer="johnlockejrr/marianmt_heb_voc")

# Input text (consonantal)
text = "בראשית ברא אלהים את השמים ואת הארץ"
input_text = f">>heb_cons<< {text}"

# Vocalize
result = vocalizer(input_text, max_length=384, num_beams=4, length_penalty=0.6)
print(result[0]['generated_text'])

Text Normalization

The model expects input text to be normalized to NFC (Normalization Form Composed) Unicode format. The model automatically handles this, but for best results, ensure your input text is properly normalized:

import unicodedata

def normalize_text(text: str) -> str:
    """Normalize text to NFC format."""
    return unicodedata.normalize("NFC", text)

# Normalize input before processing
text = normalize_text("בראשית ברא אלהים")

Input Cleaning

For optimal results, input text should contain only consonantal Biblical Hebrew characters. The model automatically:

  • Removes niqqud (vowel marks) from input
  • Removes cantillation marks
  • Converts maqaf (־) to space
  • Keeps only Hebrew letters and spaces

Generation Parameters

Recommended generation parameters:

  • num_beams: 4 (beam search for better quality)
  • length_penalty: 0.6 (encourages longer outputs)
  • early_stopping: True
  • max_length: 384 (matches training configuration)
  • do_sample: False (deterministic generation)

Limitations and Bias

  • Domain Specificity: This model is trained primarily on biblical Hebrew texts. Performance may vary on other domains (e.g., modern Hebrew/Ivrit, poetry, prose).
  • Single Direction: The model only vocalizes consonantal text. It does not perform the reverse operation (removing vocalization).
  • Length Constraints: Maximum input/output length is 384 tokens. Longer texts should be split into smaller segments.
  • Character Accuracy: Character-level accuracy is ~69%, meaning some niqqud marks may be missing or incorrect in complex cases.

Training Procedure

Training Infrastructure

  • Hardware: GPU (CUDA)
  • Training Time: ~4.75 hours (17,110 seconds)
  • Framework: Hugging Face Transformers
  • Evaluation Frequency: Every 1,000 steps

Preprocessing

  • Text normalized to NFC Unicode format
  • Language tags (>>heb_cons<< and >>heb_voc<<) added to tokenizer vocabulary
  • Tokenization using SentencePiece (inherited from base model)
  • Consonantal text: niqqud removed, maqaf converted to space
  • Vocalized text: niqqud and maqaf preserved

Hyperparameters

{
  "learning_rate": 1e-5,
  "batch_size": 8,
  "gradient_accumulation_steps": 4,
  "num_epochs": 100,
  "max_input_length": 384,
  "max_target_length": 384,
  "warmup_steps": 1000,
  "weight_decay": 0.01,
  "eval_steps": 1000,
  "save_steps": 1000,
  "save_total_limit": 3
}

Evaluation

The model is evaluated using three metrics:

  1. BLEU Score: Measures n-gram precision between generated and reference text
  2. chrF Score: Character-level F-score, more lenient than BLEU
  3. Character Accuracy: Exact character match percentage

Evaluation Results

Metric Validation (Best) Test (Final)
BLEU 51.95 50.74
chrF 86.95 86.31
Char Acc 68.22% 68.89%
Loss 0.1393 0.1430

Citation

If you use this model, please cite:

@misc{marianmt_heb_voc,
  title={MarianMT Biblical Hebrew Vocalization Model},
  author={johnlockejrr},
  year={2025},
  howpublished={\url{https://huggingface.co/johnlockejrr/marianmt_heb_voc}},
  note={Fine-tuned from Helsinki-NLP/opus-mt-sem-sem}
}

Acknowledgments

Model Card Contact

For questions, issues, or contributions, please open an issue on the model repository.

License

This model is released under the Apache 2.0 license, consistent with the base model.

Downloads last month
13
Safetensors
Model size
61.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for johnlockejrr/marianmt_heb_voc

Finetuned
(5)
this model

Space using johnlockejrr/marianmt_heb_voc 1

Evaluation results

  • BLEU Score on Biblical Hebrew Vocalization Dataset
    self-reported
    50.740
  • chrF Score on Biblical Hebrew Vocalization Dataset
    self-reported
    86.310
  • Character Accuracy on Biblical Hebrew Vocalization Dataset
    self-reported
    68.890