MarianMT Biblical Hebrew Vocalization Model

A fine-tuned MarianMT model for automatic Biblical Hebrew vocalization, converting consonantal (unvocalized) Biblical Hebrew text to fully vocalized text with niqqud (vowel marks).

Model Description

This model is fine-tuned from Helsinki-NLP/opus-mt-sem-sem to perform Biblical Hebrew vocalization—the task of adding niqqud (vowel signs) to consonantal Biblical Hebrew text. The model is trained in a single direction: consonantal → vocalized.

Key Features

Single-direction model: Converts consonantal Biblical Hebrew (>>heb_cons<<) to vocalized Biblical Hebrew (>>heb_voc<<)
Leverages pretrained Biblical Hebrew tokenization: Built on a model that already includes >>heb<< tokenization
High performance: Achieves 50.74 BLEU, 86.31 chrF, and 68.89% character accuracy on test set
Biblical text optimized: Trained on biblical Hebrew texts for accurate vocalization
MAQAF preservation: Preserves maqaf (־) in vocalized output, converts to space in consonantal input

Model Details

Model Information

Architecture: MarianMT (Transformer-based sequence-to-sequence)
Base Model: Helsinki-NLP/opus-mt-sem-sem
Parameters: 61,918,208 (~62M)
Vocabulary Size: 33,702 tokens
Language Tags:
- Source: >>heb_cons<< (consonantal Biblical Hebrew)
- Target: >>heb_voc<< (vocalized Biblical Hebrew)

Training Data

Source: Biblical Hebrew texts (vocalized text from which consonantal forms are derived)
Dataset Format: CSV with book|chapter|verse|content where content contains vocalized Biblical Hebrew
Text Processing:
- Consonantal: Removes niqqud, cantillation, punctuation; converts maqaf to space
- Vocalized: Keeps Hebrew letters, niqqud marks, and maqaf; removes other punctuation

Training Configuration

Batch Size: 8
Effective Batch Size: 32 (with gradient accumulation)
Learning Rate: 1e-5
Max Input/Target Length: 384 tokens
Training Steps: 54,000
Epochs: 86.4
Optimizer: AdamW with cosine learning rate schedule
Precision: bfloat16
Early Stopping: 5 evaluation calls without improvement
Best Checkpoint: Step 49,000

Performance

Best Validation Metrics (Step 49,000)

BLEU: 51.95
chrF: 86.95
Character Accuracy: 68.22%
Validation Loss: 0.1393

Final Test Metrics

BLEU: 50.74
chrF: 86.31
Character Accuracy: 68.89%
Test Loss: 0.1430

Usage

Direct Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("johnlockejrr/marianmt_heb_voc")
model = AutoModelForSeq2SeqLM.from_pretrained("johnlockejrr/marianmt_heb_voc")

# Input: consonantal Biblical Hebrew text
text = "בראשית ברא אלהים את השמים ואת הארץ"

# Add language tag
input_text = f">>heb_cons<< {text}"

# Tokenize
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=384)

# Generate
outputs = model.generate(**inputs, max_length=384, num_beams=4, length_penalty=0.6)

# Decode
vocalized = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(vocalized)

Using the Pipeline

from transformers import pipeline

vocalizer = pipeline("text2text-generation", 
                     model="johnlockejrr/marianmt_heb_voc",
                     tokenizer="johnlockejrr/marianmt_heb_voc")

# Input text (consonantal)
text = "בראשית ברא אלהים את השמים ואת הארץ"
input_text = f">>heb_cons<< {text}"

# Vocalize
result = vocalizer(input_text, max_length=384, num_beams=4, length_penalty=0.6)
print(result[0]['generated_text'])

Text Normalization

The model expects input text to be normalized to NFC (Normalization Form Composed) Unicode format. The model automatically handles this, but for best results, ensure your input text is properly normalized:

import unicodedata

def normalize_text(text: str) -> str:
    """Normalize text to NFC format."""
    return unicodedata.normalize("NFC", text)

# Normalize input before processing
text = normalize_text("בראשית ברא אלהים")

Input Cleaning

For optimal results, input text should contain only consonantal Biblical Hebrew characters. The model automatically:

Removes niqqud (vowel marks) from input
Removes cantillation marks
Converts maqaf (־) to space
Keeps only Hebrew letters and spaces

Generation Parameters

Recommended generation parameters:

num_beams: 4 (beam search for better quality)
length_penalty: 0.6 (encourages longer outputs)
early_stopping: True
max_length: 384 (matches training configuration)
do_sample: False (deterministic generation)

Limitations and Bias

Domain Specificity: This model is trained primarily on biblical Hebrew texts. Performance may vary on other domains (e.g., modern Hebrew/Ivrit, poetry, prose).
Single Direction: The model only vocalizes consonantal text. It does not perform the reverse operation (removing vocalization).
Length Constraints: Maximum input/output length is 384 tokens. Longer texts should be split into smaller segments.
Character Accuracy: Character-level accuracy is ~69%, meaning some niqqud marks may be missing or incorrect in complex cases.

Training Procedure

Training Infrastructure

Hardware: GPU (CUDA)
Training Time: ~4.75 hours (17,110 seconds)
Framework: Hugging Face Transformers
Evaluation Frequency: Every 1,000 steps

Preprocessing

Text normalized to NFC Unicode format
Language tags (>>heb_cons<< and >>heb_voc<<) added to tokenizer vocabulary
Tokenization using SentencePiece (inherited from base model)
Consonantal text: niqqud removed, maqaf converted to space
Vocalized text: niqqud and maqaf preserved

Hyperparameters

{
  "learning_rate": 1e-5,
  "batch_size": 8,
  "gradient_accumulation_steps": 4,
  "num_epochs": 100,
  "max_input_length": 384,
  "max_target_length": 384,
  "warmup_steps": 1000,
  "weight_decay": 0.01,
  "eval_steps": 1000,
  "save_steps": 1000,
  "save_total_limit": 3
}

Evaluation

The model is evaluated using three metrics:

BLEU Score: Measures n-gram precision between generated and reference text
chrF Score: Character-level F-score, more lenient than BLEU
Character Accuracy: Exact character match percentage

Evaluation Results

Metric	Validation (Best)	Test (Final)
BLEU	51.95	50.74
chrF	86.95	86.31
Char Acc	68.22%	68.89%
Loss	0.1393	0.1430

Citation

If you use this model, please cite:

@misc{marianmt_heb_voc,
  title={MarianMT Biblical Hebrew Vocalization Model},
  author={johnlockejrr},
  year={2025},
  howpublished={\url{https://huggingface.co/johnlockejrr/marianmt_heb_voc}},
  note={Fine-tuned from Helsinki-NLP/opus-mt-sem-sem}
}

Acknowledgments

Base Model: Helsinki-NLP/opus-mt-sem-sem by the Helsinki NLP team
Framework: Hugging Face Transformers
Training Framework: MarianMT architecture

Model Card Contact

For questions, issues, or contributions, please open an issue on the model repository.

License

This model is released under the Apache 2.0 license, consistent with the base model.

Downloads last month: 13

Safetensors

Model size

61.4M params

Tensor type

F32

Model tree for johnlockejrr/marianmt_heb_voc

Base model

Helsinki-NLP/opus-mt-sem-sem

Finetuned

(5)

this model

Space using johnlockejrr/marianmt_heb_voc 1

Evaluation results

BLEU Score on Biblical Hebrew Vocalization Dataset
self-reported

50.740
chrF Score on Biblical Hebrew Vocalization Dataset
self-reported

86.310
Character Accuracy on Biblical Hebrew Vocalization Dataset
self-reported

68.890

View on Papers With Code