GEM-RoBERTa Legal Bilingual: A Bilingual Greek-English Legal Language Model
Model Description
TGEM-RoBERTa Legal Bilingual is a RoBERTa-base model pre-trained from scratch on a comprehensive 26GB bilingual corpus of Greek and English legal, parliamentary, and governmental text. This model represents the first large-scale bilingual legal language model combining Greek and English legal domains, enabling cross-lingual legal understanding and applications.
The model employs the RoBERTa architecture optimized for legal text understanding across both languages, with dynamic masking and focused Masked Language Modeling (MLM) training. The bilingual approach allows the model to leverage legal concepts and terminology from both the Greek and Anglo-American legal traditions.
This model builds upon legal datasets including portions of the Pile of Law collection from Hugging Face, combined with comprehensive Greek legal corpora to create a unique bilingual legal language resource.
How to Get Started
You can use this model directly with the fill-mask pipeline:
from transformers import pipeline
# Load the model
fill_mask = pipeline(
"fill-mask",
model="novelcore/gem-roberta-bilingual",
tokenizer="novelcore/gem-roberta-bilingual"
)
# Example in Greek
text_gr = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."
predictions_gr = fill_mask(text_gr)
print("Greek predictions:", predictions_gr)
For downstream tasks:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# For bilingual legal document classification
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-roberta-bilingual")
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-roberta-bilingual")
# Process texts in both languages
greek_text = "Το Συνταγματικό Δικαστήριο αποφάσισε..."
english_text = "The Constitutional Court decided..."
Training Data
The model was pre-trained on a comprehensive 26GB bilingual corpus comprising 60.3% Greek legal content (13.85GB) and 39.7% English legal content (9.12GB), creating a balanced exposure to both legal traditions.
Greek Legal Corpus (13.85 GB - 60.3%)
| Dataset | Size (GB) | Context | Rationale |
|---|---|---|---|
| FEK - Greek Government Gazette | 11.0 | Legal/Regulatory | Official government publications, regulatory language |
| Greek Parliament Proceedings | 2.9 | Legal/Parliamentary | Legislative discourse, policy language |
| Political Reports of Supreme Court | 1.2 | Legal/Judicial | High-level judicial reasoning, precedents |
| Eur-Lex (Greek Content) | 0.92 | Legal/EU | EU legal documents, multilingual legal terminology |
| Europarl (Greek Content) | 0.38 | Legal/Parliamentary | Parliamentary proceedings, EU legislative language |
| Raptarchis Legal Dictionary | 0.35 | Legal/Reference | Legal terminology, definitions |
English Legal Corpus (9.12 GB - 39.7%)
| Dataset | Size (GB) | Context | Greek Equivalent |
|---|---|---|---|
| CourtListener Opinions | 4.2 | Legal/Judicial | Supreme Court Reports |
| EDGAR (SEC Filings) | 2.1 | Legal/Corporate | Corporate regulatory compliance |
| Eur-Lex (English) | 1.1 | Legal/EU | Direct parallel to Greek Eur-Lex |
| US Bills | 1.0 | Legal/Legislative | Parliamentary proceedings equivalent |
| CFR (Code of Federal Regulations) | 0.48 | Legal/Regulatory | Federal regulatory framework |
| Europarl (English) | 0.24 | Legal/Parliamentary | Direct parallel to Greek Europarl |
| Federal Register | 0.12 | Legal/Regulatory | Government gazette equivalent (FEK) |
Note: English legal datasets partially sourced from the Pile of Law collection on Hugging Face.
Rationale for Bilingual Training
The 60:40 Greek-to-English ratio was designed to:
- Preserve Greek legal specificity while benefiting from English legal corpus diversity
- Enable cross-lingual transfer learning between common law and civil law traditions
- Support multilingual legal applications in EU contexts where both languages are relevant
- Leverage complementary legal concepts from both legal systems
Training Procedure
Model Architecture
The model uses the RoBERTa-base architecture with the following configuration:
- Hidden Size: 768
- Attention Heads: 12
- Hidden Layers: 12
- Parameters: ~125M
- Max Position Embeddings: 514
- Vocabulary Size: 50,264
Preprocessing
The text was tokenized using a custom ByteLevelBPE tokenizer trained from scratch on the bilingual Greek-English legal corpus. The tokenizer uses a vocabulary of 50,264 tokens optimized for both Greek and English legal terminology, enabling effective cross-lingual representation.
The data was processed into fixed-size chunks of 512 tokens, respecting document boundaries to ensure contextual coherence across both languages.
Pre-training
The model was pre-trained from scratch for 150,000 steps on 8x NVIDIA H100 GPUs, using BFloat16 (bf16) mixed-precision for stability and speed. The training took approximately 25 hours and 7 minutes to complete.
The key hyperparameters used were:
- Learning Rate: 2e-4 (0.0002) with a linear warmup of 9,000 steps
- Batch Size: Effective batch size of 256 (
per_device_train_batch_size: 128,gradient_accumulation_steps: 2) - Optimizer: AdamW with standard parameters
- Weight Decay: 0.01
- Max Sequence Length: 512
- Max Steps: 150,000
- Warmup Steps: 9,000
- MLM Probability: 0.15
- Max Gradient Norm: 1.0
Training Results
The model achieved the following performance metrics:
- Final Training Loss: 0.7479
- Final Evaluation Loss: 0.69405
- Training Infrastructure: 8x NVIDIA H100 GPUs
- Total Training Steps: 150,000
- Total Training Time: 25 hours 7 minutes
- Train/Validation Split: 95%/5%
- Total Training Data: 26GB bilingual corpus
- Downloads last month
- 3
Model tree for novelcore/gem-roberta-bilingual
Base model
FacebookAI/roberta-base