GEM-RoBERTa Legal Bilingual: A Bilingual Greek-English Legal Language Model

Model Description

TGEM-RoBERTa Legal Bilingual is a RoBERTa-base model pre-trained from scratch on a comprehensive 26GB bilingual corpus of Greek and English legal, parliamentary, and governmental text. This model represents the first large-scale bilingual legal language model combining Greek and English legal domains, enabling cross-lingual legal understanding and applications.

The model employs the RoBERTa architecture optimized for legal text understanding across both languages, with dynamic masking and focused Masked Language Modeling (MLM) training. The bilingual approach allows the model to leverage legal concepts and terminology from both the Greek and Anglo-American legal traditions.

This model builds upon legal datasets including portions of the Pile of Law collection from Hugging Face, combined with comprehensive Greek legal corpora to create a unique bilingual legal language resource.

How to Get Started

You can use this model directly with the fill-mask pipeline:

from transformers import pipeline

# Load the model
fill_mask = pipeline(
    "fill-mask",
    model="novelcore/gem-roberta-bilingual",
    tokenizer="novelcore/gem-roberta-bilingual"
)

# Example in Greek
text_gr = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."
predictions_gr = fill_mask(text_gr)
print("Greek predictions:", predictions_gr)

For downstream tasks:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# For bilingual legal document classification
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-roberta-bilingual")
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-roberta-bilingual")

# Process texts in both languages
greek_text = "Το Συνταγματικό Δικαστήριο αποφάσισε..."
english_text = "The Constitutional Court decided..."

Training Data

The model was pre-trained on a comprehensive 26GB bilingual corpus comprising 60.3% Greek legal content (13.85GB) and 39.7% English legal content (9.12GB), creating a balanced exposure to both legal traditions.

Greek Legal Corpus (13.85 GB - 60.3%)

Dataset	Size (GB)	Context	Rationale
FEK - Greek Government Gazette	11.0	Legal/Regulatory	Official government publications, regulatory language
Greek Parliament Proceedings	2.9	Legal/Parliamentary	Legislative discourse, policy language
Political Reports of Supreme Court	1.2	Legal/Judicial	High-level judicial reasoning, precedents
Eur-Lex (Greek Content)	0.92	Legal/EU	EU legal documents, multilingual legal terminology
Europarl (Greek Content)	0.38	Legal/Parliamentary	Parliamentary proceedings, EU legislative language
Raptarchis Legal Dictionary	0.35	Legal/Reference	Legal terminology, definitions

English Legal Corpus (9.12 GB - 39.7%)

Dataset	Size (GB)	Context	Greek Equivalent
CourtListener Opinions	4.2	Legal/Judicial	Supreme Court Reports
EDGAR (SEC Filings)	2.1	Legal/Corporate	Corporate regulatory compliance
Eur-Lex (English)	1.1	Legal/EU	Direct parallel to Greek Eur-Lex
US Bills	1.0	Legal/Legislative	Parliamentary proceedings equivalent
CFR (Code of Federal Regulations)	0.48	Legal/Regulatory	Federal regulatory framework
Europarl (English)	0.24	Legal/Parliamentary	Direct parallel to Greek Europarl
Federal Register	0.12	Legal/Regulatory	Government gazette equivalent (FEK)

Note: English legal datasets partially sourced from the Pile of Law collection on Hugging Face.

Rationale for Bilingual Training

The 60:40 Greek-to-English ratio was designed to:

Preserve Greek legal specificity while benefiting from English legal corpus diversity
Enable cross-lingual transfer learning between common law and civil law traditions
Support multilingual legal applications in EU contexts where both languages are relevant
Leverage complementary legal concepts from both legal systems

Training Procedure

Model Architecture

The model uses the RoBERTa-base architecture with the following configuration:

Hidden Size: 768
Attention Heads: 12
Hidden Layers: 12
Parameters: ~125M
Max Position Embeddings: 514
Vocabulary Size: 50,264

Preprocessing

The text was tokenized using a custom ByteLevelBPE tokenizer trained from scratch on the bilingual Greek-English legal corpus. The tokenizer uses a vocabulary of 50,264 tokens optimized for both Greek and English legal terminology, enabling effective cross-lingual representation.

The data was processed into fixed-size chunks of 512 tokens, respecting document boundaries to ensure contextual coherence across both languages.

Pre-training

The model was pre-trained from scratch for 150,000 steps on 8x NVIDIA H100 GPUs, using BFloat16 (bf16) mixed-precision for stability and speed. The training took approximately 25 hours and 7 minutes to complete.

The key hyperparameters used were:

Learning Rate: 2e-4 (0.0002) with a linear warmup of 9,000 steps
Batch Size: Effective batch size of 256 (per_device_train_batch_size: 128, gradient_accumulation_steps: 2)
Optimizer: AdamW with standard parameters
Weight Decay: 0.01
Max Sequence Length: 512
Max Steps: 150,000
Warmup Steps: 9,000
MLM Probability: 0.15
Max Gradient Norm: 1.0

Training Results

The model achieved the following performance metrics:

Final Training Loss: 0.7479
Final Evaluation Loss: 0.69405
Training Infrastructure: 8x NVIDIA H100 GPUs
Total Training Steps: 150,000
Total Training Time: 25 hours 7 minutes
Train/Validation Split: 95%/5%
Total Training Data: 26GB bilingual corpus

Downloads last month: 3

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for novelcore/gem-roberta-bilingual

Base model

FacebookAI/roberta-base

Finetuned

(2003)

this model