GEM-RoBERTa Legal Bilingual: A Bilingual Greek-English Legal Language Model

Model Description

TGEM-RoBERTa Legal Bilingual is a RoBERTa-base model pre-trained from scratch on a comprehensive 26GB bilingual corpus of Greek and English legal, parliamentary, and governmental text. This model represents the first large-scale bilingual legal language model combining Greek and English legal domains, enabling cross-lingual legal understanding and applications.

The model employs the RoBERTa architecture optimized for legal text understanding across both languages, with dynamic masking and focused Masked Language Modeling (MLM) training. The bilingual approach allows the model to leverage legal concepts and terminology from both the Greek and Anglo-American legal traditions.

This model builds upon legal datasets including portions of the Pile of Law collection from Hugging Face, combined with comprehensive Greek legal corpora to create a unique bilingual legal language resource.

How to Get Started

You can use this model directly with the fill-mask pipeline:

from transformers import pipeline

# Load the model
fill_mask = pipeline(
    "fill-mask",
    model="novelcore/gem-roberta-bilingual",
    tokenizer="novelcore/gem-roberta-bilingual"
)

# Example in Greek
text_gr = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."
predictions_gr = fill_mask(text_gr)
print("Greek predictions:", predictions_gr)

For downstream tasks:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# For bilingual legal document classification
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-roberta-bilingual")
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-roberta-bilingual")

# Process texts in both languages
greek_text = "Το Συνταγματικό Δικαστήριο αποφάσισε..."
english_text = "The Constitutional Court decided..."

Training Data

The model was pre-trained on a comprehensive 26GB bilingual corpus comprising 60.3% Greek legal content (13.85GB) and 39.7% English legal content (9.12GB), creating a balanced exposure to both legal traditions.

Greek Legal Corpus (13.85 GB - 60.3%)

Dataset Size (GB) Context Rationale
FEK - Greek Government Gazette 11.0 Legal/Regulatory Official government publications, regulatory language
Greek Parliament Proceedings 2.9 Legal/Parliamentary Legislative discourse, policy language
Political Reports of Supreme Court 1.2 Legal/Judicial High-level judicial reasoning, precedents
Eur-Lex (Greek Content) 0.92 Legal/EU EU legal documents, multilingual legal terminology
Europarl (Greek Content) 0.38 Legal/Parliamentary Parliamentary proceedings, EU legislative language
Raptarchis Legal Dictionary 0.35 Legal/Reference Legal terminology, definitions

English Legal Corpus (9.12 GB - 39.7%)

Dataset Size (GB) Context Greek Equivalent
CourtListener Opinions 4.2 Legal/Judicial Supreme Court Reports
EDGAR (SEC Filings) 2.1 Legal/Corporate Corporate regulatory compliance
Eur-Lex (English) 1.1 Legal/EU Direct parallel to Greek Eur-Lex
US Bills 1.0 Legal/Legislative Parliamentary proceedings equivalent
CFR (Code of Federal Regulations) 0.48 Legal/Regulatory Federal regulatory framework
Europarl (English) 0.24 Legal/Parliamentary Direct parallel to Greek Europarl
Federal Register 0.12 Legal/Regulatory Government gazette equivalent (FEK)

Note: English legal datasets partially sourced from the Pile of Law collection on Hugging Face.

Rationale for Bilingual Training

The 60:40 Greek-to-English ratio was designed to:

  • Preserve Greek legal specificity while benefiting from English legal corpus diversity
  • Enable cross-lingual transfer learning between common law and civil law traditions
  • Support multilingual legal applications in EU contexts where both languages are relevant
  • Leverage complementary legal concepts from both legal systems

Training Procedure

Model Architecture

The model uses the RoBERTa-base architecture with the following configuration:

  • Hidden Size: 768
  • Attention Heads: 12
  • Hidden Layers: 12
  • Parameters: ~125M
  • Max Position Embeddings: 514
  • Vocabulary Size: 50,264

Preprocessing

The text was tokenized using a custom ByteLevelBPE tokenizer trained from scratch on the bilingual Greek-English legal corpus. The tokenizer uses a vocabulary of 50,264 tokens optimized for both Greek and English legal terminology, enabling effective cross-lingual representation.

The data was processed into fixed-size chunks of 512 tokens, respecting document boundaries to ensure contextual coherence across both languages.

Pre-training

The model was pre-trained from scratch for 150,000 steps on 8x NVIDIA H100 GPUs, using BFloat16 (bf16) mixed-precision for stability and speed. The training took approximately 25 hours and 7 minutes to complete.

The key hyperparameters used were:

  • Learning Rate: 2e-4 (0.0002) with a linear warmup of 9,000 steps
  • Batch Size: Effective batch size of 256 (per_device_train_batch_size: 128, gradient_accumulation_steps: 2)
  • Optimizer: AdamW with standard parameters
  • Weight Decay: 0.01
  • Max Sequence Length: 512
  • Max Steps: 150,000
  • Warmup Steps: 9,000
  • MLM Probability: 0.15
  • Max Gradient Norm: 1.0

Training Results

The model achieved the following performance metrics:

  • Final Training Loss: 0.7479
  • Final Evaluation Loss: 0.69405
  • Training Infrastructure: 8x NVIDIA H100 GPUs
  • Total Training Steps: 150,000
  • Total Training Time: 25 hours 7 minutes
  • Train/Validation Split: 95%/5%
  • Total Training Data: 26GB bilingual corpus
Downloads last month
3
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for novelcore/gem-roberta-bilingual

Finetuned
(2003)
this model