--- library_name: transformers tags: - legal datasets: - Moryjj/Iranian-Courts-Ruling language: - fa metrics: - rouge - bertscore base_model: - Ahmad/parsT5-base --- # 🧾 Persian Legal Text Simplification with ParsT5 + Unlimiformer This model is part of the first benchmark for Persian legal text simplification. It fine-tunes the **ParsT5-base** encoder-decoder model with **Unlimiformer** extension to handle long legal documents efficiently—without truncation. 🔗 **Project GitHub**: [mrjoneidi/Simplification-Legal-Texts](https://github.com/mrjoneidi/Simplification-Legal-Texts) --- ## 🧠 Model Description - **Base Model**: [`Ahmad/parsT5-base`](https://huggingface.co/Ahmad/parsT5-base) - **Extended with**: [Unlimiformer](https://arxiv.org/abs/2309.12254) for long-input attention - **Training Data**: 5000+ Persian legal rulings, simplified using GPT-4o (ChatGPT) prompts - **Max Input Length**: ~16,000 tokens (with Unlimiformer) - **Fine-tuned on**: Simplified dataset generated from judicial rulings - **Task**: Text simplification (formal → public-friendly) --- ## ✨ Example Usage ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("mrjoneidi/parsT5-legal-simplification") tokenizer = AutoTokenizer.from_pretrained("mrjoneidi/parsT5-legal-simplification") input_text = "دادگاه با بررسی مدارک موجود، حکم به رد دعوی صادر می‌نماید..." inputs = tokenizer(input_text, return_tensors="pt", truncation=True) output_ids = model.generate(**inputs, max_new_tokens=512) simplified = tokenizer.decode(output_ids[0], skip_special_tokens=True) print(simplified) ``` ## 🛠 Training Details - Platform: Kaggle P100 GPU - Optimizer: AdamW (best performance among AdamW, LAMB, SGD) - Configurations: - 1 vs. 3 unfrozen encoder-decoder blocks - Best results with 3-block configuration --- ## More detail on GitHub