metadata

title: Napolab Leaderboard
emoji: 🌎
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.38.2
app_file: app.py
pinned: true
python_version: '3.10'
tags:
  - nlp
  - portuguese
  - benchmarking
  - language-models
  - gradio
datasets:
  - ruanchaves/napolab
  - assin
  - assin2
  - ruanchaves/hatebr
  - ruanchaves/faquad-nli
short_description: The Natural Portuguese Language Benchmark

Napolab Leaderboard - Gradio App

A comprehensive Gradio web application for exploring and benchmarking Portuguese language models using the Napolab dataset collection.

Features

🏆 Benchmark Results: Single comprehensive table with one column per dataset and clickable model links
📈 Model Analysis: Radar chart showing model performance across all datasets
ℹ️ About: Information about Napolab and citation details

Installation

Navigate to the leaderboard directory:

cd dev/napolab/leaderboard

Install the required dependencies:

pip install -r requirements.txt

Extract data from external sources (optional but recommended):

# Extract data from Portuguese LLM Leaderboard
python extract_portuguese_leaderboard.py

# Download external models data
python download_external_models.py

Run the Gradio app:

python app.py

The app will be available at http://localhost:7860

Data Management

The app uses a YAML configuration file (data.yaml) for adding new data, making it easy to edit and maintain.

Data Extraction Scripts

The leaderboard includes scripts to automatically extract and update data from external sources:

`extract_portuguese_leaderboard.py`

This script extracts benchmark results from the Open Portuguese LLM Leaderboard:

Fetches data from the Hugging Face Spaces leaderboard
Updates the portuguese_leaderboard.csv file
Includes both open-source and proprietary models
Automatically handles data formatting and validation

`download_external_models.py`

This script downloads additional model data:

Fetches model metadata from various sources
Updates the external_models.csv file
Includes model links and performance metrics
Ensures data consistency with the main leaderboard

Note: These scripts require internet connection and may take a few minutes to complete. Run them periodically to keep the leaderboard data up to date.

Usage

Benchmark Results Tab

Single Comprehensive Table: Shows all models with one column per dataset
Dataset Columns: Each dataset has its own column showing model performance scores
Average Column: Shows the average performance across all datasets for each model
Model Column: Clickable links to Hugging Face model pages
Sorted Results: Models are sorted by overall average performance (descending)

Model Analysis Tab

Radar chart showing each model's performance across all datasets
Default view: Shows only bertimbau-large and mdeberta-v3-base models
Interactive legend: Click to show/hide models, double-click to isolate
Each line represents one model, each point represents one dataset
Color-coded by model architecture
Interactive hover information with detailed performance metrics

Model Hub Tab

Access links to pre-trained models on Hugging Face
Models are organized by dataset and architecture type
Direct links to model repositories

Supported Datasets

The app includes all Napolab datasets:

ASSIN: Semantic Similarity and Textual Entailment
ASSIN 2: Semantic Similarity and Textual Entailment (v2)
Rerelem: Relational Reasoning
HateBR: Hate Speech Detection
Reli-SA: Religious Sentiment Analysis
FaQUaD-NLI: Factual Question Answering and NLI
PorSimplesSent: Simple Sentences Sentiment Analysis

Model Architectures

The benchmark includes models based on:

mDeBERTa v3: Multilingual DeBERTa v3
BERT Large: Large Portuguese BERT
BERT Base: Base Portuguese BERT

Data Management

The app now uses a YAML configuration file (data.yaml) for all data, making it easy to edit and maintain.

Editing Data

Simply edit the data.yaml file to:

Add new datasets
Update benchmark results
Add new models
Modify model metadata

Data Structure

The YAML file contains four main sections:

datasets: Information about each dataset
benchmark_results: Performance metrics for models on datasets
model_metadata: Model information (parameters, architecture, etc.)
additional_models: Additional models for the Model Hub

Data Management Tools

Use the manage_data.py script for data operations:

# Validate the data structure
python manage_data.py validate

# Add a new dataset
python manage_data.py add-dataset \
  --dataset-name "new_dataset" \
  --dataset-display-name "New Dataset" \
  --dataset-description "Description of the dataset" \
  --dataset-tasks "Classification" "Sentiment Analysis" \
  --dataset-url "https://huggingface.co/datasets/new_dataset"

# Add benchmark results
python manage_data.py add-benchmark \
  --dataset-name "assin" \
  --model-name "new-model" \
  --metrics "accuracy=0.92" "f1=0.91"

# Add model metadata
python manage_data.py add-model \
  --model-name "new-model" \
  --parameters 110000000 \
  --architecture "BERT Base" \
  --base-model "bert-base-uncased" \
  --task "Classification" \
  --huggingface-url "https://huggingface.co/new-model"

Customization

To add new datasets or benchmark results:

Edit the data.yaml file directly, or
Use the manage_data.py script for structured additions
The app will automatically reload the data when restarted

Troubleshooting

Dataset loading errors: Ensure you have internet connection to access Hugging Face datasets
Memory issues: Reduce the number of samples in the Dataset Explorer
Port conflicts: Change the port in the app.launch() call

Contributing

Feel free to contribute by:

Adding new datasets
Improving visualizations
Adding new features
Reporting bugs

License

This project follows the same license as the main Napolab repository.