Greek Climate News Classification

The cvcio/mediawatch-el-climate model is a fine-tuned RoBERTa-based model for Sequence Classification of Greek news articles, specifically for topics related to climate change and the environment.

It is part of the MediaWatch-EL project and is designed to automatically categorize Greek news content, enabling large-scale analysis of media coverage on this critical subject.


Model Overview

The model performs a single-label text classification task, assigning one of eight defined climate-related labels to a given text input.

  • Base Model: cvcio/roberta-el-news
      • This choice leverages a pre-trained language model optimized for the Greek language and news-related text.
  • Architecture: RobertaForSequenceClassification
  • Language: Greek (el)
  • Task: Text Classification (Categorization of climate-related news).

Classification Labels

The model classifies text into one of the following 8 labels, all of which represent distinct themes within climate and environmental reporting:

Label Greek Label English Translation Description
LABEL_0 ΒΙΩΣΙΜΟΤΗΤΑ Sustainability Topics related to sustainable practices and development.
LABEL_1 ΠΕΡΙΒΑΛΛΟΝ Environment Broad environmental topics, not solely climate.
LABEL_2 ΚΛΙΜΑΤΙΚΗ ΑΛΛΑΓΗ Climate Change General mention of climate change.
LABEL_3 ΘΕΡΜΟΚΡΑΣΙΑ Temperature Specific mentions of temperature or heat-related events.
LABEL_4 ΚΛΙΜΑΤΙΚΗ ΚΡΙΣΗ Climate Crisis Focus on the urgency or severity of the issue.
LABEL_5 ΚΛΙΜΑ Climate Meteorological or general climate context.
LABEL_6 ΡΥΠΑΝΣΗ Pollution Specific mentions of environmental contamination.
LABEL_7 ΕΝΕΡΓΕΙΑ Energy Focus on energy sources, transitions, or policy.

Training Data and Annotation

Dataset

  • Size: Approximately 12,000 unique Greek news articles.
  • Source: The articles were collected from a wide range of Greek online media outlets.
  • Content: The full articles, not just titles, were used for fine-tuning.
  • Data Split: The dataset was split using an 80% training and 20% testing (evaluation) ratio.

Annotation Process

The dataset was annotated by a group of academics.

  • Methodology: The news articles were labeled with one of the 8 categories mentioned above.
  • Known Limitation (Poor Annotation): It is acknowledged that the annotation quality may not be optimal due to the inherent difficulty of the task, the complexity of journalistic text, and potential inter-annotator disagreement. This poor quality could introduce noise into the training data and potentially limit the model's maximum achievable performance.

Training

The model was fine-tuned using a custom Python script (fine_tune_classifier.py) and the Hugging Face transformers library.

Key Hyperparameters

Hyperparameter Value
Base Model Checkpoint cvcio/roberta-el-news
Number of Epochs 4
Batch Size (Train/Eval) 64
Weight Decay 0.01
Warmup Steps 50
Max Sequence Length 512

Evaluation Metrics

  • Accuracy: The overall fraction of correct predictions.
  • F1-Score (Weighted): The F1 score, weighted by the number of true instances for each label. This is a critical metric for handling potential class imbalance in the dataset.

Limitations

This model is intended for research purposes and for automated media monitoring in the Greek language. Specific uses include:

  1. Categorization: Automatically classifying new Greek news articles into one of the 8 climate-related categories.
  2. Trend Analysis: Monitoring the frequency and shifts in media coverage across the different climate topics over time.

Limitations and Biases

  • Annotation Quality: The primary limitation is the acknowledged "poor" quality of the academic annotations, which may lead to misclassifications, especially for ambiguous or intersectional articles.
  • Monolingual (Greek): The model is strictly intended for Greek language text.
  • Domain Specificity: It is fine-tuned only on news text. Its performance on other text types (e.g., social media, academic papers) will likely be lower.
  • General RoBERTa Biases: The model may inherit any biases present in the original cvcio/roberta-el-news pre-training data.

How to Use

You can easily use this model for sequence classification with the Hugging Face transformers library:

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

# Model ID on Hugging Face Hub
model_name = "cvcio/mediawatch-el-climate"

# Load model and tokenizer (using the base model's tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained("cvcio/roberta-el-news")

# Create a classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Example Greek text
text = "Λειψυδρία: Σε ανησυχητικό επίπεδο η στάθμη του νερού σε Πηνειό και Μόρνο – Καμπανάκι ΕΥΔΑΠ για τα αποθέματα : Έχουμε λιγότερο από τα μισά του 2019"

# Run classification
result = classifier(text)

# Print the result
print(result)
# Expected output (e.g.): [{'label': 'ΚΛΙΜΑΤΙΚΗ ΑΛΛΑΓΗ', 'score': 0.98...}]
Downloads last month
19
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cvcio/mediawatch-el-climate

Finetuned
(1)
this model