Greek Climate News Classification

The cvcio/mediawatch-el-climate model is a fine-tuned RoBERTa-based model for Sequence Classification of Greek news articles, specifically for topics related to climate change and the environment.

It is part of the MediaWatch-EL project and is designed to automatically categorize Greek news content, enabling large-scale analysis of media coverage on this critical subject.

Model Overview

The model performs a single-label text classification task, assigning one of eight defined climate-related labels to a given text input.

Base Model: cvcio/roberta-el-news
- - This choice leverages a pre-trained language model optimized for the Greek language and news-related text.
Architecture: RobertaForSequenceClassification
Language: Greek (el)
Task: Text Classification (Categorization of climate-related news).

Classification Labels

The model classifies text into one of the following 8 labels, all of which represent distinct themes within climate and environmental reporting:

Label	Greek Label	English Translation	Description
LABEL_0	ΒΙΩΣΙΜΟΤΗΤΑ	Sustainability	Topics related to sustainable practices and development.
LABEL_1	ΠΕΡΙΒΑΛΛΟΝ	Environment	Broad environmental topics, not solely climate.
LABEL_2	ΚΛΙΜΑΤΙΚΗ ΑΛΛΑΓΗ	Climate Change	General mention of climate change.
LABEL_3	ΘΕΡΜΟΚΡΑΣΙΑ	Temperature	Specific mentions of temperature or heat-related events.
LABEL_4	ΚΛΙΜΑΤΙΚΗ ΚΡΙΣΗ	Climate Crisis	Focus on the urgency or severity of the issue.
LABEL_5	ΚΛΙΜΑ	Climate	Meteorological or general climate context.
LABEL_6	ΡΥΠΑΝΣΗ	Pollution	Specific mentions of environmental contamination.
LABEL_7	ΕΝΕΡΓΕΙΑ	Energy	Focus on energy sources, transitions, or policy.

Training Data and Annotation

Dataset

Size: Approximately 12,000 unique Greek news articles.
Source: The articles were collected from a wide range of Greek online media outlets.
Content: The full articles, not just titles, were used for fine-tuning.
Data Split: The dataset was split using an 80% training and 20% testing (evaluation) ratio.

Annotation Process

The dataset was annotated by a group of academics.

Methodology: The news articles were labeled with one of the 8 categories mentioned above.
Known Limitation (Poor Annotation): It is acknowledged that the annotation quality may not be optimal due to the inherent difficulty of the task, the complexity of journalistic text, and potential inter-annotator disagreement. This poor quality could introduce noise into the training data and potentially limit the model's maximum achievable performance.

Training

The model was fine-tuned using a custom Python script (fine_tune_classifier.py) and the Hugging Face transformers library.

Key Hyperparameters

Hyperparameter	Value
Base Model Checkpoint	`cvcio/roberta-el-news`
Number of Epochs	4
Batch Size (Train/Eval)	64
Weight Decay	0.01
Warmup Steps	50
Max Sequence Length	512

Evaluation Metrics

Accuracy: The overall fraction of correct predictions.
F1-Score (Weighted): The F1 score, weighted by the number of true instances for each label. This is a critical metric for handling potential class imbalance in the dataset.

Limitations

This model is intended for research purposes and for automated media monitoring in the Greek language. Specific uses include:

Categorization: Automatically classifying new Greek news articles into one of the 8 climate-related categories.
Trend Analysis: Monitoring the frequency and shifts in media coverage across the different climate topics over time.

Limitations and Biases

Annotation Quality: The primary limitation is the acknowledged "poor" quality of the academic annotations, which may lead to misclassifications, especially for ambiguous or intersectional articles.
Monolingual (Greek): The model is strictly intended for Greek language text.
Domain Specificity: It is fine-tuned only on news text. Its performance on other text types (e.g., social media, academic papers) will likely be lower.
General RoBERTa Biases: The model may inherit any biases present in the original cvcio/roberta-el-news pre-training data.

How to Use

You can easily use this model for sequence classification with the Hugging Face transformers library:

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

# Model ID on Hugging Face Hub
model_name = "cvcio/mediawatch-el-climate"

# Load model and tokenizer (using the base model's tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained("cvcio/roberta-el-news")

# Create a classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Example Greek text
text = "Λειψυδρία: Σε ανησυχητικό επίπεδο η στάθμη του νερού σε Πηνειό και Μόρνο – Καμπανάκι ΕΥΔΑΠ για τα αποθέματα : Έχουμε λιγότερο από τα μισά του 2019"

# Run classification
result = classifier(text)

# Print the result
print(result)
# Expected output (e.g.): [{'label': 'ΚΛΙΜΑΤΙΚΗ ΑΛΛΑΓΗ', 'score': 0.98...}]

Downloads last month: 19

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for cvcio/mediawatch-el-climate

Base model

cvcio/roberta-el-news

Finetuned

(1)

this model