Greek Climate News Classification
The cvcio/mediawatch-el-climate model is a fine-tuned RoBERTa-based model for Sequence Classification of Greek news articles, specifically for topics related to climate change and the environment.
It is part of the MediaWatch-EL project and is designed to automatically categorize Greek news content, enabling large-scale analysis of media coverage on this critical subject.
Model Overview
The model performs a single-label text classification task, assigning one of eight defined climate-related labels to a given text input.
- Base Model:
cvcio/roberta-el-news- This choice leverages a pre-trained language model optimized for the Greek language and news-related text.
- Architecture:
RobertaForSequenceClassification - Language: Greek (el)
- Task: Text Classification (Categorization of climate-related news).
Classification Labels
The model classifies text into one of the following 8 labels, all of which represent distinct themes within climate and environmental reporting:
| Label | Greek Label | English Translation | Description |
|---|---|---|---|
| LABEL_0 | ΒΙΩΣΙΜΟΤΗΤΑ | Sustainability | Topics related to sustainable practices and development. |
| LABEL_1 | ΠΕΡΙΒΑΛΛΟΝ | Environment | Broad environmental topics, not solely climate. |
| LABEL_2 | ΚΛΙΜΑΤΙΚΗ ΑΛΛΑΓΗ | Climate Change | General mention of climate change. |
| LABEL_3 | ΘΕΡΜΟΚΡΑΣΙΑ | Temperature | Specific mentions of temperature or heat-related events. |
| LABEL_4 | ΚΛΙΜΑΤΙΚΗ ΚΡΙΣΗ | Climate Crisis | Focus on the urgency or severity of the issue. |
| LABEL_5 | ΚΛΙΜΑ | Climate | Meteorological or general climate context. |
| LABEL_6 | ΡΥΠΑΝΣΗ | Pollution | Specific mentions of environmental contamination. |
| LABEL_7 | ΕΝΕΡΓΕΙΑ | Energy | Focus on energy sources, transitions, or policy. |
Training Data and Annotation
Dataset
- Size: Approximately 12,000 unique Greek news articles.
- Source: The articles were collected from a wide range of Greek online media outlets.
- Content: The full articles, not just titles, were used for fine-tuning.
- Data Split: The dataset was split using an 80% training and 20% testing (evaluation) ratio.
Annotation Process
The dataset was annotated by a group of academics.
- Methodology: The news articles were labeled with one of the 8 categories mentioned above.
- Known Limitation (Poor Annotation): It is acknowledged that the annotation quality may not be optimal due to the inherent difficulty of the task, the complexity of journalistic text, and potential inter-annotator disagreement. This poor quality could introduce noise into the training data and potentially limit the model's maximum achievable performance.
Training
The model was fine-tuned using a custom Python script (fine_tune_classifier.py) and the Hugging Face transformers library.
Key Hyperparameters
| Hyperparameter | Value |
|---|---|
| Base Model Checkpoint | cvcio/roberta-el-news |
| Number of Epochs | 4 |
| Batch Size (Train/Eval) | 64 |
| Weight Decay | 0.01 |
| Warmup Steps | 50 |
| Max Sequence Length | 512 |
Evaluation Metrics
- Accuracy: The overall fraction of correct predictions.
- F1-Score (Weighted): The F1 score, weighted by the number of true instances for each label. This is a critical metric for handling potential class imbalance in the dataset.
Limitations
This model is intended for research purposes and for automated media monitoring in the Greek language. Specific uses include:
- Categorization: Automatically classifying new Greek news articles into one of the 8 climate-related categories.
- Trend Analysis: Monitoring the frequency and shifts in media coverage across the different climate topics over time.
Limitations and Biases
- Annotation Quality: The primary limitation is the acknowledged "poor" quality of the academic annotations, which may lead to misclassifications, especially for ambiguous or intersectional articles.
- Monolingual (Greek): The model is strictly intended for Greek language text.
- Domain Specificity: It is fine-tuned only on news text. Its performance on other text types (e.g., social media, academic papers) will likely be lower.
- General RoBERTa Biases: The model may inherit any biases present in the original
cvcio/roberta-el-newspre-training data.
How to Use
You can easily use this model for sequence classification with the Hugging Face transformers library:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
# Model ID on Hugging Face Hub
model_name = "cvcio/mediawatch-el-climate"
# Load model and tokenizer (using the base model's tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained("cvcio/roberta-el-news")
# Create a classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
# Example Greek text
text = "Λειψυδρία: Σε ανησυχητικό επίπεδο η στάθμη του νερού σε Πηνειό και Μόρνο – Καμπανάκι ΕΥΔΑΠ για τα αποθέματα : Έχουμε λιγότερο από τα μισά του 2019"
# Run classification
result = classifier(text)
# Print the result
print(result)
# Expected output (e.g.): [{'label': 'ΚΛΙΜΑΤΙΚΗ ΑΛΛΑΓΗ', 'score': 0.98...}]
- Downloads last month
- 19
Model tree for cvcio/mediawatch-el-climate
Base model
cvcio/roberta-el-news