--- license: gpl-3.0 tags: - greek - text-classification - sequence-classification - roberta - fine-tuned - climate-change - mediawatch - greek-news pipeline_tag: text-classification datasets: - custom-greek-climate-news-dataset model_name: mediawatch-el-climate base_model: cvcio/roberta-el-news widget: - text: "Η **κλιματική κρίση** και η **ρύπανση** των ωκεανών απασχόλησε τη σύνοδο κορυφής." example_title: Example Text in Greek --- # Greek Climate News Classification The **`cvcio/mediawatch-el-climate`** model is a fine-tuned RoBERTa-based model for **Sequence Classification** of Greek news articles, specifically for topics related to **climate change** and the **environment**. It is part of the MediaWatch-EL project and is designed to automatically categorize Greek news content, enabling large-scale analysis of media coverage on this critical subject. --- ## Model Overview The model performs a **single-label text classification** task, assigning one of eight defined climate-related labels to a given text input. * **Base Model:** **`cvcio/roberta-el-news`** * * This choice leverages a pre-trained language model optimized for the Greek language and news-related text. * **Architecture:** **`RobertaForSequenceClassification`** * **Language:** **Greek (el)** * **Task:** Text Classification (Categorization of climate-related news). ### **Classification Labels** The model classifies text into one of the following 8 labels, all of which represent distinct themes within climate and environmental reporting: | Label | Greek Label | English Translation | Description | | :--- | :--- | :--- | :--- | | **LABEL_0** | **ΒΙΩΣΙΜΟΤΗΤΑ** | Sustainability | Topics related to sustainable practices and development. | | **LABEL_1** | **ΠΕΡΙΒΑΛΛΟΝ** | Environment | Broad environmental topics, not solely climate. | | **LABEL_2** | **ΚΛΙΜΑΤΙΚΗ ΑΛΛΑΓΗ** | Climate Change | General mention of climate change. | | **LABEL_3** | **ΘΕΡΜΟΚΡΑΣΙΑ** | Temperature | Specific mentions of temperature or heat-related events. | | **LABEL_4** | **ΚΛΙΜΑΤΙΚΗ ΚΡΙΣΗ** | Climate Crisis | Focus on the urgency or severity of the issue. | | **LABEL_5** | **ΚΛΙΜΑ** | Climate | Meteorological or general climate context. | | **LABEL_6** | **ΡΥΠΑΝΣΗ** | Pollution | Specific mentions of environmental contamination. | | **LABEL_7** | **ΕΝΕΡΓΕΙΑ** | Energy | Focus on energy sources, transitions, or policy. | --- ## Training Data and Annotation ### **Dataset** * **Size:** Approximately **12,000 unique Greek news articles**. * **Source:** The articles were collected from a wide range of Greek online media outlets. * **Content:** The full articles, not just titles, were used for fine-tuning. * **Data Split:** The dataset was split using an **80% training** and **20% testing (evaluation)** ratio. ### **Annotation Process** The dataset was annotated by a group of academics. * **Methodology:** The news articles were labeled with one of the 8 categories mentioned above. * **Known Limitation (Poor Annotation):** It is acknowledged that the annotation quality may not be optimal due to the inherent difficulty of the task, the complexity of journalistic text, and potential inter-annotator disagreement. This poor quality could introduce noise into the training data and potentially limit the model's maximum achievable performance. --- ## Training The model was fine-tuned using a **custom Python script (`fine_tune_classifier.py`)** and the Hugging Face `transformers` library. ### **Key Hyperparameters** | Hyperparameter | Value | | :--- | :--- | | **Base Model Checkpoint** | `cvcio/roberta-el-news` | | **Number of Epochs** | 4 | | **Batch Size (Train/Eval)** | 64 | | **Weight Decay** | 0.01 | | **Warmup Steps** | 50 | | **Max Sequence Length** | 512 | ### **Evaluation Metrics** * **Accuracy:** The overall fraction of correct predictions. * **F1-Score (Weighted):** The F1 score, weighted by the number of true instances for each label. This is a critical metric for handling potential class imbalance in the dataset. --- ## Limitations This model is intended for **research purposes** and for automated **media monitoring** in the Greek language. Specific uses include: 1. **Categorization:** Automatically classifying new Greek news articles into one of the 8 climate-related categories. 2. **Trend Analysis:** Monitoring the frequency and shifts in media coverage across the different climate topics over time. ### **Limitations and Biases** * **Annotation Quality:** The primary limitation is the acknowledged "poor" quality of the academic annotations, which may lead to misclassifications, especially for ambiguous or intersectional articles. * **Monolingual (Greek):** The model is strictly intended for Greek language text. * **Domain Specificity:** It is fine-tuned only on news text. Its performance on other text types (e.g., social media, academic papers) will likely be lower. * **General RoBERTa Biases:** The model may inherit any biases present in the original **`cvcio/roberta-el-news`** pre-training data. --- ## How to Use You can easily use this model for sequence classification with the Hugging Face `transformers` library: ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline # Model ID on Hugging Face Hub model_name = "cvcio/mediawatch-el-climate" # Load model and tokenizer (using the base model's tokenizer) model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained("cvcio/roberta-el-news") # Create a classification pipeline classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) # Example Greek text text = "Λειψυδρία: Σε ανησυχητικό επίπεδο η στάθμη του νερού σε Πηνειό και Μόρνο – Καμπανάκι ΕΥΔΑΠ για τα αποθέματα : Έχουμε λιγότερο από τα μισά του 2019" # Run classification result = classifier(text) # Print the result print(result) # Expected output (e.g.): [{'label': 'ΚΛΙΜΑΤΙΚΗ ΑΛΛΑΓΗ', 'score': 0.98...}] ```