ragkasi commited on
Commit
0403b6d
·
verified ·
1 Parent(s): 677d607

Upload 20 files

Browse files
.gitattributes ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ data/raw/fake-and-real-news-2/fake_and_real_news.csv filter=lfs diff=lfs merge=lfs -text
2
+ data/raw/fake-and-real-news/Fake.csv filter=lfs diff=lfs merge=lfs -text
3
+ data/raw/fake-and-real-news/True.csv filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ venv
2
+ models
3
+
4
+ # Python virtual environments
5
+ venv/
6
+ .venv/
7
+ env/
8
+ .env/
9
+ ENV/
10
+ env.bak/
11
+ venv.bak/
12
+
13
+ # Python cache and compiled files
14
+ __pycache__/
15
+ *.py[cod]
16
+ *$py.class
17
+ *.so
18
+ .Python
19
+ build/
20
+ develop-eggs/
21
+ dist/
22
+ downloads/
23
+ eggs/
24
+ .eggs/
25
+ lib/
26
+ lib64/
27
+ parts/
28
+ sdist/
29
+ var/
30
+ wheels/
31
+ share/python-wheels/
32
+ *.egg-info/
33
+ .installed.cfg
34
+ *.egg
35
+ MANIFEST
36
+
37
+ # Jupyter Notebook checkpoints
38
+ .ipynb_checkpoints/
39
+ notebooks/.ipynb_checkpoints/
40
+ */.ipynb_checkpoints/
41
+
42
+ # Model files and training outputs
43
+ models/
44
+ *.h5
45
+ *.hdf5
46
+ *.pkl
47
+ *.pickle
48
+ *.joblib
49
+ *.pt
50
+ *.pth
51
+ *.ckpt
52
+ *.safetensors
53
+ wandb/
54
+ tensorboard_logs/
55
+ mlruns/
56
+
57
+ # Data files (uncomment if you have large datasets)
58
+ # data/
59
+ # *.csv
60
+ # *.json
61
+ # *.parquet
62
+ # *.feather
63
+
64
+ # Training and evaluation results
65
+ results/
66
+ outputs/
67
+ logs/
68
+ *.log
69
+
70
+ # Environment variables and configuration
71
+ .env
72
+ .env.local
73
+ .env.*.local
74
+ config.ini
75
+ secrets.json
76
+
77
+ # IDE and editor files
78
+ .vscode/
79
+ .idea/
80
+ *.swp
81
+ *.swo
82
+ *~
83
+ .DS_Store
84
+ Thumbs.db
85
+ .project
86
+ .pydevproject
87
+
88
+ # Temporary files
89
+ *.tmp
90
+ *.temp
91
+ temp/
92
+ tmp/
93
+
94
+ # Coverage and testing
95
+ .coverage
96
+ .pytest_cache/
97
+ .tox/
98
+ .nox/
99
+ .coverage.*
100
+ htmlcov/
101
+ .cache
102
+
103
+ # Documentation builds
104
+ docs/_build/
105
+ .sphinx/
106
+
107
+ # PyInstaller
108
+ *.manifest
109
+ *.spec
110
+
111
+ # Unit test / coverage reports
112
+ .coverage
113
+ .pytest_cache/
114
+ cover/
115
+
116
+ # Rope project settings
117
+ .ropeproject
118
+
119
+ # Spyder project settings
120
+ .spyderproject
121
+ .spyproject
.streamlit/config.toml ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [server]
2
+ # Disable file watcher to prevent PyTorch compatibility issues
3
+ fileWatcherType = "none"
4
+
5
+ # Disable usage stats collection
6
+ gatherUsageStats = false
7
+
8
+ # Set headless mode for better performance
9
+ headless = true
10
+
11
+ # Disable CORS protection for local development
12
+ enableCORS = false
13
+
14
+ [browser]
15
+ # Disable automatic browser opening
16
+ gatherUsageStats = false
17
+
18
+ [theme]
19
+ # Optional: set a nice theme
20
+ primaryColor = "#1f77b4"
21
+ backgroundColor = "#ffffff"
22
+ secondaryBackgroundColor = "#f0f2f6"
23
+ textColor = "#262730"
README.md ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fake News Detection Project
2
+
3
+ A machine learning project that classifies news articles as real or fake using both traditional NLP techniques and advanced transformer models.
4
+
5
+ ## 🎯 Project Overview
6
+
7
+ This project implements multiple approaches to detect fake news:
8
+ - **Traditional ML**: TF-IDF vectorization with Logistic Regression
9
+ - **Deep Learning**: Fine-tuned BERT model for sequence classification
10
+
11
+ ## 📊 Performance Results
12
+
13
+ ### TF-IDF + Logistic Regression Model
14
+ - **Accuracy**: 98.62%
15
+ - **F1 Score**: 98.67%
16
+
17
+ #### Detailed Classification Report:
18
+ ```
19
+ precision recall f1-score support
20
+
21
+ 0 0.98 0.99 0.99 4284 (Real News)
22
+ 1 0.99 0.98 0.99 4696 (Fake News)
23
+
24
+ accuracy 0.99 8980
25
+ macro avg 0.99 0.99 0.99 8980
26
+ weighted avg 0.99 0.99 0.99 8980
27
+ ```
28
+
29
+ ## 📁 Project Structure
30
+
31
+ ```
32
+ FakeNewsDetector/
33
+ ├── README.md
34
+ ├── requirements.txt
35
+ ├── notebooks/
36
+ │ └── FakeNewsClassifier_HuggingFace.ipynb
37
+ ├── scripts/
38
+ │ └── train.py
39
+ ├── models/
40
+ │ └── bert-fake-news/ (generated after training)
41
+ ├── data/
42
+ ├── app/
43
+ └── venv/
44
+ ```
45
+
46
+ ## 🚀 Quick Start
47
+
48
+ ### 1. Clone and Setup
49
+ ```bash
50
+ git clone <repository-url>
51
+ cd FakeNewsDetector
52
+ ```
53
+
54
+ ### 2. Create Virtual Environment
55
+ ```bash
56
+ python -m venv venv
57
+
58
+ # Windows PowerShell
59
+ .\venv\Scripts\Activate.ps1
60
+
61
+ # Windows CMD
62
+ .\venv\Scripts\activate.bat
63
+
64
+ # Git Bash
65
+ source venv/Scripts/activate
66
+ ```
67
+
68
+ ### 3. Install Dependencies
69
+ ```bash
70
+ pip install -r requirements.txt
71
+ ```
72
+
73
+ ### 4. Launch Jupyter Notebook
74
+ ```bash
75
+ jupyter notebook
76
+ ```
77
+
78
+ ## 📚 Dataset
79
+
80
+ The project uses the `mrm8488/fake-news` dataset from Hugging Face, which contains:
81
+ - **Total articles**: ~45,000
82
+ - **Training split**: 80% (~36,000 articles)
83
+ - **Test split**: 20% (~9,000 articles)
84
+ - **Classes**:
85
+ - 0: Real News
86
+ - 1: Fake News
87
+
88
+ ## 🔧 Models Implemented
89
+
90
+ ### 1. TF-IDF + Logistic Regression
91
+ - **Vectorizer**: TF-IDF with 5,000 max features, n-grams (1,2)
92
+ - **Classifier**: Logistic Regression with balanced class weights
93
+ - **Performance**: 98.62% accuracy
94
+
95
+ ### 2. BERT Fine-tuning
96
+ - **Base Model**: `bert-base-uncased`
97
+ - **Training**: 3 epochs with evaluation per epoch
98
+ - **Optimizer**: AdamW with learning rate 2e-5
99
+ - **Batch Size**: 8 per device
100
+
101
+ ## 🛠️ Usage
102
+
103
+ ### Running the Notebook
104
+ 1. Ensure your virtual environment is activated
105
+ 2. Start Jupyter: `jupyter notebook`
106
+ 3. Open `notebooks/FakeNewsClassifier_HuggingFace.ipynb`
107
+ 4. Make sure the kernel is set to "venv" or "FakeNewsDetector (venv)"
108
+ 5. Run all cells
109
+
110
+ ### Training BERT Model
111
+ ```bash
112
+ python scripts/train.py
113
+ ```
114
+
115
+ The trained model will be saved to `models/bert-fake-news/`
116
+
117
+ ## 📋 Requirements
118
+
119
+ - Python 3.8+
120
+ - pandas
121
+ - scikit-learn
122
+ - datasets (Hugging Face)
123
+ - transformers
124
+ - torch
125
+ - matplotlib
126
+ - seaborn
127
+ - jupyter
128
+ - ipywidgets
129
+
130
+ ## 🎯 Key Features
131
+
132
+ - **High Accuracy**: Achieves 98.6% accuracy on test set
133
+ - **Multiple Approaches**: Compares traditional ML vs. transformer models
134
+ - **Easy Setup**: Simple virtual environment setup
135
+ - **Comprehensive Analysis**: Includes confusion matrix and detailed metrics
136
+ - **Production Ready**: Trained models can be saved and deployed
137
+
138
+ ## 🔍 Model Analysis
139
+
140
+ The TF-IDF + Logistic Regression model shows excellent performance:
141
+ - **Balanced Performance**: High precision and recall for both classes
142
+ - **Low False Positives**: 98% precision for fake news detection
143
+ - **Low False Negatives**: 99% recall for real news detection
144
+ - **Robust**: Handles class imbalance well with balanced weights
145
+
146
+ ## 🚀 Future Improvements
147
+
148
+ - [ ] Implement ensemble methods combining multiple models
149
+ - [ ] Add cross-validation for more robust evaluation
150
+ - [ ] Experiment with other transformer models (RoBERTa, DistilBERT)
151
+ - [ ] Deploy model as a web API
152
+ - [ ] Add real-time news article classification
153
+ - [ ] Implement explainability features (LIME, SHAP)
154
+
155
+ ## 🤝 Contributing
156
+
157
+ Contributions, issues, and feature requests are welcome! Feel free to check the [issues page](../../issues).
158
+
159
+ ## 📧 Contact
160
+
161
+ For questions or suggestions, please open an issue or contact the project maintainer.
162
+
163
+ ---
164
+
165
+ **Note**: This project is for educational and research purposes. Always verify news from multiple reliable sources.
app/app.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
3
+
4
+ # Set page config
5
+ st.set_page_config(page_title="Fake News Detector", page_icon="📰")
6
+
7
+ # Hugging Face model path (change this to your actual repo ID)
8
+ MODEL_DIR = "ragkasi/bert-fake-news"
9
+
10
+ @st.cache_resource
11
+ def load_pipeline():
12
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
13
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
14
+ return pipeline("text-classification", model=model, tokenizer=tokenizer)
15
+
16
+ classifier = load_pipeline()
17
+
18
+ # UI
19
+ st.title("📰 Fake News Detector")
20
+ st.markdown("Enter a news **headline** or **statement**, and this app will predict if it's **real** or **fake**.")
21
+
22
+ news_input = st.text_area("✏️ News Text", height=150)
23
+
24
+ if st.button("🔍 Check News"):
25
+ if news_input.strip():
26
+ result = classifier(news_input)[0]
27
+ label = result["label"]
28
+ score = result["score"]
29
+
30
+ # Adjust label display
31
+ if label == "LABEL_1":
32
+ st.error(f"🚨 Likely **Fake News** (Confidence: `{score:.2f}`)")
33
+ else:
34
+ st.success(f"✅ Likely **Real News** (Confidence: `{score:.2f}`)")
35
+ else:
36
+ st.warning("⚠️ Please enter a news statement.")
data/raw/fake-and-real-news-2/fake_and_real_news.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1c27a9eba250b78e392310f35876fe47623768d447715d8784850f91936539be
3
+ size 25876225
data/raw/fake-and-real-news/Fake.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bebf8bcfe95678bf2c732bf413a2ce5f621af0102c82bf08083b2e5d3c693d0c
3
+ size 62789876
data/raw/fake-and-real-news/True.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba0844414a65dc6ae7402b8eee5306da24b6b56488d6767135af466c7dcb2775
3
+ size 53582940
data/raw/liar/README ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ LIAR: A BENCHMARK DATASET FOR FAKE NEWS DETECTION
2
+
3
+ William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), short paper, Vancouver, BC, Canada, July 30-August 4, ACL.
4
+ =====================================================================
5
+ Description of the TSV format:
6
+
7
+ Column 1: the ID of the statement ([ID].json).
8
+ Column 2: the label.
9
+ Column 3: the statement.
10
+ Column 4: the subject(s).
11
+ Column 5: the speaker.
12
+ Column 6: the speaker's job title.
13
+ Column 7: the state info.
14
+ Column 8: the party affiliation.
15
+ Column 9-13: the total credit history count, including the current statement.
16
+ 9: barely true counts.
17
+ 10: false counts.
18
+ 11: half true counts.
19
+ 12: mostly true counts.
20
+ 13: pants on fire counts.
21
+ Column 14: the context (venue / location of the speech or statement).
22
+
23
+ Note that we do not provide the full-text verdict report in this current version of the dataset,
24
+ but you can use the following command to access the full verdict report and links to the source documents:
25
+ wget http://www.politifact.com//api/v/2/statement/[ID]/?format=json
26
+
27
+ ======================================================================
28
+ The original sources retain the copyright of the data.
29
+
30
+ Note that there are absolutely no guarantees with this data,
31
+ and we provide this dataset "as is",
32
+ but you are welcome to report the issues of the preliminary version
33
+ of this data.
34
+
35
+ You are allowed to use this dataset for research purposes only.
36
+
37
+ For more question about the dataset, please contact:
38
+ William Wang, william@cs.ucsb.edu
39
+
40
+ v1.0 04/23/2017
41
+
data/raw/liar/test.tsv ADDED
The diff for this file is too large to render. See raw diff
 
data/raw/liar/train.tsv ADDED
The diff for this file is too large to render. See raw diff
 
data/raw/liar/valid.tsv ADDED
The diff for this file is too large to render. See raw diff
 
data/test.csv ADDED
File without changes
data/train.csv ADDED
File without changes
notebooks/FakeNewsClassifier_HuggingFace.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
notebooks/FakeNews_EDA.ipynb ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "metadata": {},
7
+ "outputs": [
8
+ {
9
+ "name": "stderr",
10
+ "output_type": "stream",
11
+ "text": [
12
+ "c:\\Users\\super\\Documents\\CSE Projects\\FakeNewsDetector\\venv\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
13
+ " from .autonotebook import tqdm as notebook_tqdm\n",
14
+ "Generating train split: 100%|██████████| 44898/44898 [00:01<00:00, 25984.86 examples/s]\n"
15
+ ]
16
+ }
17
+ ],
18
+ "source": [
19
+ "from datasets import load_dataset\n",
20
+ "\n",
21
+ "dataset = load_dataset(\"mrm8488/fake-news\")\n",
22
+ "dataset = dataset['train'].train_test_split(test_size=0.2)\n",
23
+ "\n",
24
+ "# Split out sets\n",
25
+ "train_ds = dataset['train']\n",
26
+ "test_ds = dataset['test']\n"
27
+ ]
28
+ },
29
+ {
30
+ "cell_type": "code",
31
+ "execution_count": 2,
32
+ "metadata": {},
33
+ "outputs": [],
34
+ "source": [
35
+ "import pandas as pd\n",
36
+ "\n",
37
+ "train_df = train_ds.to_pandas()\n",
38
+ "test_df = test_ds.to_pandas()\n"
39
+ ]
40
+ },
41
+ {
42
+ "cell_type": "code",
43
+ "execution_count": 4,
44
+ "metadata": {},
45
+ "outputs": [
46
+ {
47
+ "data": {
48
+ "image/png": "",
49
+ "text/plain": [
50
+ "<Figure size 640x480 with 1 Axes>"
51
+ ]
52
+ },
53
+ "metadata": {},
54
+ "output_type": "display_data"
55
+ },
56
+ {
57
+ "name": "stdout",
58
+ "output_type": "stream",
59
+ "text": [
60
+ "label\n",
61
+ "1 18743\n",
62
+ "0 17175\n",
63
+ "Name: count, dtype: int64\n"
64
+ ]
65
+ }
66
+ ],
67
+ "source": [
68
+ "import matplotlib.pyplot as plt\n",
69
+ "import seaborn as sns\n",
70
+ "\n",
71
+ "sns.countplot(data=train_df, x=\"label\")\n",
72
+ "plt.title(\"Class Distribution (0 = Fake, 1 = Real)\")\n",
73
+ "plt.show()\n",
74
+ "\n",
75
+ "# Optional: exact counts\n",
76
+ "print(train_df['label'].value_counts())\n"
77
+ ]
78
+ },
79
+ {
80
+ "cell_type": "code",
81
+ "execution_count": 5,
82
+ "metadata": {},
83
+ "outputs": [
84
+ {
85
+ "name": "stdout",
86
+ "output_type": "stream",
87
+ "text": [
88
+ "text 0\n",
89
+ "label 0\n",
90
+ "dtype: int64\n",
91
+ " text label\n",
92
+ "29516 A federal aid package was all set to pass the ... 1\n",
93
+ "26424 MOSCOW (Reuters) - Russia s lower house of par... 0\n",
94
+ "12155 The information below is disturbing and should... 1\n",
95
+ "14098 There are people out there who are giving the... 1\n",
96
+ "21622 CARACAS (Reuters) - Cuba’s main regional ally,... 0\n"
97
+ ]
98
+ }
99
+ ],
100
+ "source": [
101
+ "print(train_df.isnull().sum())\n",
102
+ "\n",
103
+ "# You can also visually inspect\n",
104
+ "print(train_df.sample(5))\n"
105
+ ]
106
+ },
107
+ {
108
+ "cell_type": "code",
109
+ "execution_count": 6,
110
+ "metadata": {},
111
+ "outputs": [
112
+ {
113
+ "data": {
114
+ "image/png": "",
115
+ "text/plain": [
116
+ "<Figure size 640x480 with 1 Axes>"
117
+ ]
118
+ },
119
+ "metadata": {},
120
+ "output_type": "display_data"
121
+ }
122
+ ],
123
+ "source": [
124
+ "# Add new column for text length\n",
125
+ "train_df['text_length'] = train_df['text'].apply(lambda x: len(x.split()))\n",
126
+ "\n",
127
+ "# Plot distribution\n",
128
+ "sns.histplot(train_df['text_length'], bins=50, kde=True)\n",
129
+ "plt.title(\"Article Length Distribution (in words)\")\n",
130
+ "plt.xlabel(\"Word Count\")\n",
131
+ "plt.show()\n"
132
+ ]
133
+ },
134
+ {
135
+ "cell_type": "code",
136
+ "execution_count": 7,
137
+ "metadata": {},
138
+ "outputs": [
139
+ {
140
+ "name": "stdout",
141
+ "output_type": "stream",
142
+ "text": [
143
+ "Min: 0\n",
144
+ "Max: 8135\n",
145
+ "Mean: 404.48017706999275\n"
146
+ ]
147
+ }
148
+ ],
149
+ "source": [
150
+ "print(\"Min:\", train_df['text_length'].min())\n",
151
+ "print(\"Max:\", train_df['text_length'].max())\n",
152
+ "print(\"Mean:\", train_df['text_length'].mean())\n"
153
+ ]
154
+ }
155
+ ],
156
+ "metadata": {
157
+ "kernelspec": {
158
+ "display_name": "venv",
159
+ "language": "python",
160
+ "name": "python3"
161
+ },
162
+ "language_info": {
163
+ "codemirror_mode": {
164
+ "name": "ipython",
165
+ "version": 3
166
+ },
167
+ "file_extension": ".py",
168
+ "mimetype": "text/x-python",
169
+ "name": "python",
170
+ "nbconvert_exporter": "python",
171
+ "pygments_lexer": "ipython3",
172
+ "version": "3.12.2"
173
+ }
174
+ },
175
+ "nbformat": 4,
176
+ "nbformat_minor": 2
177
+ }
requirements.txt ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ aiohappyeyeballs==2.6.1
2
+ aiohttp==3.12.6
3
+ aiosignal==1.3.2
4
+ altair==5.5.0
5
+ attrs==25.3.0
6
+ blinker==1.9.0
7
+ cachetools==5.5.2
8
+ certifi==2025.4.26
9
+ charset-normalizer==3.4.2
10
+ click==8.2.1
11
+ colorama==0.4.6
12
+ contourpy==1.3.2
13
+ cycler==0.12.1
14
+ datasets==3.6.0
15
+ dill==0.3.8
16
+ filelock==3.18.0
17
+ fonttools==4.58.1
18
+ frozenlist==1.6.0
19
+ fsspec==2025.3.0
20
+ gitdb==4.0.12
21
+ GitPython==3.1.44
22
+ huggingface-hub==0.32.3
23
+ idna==3.10
24
+ Jinja2==3.1.6
25
+ joblib==1.5.1
26
+ jsonschema==4.24.0
27
+ jsonschema-specifications==2025.4.1
28
+ kiwisolver==1.4.8
29
+ MarkupSafe==3.0.2
30
+ matplotlib==3.10.3
31
+ mpmath==1.3.0
32
+ multidict==6.4.4
33
+ multiprocess==0.70.16
34
+ narwhals==1.41.0
35
+ networkx==3.5
36
+ numpy==2.2.6
37
+ packaging==24.2
38
+ pandas==2.2.3
39
+ pillow==11.2.1
40
+ propcache==0.3.1
41
+ protobuf==6.31.1
42
+ pyarrow==20.0.0
43
+ pydeck==0.9.1
44
+ pyparsing==3.2.3
45
+ python-dateutil==2.9.0.post0
46
+ pytz==2025.2
47
+ PyYAML==6.0.2
48
+ referencing==0.36.2
49
+ regex==2024.11.6
50
+ requests==2.32.3
51
+ rpds-py==0.25.1
52
+ safetensors==0.5.3
53
+ scikit-learn==1.6.1
54
+ scipy==1.15.3
55
+ setuptools==80.9.0
56
+ six==1.17.0
57
+ smmap==5.0.2
58
+ streamlit==1.45.1
59
+ sympy==1.14.0
60
+ tenacity==9.1.2
61
+ threadpoolctl==3.6.0
62
+ tokenizers==0.21.1
63
+ toml==0.10.2
64
+ torch==2.7.0
65
+ tornado==6.5.1
66
+ tqdm==4.67.1
67
+ transformers==4.52.4
68
+ typing_extensions==4.13.2
69
+ tzdata==2025.2
70
+ urllib3==2.4.0
71
+ watchdog==6.0.0
72
+ xxhash==3.5.0
73
+ yarl==1.20.0
scripts/evaluate.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from datasets import load_dataset
2
+ from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
3
+ TrainingArguments, Trainer)
4
+ # Reload dataset and tokenizer
5
+ dataset = load_dataset("liar")
6
+ dataset = dataset["train"].train_test_split(test_size=0.2, seed=42)
7
+
8
+ def simplify_label(example):
9
+ name = dataset["train"].features["label"].names[ example["label"] ]
10
+ example["label"] = int(name in ["pants‑fire","false","barely‑true"])
11
+ return example
12
+
13
+ dataset = dataset.map(simplify_label)
14
+
15
+ tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
16
+ # Tokenize the text field (can try combining title + text later for improved performance):
17
+ def tokenize(example):
18
+ return tokenizer(example["statement"], truncation=True, padding="max_length", max_length=128)
19
+ # Tokenize the dataset
20
+ tokenized_dataset = dataset.map(tokenize, batched=True)
21
+ tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
22
+ # Load model
23
+ model = AutoModelForSequenceClassification.from_pretrained("models/bert-liar-fake-news")
24
+ # Set up Trainer for evaluation
25
+ training_args = TrainingArguments(output_dir="./results", per_device_eval_batch_size=8)
26
+ trainer = Trainer(model=model, args=training_args)
27
+ # Evaluate
28
+ metrics = trainer.evaluate(eval_dataset=tokenized_dataset["test"])
29
+ print(metrics)
scripts/predict.py ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import pipeline
2
+ # Load the classifier
3
+ classifier = pipeline("text-classification", model="models/bert-liar-fake-news", tokenizer="bert-base-uncased")
4
+ # Define the predict function
5
+ def predict(text):
6
+ result = classifier(text)
7
+ label = result[0]['label']
8
+ score = result[0]['score']
9
+ return label, score
10
+ # Example usage
11
+ if __name__ == "__main__":
12
+ text = input("Enter a statement to evaluate:\n")
13
+ label, score = predict(text)
14
+ print(f"Prediction: {label} (Confidence: {score:.2f})")
scripts/train.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from datasets import load_dataset
2
+ from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
3
+ TrainingArguments, Trainer)
4
+ # Get
5
+ # dataset['train'] — 80%
6
+ # dataset['test'] — 20%
7
+ dataset = load_dataset("liar")
8
+ dataset = dataset["train"].train_test_split(test_size=0.2, seed=42)
9
+
10
+ def simplify_label(example):
11
+ name = dataset["train"].features["label"].names[ example["label"] ]
12
+ example["label"] = int(name in ["pants‑fire","false","barely‑true"])
13
+ return example
14
+
15
+ dataset = dataset.map(simplify_label)
16
+
17
+ # Load the tokenizer
18
+ tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
19
+
20
+ # Tokenize the text field (could also try combining title + text later for improved performance):
21
+ def tokenize(example):
22
+ return tokenizer(example["statement"], truncation=True, padding="max_length", max_length=128)
23
+
24
+ tokenized_dataset = dataset.map(tokenize, batched=True)
25
+ # Set the format to torch and specify the columns to include
26
+ tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
27
+ # Load the model
28
+ model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
29
+
30
+ # Define the training arguments
31
+ training_args = TrainingArguments(
32
+ output_dir="./results",
33
+ eval_strategy="epoch",
34
+ save_strategy="epoch",
35
+ num_train_epochs=3,
36
+ per_device_train_batch_size=8,
37
+ per_device_eval_batch_size=8,
38
+ learning_rate=2e-5,
39
+ weight_decay=0.01,
40
+ )
41
+ # Initialize the Trainer
42
+ trainer = Trainer(
43
+ model=model,
44
+ args=training_args,
45
+ train_dataset=tokenized_dataset["train"],
46
+ eval_dataset=tokenized_dataset["test"],
47
+ )
48
+ # Train the model
49
+ trainer.train()
50
+ # Evaluate the model
51
+ trainer.save_model("models/bert-liar-fake-news")
52
+
53
+
54
+
scripts/utils.py ADDED
File without changes