File size: 6,697 Bytes
2989d17 8a1304d 12faaae 8a1304d 12faaae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 |
---
title: Paper Classifier
emoji: π
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: "1.32.0"
app_file: app.py
pinned: false
---
# π Academic Paper Classifier
[link](https://huggingface.co/spaces/ssbars/ysdaml4)
This Streamlit application helps classify academic papers into different categories using a BERT-based model.
## Features
- **Text Classification**: Paste any paper text directly
- **PDF Support**: Upload PDF files for classification
- **Real-time Analysis**: Get instant classification results
- **Probability Distribution**: See confidence scores for each category
- **Multiple Categories**: Supports various academic fields
## How to Use
1. **Text Input**
- Paste your paper's text (abstract or full content)
- Click "Classify Text"
- View results and probability distribution
2. **PDF Upload**
- Upload a PDF file of your paper
- Click "Classify PDF"
- Get classification results
## Categories
The model classifies papers into the following categories:
- Computer Science
- Mathematics
- Physics
- Biology
- Economics
## Technical Details
- Built with Streamlit
- Uses BERT-based model for classification
- Supports PDF file processing
- Real-time classification
## About
This application is designed to help researchers, students, and academics quickly identify the primary field of academic papers. It uses state-of-the-art natural language processing to analyze paper content and provide accurate classifications.
---
Created with β€οΈ using Streamlit and Transformers
## Setup
1. Install `uv` (if not already installed):
```bash
# Using pip
pip install uv
# Or using Homebrew on macOS
brew install uv
```
2. Create and activate a virtual environment:
```bash
uv venv
source .venv/bin/activate # On Unix/macOS
# OR
.venv\Scripts\activate # On Windows
```
3. Install the dependencies using uv:
```bash
uv pip install -r requirements.lock
```
4. Run the Streamlit application:
```bash
streamlit run app.py
```
## Usage
1. **Text Classification**
- Paste the paper's text (abstract or content) into the text area
- Click "Classify Text" to get results
2. **PDF Classification**
- Upload a PDF file using the file uploader
- Click "Classify PDF" to process and classify the document
## Model Information
The service uses a BERT-based model for classification with the following categories:
- Computer Science
- Mathematics
- Physics
- Biology
- Economics
## Note
The current implementation uses a base BERT model. For production use, you should:
1. Fine-tune the model on a dataset of academic papers
2. Adjust the categories based on your specific needs
3. Implement proper error handling and validation
4. Add authentication if needed
## Package Management
This project uses `uv` as the package manager for faster and more reliable dependency management. The dependencies are locked in `requirements.lock` for reproducible installations.
To update dependencies:
```bash
# Update a single package
uv pip install --upgrade package_name
# Update all packages and regenerate lock file
uv pip compile requirements.txt -o requirements.lock
uv pip install -r requirements.lock
```
## Requirements
See `requirements.txt` for a complete list of dependencies.
# ArXiv Paper Classifier
This project implements a machine learning system for classifying academic papers into ArXiv categories using state-of-the-art transformer models.
## Project Overview
The system uses pre-trained transformer models to classify academic papers into one of the main ArXiv categories:
- Computer Science (cs)
- Mathematics (math)
- Physics (physics)
- Quantitative Biology (q-bio)
- Quantitative Finance (q-fin)
- Statistics (stat)
- Electrical Engineering and Systems Science (eess)
- Economics (econ)
## Features
- Multiple model support:
- DistilBERT: Lightweight and fast model, good for testing
- DeBERTa-v3: Advanced model with better performance
- RoBERTa: Advanced model with strong performance
- SciBERT: Specialized for scientific text
- BERT: Classic model with good all-round performance
- Flexible input handling:
- Can process both title and abstract
- Handles text preprocessing and tokenization
- Supports different maximum sequence lengths
- Robust error handling:
- Multiple fallback mechanisms for tokenizer initialization
- Graceful degradation to simpler models if needed
- Detailed error messages and logging
## Installation
1. Clone the repository
2. Install dependencies:
```bash
pip install -r requirements.txt
```
## Usage
### Basic Usage
```python
from model import PaperClassifier
# Initialize classifier with default model (DistilBERT)
classifier = PaperClassifier()
# Classify a paper
result = classifier.classify_paper(
title="Your paper title",
abstract="Your paper abstract"
)
# Print results
print(result)
```
### Using Different Models
```python
# Initialize with DeBERTa-v3
classifier = PaperClassifier(model_type='deberta-v3')
# Initialize with RoBERTa
classifier = PaperClassifier(model_type='roberta')
# Initialize with SciBERT
classifier = PaperClassifier(model_type='scibert')
# Initialize with BERT
classifier = PaperClassifier(model_type='bert')
```
### Training on Custom Data
```python
# Prepare your training data
train_texts = ["paper1 title and abstract", "paper2 title and abstract", ...]
train_labels = ["cs", "math", ...]
# Train the model
classifier.train_on_arxiv(
train_texts=train_texts,
train_labels=train_labels,
epochs=3,
batch_size=16,
learning_rate=2e-5
)
```
## Model Details
### Available Models
1. **DistilBERT** (`distilbert`)
- Model: `distilbert-base-cased`
- Max length: 512 tokens
- Fast tokenizer
- Good for testing and quick results
2. **DeBERTa-v3** (`deberta-v3`)
- Model: `microsoft/deberta-v3-base`
- Max length: 512 tokens
- Uses DebertaV2TokenizerFast
- Advanced performance
3. **RoBERTa** (`roberta`)
- Model: `roberta-base`
- Max length: 512 tokens
- Strong performance on various tasks
4. **SciBERT** (`scibert`)
- Model: `allenai/scibert_scivocab_uncased`
- Max length: 512 tokens
- Specialized for scientific text
5. **BERT** (`bert`)
- Model: `bert-base-uncased`
- Max length: 512 tokens
- Classic model with good all-round performance
## Error Handling
The system includes robust error handling mechanisms:
- Multiple fallback levels for tokenizer initialization
- Graceful degradation to simpler models
- Detailed error messages and logging
- Automatic fallback to BERT tokenizer if needed
## Requirements
- Python 3.7+
- PyTorch
- Transformers library
- NumPy
- Sacremoses (for tokenization support) |