|
--- |
|
title: Paper Classifier |
|
emoji: π |
|
colorFrom: blue |
|
colorTo: indigo |
|
sdk: streamlit |
|
sdk_version: "1.32.0" |
|
app_file: app.py |
|
pinned: false |
|
--- |
|
|
|
# π Academic Paper Classifier |
|
|
|
[link](https://huggingface.co/spaces/ssbars/ysdaml4) |
|
|
|
This Streamlit application helps classify academic papers into different categories using a BERT-based model. |
|
|
|
## Features |
|
|
|
- **Text Classification**: Paste any paper text directly |
|
- **PDF Support**: Upload PDF files for classification |
|
- **Real-time Analysis**: Get instant classification results |
|
- **Probability Distribution**: See confidence scores for each category |
|
- **Multiple Categories**: Supports various academic fields |
|
|
|
## How to Use |
|
|
|
1. **Text Input** |
|
- Paste your paper's text (abstract or full content) |
|
- Click "Classify Text" |
|
- View results and probability distribution |
|
|
|
2. **PDF Upload** |
|
- Upload a PDF file of your paper |
|
- Click "Classify PDF" |
|
- Get classification results |
|
|
|
## Categories |
|
|
|
The model classifies papers into the following categories: |
|
- Computer Science |
|
- Mathematics |
|
- Physics |
|
- Biology |
|
- Economics |
|
|
|
## Technical Details |
|
|
|
- Built with Streamlit |
|
- Uses BERT-based model for classification |
|
- Supports PDF file processing |
|
- Real-time classification |
|
|
|
## About |
|
|
|
This application is designed to help researchers, students, and academics quickly identify the primary field of academic papers. It uses state-of-the-art natural language processing to analyze paper content and provide accurate classifications. |
|
|
|
--- |
|
Created with β€οΈ using Streamlit and Transformers |
|
|
|
## Setup |
|
|
|
1. Install `uv` (if not already installed): |
|
```bash |
|
# Using pip |
|
pip install uv |
|
|
|
# Or using Homebrew on macOS |
|
brew install uv |
|
``` |
|
|
|
2. Create and activate a virtual environment: |
|
```bash |
|
uv venv |
|
source .venv/bin/activate # On Unix/macOS |
|
# OR |
|
.venv\Scripts\activate # On Windows |
|
``` |
|
|
|
3. Install the dependencies using uv: |
|
```bash |
|
uv pip install -r requirements.lock |
|
``` |
|
|
|
4. Run the Streamlit application: |
|
```bash |
|
streamlit run app.py |
|
``` |
|
|
|
## Usage |
|
|
|
1. **Text Classification** |
|
- Paste the paper's text (abstract or content) into the text area |
|
- Click "Classify Text" to get results |
|
|
|
2. **PDF Classification** |
|
- Upload a PDF file using the file uploader |
|
- Click "Classify PDF" to process and classify the document |
|
|
|
## Model Information |
|
|
|
The service uses a BERT-based model for classification with the following categories: |
|
- Computer Science |
|
- Mathematics |
|
- Physics |
|
- Biology |
|
- Economics |
|
|
|
## Note |
|
|
|
The current implementation uses a base BERT model. For production use, you should: |
|
1. Fine-tune the model on a dataset of academic papers |
|
2. Adjust the categories based on your specific needs |
|
3. Implement proper error handling and validation |
|
4. Add authentication if needed |
|
|
|
## Package Management |
|
|
|
This project uses `uv` as the package manager for faster and more reliable dependency management. The dependencies are locked in `requirements.lock` for reproducible installations. |
|
|
|
To update dependencies: |
|
```bash |
|
# Update a single package |
|
uv pip install --upgrade package_name |
|
|
|
# Update all packages and regenerate lock file |
|
uv pip compile requirements.txt -o requirements.lock |
|
uv pip install -r requirements.lock |
|
``` |
|
|
|
## Requirements |
|
|
|
See `requirements.txt` for a complete list of dependencies. |
|
|
|
# ArXiv Paper Classifier |
|
|
|
This project implements a machine learning system for classifying academic papers into ArXiv categories using state-of-the-art transformer models. |
|
|
|
## Project Overview |
|
|
|
The system uses pre-trained transformer models to classify academic papers into one of the main ArXiv categories: |
|
- Computer Science (cs) |
|
- Mathematics (math) |
|
- Physics (physics) |
|
- Quantitative Biology (q-bio) |
|
- Quantitative Finance (q-fin) |
|
- Statistics (stat) |
|
- Electrical Engineering and Systems Science (eess) |
|
- Economics (econ) |
|
|
|
## Features |
|
|
|
- Multiple model support: |
|
- DistilBERT: Lightweight and fast model, good for testing |
|
- DeBERTa-v3: Advanced model with better performance |
|
- RoBERTa: Advanced model with strong performance |
|
- SciBERT: Specialized for scientific text |
|
- BERT: Classic model with good all-round performance |
|
|
|
- Flexible input handling: |
|
- Can process both title and abstract |
|
- Handles text preprocessing and tokenization |
|
- Supports different maximum sequence lengths |
|
|
|
- Robust error handling: |
|
- Multiple fallback mechanisms for tokenizer initialization |
|
- Graceful degradation to simpler models if needed |
|
- Detailed error messages and logging |
|
|
|
## Installation |
|
|
|
1. Clone the repository |
|
2. Install dependencies: |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
## Usage |
|
|
|
### Basic Usage |
|
|
|
```python |
|
from model import PaperClassifier |
|
|
|
# Initialize classifier with default model (DistilBERT) |
|
classifier = PaperClassifier() |
|
|
|
# Classify a paper |
|
result = classifier.classify_paper( |
|
title="Your paper title", |
|
abstract="Your paper abstract" |
|
) |
|
|
|
# Print results |
|
print(result) |
|
``` |
|
|
|
### Using Different Models |
|
|
|
```python |
|
# Initialize with DeBERTa-v3 |
|
classifier = PaperClassifier(model_type='deberta-v3') |
|
|
|
# Initialize with RoBERTa |
|
classifier = PaperClassifier(model_type='roberta') |
|
|
|
# Initialize with SciBERT |
|
classifier = PaperClassifier(model_type='scibert') |
|
|
|
# Initialize with BERT |
|
classifier = PaperClassifier(model_type='bert') |
|
``` |
|
|
|
### Training on Custom Data |
|
|
|
```python |
|
# Prepare your training data |
|
train_texts = ["paper1 title and abstract", "paper2 title and abstract", ...] |
|
train_labels = ["cs", "math", ...] |
|
|
|
# Train the model |
|
classifier.train_on_arxiv( |
|
train_texts=train_texts, |
|
train_labels=train_labels, |
|
epochs=3, |
|
batch_size=16, |
|
learning_rate=2e-5 |
|
) |
|
``` |
|
|
|
## Model Details |
|
|
|
### Available Models |
|
|
|
1. **DistilBERT** (`distilbert`) |
|
- Model: `distilbert-base-cased` |
|
- Max length: 512 tokens |
|
- Fast tokenizer |
|
- Good for testing and quick results |
|
|
|
2. **DeBERTa-v3** (`deberta-v3`) |
|
- Model: `microsoft/deberta-v3-base` |
|
- Max length: 512 tokens |
|
- Uses DebertaV2TokenizerFast |
|
- Advanced performance |
|
|
|
3. **RoBERTa** (`roberta`) |
|
- Model: `roberta-base` |
|
- Max length: 512 tokens |
|
- Strong performance on various tasks |
|
|
|
4. **SciBERT** (`scibert`) |
|
- Model: `allenai/scibert_scivocab_uncased` |
|
- Max length: 512 tokens |
|
- Specialized for scientific text |
|
|
|
5. **BERT** (`bert`) |
|
- Model: `bert-base-uncased` |
|
- Max length: 512 tokens |
|
- Classic model with good all-round performance |
|
|
|
## Error Handling |
|
|
|
The system includes robust error handling mechanisms: |
|
- Multiple fallback levels for tokenizer initialization |
|
- Graceful degradation to simpler models |
|
- Detailed error messages and logging |
|
- Automatic fallback to BERT tokenizer if needed |
|
|
|
## Requirements |
|
|
|
- Python 3.7+ |
|
- PyTorch |
|
- Transformers library |
|
- NumPy |
|
- Sacremoses (for tokenization support) |