ysdaml4 / README.md
ssbars's picture
v2
12faaae
---
title: Paper Classifier
emoji: πŸ“š
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: "1.32.0"
app_file: app.py
pinned: false
---
# πŸ“š Academic Paper Classifier
[link](https://huggingface.co/spaces/ssbars/ysdaml4)
This Streamlit application helps classify academic papers into different categories using a BERT-based model.
## Features
- **Text Classification**: Paste any paper text directly
- **PDF Support**: Upload PDF files for classification
- **Real-time Analysis**: Get instant classification results
- **Probability Distribution**: See confidence scores for each category
- **Multiple Categories**: Supports various academic fields
## How to Use
1. **Text Input**
- Paste your paper's text (abstract or full content)
- Click "Classify Text"
- View results and probability distribution
2. **PDF Upload**
- Upload a PDF file of your paper
- Click "Classify PDF"
- Get classification results
## Categories
The model classifies papers into the following categories:
- Computer Science
- Mathematics
- Physics
- Biology
- Economics
## Technical Details
- Built with Streamlit
- Uses BERT-based model for classification
- Supports PDF file processing
- Real-time classification
## About
This application is designed to help researchers, students, and academics quickly identify the primary field of academic papers. It uses state-of-the-art natural language processing to analyze paper content and provide accurate classifications.
---
Created with ❀️ using Streamlit and Transformers
## Setup
1. Install `uv` (if not already installed):
```bash
# Using pip
pip install uv
# Or using Homebrew on macOS
brew install uv
```
2. Create and activate a virtual environment:
```bash
uv venv
source .venv/bin/activate # On Unix/macOS
# OR
.venv\Scripts\activate # On Windows
```
3. Install the dependencies using uv:
```bash
uv pip install -r requirements.lock
```
4. Run the Streamlit application:
```bash
streamlit run app.py
```
## Usage
1. **Text Classification**
- Paste the paper's text (abstract or content) into the text area
- Click "Classify Text" to get results
2. **PDF Classification**
- Upload a PDF file using the file uploader
- Click "Classify PDF" to process and classify the document
## Model Information
The service uses a BERT-based model for classification with the following categories:
- Computer Science
- Mathematics
- Physics
- Biology
- Economics
## Note
The current implementation uses a base BERT model. For production use, you should:
1. Fine-tune the model on a dataset of academic papers
2. Adjust the categories based on your specific needs
3. Implement proper error handling and validation
4. Add authentication if needed
## Package Management
This project uses `uv` as the package manager for faster and more reliable dependency management. The dependencies are locked in `requirements.lock` for reproducible installations.
To update dependencies:
```bash
# Update a single package
uv pip install --upgrade package_name
# Update all packages and regenerate lock file
uv pip compile requirements.txt -o requirements.lock
uv pip install -r requirements.lock
```
## Requirements
See `requirements.txt` for a complete list of dependencies.
# ArXiv Paper Classifier
This project implements a machine learning system for classifying academic papers into ArXiv categories using state-of-the-art transformer models.
## Project Overview
The system uses pre-trained transformer models to classify academic papers into one of the main ArXiv categories:
- Computer Science (cs)
- Mathematics (math)
- Physics (physics)
- Quantitative Biology (q-bio)
- Quantitative Finance (q-fin)
- Statistics (stat)
- Electrical Engineering and Systems Science (eess)
- Economics (econ)
## Features
- Multiple model support:
- DistilBERT: Lightweight and fast model, good for testing
- DeBERTa-v3: Advanced model with better performance
- RoBERTa: Advanced model with strong performance
- SciBERT: Specialized for scientific text
- BERT: Classic model with good all-round performance
- Flexible input handling:
- Can process both title and abstract
- Handles text preprocessing and tokenization
- Supports different maximum sequence lengths
- Robust error handling:
- Multiple fallback mechanisms for tokenizer initialization
- Graceful degradation to simpler models if needed
- Detailed error messages and logging
## Installation
1. Clone the repository
2. Install dependencies:
```bash
pip install -r requirements.txt
```
## Usage
### Basic Usage
```python
from model import PaperClassifier
# Initialize classifier with default model (DistilBERT)
classifier = PaperClassifier()
# Classify a paper
result = classifier.classify_paper(
title="Your paper title",
abstract="Your paper abstract"
)
# Print results
print(result)
```
### Using Different Models
```python
# Initialize with DeBERTa-v3
classifier = PaperClassifier(model_type='deberta-v3')
# Initialize with RoBERTa
classifier = PaperClassifier(model_type='roberta')
# Initialize with SciBERT
classifier = PaperClassifier(model_type='scibert')
# Initialize with BERT
classifier = PaperClassifier(model_type='bert')
```
### Training on Custom Data
```python
# Prepare your training data
train_texts = ["paper1 title and abstract", "paper2 title and abstract", ...]
train_labels = ["cs", "math", ...]
# Train the model
classifier.train_on_arxiv(
train_texts=train_texts,
train_labels=train_labels,
epochs=3,
batch_size=16,
learning_rate=2e-5
)
```
## Model Details
### Available Models
1. **DistilBERT** (`distilbert`)
- Model: `distilbert-base-cased`
- Max length: 512 tokens
- Fast tokenizer
- Good for testing and quick results
2. **DeBERTa-v3** (`deberta-v3`)
- Model: `microsoft/deberta-v3-base`
- Max length: 512 tokens
- Uses DebertaV2TokenizerFast
- Advanced performance
3. **RoBERTa** (`roberta`)
- Model: `roberta-base`
- Max length: 512 tokens
- Strong performance on various tasks
4. **SciBERT** (`scibert`)
- Model: `allenai/scibert_scivocab_uncased`
- Max length: 512 tokens
- Specialized for scientific text
5. **BERT** (`bert`)
- Model: `bert-base-uncased`
- Max length: 512 tokens
- Classic model with good all-round performance
## Error Handling
The system includes robust error handling mechanisms:
- Multiple fallback levels for tokenizer initialization
- Graceful degradation to simpler models
- Detailed error messages and logging
- Automatic fallback to BERT tokenizer if needed
## Requirements
- Python 3.7+
- PyTorch
- Transformers library
- NumPy
- Sacremoses (for tokenization support)