Spaces:

ssbars
/

ysdaml4

Sleeping

App Files Files Community

ysdaml4 / README.md

ssbars

12faaae 4 months ago

preview code

raw

history blame contribute delete

6.7 kB

	---
	title: Paper Classifier
	emoji: 📚
	colorFrom: blue
	colorTo: indigo
	sdk: streamlit
	sdk_version: "1.32.0"
	app_file: app.py
	pinned: false
	---

	# 📚 Academic Paper Classifier

	[link](https://huggingface.co/spaces/ssbars/ysdaml4)

	This Streamlit application helps classify academic papers into different categories using a BERT-based model.

	## Features

	- Text Classification: Paste any paper text directly
	- PDF Support: Upload PDF files for classification
	- Real-time Analysis: Get instant classification results
	- Probability Distribution: See confidence scores for each category
	- Multiple Categories: Supports various academic fields

	## How to Use

	1. Text Input
	- Paste your paper's text (abstract or full content)
	- Click "Classify Text"
	- View results and probability distribution

	2. PDF Upload
	- Upload a PDF file of your paper
	- Click "Classify PDF"
	- Get classification results

	## Categories

	The model classifies papers into the following categories:
	- Computer Science
	- Mathematics
	- Physics
	- Biology
	- Economics

	## Technical Details

	- Built with Streamlit
	- Uses BERT-based model for classification
	- Supports PDF file processing
	- Real-time classification

	## About

	This application is designed to help researchers, students, and academics quickly identify the primary field of academic papers. It uses state-of-the-art natural language processing to analyze paper content and provide accurate classifications.

	---
	Created with ❤️ using Streamlit and Transformers

	## Setup

	1. Install `uv` (if not already installed):
	```bash
	# Using pip
	pip install uv

	# Or using Homebrew on macOS
	brew install uv
	```

	2. Create and activate a virtual environment:
	```bash
	uv venv
	source .venv/bin/activate # On Unix/macOS
	# OR
	.venv\Scripts\activate # On Windows
	```

	3. Install the dependencies using uv:
	```bash
	uv pip install -r requirements.lock
	```

	4. Run the Streamlit application:
	```bash
	streamlit run app.py
	```

	## Usage

	1. Text Classification
	- Paste the paper's text (abstract or content) into the text area
	- Click "Classify Text" to get results

	2. PDF Classification
	- Upload a PDF file using the file uploader
	- Click "Classify PDF" to process and classify the document

	## Model Information

	The service uses a BERT-based model for classification with the following categories:
	- Computer Science
	- Mathematics
	- Physics
	- Biology
	- Economics

	## Note

	The current implementation uses a base BERT model. For production use, you should:
	1. Fine-tune the model on a dataset of academic papers
	2. Adjust the categories based on your specific needs
	3. Implement proper error handling and validation
	4. Add authentication if needed

	## Package Management

	This project uses `uv` as the package manager for faster and more reliable dependency management. The dependencies are locked in `requirements.lock` for reproducible installations.

	To update dependencies:
	```bash
	# Update a single package
	uv pip install --upgrade package_name

	# Update all packages and regenerate lock file
	uv pip compile requirements.txt -o requirements.lock
	uv pip install -r requirements.lock
	```

	## Requirements

	See `requirements.txt` for a complete list of dependencies.

	# ArXiv Paper Classifier

	This project implements a machine learning system for classifying academic papers into ArXiv categories using state-of-the-art transformer models.

	## Project Overview

	The system uses pre-trained transformer models to classify academic papers into one of the main ArXiv categories:
	- Computer Science (cs)
	- Mathematics (math)
	- Physics (physics)
	- Quantitative Biology (q-bio)
	- Quantitative Finance (q-fin)
	- Statistics (stat)
	- Electrical Engineering and Systems Science (eess)
	- Economics (econ)

	## Features

	- Multiple model support:
	- DistilBERT: Lightweight and fast model, good for testing
	- DeBERTa-v3: Advanced model with better performance
	- RoBERTa: Advanced model with strong performance
	- SciBERT: Specialized for scientific text
	- BERT: Classic model with good all-round performance

	- Flexible input handling:
	- Can process both title and abstract
	- Handles text preprocessing and tokenization
	- Supports different maximum sequence lengths

	- Robust error handling:
	- Multiple fallback mechanisms for tokenizer initialization
	- Graceful degradation to simpler models if needed
	- Detailed error messages and logging

	## Installation

	1. Clone the repository
	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	## Usage

	### Basic Usage

	```python
	from model import PaperClassifier

	# Initialize classifier with default model (DistilBERT)
	classifier = PaperClassifier()

	# Classify a paper
	result = classifier.classify_paper(
	title="Your paper title",
	abstract="Your paper abstract"
	)

	# Print results
	print(result)
	```

	### Using Different Models

	```python
	# Initialize with DeBERTa-v3
	classifier = PaperClassifier(model_type='deberta-v3')

	# Initialize with RoBERTa
	classifier = PaperClassifier(model_type='roberta')

	# Initialize with SciBERT
	classifier = PaperClassifier(model_type='scibert')

	# Initialize with BERT
	classifier = PaperClassifier(model_type='bert')
	```

	### Training on Custom Data

	```python
	# Prepare your training data
	train_texts = ["paper1 title and abstract", "paper2 title and abstract", ...]
	train_labels = ["cs", "math", ...]

	# Train the model
	classifier.train_on_arxiv(
	train_texts=train_texts,
	train_labels=train_labels,
	epochs=3,
	batch_size=16,
	learning_rate=2e-5
	)
	```

	## Model Details

	### Available Models

	1. DistilBERT (`distilbert`)
	- Model: `distilbert-base-cased`
	- Max length: 512 tokens
	- Fast tokenizer
	- Good for testing and quick results

	2. DeBERTa-v3 (`deberta-v3`)
	- Model: `microsoft/deberta-v3-base`
	- Max length: 512 tokens
	- Uses DebertaV2TokenizerFast
	- Advanced performance

	3. RoBERTa (`roberta`)
	- Model: `roberta-base`
	- Max length: 512 tokens
	- Strong performance on various tasks

	4. SciBERT (`scibert`)
	- Model: `allenai/scibert_scivocab_uncased`
	- Max length: 512 tokens
	- Specialized for scientific text

	5. BERT (`bert`)
	- Model: `bert-base-uncased`
	- Max length: 512 tokens
	- Classic model with good all-round performance

	## Error Handling

	The system includes robust error handling mechanisms:
	- Multiple fallback levels for tokenizer initialization
	- Graceful degradation to simpler models
	- Detailed error messages and logging
	- Automatic fallback to BERT tokenizer if needed

	## Requirements

	- Python 3.7+
	- PyTorch
	- Transformers library
	- NumPy
	- Sacremoses (for tokenization support)