File size: 6,697 Bytes
2989d17
 
 
 
 
 
 
 
 
 
 
8a1304d
 
12faaae
 
8a1304d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12faaae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
---
title: Paper Classifier
emoji: πŸ“š
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: "1.32.0"
app_file: app.py
pinned: false
---

# πŸ“š Academic Paper Classifier

[link](https://huggingface.co/spaces/ssbars/ysdaml4)

This Streamlit application helps classify academic papers into different categories using a BERT-based model.

## Features

- **Text Classification**: Paste any paper text directly
- **PDF Support**: Upload PDF files for classification
- **Real-time Analysis**: Get instant classification results
- **Probability Distribution**: See confidence scores for each category
- **Multiple Categories**: Supports various academic fields

## How to Use

1. **Text Input**
   - Paste your paper's text (abstract or full content)
   - Click "Classify Text"
   - View results and probability distribution

2. **PDF Upload**
   - Upload a PDF file of your paper
   - Click "Classify PDF"
   - Get classification results

## Categories

The model classifies papers into the following categories:
- Computer Science
- Mathematics
- Physics
- Biology
- Economics

## Technical Details

- Built with Streamlit
- Uses BERT-based model for classification
- Supports PDF file processing
- Real-time classification

## About

This application is designed to help researchers, students, and academics quickly identify the primary field of academic papers. It uses state-of-the-art natural language processing to analyze paper content and provide accurate classifications.

---
Created with ❀️ using Streamlit and Transformers

## Setup

1. Install `uv` (if not already installed):
```bash
# Using pip
pip install uv

# Or using Homebrew on macOS
brew install uv
```

2. Create and activate a virtual environment:
```bash
uv venv
source .venv/bin/activate  # On Unix/macOS
# OR
.venv\Scripts\activate     # On Windows
```

3. Install the dependencies using uv:
```bash
uv pip install -r requirements.lock
```

4. Run the Streamlit application:
```bash
streamlit run app.py
```

## Usage

1. **Text Classification**
   - Paste the paper's text (abstract or content) into the text area
   - Click "Classify Text" to get results

2. **PDF Classification**
   - Upload a PDF file using the file uploader
   - Click "Classify PDF" to process and classify the document

## Model Information

The service uses a BERT-based model for classification with the following categories:
- Computer Science
- Mathematics
- Physics
- Biology
- Economics

## Note

The current implementation uses a base BERT model. For production use, you should:
1. Fine-tune the model on a dataset of academic papers
2. Adjust the categories based on your specific needs
3. Implement proper error handling and validation
4. Add authentication if needed

## Package Management

This project uses `uv` as the package manager for faster and more reliable dependency management. The dependencies are locked in `requirements.lock` for reproducible installations.

To update dependencies:
```bash
# Update a single package
uv pip install --upgrade package_name

# Update all packages and regenerate lock file
uv pip compile requirements.txt -o requirements.lock
uv pip install -r requirements.lock
```

## Requirements

See `requirements.txt` for a complete list of dependencies.

# ArXiv Paper Classifier

This project implements a machine learning system for classifying academic papers into ArXiv categories using state-of-the-art transformer models.

## Project Overview

The system uses pre-trained transformer models to classify academic papers into one of the main ArXiv categories:
- Computer Science (cs)
- Mathematics (math)
- Physics (physics)
- Quantitative Biology (q-bio)
- Quantitative Finance (q-fin)
- Statistics (stat)
- Electrical Engineering and Systems Science (eess)
- Economics (econ)

## Features

- Multiple model support:
  - DistilBERT: Lightweight and fast model, good for testing
  - DeBERTa-v3: Advanced model with better performance
  - RoBERTa: Advanced model with strong performance
  - SciBERT: Specialized for scientific text
  - BERT: Classic model with good all-round performance

- Flexible input handling:
  - Can process both title and abstract
  - Handles text preprocessing and tokenization
  - Supports different maximum sequence lengths

- Robust error handling:
  - Multiple fallback mechanisms for tokenizer initialization
  - Graceful degradation to simpler models if needed
  - Detailed error messages and logging

## Installation

1. Clone the repository
2. Install dependencies:
```bash
pip install -r requirements.txt
```

## Usage

### Basic Usage

```python
from model import PaperClassifier

# Initialize classifier with default model (DistilBERT)
classifier = PaperClassifier()

# Classify a paper
result = classifier.classify_paper(
    title="Your paper title",
    abstract="Your paper abstract"
)

# Print results
print(result)
```

### Using Different Models

```python
# Initialize with DeBERTa-v3
classifier = PaperClassifier(model_type='deberta-v3')

# Initialize with RoBERTa
classifier = PaperClassifier(model_type='roberta')

# Initialize with SciBERT
classifier = PaperClassifier(model_type='scibert')

# Initialize with BERT
classifier = PaperClassifier(model_type='bert')
```

### Training on Custom Data

```python
# Prepare your training data
train_texts = ["paper1 title and abstract", "paper2 title and abstract", ...]
train_labels = ["cs", "math", ...]

# Train the model
classifier.train_on_arxiv(
    train_texts=train_texts,
    train_labels=train_labels,
    epochs=3,
    batch_size=16,
    learning_rate=2e-5
)
```

## Model Details

### Available Models

1. **DistilBERT** (`distilbert`)
   - Model: `distilbert-base-cased`
   - Max length: 512 tokens
   - Fast tokenizer
   - Good for testing and quick results

2. **DeBERTa-v3** (`deberta-v3`)
   - Model: `microsoft/deberta-v3-base`
   - Max length: 512 tokens
   - Uses DebertaV2TokenizerFast
   - Advanced performance

3. **RoBERTa** (`roberta`)
   - Model: `roberta-base`
   - Max length: 512 tokens
   - Strong performance on various tasks

4. **SciBERT** (`scibert`)
   - Model: `allenai/scibert_scivocab_uncased`
   - Max length: 512 tokens
   - Specialized for scientific text

5. **BERT** (`bert`)
   - Model: `bert-base-uncased`
   - Max length: 512 tokens
   - Classic model with good all-round performance

## Error Handling

The system includes robust error handling mechanisms:
- Multiple fallback levels for tokenizer initialization
- Graceful degradation to simpler models
- Detailed error messages and logging
- Automatic fallback to BERT tokenizer if needed

## Requirements

- Python 3.7+
- PyTorch
- Transformers library
- NumPy
- Sacremoses (for tokenization support)