File size: 3,870 Bytes
0a97af6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# Python Multilingual BibTeX Generator

This Python script generates `multilingual_papers.bib` by filtering the original `anthology+abstracts.bib` file for multilingual NLP research papers.

## Features

- **Identical Logic**: Uses the same filtering logic as the JavaScript web application
- **Comprehensive Detection**: Detects multilingual papers using keywords and language names
- **LaTeX Cleaning**: Properly handles LaTeX commands and formatting
- **Statistics**: Provides detailed statistics about the filtering process
- **Safe Operation**: Checks for existing files and asks for confirmation before overwriting

## Requirements

- Python 3.6 or higher
- No external dependencies (uses only standard library)

## Usage

1. **Place your files**: Ensure `anthology+abstracts.bib` is in the same directory as the script
2. **Run the script**:
   ```bash
   python generate_multilingual_bib.py
   ```
3. **Follow prompts**: The script will ask for confirmation if `multilingual_papers.bib` already exists

## Output

The script will:
- Generate `multilingual_papers.bib` containing only multilingual papers
- Display statistics about the filtering process
- Show the top 10 most common keywords found

## Example Output

```
Reading anthology+abstracts.bib...
Parsing BibTeX entries...
Found 50000 total papers
Found 2500 multilingual papers
Generating BibTeX content...
Writing to multilingual_papers.bib...
Successfully generated multilingual_papers.bib with 2500 papers!

Statistics:
  Total papers processed: 50000
  Multilingual papers found: 2500
  Percentage multilingual: 5.0%

Top 10 keywords found:
  multilingual: 1200 papers
  chinese: 800 papers
  crosslingual: 600 papers
  hindi: 400 papers
  low-resource: 350 papers
  korean: 300 papers
  arabic: 250 papers
  japanese: 200 papers
  spanish: 180 papers
  french: 150 papers
```

## Filtering Criteria

The script uses the same criteria as the web application:

### Multilingual Keywords
- multilingual, crosslingual, multi-lingual, cross-lingual
- low-resource language, low resource language
- low-resource, low resource

### Language Names
- 100+ language names including: Hindi, Chinese, Korean, Arabic, Spanish, French, German, Japanese, etc.
- Regional language variations and dialects

## Customization

You can modify the filtering criteria by editing the constants at the top of the script:

```python
MULTILINGUAL_KEYWORDS = [
    'multilingual', 'crosslingual', 'multi lingual',
    # Add your custom keywords here
]

LANGUAGE_NAMES = [
    'afrikaans', 'albanian', 'amharic', 'arabic',
    # Add more language names here
]
```

## Error Handling

The script includes robust error handling:
- Checks for input file existence
- Handles malformed BibTeX entries gracefully
- Provides clear error messages
- Asks for confirmation before overwriting existing files

## Performance

- Efficient regex-based parsing
- Memory-efficient processing for large files
- Fast keyword matching using set operations

## Troubleshooting

### File Not Found
```
Error: anthology+abstracts.bib not found in current directory.
```
**Solution**: Ensure the input file is in the same directory as the script.

### No Papers Found
```
No multilingual papers found. Check your keywords and language lists.
```
**Solution**: Verify your BibTeX file contains papers with multilingual content, or adjust the keyword lists.

### Encoding Issues
If you encounter encoding errors, the script uses UTF-8 encoding. Ensure your BibTeX file is properly encoded.

## Comparison with JavaScript Version

This Python script produces identical results to the JavaScript web application:
- Same filtering logic
- Same LaTeX cleaning
- Same BibTeX output format
- Same keyword detection

The main advantage is that it can be run independently without a web browser and provides detailed statistics about the filtering process.