File size: 3,870 Bytes
0a97af6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
# Python Multilingual BibTeX Generator
This Python script generates `multilingual_papers.bib` by filtering the original `anthology+abstracts.bib` file for multilingual NLP research papers.
## Features
- **Identical Logic**: Uses the same filtering logic as the JavaScript web application
- **Comprehensive Detection**: Detects multilingual papers using keywords and language names
- **LaTeX Cleaning**: Properly handles LaTeX commands and formatting
- **Statistics**: Provides detailed statistics about the filtering process
- **Safe Operation**: Checks for existing files and asks for confirmation before overwriting
## Requirements
- Python 3.6 or higher
- No external dependencies (uses only standard library)
## Usage
1. **Place your files**: Ensure `anthology+abstracts.bib` is in the same directory as the script
2. **Run the script**:
```bash
python generate_multilingual_bib.py
```
3. **Follow prompts**: The script will ask for confirmation if `multilingual_papers.bib` already exists
## Output
The script will:
- Generate `multilingual_papers.bib` containing only multilingual papers
- Display statistics about the filtering process
- Show the top 10 most common keywords found
## Example Output
```
Reading anthology+abstracts.bib...
Parsing BibTeX entries...
Found 50000 total papers
Found 2500 multilingual papers
Generating BibTeX content...
Writing to multilingual_papers.bib...
Successfully generated multilingual_papers.bib with 2500 papers!
Statistics:
Total papers processed: 50000
Multilingual papers found: 2500
Percentage multilingual: 5.0%
Top 10 keywords found:
multilingual: 1200 papers
chinese: 800 papers
crosslingual: 600 papers
hindi: 400 papers
low-resource: 350 papers
korean: 300 papers
arabic: 250 papers
japanese: 200 papers
spanish: 180 papers
french: 150 papers
```
## Filtering Criteria
The script uses the same criteria as the web application:
### Multilingual Keywords
- multilingual, crosslingual, multi-lingual, cross-lingual
- low-resource language, low resource language
- low-resource, low resource
### Language Names
- 100+ language names including: Hindi, Chinese, Korean, Arabic, Spanish, French, German, Japanese, etc.
- Regional language variations and dialects
## Customization
You can modify the filtering criteria by editing the constants at the top of the script:
```python
MULTILINGUAL_KEYWORDS = [
'multilingual', 'crosslingual', 'multi lingual',
# Add your custom keywords here
]
LANGUAGE_NAMES = [
'afrikaans', 'albanian', 'amharic', 'arabic',
# Add more language names here
]
```
## Error Handling
The script includes robust error handling:
- Checks for input file existence
- Handles malformed BibTeX entries gracefully
- Provides clear error messages
- Asks for confirmation before overwriting existing files
## Performance
- Efficient regex-based parsing
- Memory-efficient processing for large files
- Fast keyword matching using set operations
## Troubleshooting
### File Not Found
```
Error: anthology+abstracts.bib not found in current directory.
```
**Solution**: Ensure the input file is in the same directory as the script.
### No Papers Found
```
No multilingual papers found. Check your keywords and language lists.
```
**Solution**: Verify your BibTeX file contains papers with multilingual content, or adjust the keyword lists.
### Encoding Issues
If you encounter encoding errors, the script uses UTF-8 encoding. Ensure your BibTeX file is properly encoded.
## Comparison with JavaScript Version
This Python script produces identical results to the JavaScript web application:
- Same filtering logic
- Same LaTeX cleaning
- Same BibTeX output format
- Same keyword detection
The main advantage is that it can be run independently without a web browser and provides detailed statistics about the filtering process. |