|
# Python Multilingual BibTeX Generator |
|
|
|
This Python script generates `multilingual_papers.bib` by filtering the original `anthology+abstracts.bib` file for multilingual NLP research papers. |
|
|
|
## Features |
|
|
|
- **Identical Logic**: Uses the same filtering logic as the JavaScript web application |
|
- **Comprehensive Detection**: Detects multilingual papers using keywords and language names |
|
- **LaTeX Cleaning**: Properly handles LaTeX commands and formatting |
|
- **Statistics**: Provides detailed statistics about the filtering process |
|
- **Safe Operation**: Checks for existing files and asks for confirmation before overwriting |
|
|
|
## Requirements |
|
|
|
- Python 3.6 or higher |
|
- No external dependencies (uses only standard library) |
|
|
|
## Usage |
|
|
|
1. **Place your files**: Ensure `anthology+abstracts.bib` is in the same directory as the script |
|
2. **Run the script**: |
|
```bash |
|
python generate_multilingual_bib.py |
|
``` |
|
3. **Follow prompts**: The script will ask for confirmation if `multilingual_papers.bib` already exists |
|
|
|
## Output |
|
|
|
The script will: |
|
- Generate `multilingual_papers.bib` containing only multilingual papers |
|
- Display statistics about the filtering process |
|
- Show the top 10 most common keywords found |
|
|
|
## Example Output |
|
|
|
``` |
|
Reading anthology+abstracts.bib... |
|
Parsing BibTeX entries... |
|
Found 50000 total papers |
|
Found 2500 multilingual papers |
|
Generating BibTeX content... |
|
Writing to multilingual_papers.bib... |
|
Successfully generated multilingual_papers.bib with 2500 papers! |
|
|
|
Statistics: |
|
Total papers processed: 50000 |
|
Multilingual papers found: 2500 |
|
Percentage multilingual: 5.0% |
|
|
|
Top 10 keywords found: |
|
multilingual: 1200 papers |
|
chinese: 800 papers |
|
crosslingual: 600 papers |
|
hindi: 400 papers |
|
low-resource: 350 papers |
|
korean: 300 papers |
|
arabic: 250 papers |
|
japanese: 200 papers |
|
spanish: 180 papers |
|
french: 150 papers |
|
``` |
|
|
|
## Filtering Criteria |
|
|
|
The script uses the same criteria as the web application: |
|
|
|
### Multilingual Keywords |
|
- multilingual, crosslingual, multi-lingual, cross-lingual |
|
- low-resource language, low resource language |
|
- low-resource, low resource |
|
|
|
### Language Names |
|
- 100+ language names including: Hindi, Chinese, Korean, Arabic, Spanish, French, German, Japanese, etc. |
|
- Regional language variations and dialects |
|
|
|
## Customization |
|
|
|
You can modify the filtering criteria by editing the constants at the top of the script: |
|
|
|
```python |
|
MULTILINGUAL_KEYWORDS = [ |
|
'multilingual', 'crosslingual', 'multi lingual', |
|
# Add your custom keywords here |
|
] |
|
|
|
LANGUAGE_NAMES = [ |
|
'afrikaans', 'albanian', 'amharic', 'arabic', |
|
# Add more language names here |
|
] |
|
``` |
|
|
|
## Error Handling |
|
|
|
The script includes robust error handling: |
|
- Checks for input file existence |
|
- Handles malformed BibTeX entries gracefully |
|
- Provides clear error messages |
|
- Asks for confirmation before overwriting existing files |
|
|
|
## Performance |
|
|
|
- Efficient regex-based parsing |
|
- Memory-efficient processing for large files |
|
- Fast keyword matching using set operations |
|
|
|
## Troubleshooting |
|
|
|
### File Not Found |
|
``` |
|
Error: anthology+abstracts.bib not found in current directory. |
|
``` |
|
**Solution**: Ensure the input file is in the same directory as the script. |
|
|
|
### No Papers Found |
|
``` |
|
No multilingual papers found. Check your keywords and language lists. |
|
``` |
|
**Solution**: Verify your BibTeX file contains papers with multilingual content, or adjust the keyword lists. |
|
|
|
### Encoding Issues |
|
If you encounter encoding errors, the script uses UTF-8 encoding. Ensure your BibTeX file is properly encoded. |
|
|
|
## Comparison with JavaScript Version |
|
|
|
This Python script produces identical results to the JavaScript web application: |
|
- Same filtering logic |
|
- Same LaTeX cleaning |
|
- Same BibTeX output format |
|
- Same keyword detection |
|
|
|
The main advantage is that it can be run independently without a web browser and provides detailed statistics about the filtering process. |