Spaces:

crystina-z
/

multilingual-paperbase

Running

App Files Files Community

multilingual-paperbase / PYTHON_README.md

Crystina

init

0a97af6 29 days ago

preview code

raw

history blame contribute delete

3.87 kB

	# Python Multilingual BibTeX Generator

	This Python script generates `multilingual_papers.bib` by filtering the original `anthology+abstracts.bib` file for multilingual NLP research papers.

	## Features

	- Identical Logic: Uses the same filtering logic as the JavaScript web application
	- Comprehensive Detection: Detects multilingual papers using keywords and language names
	- LaTeX Cleaning: Properly handles LaTeX commands and formatting
	- Statistics: Provides detailed statistics about the filtering process
	- Safe Operation: Checks for existing files and asks for confirmation before overwriting

	## Requirements

	- Python 3.6 or higher
	- No external dependencies (uses only standard library)

	## Usage

	1. Place your files: Ensure `anthology+abstracts.bib` is in the same directory as the script
	2. Run the script:
	```bash
	python generate_multilingual_bib.py
	```
	3. Follow prompts: The script will ask for confirmation if `multilingual_papers.bib` already exists

	## Output

	The script will:
	- Generate `multilingual_papers.bib` containing only multilingual papers
	- Display statistics about the filtering process
	- Show the top 10 most common keywords found

	## Example Output

	```
	Reading anthology+abstracts.bib...
	Parsing BibTeX entries...
	Found 50000 total papers
	Found 2500 multilingual papers
	Generating BibTeX content...
	Writing to multilingual_papers.bib...
	Successfully generated multilingual_papers.bib with 2500 papers!

	Statistics:
	Total papers processed: 50000
	Multilingual papers found: 2500
	Percentage multilingual: 5.0%

	Top 10 keywords found:
	multilingual: 1200 papers
	chinese: 800 papers
	crosslingual: 600 papers
	hindi: 400 papers
	low-resource: 350 papers
	korean: 300 papers
	arabic: 250 papers
	japanese: 200 papers
	spanish: 180 papers
	french: 150 papers
	```

	## Filtering Criteria

	The script uses the same criteria as the web application:

	### Multilingual Keywords
	- multilingual, crosslingual, multi-lingual, cross-lingual
	- low-resource language, low resource language
	- low-resource, low resource

	### Language Names
	- 100+ language names including: Hindi, Chinese, Korean, Arabic, Spanish, French, German, Japanese, etc.
	- Regional language variations and dialects

	## Customization

	You can modify the filtering criteria by editing the constants at the top of the script:

	```python
	MULTILINGUAL_KEYWORDS = [
	'multilingual', 'crosslingual', 'multi lingual',
	# Add your custom keywords here
	]

	LANGUAGE_NAMES = [
	'afrikaans', 'albanian', 'amharic', 'arabic',
	# Add more language names here
	]
	```

	## Error Handling

	The script includes robust error handling:
	- Checks for input file existence
	- Handles malformed BibTeX entries gracefully
	- Provides clear error messages
	- Asks for confirmation before overwriting existing files

	## Performance

	- Efficient regex-based parsing
	- Memory-efficient processing for large files
	- Fast keyword matching using set operations

	## Troubleshooting

	### File Not Found
	```
	Error: anthology+abstracts.bib not found in current directory.
	```
	Solution: Ensure the input file is in the same directory as the script.

	### No Papers Found
	```
	No multilingual papers found. Check your keywords and language lists.
	```
	Solution: Verify your BibTeX file contains papers with multilingual content, or adjust the keyword lists.

	### Encoding Issues
	If you encounter encoding errors, the script uses UTF-8 encoding. Ensure your BibTeX file is properly encoded.

	## Comparison with JavaScript Version

	This Python script produces identical results to the JavaScript web application:
	- Same filtering logic
	- Same LaTeX cleaning
	- Same BibTeX output format
	- Same keyword detection

	The main advantage is that it can be run independently without a web browser and provides detailed statistics about the filtering process.