# Generate paper json files from a collection xml file, with fulltext extraction. | |
This is a slightly re-arranged version of Sotaro Takeshita's code, which is available at https://github.com/gengo-proj/data-factory. | |
## Requirements | |
- Docker | |
- Python>=3.10 | |
- python packages: | |
- acl-anthology-py>=0.4.3 | |
- bs4 | |
- jsonschema | |
## Setup | |
Start Grobid Docker container | |
```bash | |
docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0 | |
``` | |
Get the meta data from ACL Anthology | |
```bash | |
git clone git@github.com:acl-org/acl-anthology.git | |
``` | |
## Usage | |
```bash | |
python src/data/acl_anthology_crawler.py \ | |
--base-output-dir <path/to/save/raw-paper.json> \ | |
--pdf-output-dir <path/to/save/downloaded/paper.pdf> \ | |
--anthology-data-dir ./acl-anthology/data/ | |
``` | |