Spaces:

ArneBinder
/

ScientificArgumentRecommender

Sleeping

update from https://github.com/ArneBinder/pie-document-level/pull/397

ced4316 verified 6 months ago

779 Bytes

	# Generate paper json files from a collection xml file, with fulltext extraction.

	This is a slightly re-arranged version of Sotaro Takeshita's code, which is available at https://github.com/gengo-proj/data-factory.

	## Requirements

	- Docker
	- Python>=3.10
	- python packages:
	- acl-anthology-py>=0.4.3
	- bs4
	- jsonschema

	## Setup

	Start Grobid Docker container

	```bash
	docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0
	```

	Get the meta data from ACL Anthology

	```bash
	git clone git@github.com:acl-org/acl-anthology.git
	```

	## Usage

	```bash
	python src/data/acl_anthology_crawler.py \
	--base-output-dir <path/to/save/raw-paper.json> \
	--pdf-output-dir <path/to/save/downloaded/paper.pdf> \
	--anthology-data-dir ./acl-anthology/data/
	```