Aleph-Alpha
/

Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText

Model card Files Files and versions

Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText / README.md

bastitx's picture

Update README.md

9d15ede verified 4 months ago

|

history blame contribute delete

2.62 kB

	---
	language:
	- de
	library_name: fasttext
	license: other
	license_name: open-aleph-license
	license_link: LICENSE
	---
	# Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText

	Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText is a model that was used in the creation of [Aleph-Alpha-GermanWeb](https://huggingface.co/datasets/Aleph-Alpha/Aleph-Alpha-GermanWeb), a new German-language dataset that combines heuristic and model-based filtering techniques with synthetic data generation to achieve SOTA performance in German-language benchmarks.

	Here we provide one of our quality classification models, a fastText model, along with inference code. This model is released as part of a [collection of four text quality classification models](https://huggingface.co/collections/Aleph-Alpha/aleph-alpha-germanweb-68010b712bf06d3479055d49).

	To train Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText, we used [LanguageTool](https://dev.languagetool.org/http-server) to annotate a random subset of 400,000 German FineWeb2 documents with the DE_AGREEMENT rule, which identifies text passages with grammatical disagreement. To train our classifier, we randomly selected 75,000 documents without identified grammar mistakes as high quality examples. As low quality examples, we took 75,000 random documents containing at least one identified grammar error.

	We trained Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText on 95\% of the data to classify the high and low quality examples -- and used the remaining 5\% for validation, reaching a precision of 63\% and recall of 63\% on the validation set.

	Further details can be found in our [accompanying paper](https://arxiv.org/abs/2505.00022).

	## Example Snippet

	```python
	import fasttext
	from huggingface_hub import hf_hub_download


	model_path = hf_hub_download(repo_id="Aleph-Alpha/Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText", filename="model.bin")
	model = fasttext.load_model(model_path)

	text = "Das ist ein Beispieltext, um die Grammatik zu überprüfen."

	pre_processed_document = text.replace("\n", " ")

	predicted_class, prob = model.predict(pre_processed_document)
	predicted_label = predicted_class[0].replace("__label__", "")
	document_score = prob[0]
	# similar to https://github.com/NVIDIA/NeMo-Curator/blob/31c8171434205e62f6a7d38565ffd9cb4c2806b7/nemo_curator/filters/classifier_filter.py#L47 , the document score is defined as the probability of the predicted class is the predicted label is 'high quality', otherwise it is 1 - document_score

	if predicted_label != "high_quality":
	document_score = 1 - document_score

	print(predicted_label, document_score)
	```