|
--- |
|
language: |
|
- de |
|
library_name: fasttext |
|
license: other |
|
license_name: open-aleph-license |
|
license_link: LICENSE |
|
--- |
|
# Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText |
|
|
|
Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText is a model that was used in the creation of [Aleph-Alpha-GermanWeb](https://huggingface.co/datasets/Aleph-Alpha/Aleph-Alpha-GermanWeb), a new German-language dataset that combines heuristic and model-based filtering techniques with synthetic data generation to achieve SOTA performance in German-language benchmarks. |
|
|
|
Here we provide one of our quality classification models, a fastText model, along with inference code. This model is released as part of a [collection of four text quality classification models](https://huggingface.co/collections/Aleph-Alpha/aleph-alpha-germanweb-68010b712bf06d3479055d49). |
|
|
|
To train Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText, we used [LanguageTool](https://dev.languagetool.org/http-server) to annotate a random subset of 400,000 German FineWeb2 documents with the DE_AGREEMENT rule, which identifies text passages with grammatical disagreement. To train our classifier, we randomly selected 75,000 documents without identified grammar mistakes as high quality examples. As low quality examples, we took 75,000 random documents containing at least one identified grammar error. |
|
|
|
We trained Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText on 95\% of the data to classify the high and low quality examples -- and used the remaining 5\% for validation, reaching a precision of 63\% and recall of 63\% on the validation set. |
|
|
|
Further details can be found in our [accompanying paper](https://arxiv.org/abs/2505.00022). |
|
|
|
## Example Snippet |
|
|
|
```python |
|
import fasttext |
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
model_path = hf_hub_download(repo_id="Aleph-Alpha/Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText", filename="model.bin") |
|
model = fasttext.load_model(model_path) |
|
|
|
text = "Das ist ein Beispieltext, um die Grammatik zu überprüfen." |
|
|
|
pre_processed_document = text.replace("\n", " ") |
|
|
|
predicted_class, prob = model.predict(pre_processed_document) |
|
predicted_label = predicted_class[0].replace("__label__", "") |
|
document_score = prob[0] |
|
# similar to https://github.com/NVIDIA/NeMo-Curator/blob/31c8171434205e62f6a7d38565ffd9cb4c2806b7/nemo_curator/filters/classifier_filter.py#L47 , the document score is defined as the probability of the predicted class is the predicted label is 'high quality', otherwise it is 1 - document_score |
|
|
|
if predicted_label != "high_quality": |
|
document_score = 1 - document_score |
|
|
|
print(predicted_label, document_score) |
|
``` |
|
|