File size: 2,503 Bytes
731dfd1
 
 
 
 
e1baaf1
 
 
731dfd1
 
58ef84f
341f055
 
 
 
 
 
 
 
3e0e1b7
341f055
58ef84f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
language:
- de
base_model:
- google-bert/bert-base-uncased
license: other
license_name: open-aleph-license
license_link: LICENSE
---
# Aleph-Alpha-GermanWeb-Grammar-Classifier-BERT

Aleph-Alpha-GermanWeb-Grammar-Classifier-BERT is a model that was used in the creation of [Aleph-Alpha-GermanWeb](https://huggingface.co/datasets/Aleph-Alpha/Aleph-Alpha-GermanWeb), a new German-language dataset that combines heuristic and model-based filtering techniques with synthetic data generation to achieve SOTA performance in German-language benchmarks.

Here we provide one of our quality classification models, based on a BERT backbone, along with inference code. This model is released as part of a [collection of four text quality classification models](https://huggingface.co/collections/Aleph-Alpha/aleph-alpha-germanweb-68010b712bf06d3479055d49).

To train Aleph-Alpha-GermanWeb-Grammar-Classifier-BERT, we used [LanguageTool](https://dev.languagetool.org/http-server) to annotate a random subset of 400,000 German FineWeb2 documents with the DE_AGREEMENT rule, which identifies text passages with grammatical disagreement. To train our classifier, we randomly selected 75,000 documents without identified grammar mistakes as high quality examples. As low quality examples, we took 75,000 random documents containing at least one identified grammar error.

We trained Aleph-Alpha-GermanWeb-Grammar-Classifier-BERT on 95\% of the data to classify the high and low quality examples -- and used the remaining 5\% for validation, reaching a precision of 67\% and recall of 66\% on the validation set.

Further details can be found in our [accompanying paper](https://arxiv.org/abs/2505.00022).

## Example Snippet

```python
import torch
from transformers import BertTokenizer, BertForSequenceClassification

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = BertForSequenceClassification.from_pretrained("Aleph-Alpha/Aleph-Alpha-GermanWeb-Grammar-Classifier-BERT", num_labels=2).to(device)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# disclaimer: short text is not in the model distribution
text = 'Das ist ein Beispieltext, um die Grammatik zu überprüfen.'

target_names = ['Low Quality', 'High Quality']

with torch.no_grad():
    prediction = torch.argmax(
                    model(**tokenizer(text, return_tensors='pt', truncation=True, padding=True).to(device)).logits
                ).item()
print(target_names[prediction])

```