File size: 7,073 Bytes
b43bda7 e4d0322 c1e4127 e4d0322 efd7df5 3ea8af5 efd7df5 e4d0322 dc907c9 e4d0322 8e18dcd e4d0322 8e18dcd dc907c9 efd7df5 e4d0322 2140249 e4d0322 dc907c9 e4d0322 dc907c9 e4d0322 dc907c9 e4d0322 dc907c9 e4d0322 dc907c9 2140249 e4d0322 ac77cf0 fe5104c ac77cf0 fe5104c ac77cf0 dc907c9 e4d0322 dc907c9 e4d0322 c42e485 e4d0322 19b2d0d efd7df5 19b2d0d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
---
tags:
- masked-auto-encoding
- generated_from_trainer
model-index:
- name: pixel-base-german
results: []
paper: https://aclanthology.org/2025.coling-main.427/
license: apache-2.0
language:
- de
---
# PIXEL-base-german
`pixel-base-german` is a [PIXEL model](https://arxiv.org/abs/2207.06991) trained on the [German DBMDZ BERT Corpus](https://huggingface.co/datasets/stefan-it/german-dbmdz-bert-corpus).
We trained the model using the architecture and [codebase](https://github.com/xplip/pixel) proposed in the 2023 Rust et al. paper [Language Modelling with Pixels](https://arxiv.org/abs/2207.06991).
This German model was introduced and evaluated in the paper [Evaluating Pixel Language Models on Non-Standardized Languages](https://aclanthology.org/2025.coling-main.427/), presented at COLING 2025.
## Model description
*Description from [https://huggingface.co/Team-PIXEL/pixel-base](https://huggingface.co/Team-PIXEL/pixel-base)*
PIXEL consists of three major components: a text renderer, which draws text as an image; an encoder, which encodes the unmasked regions of the rendered image; and a decoder, which reconstructs the masked regions at the pixel level. It is built on ViT-MAE.
During pretraining, the renderer produces images containing the training sentences. Patches of these images are linearly projected to obtain patch embeddings (as opposed to having an embedding matrix like e.g. in BERT), and 25% of the patches are masked out. The encoder, which is a Vision Transformer (ViT), then only processes the unmasked patches. The lightweight decoder with hidden size 512 and 8 transformer layers inserts learnable mask tokens into the encoder's output sequence and learns to reconstruct the raw pixel values at the masked positions.
After pretraining, the decoder can be discarded leaving an 86M parameter encoder, upon which task-specific classification heads can be stacked. Alternatively, the decoder can be retained and PIXEL can be used as a pixel-level generative language model (see Figures 3 and 6 in the paper for examples).
For more details on how PIXEL works, please check the paper and the codebase linked above.
## Training and evaluation data
This model has been trained renderized German text from [https://huggingface.co/datasets/stefan-it/german-dbmdz-bert-corpus](https://huggingface.co/datasets/stefan-it/german-dbmdz-bert-corpus).
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.00015
- train_batch_size: 256
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.05
- training_steps: 1500000
- mixed_precision_training: Apex, opt level O1
## How to use
Before being able to use this model, it is necessary to clone and configure the [codebase](https://github.com/xplip/pixel).
### Setup
*Instructions from [https://github.com/xplip/pixel](https://github.com/xplip/pixel)*
This codebase is built on [Transformers](https://github.com/huggingface/transformers) for PyTorch. We also took inspiration from the original [ViT-MAE codebase](https://github.com/facebookresearch/mae). The default font `GoNotoCurrent.ttf` that we used for all experiments is a merged Noto font built with [go-noto-universal](https://github.com/satbyy/go-noto-universal).
You can set up this codebase as follows to get started with using PIXEL models:
<details>
<summary><i>Show Instructions</i></summary>
1. Clone repo and initialize submodules
```
git clone https://github.com/xplip/pixel.git
cd pixel
git submodule update --init --recursive
```
2. Create a fresh conda environment
```
conda create -n pixel-env python=3.9
conda activate pixel-env
```
3. Install Python packages
```bash
conda install pytorch torchvision cudatoolkit=11.3 -c pytorch
conda install -c conda-forge pycairo pygobject manimpango
pip install --upgrade pip
pip install -r requirements.txt
pip install ./datasets
pip install -e .
```
4. (Optional) Install Nvidia Apex
```bash
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
```
5. Verify Installation on Vietnamese POS tagging
```bash
# Create a folder in which we keep the data
mkdir -p data
# Get and extract the UD data for parsing and POS tagging
wget -qO- https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4758/ud-treebanks-v2.10.tgz | tar xvz -C data
python scripts/training/run_pos.py \
--model_name_or_path="Team-PIXEL/pixel-base-finetuned-pos-ud-vietnamese-vtb" \
--data_dir="data/ud-treebanks-v2.10/UD_Vietnamese-VTB" \
--remove_unused_columns=False \
--output_dir="sanity_check" \
--do_eval \
--max_seq_length=256 \
--overwrite_cache
```
If everything is configured correctly, you should expect to see results similar to the following:
```bash
***** eval metrics *****
eval_accuracy = 0.8632
eval_loss = 1.2375
```
</details>
### Loading our model
Once everything is set up properly, `pixel-base-german` can be used with the codebase. To load the model:
```
from pixel import PIXELConfig, PIXELForPreTraining
config = PIXELConfig.from_pretrained("amunozo/pixel-base-german")
model = PIXELForPreTraining.from_pretrained("amunozo/pixel-base-german", config=config)
```
## Framework versions
- Transformers 4.17.0
- Pytorch 2.0.1+cu117
- Datasets 2.14.5
- Tokenizers 0.13.3
## Citation
```
@inproceedings{munoz-ortiz-etal-2025-evaluating,
title = "Evaluating Pixel Language Models on Non-Standardized Languages",
author = "Mu{\~n}oz-Ortiz, Alberto and Blaschke, Verena and Plank, Barbara",
editor = "Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven",
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
month = jan,
year = "2025",
address = "Abu Dhabi, UAE",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.coling-main.427/",
pages = "6412--6419",
}
```
## Acknowledgements
This work was funded by the European Research Council (ERC) Consolidator Grant DIALECT 101043235; SCANNER-UDC (PID2020-113230RB-C21) funded by MICIU/AEI/10.13039/501100011033; Xunta de Galicia (ED431C 2024/02); GAP (PID2022-139308OA-I00) funded by MICIU/AEI/10.13039/501100011033/ and by ERDF, EU; Grant PRE2021-097001 funded by MICIU/AEI/10.13039/501100011033 and by ESF+ (predoctoral training grant associated to project PID2020-113230RB-C21); LATCHING (PID2023-147129OB-C21) funded by MICIU/AEI/10.13039/501100011033 and ERDF; and Centro de Investigación de Galicia ‘‘CITIC’’, funded by the Xunta de Galicia through the collaboration agreement between the Consellería de Cultura, Educación, Formación Profesional e Universidades and the Galician universities for the reinforcement of the research centres of the Galician University System (CIGUS). |