Update README.md
Browse files
README.md
CHANGED
|
@@ -6,3 +6,28 @@ language:
|
|
| 6 |
widget:
|
| 7 |
- text: "Hinder s'Hans-Heiris Huus hani hundert Hase ghöre hueschte."
|
| 8 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
widget:
|
| 7 |
- text: "Hinder s'Hans-Heiris Huus hani hundert Hase ghöre hueschte."
|
| 8 |
---
|
| 9 |
+
|
| 10 |
+
The [**google/canine-s**](https://huggingface.co/google/canine-s) model ([Clark et al., TACL 2022](https://aclanthology.org/2022.tacl-1.5/)) trained on Swiss German text data via continued pre-training.
|
| 11 |
+
|
| 12 |
+
## Training Objective
|
| 13 |
+
We used the CANINE-S objective combined with the subword vocabulary of [SwissBERT](https://huggingface.co/ZurichNLP/swissbert).
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
For continued pre-training, we used the following two datasets of written Swiss German:
|
| 17 |
+
1. [SwissCrawl](https://icosys.ch/swisscrawl) ([Linder et al., LREC 2020](https://aclanthology.org/2020.lrec-1.329)), a collection of Swiss German web text (forum discussions, social media).
|
| 18 |
+
2. A custom dataset of Swiss German tweets
|
| 19 |
+
|
| 20 |
+
In addition, we trained the model on an equal amount of Standard German data. We used news articles retrieved from [Swissdox@LiRI](https://t.uzh.ch/1hI).
|
| 21 |
+
|
| 22 |
+
## License
|
| 23 |
+
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
|
| 24 |
+
|
| 25 |
+
## Citation
|
| 26 |
+
```bibtex
|
| 27 |
+
@inproceedings{vamvas-etal-2024-modular,
|
| 28 |
+
title={Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect},
|
| 29 |
+
author={Jannis Vamvas and No{\"e}mi Aepli and Rico Sennrich},
|
| 30 |
+
booktitle={First Workshop on Modular and Open Multilingual NLP},
|
| 31 |
+
year={2024},
|
| 32 |
+
}
|
| 33 |
+
```
|