|
# My Dummy Model |
|
|
|
--- |
|
language: fr |
|
license: apache-2.0 |
|
tags: |
|
- masked-lm |
|
- camembert |
|
- transformers |
|
- tf |
|
- french |
|
- fill-mask |
|
--- |
|
|
|
# CamemBERT MLM - Fine-tuned Model |
|
|
|
This is a TensorFlow-based masked language model (MLM) based on the [camembert-base](https://huggingface.co/camembert-base) checkpoint, a RoBERTa-like model trained on French text. |
|
|
|
## Model description |
|
|
|
This model uses the CamemBERT architecture, which is a RoBERTa-based transformer trained on large-scale French corpora (e.g., OSCAR, CCNet). It's designed to perform Masked Language Modeling (MLM) tasks. |
|
|
|
It was loaded and saved using the `transformers` library in TensorFlow (`TFAutoModelForMaskedLM`). It can be used for fill-in-the-blank tasks in French. |
|
|
|
## Intended uses & limitations |
|
|
|
### Intended uses |
|
- Fill-mask predictions in French |
|
- Feature extraction for NLP tasks |
|
- Fine-tuning on downstream tasks like text classification, NER, etc. |
|
|
|
### Limitations |
|
- Works best with French text |
|
- May not generalize well to other languages |
|
- Cannot be used for generative tasks (e.g., translation, text generation) |
|
|
|
## How to use |
|
|
|
```python |
|
from transformers import TFAutoModelForMaskedLM, AutoTokenizer |
|
import tensorflow as tf |
|
|
|
model = TFAutoModelForMaskedLM.from_pretrained("Mhammad2023/my-dummy-model") |
|
tokenizer = AutoTokenizer.from_pretrained("Mhammad2023/my-dummy-model") |
|
|
|
inputs = tokenizer("J'aime le [MASK] rouge.", return_tensors="tf") |
|
outputs = model(**inputs) |
|
logits = outputs.logits |
|
|
|
masked_index = tf.argmax(inputs.input_ids == tokenizer.mask_token_id, axis=1)[0] |
|
predicted_token_id = tf.argmax(logits[0, masked_index]) |
|
predicted_token = tokenizer.decode([predicted_token_id]) |
|
|
|
print(f"Predicted word: {predicted_token}") |
|
``` |
|
|
|
## Limitations and bias |
|
This model inherits the limitations and biases from the camembert-base checkpoint, including: |
|
|
|
Potential biases from the training data (e.g., internet corpora) |
|
|
|
## Inappropriate predictions for sensitive topics |
|
|
|
Use with caution in production or sensitive applications. |
|
|
|
## Training data |
|
The model was not further fine-tuned; it is based directly on camembert-base, which was trained on: |
|
|
|
OSCAR (Open Super-large Crawled ALMAnaCH coRpus) |
|
|
|
CCNet (Common Crawl News) |
|
|
|
## Training procedure |
|
No additional training was applied for this version. You can load and fine-tune it on your task using Trainer or Keras API. |
|
|
|
## Evaluation results |
|
This version has not been evaluated on downstream tasks. For evaluation metrics and benchmarks, refer to the original camembert-base model card. |