File size: 2,527 Bytes
73219f5
 
7fca23a
 
 
 
 
 
 
 
 
 
 
73219f5
7fca23a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23c0391
7fca23a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# My Dummy Model

---
language: fr
license: apache-2.0
tags:
  - masked-lm
  - camembert
  - transformers
  - tf
  - french
  - fill-mask
---

# CamemBERT MLM - Fine-tuned Model

This is a TensorFlow-based masked language model (MLM) based on the [camembert-base](https://huggingface.co/camembert-base) checkpoint, a RoBERTa-like model trained on French text.

## Model description

This model uses the CamemBERT architecture, which is a RoBERTa-based transformer trained on large-scale French corpora (e.g., OSCAR, CCNet). It's designed to perform Masked Language Modeling (MLM) tasks.

It was loaded and saved using the `transformers` library in TensorFlow (`TFAutoModelForMaskedLM`). It can be used for fill-in-the-blank tasks in French.

## Intended uses & limitations

### Intended uses
- Fill-mask predictions in French
- Feature extraction for NLP tasks
- Fine-tuning on downstream tasks like text classification, NER, etc.

### Limitations
- Works best with French text
- May not generalize well to other languages
- Cannot be used for generative tasks (e.g., translation, text generation)

## How to use

```python
from transformers import TFAutoModelForMaskedLM, AutoTokenizer
import tensorflow as tf

model = TFAutoModelForMaskedLM.from_pretrained("Mhammad2023/my-dummy-model")
tokenizer = AutoTokenizer.from_pretrained("Mhammad2023/my-dummy-model")

inputs = tokenizer("J'aime le [MASK] rouge.", return_tensors="tf")
outputs = model(**inputs)
logits = outputs.logits

masked_index = tf.argmax(inputs.input_ids == tokenizer.mask_token_id, axis=1)[0]
predicted_token_id = tf.argmax(logits[0, masked_index])
predicted_token = tokenizer.decode([predicted_token_id])

print(f"Predicted word: {predicted_token}")
```

## Limitations and bias
This model inherits the limitations and biases from the camembert-base checkpoint, including:

Potential biases from the training data (e.g., internet corpora)

## Inappropriate predictions for sensitive topics

Use with caution in production or sensitive applications.

## Training data
The model was not further fine-tuned; it is based directly on camembert-base, which was trained on:

OSCAR (Open Super-large Crawled ALMAnaCH coRpus)

CCNet (Common Crawl News)

## Training procedure
No additional training was applied for this version. You can load and fine-tune it on your task using Trainer or Keras API.

## Evaluation results
This version has not been evaluated on downstream tasks. For evaluation metrics and benchmarks, refer to the original camembert-base model card.