---
license: cc-by-nc-4.0
language:
  - pt
tags:
  - ai-detection
  - text-classification
  - portuguese
  - bert
  - transformers
base_model: neuralmind/bert-base-portuguese-cased
pipeline_tag: text-classification
datasets:
  - wiki40b-pt    # From consolidated human sources
  - oscar-pt      # From consolidated human sources
  - cc100-pt      # From consolidated human sources
  - europarl-pt   # From consolidated human sources
  - opus-books-pt # From consolidated human sources
  - Detecting-ai/ai_pt_corpus # AI-generated corpus
model_type: bert
---

# 🇧🇷 pt-ai-detector

**pt-ai-detector** is a BERT-base model fine-tuned to decide whether a Portuguese sentence or paragraph was written by a *human* (`label = 0`) or generated by *AI* (`label = 1`).

| Metric                | Value                          |
| --------------------- | ------------------------------ |
| **Train data**        | 1 000 000 human + 1 000 000 AI |
| **Balanced test set** | 1 954 190 (½ human, ½ AI)      |
| **Accuracy**          | ≈ 99 %                         |
| **F1 (macro)**        | ≈ 0.99                         |
| **Model size**        | 434 M parameters (≈ 430 MB)    |

---

## 📖 Quick usage
**Try it live** at [detecting-ai.com](https://detecting-ai.com/pt) – our team at Detecting-ai built this model and demo so you can instantly test any Portuguese text online.

```python
from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="Detecting-ai/pt-ai-detector",
    tokenizer="Detecting-ai/pt-ai-detector",
    device=0  # set -1 for CPU
)

text = "A inteligência artificial está transformando a educação."
print(clf(text))        # → [{'label': 'AI', 'score': 0.987}]
```

| id | label |
| -- | ----- |
| 0  | Human |
| 1  | AI    |

---

## 🏋️‍♂️ Training details

* **Base model:** `neuralmind/bert-base-portuguese-cased`
* **Epochs:** 3 (fp16 on 1 × A100)
* **Batch size:** 32
* **Optimizer/LR:** AdamW 2 × 10⁻⁵
* **Loss:** Cross-entropy

### Data sources

| Corpus                                                                    | Lines used  | Notes                                                                                                                                      |
| ------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| Human corpus (wiki40b-pt, oscar-pt, cc100-pt, europarl-pt, opus-books-pt) | 1 M sampled | Diverse Portuguese web/news/books                                                                                                          |
| **AI corpus** (`Detecting-ai/ai_pt_corpus`)                               | 1 M         | Generated with **OpenAI models** (various GPT-4 / GPT-3.5 variants); prompts cover essays, news, tweets, dialogs, paraphrases, T = 0.6–1.0 |

Datasets were **balanced 1 : 1** and shuffled before training.

---

## 🚦 Intended use

Detect AI-generated Portuguese text in essays, articles, chats, support tickets, etc.

### Limitations

* Not trained on code or non-Portuguese language.
* Accuracy may drop on texts < 10 tokens or heavily paraphrased AI.
* **Commercial use is disallowed** (CC-BY-NC-4.0).

---

## ⚠️ Future work

* Evaluate on adversarial paraphrases.
* Distill/quantize for edge deployment.
* Extend to multilingual detection.

---

## 📜 Citation

```bibtex
@misc{abdurazzoqov2025ptaidetector,
  title   = {pt-ai-detector: Detecting AI-generated Portuguese Text},
  author  = {Abdulla Abdurazzoqov},
  year    = {2025},
  howpublished = {Hugging Face hub},
  url     = {https://huggingface.co/Detecting-ai/pt-ai-detector}
}
```

---

## 💬 License

Creative Commons **CC-BY-NC-4.0** — free for research & personal use; commercial use requires written permission.