--- license: cc-by-nc-4.0 language: - pt tags: - ai-detection - text-classification - portuguese - bert - transformers base_model: neuralmind/bert-base-portuguese-cased pipeline_tag: text-classification datasets: - wiki40b-pt # From consolidated human sources - oscar-pt # From consolidated human sources - cc100-pt # From consolidated human sources - europarl-pt # From consolidated human sources - opus-books-pt # From consolidated human sources - Detecting-ai/ai_pt_corpus # AI-generated corpus model_type: bert --- # 🇧🇷 pt-ai-detector **pt-ai-detector** is a BERT-base model fine-tuned to decide whether a Portuguese sentence or paragraph was written by a *human* (`label = 0`) or generated by *AI* (`label = 1`). | Metric | Value | | --------------------- | ------------------------------ | | **Train data** | 1 000 000 human + 1 000 000 AI | | **Balanced test set** | 1 954 190 (½ human, ½ AI) | | **Accuracy** | ≈ 99 % | | **F1 (macro)** | ≈ 0.99 | | **Model size** | 434 M parameters (≈ 430 MB) | --- ## 📖 Quick usage **Try it live** at [detecting-ai.com](https://detecting-ai.com/pt) – our team at Detecting-ai built this model and demo so you can instantly test any Portuguese text online. ```python from transformers import pipeline clf = pipeline( "text-classification", model="Detecting-ai/pt-ai-detector", tokenizer="Detecting-ai/pt-ai-detector", device=0 # set -1 for CPU ) text = "A inteligência artificial está transformando a educação." print(clf(text)) # → [{'label': 'AI', 'score': 0.987}] ``` | id | label | | -- | ----- | | 0 | Human | | 1 | AI | --- ## 🏋️‍♂️ Training details * **Base model:** `neuralmind/bert-base-portuguese-cased` * **Epochs:** 3 (fp16 on 1 × A100) * **Batch size:** 32 * **Optimizer/LR:** AdamW 2 × 10⁻⁵ * **Loss:** Cross-entropy ### Data sources | Corpus | Lines used | Notes | | ------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------ | | Human corpus (wiki40b-pt, oscar-pt, cc100-pt, europarl-pt, opus-books-pt) | 1 M sampled | Diverse Portuguese web/news/books | | **AI corpus** (`Detecting-ai/ai_pt_corpus`) | 1 M | Generated with **OpenAI models** (various GPT-4 / GPT-3.5 variants); prompts cover essays, news, tweets, dialogs, paraphrases, T = 0.6–1.0 | Datasets were **balanced 1 : 1** and shuffled before training. --- ## 🚦 Intended use Detect AI-generated Portuguese text in essays, articles, chats, support tickets, etc. ### Limitations * Not trained on code or non-Portuguese language. * Accuracy may drop on texts < 10 tokens or heavily paraphrased AI. * **Commercial use is disallowed** (CC-BY-NC-4.0). --- ## ⚠️ Future work * Evaluate on adversarial paraphrases. * Distill/quantize for edge deployment. * Extend to multilingual detection. --- ## 📜 Citation ```bibtex @misc{abdurazzoqov2025ptaidetector, title = {pt-ai-detector: Detecting AI-generated Portuguese Text}, author = {Abdulla Abdurazzoqov}, year = {2025}, howpublished = {Hugging Face hub}, url = {https://huggingface.co/Detecting-ai/pt-ai-detector} } ``` --- ## 💬 License Creative Commons **CC-BY-NC-4.0** — free for research & personal use; commercial use requires written permission.