File size: 4,155 Bytes
0242dd2
1760c87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0242dd2
 
1760c87
0242dd2
1760c87
0242dd2
1760c87
0242dd2
1760c87
0242dd2
1760c87
0242dd2
1760c87
 
 
 
 
 
 
0242dd2
1760c87
0242dd2
1760c87
0242dd2
1760c87
 
 
 
 
 
0242dd2
1760c87
0242dd2
1760c87
0242dd2
1760c87
0242dd2
1760c87
 
0242dd2
1760c87
0242dd2
1760c87
0242dd2
1760c87
 
 
 
0242dd2
1760c87
0242dd2
1760c87
0242dd2
1760c87
 
 
0242dd2
1760c87
0242dd2
1760c87
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
language:
- tr
license: apache-2.0
tags:
- turkish
- diffusion
- masked-diffusion
- non-autoregressive
- foundation-model
- dllm
datasets:
- turkish-nlp-suite/Havadis
- turkish-nlp-suite/temiz-OSCAR
- wikimedia/wikipedia
metrics:
- perplexity
---

# DiffutronLM-0.3B-Base

**DiffutronLM-0.3B-Base** is the foundational Masked Diffusion Language Model (MDLM) of the Diffutron series, tailored specifically for the Turkish language. 

This model represents the completion of the **Continual Pre-training (CPT)** phase. It has successfully adapted the multilingual representations of its backbone to the agglutinative complexity and morphological nuances of Turkish. 

⚠️ **Note:** This is a base foundation model. It has **not** been instruction-tuned or aligned for chat capabilities. If you are looking for a model that follows prompts and answers questions, please use `DiffutronLM-0.3B-Instruct`.

## 📌 Model Details

* **Model Type:** Masked Diffusion Language Model (MDLM) Base
* **Base Architecture:** `jhu-clsp/mmBERT-base` (Multilingual Encoder)
* **Language:** Turkish
* **Parameter Count:** 307M (0.3B)
* **Context Length:** 512 tokens
* **Training Libraries:** `dllm`, PyTorch
* **Status:** Foundation / Base Model (Post-CPT)

## 🚀 Architecture & Continual Pre-training (CPT)

Unlike standard autoregressive models, Diffutron models text generation as a discrete diffusion process. To align the base encoder's latent space with the Turkish target distribution while preserving cross-lingual reasoning, this model underwent a specialized CPT pipeline:

* **Data Curation:** Trained on a composite dataset of approximately 2 million sequences (max length 512) sourced from:
  * **Havadis:** Comprehensive Turkish news articles.
  * **Temiz-OSCAR:** A cleaned, filtered subset of the Common Crawl-based Turkish OSCAR corpus.
  * **Turkish Wikipedia:** High-quality encyclopedic sequences.
* **Efficient Adaptation via LoRA:** Instead of full-parameter fine-tuning which risks catastrophic forgetting, we applied Low-Rank Adaptation (LoRA) with a high rank ($r=256$, $\alpha=256$) targeting all linear modules (Attention Q, K, V, O and MLP Input, Output). This resulted in ~14.94% trainable parameters.
* **Objective:** Masked Language Modeling (MLM).

## 📊 Intrinsic Evaluation

To quantify the improvements gained from the CPT phase, we conducted an intrinsic evaluation using perplexity on the **Bilkent Turkish Writings Dataset** (evaluated with a masked language modeling probability of 0.15). 

The CPT process resulted in a significant reduction in perplexity, indicating a strong alignment with Turkish linguistic structures:

* **jhu-clsp/mmBERT-base (Pre-CPT):** 3.42
* **DiffutronLM-0.3B-Base (Post-CPT):** **2.75**

*(Note: Downstream task evaluations on the CETVEL benchmark were conducted on the Instruct-tuned versions of this model.)*

## 💻 Usage

As a base masked diffusion model, this checkpoint is ideal for:
1. **Further Fine-tuning:** Acting as a starting point for domain-specific continued pre-training or custom instruction tuning.
2. **Masked Token Prediction:** Filling in blanks or reconstructing corrupted text.
3. **Unconditional/Conditional Generation:** Generating text using a discrete diffusion sampling loop (e.g., via the `dllm` library).

Because it uses a non-autoregressive paradigm, standard `AutoModelForCausalLM.generate()` pipelines will not work. Please utilize discrete diffusion generation strategies.

## ⚠️ Limitations

* **No Instruction Tuning:** Will not respond to QA prompts or instructions naturally.
* **Multilingual Backbone:** While heavily adapted to Turkish, it is built upon a multilingual encoder.
* **Context Window:** Restricted to a 512-token context window during the base phase.

## 📝 Citation

```bibtex
@misc{diffutron2026,
  author = {Kocabay, Şuayp Talha and Akkuş, Talha Rüzgar},
  title = {Diffutron: A Masked Diffusion Language Model for Turkish Language},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/collections/diffutron/diffutronlm](https://huggingface.co/collections/diffutron/diffutronlm)}}
}