Update to epoch 2 checkpoint with 3.36 validation perplexity
Browse files- README.md +69 -17
- model.safetensors +1 -1
README.md
CHANGED
|
@@ -11,22 +11,37 @@ tags:
|
|
| 11 |
datasets:
|
| 12 |
- humbleworth/registered-domains
|
| 13 |
base_model: google/canine-c
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
-
# Domain MLM - CANINE Character-Level Model for Domain Names
|
| 17 |
|
| 18 |
This model is a CANINE-based character-level language model that has been further pre-trained on domain names using masked language modeling (MLM). It's designed to understand and predict patterns in domain names at the character level.
|
| 19 |
|
| 20 |
## Model Description
|
| 21 |
|
| 22 |
-
This is a checkpoint from epoch
|
| 23 |
|
| 24 |
### Key Features
|
| 25 |
|
| 26 |
- **Character-level processing**: Works directly with Unicode code points, no tokenization required
|
| 27 |
- **Domain-specific**: Pre-trained on 255M registered domain names
|
| 28 |
-
- **Masked Language Modeling**: Trained to predict masked characters in domain names
|
| 29 |
- **Efficient**: 132M parameters, suitable for downstream fine-tuning
|
|
|
|
| 30 |
|
| 31 |
### Architecture
|
| 32 |
|
|
@@ -43,11 +58,27 @@ This is a checkpoint from epoch 1 of training CANINE-c on domain name data. The
|
|
| 43 |
- **Training Data**: humbleworth/registered-domains dataset (255M domains)
|
| 44 |
- **Training Objective**: Masked Language Modeling (MLM) with 25% masking probability
|
| 45 |
- **Masking Strategy**: Mix of contiguous spans (80%) and random characters (20%)
|
| 46 |
-
- **Optimizer**: AdamW with learning rate
|
| 47 |
-
- **Batch Size**:
|
| 48 |
-
- **Hardware**:
|
| 49 |
- **Mixed Precision**: BF16 automatic mixed precision
|
| 50 |
- **Training Framework**: PyTorch with custom training loop
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
## Intended Uses & Limitations
|
| 53 |
|
|
@@ -58,14 +89,15 @@ This is a checkpoint from epoch 1 of training CANINE-c on domain name data. The
|
|
| 58 |
- Feature extraction for domain-related tasks
|
| 59 |
- Fine-tuning for domain classification tasks
|
| 60 |
- Domain name generation (with additional fine-tuning)
|
|
|
|
| 61 |
|
| 62 |
### Limitations
|
| 63 |
|
| 64 |
-
- This is an early checkpoint (epoch 1) - later checkpoints may perform better
|
| 65 |
- Primarily trained on ASCII domain names
|
| 66 |
-
- Limited to domains up to
|
| 67 |
- Not suitable for general text understanding tasks
|
| 68 |
- Performance on internationalized domain names (IDN) may be limited
|
|
|
|
| 69 |
|
| 70 |
## How to Use
|
| 71 |
|
|
@@ -76,11 +108,11 @@ import torch
|
|
| 76 |
from transformers import CanineTokenizer, CanineModel, CanineConfig
|
| 77 |
|
| 78 |
# Load tokenizer
|
| 79 |
-
tokenizer = CanineTokenizer.from_pretrained('humbleworth/domain-mlm')
|
| 80 |
|
| 81 |
# Load base CANINE model
|
| 82 |
-
config = CanineConfig.from_pretrained('humbleworth/domain-mlm')
|
| 83 |
-
model = CanineModel.from_pretrained('humbleworth/domain-mlm')
|
| 84 |
|
| 85 |
# Encode a domain
|
| 86 |
domain = "example.com"
|
|
@@ -106,7 +138,7 @@ from train_mlm import CanineForMaskedLM
|
|
| 106 |
|
| 107 |
# Load model with MLM head
|
| 108 |
model = CanineForMaskedLM(config)
|
| 109 |
-
model.canine = CanineModel.from_pretrained('humbleworth/domain-mlm')
|
| 110 |
|
| 111 |
# Load MLM head weights
|
| 112 |
state_dict = torch.load('training_state.bin', map_location='cpu')
|
|
@@ -127,6 +159,7 @@ The model was trained on the [humbleworth/registered-domains](https://huggingfac
|
|
| 127 |
- **Source**: [Domains Project](https://domainsproject.org/)
|
| 128 |
- **Character Set**: 100% ASCII (no internationalized domains)
|
| 129 |
- **Average Length**: 15.9 characters (range: 4-77 characters)
|
|
|
|
| 130 |
|
| 131 |
### TLD Distribution
|
| 132 |
- **Total Unique TLDs**: 1,274
|
|
@@ -146,7 +179,23 @@ This comprehensive dataset provides excellent coverage of real-world domain patt
|
|
| 146 |
|
| 147 |
## Evaluation
|
| 148 |
|
| 149 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
## Technical Specifications
|
| 152 |
|
|
@@ -163,18 +212,20 @@ This is an intermediate checkpoint. Full evaluation metrics will be available wi
|
|
| 163 |
- PyTorch 2.0+
|
| 164 |
- Mixed precision training (BF16)
|
| 165 |
- Custom training loop implementation
|
|
|
|
|
|
|
| 166 |
|
| 167 |
## Citation
|
| 168 |
|
| 169 |
If you use this model, please cite:
|
| 170 |
|
| 171 |
```bibtex
|
| 172 |
-
@misc{domain-mlm-
|
| 173 |
title={Domain MLM: Character-Level Language Modeling for Domain Names},
|
| 174 |
author={humbleworth},
|
| 175 |
-
year={
|
| 176 |
publisher={Hugging Face},
|
| 177 |
-
howpublished={\url{https://huggingface.co/humbleworth/domain-mlm}}
|
| 178 |
}
|
| 179 |
```
|
| 180 |
|
|
@@ -186,4 +237,5 @@ This model is released under the Apache 2.0 license.
|
|
| 186 |
|
| 187 |
- Based on Google's CANINE-c model
|
| 188 |
- Trained using the humbleworth/registered-domains dataset
|
| 189 |
-
- Optimized training code for NVIDIA A100 GPUs
|
|
|
|
|
|
| 11 |
datasets:
|
| 12 |
- humbleworth/registered-domains
|
| 13 |
base_model: google/canine-c
|
| 14 |
+
model-index:
|
| 15 |
+
- name: domain-mlm-epoch-2
|
| 16 |
+
results:
|
| 17 |
+
- task:
|
| 18 |
+
type: fill-mask
|
| 19 |
+
name: Masked Language Modeling
|
| 20 |
+
dataset:
|
| 21 |
+
name: humbleworth/registered-domains
|
| 22 |
+
type: humbleworth/registered-domains
|
| 23 |
+
split: validation
|
| 24 |
+
metrics:
|
| 25 |
+
- type: perplexity
|
| 26 |
+
value: 3.36
|
| 27 |
+
name: Validation Perplexity
|
| 28 |
---
|
| 29 |
|
| 30 |
+
# Domain MLM - CANINE Character-Level Model for Domain Names (Epoch 2)
|
| 31 |
|
| 32 |
This model is a CANINE-based character-level language model that has been further pre-trained on domain names using masked language modeling (MLM). It's designed to understand and predict patterns in domain names at the character level.
|
| 33 |
|
| 34 |
## Model Description
|
| 35 |
|
| 36 |
+
This is a checkpoint from epoch 2 of training CANINE-c on domain name data. The model continues pretraining from Google's CANINE-c base model, adapting it specifically to domain name patterns through masked character prediction.
|
| 37 |
|
| 38 |
### Key Features
|
| 39 |
|
| 40 |
- **Character-level processing**: Works directly with Unicode code points, no tokenization required
|
| 41 |
- **Domain-specific**: Pre-trained on 255M registered domain names
|
| 42 |
+
- **Masked Language Modeling**: Trained to predict masked characters in domain names (25% masking probability)
|
| 43 |
- **Efficient**: 132M parameters, suitable for downstream fine-tuning
|
| 44 |
+
- **Strong Performance**: Achieved 3.36 validation perplexity
|
| 45 |
|
| 46 |
### Architecture
|
| 47 |
|
|
|
|
| 58 |
- **Training Data**: humbleworth/registered-domains dataset (255M domains)
|
| 59 |
- **Training Objective**: Masked Language Modeling (MLM) with 25% masking probability
|
| 60 |
- **Masking Strategy**: Mix of contiguous spans (80%) and random characters (20%)
|
| 61 |
+
- **Optimizer**: AdamW with learning rate 3e-5, weight decay 0.01
|
| 62 |
+
- **Batch Size**: 512 per device with gradient accumulation steps of 3 (effective batch size: 1,536)
|
| 63 |
+
- **Hardware**: NVIDIA A100 40GB
|
| 64 |
- **Mixed Precision**: BF16 automatic mixed precision
|
| 65 |
- **Training Framework**: PyTorch with custom training loop
|
| 66 |
+
- **Warmup Steps**: 2,000
|
| 67 |
+
- **Total Steps**: ~830,000 (2 epochs completed at 332,200 steps)
|
| 68 |
+
- **Training Time**: ~36 hours for 2 epochs
|
| 69 |
+
|
| 70 |
+
### Performance Metrics
|
| 71 |
+
|
| 72 |
+
**Epoch 2 Results**:
|
| 73 |
+
- **Training Loss**: 1.29
|
| 74 |
+
- **Training Perplexity**: 3.62
|
| 75 |
+
- **Validation Loss**: 1.21
|
| 76 |
+
- **Validation Perplexity**: 3.36
|
| 77 |
+
- **Best Training Perplexity**: 3.49 (achieved during epoch 2)
|
| 78 |
+
- **Processing Speed**: 4,037 samples/second
|
| 79 |
+
- **GPU Memory Usage**: 2.85 GB (highly optimized)
|
| 80 |
+
|
| 81 |
+
The model shows excellent convergence, improving from an initial perplexity of 10.08 to 3.36 on validation data. The validation perplexity of 3.36 indicates the model effectively narrows down character predictions to approximately 3-4 likely candidates on average.
|
| 82 |
|
| 83 |
## Intended Uses & Limitations
|
| 84 |
|
|
|
|
| 89 |
- Feature extraction for domain-related tasks
|
| 90 |
- Fine-tuning for domain classification tasks
|
| 91 |
- Domain name generation (with additional fine-tuning)
|
| 92 |
+
- Character-level anomaly detection in domains
|
| 93 |
|
| 94 |
### Limitations
|
| 95 |
|
|
|
|
| 96 |
- Primarily trained on ASCII domain names
|
| 97 |
+
- Limited to domains up to 64 characters (training max_length)
|
| 98 |
- Not suitable for general text understanding tasks
|
| 99 |
- Performance on internationalized domain names (IDN) may be limited
|
| 100 |
+
- The model has learned strong biases toward common TLDs (.com, .net, .org)
|
| 101 |
|
| 102 |
## How to Use
|
| 103 |
|
|
|
|
| 108 |
from transformers import CanineTokenizer, CanineModel, CanineConfig
|
| 109 |
|
| 110 |
# Load tokenizer
|
| 111 |
+
tokenizer = CanineTokenizer.from_pretrained('humbleworth/domain-mlm-epoch-2')
|
| 112 |
|
| 113 |
# Load base CANINE model
|
| 114 |
+
config = CanineConfig.from_pretrained('humbleworth/domain-mlm-epoch-2')
|
| 115 |
+
model = CanineModel.from_pretrained('humbleworth/domain-mlm-epoch-2')
|
| 116 |
|
| 117 |
# Encode a domain
|
| 118 |
domain = "example.com"
|
|
|
|
| 138 |
|
| 139 |
# Load model with MLM head
|
| 140 |
model = CanineForMaskedLM(config)
|
| 141 |
+
model.canine = CanineModel.from_pretrained('humbleworth/domain-mlm-epoch-2')
|
| 142 |
|
| 143 |
# Load MLM head weights
|
| 144 |
state_dict = torch.load('training_state.bin', map_location='cpu')
|
|
|
|
| 159 |
- **Source**: [Domains Project](https://domainsproject.org/)
|
| 160 |
- **Character Set**: 100% ASCII (no internationalized domains)
|
| 161 |
- **Average Length**: 15.9 characters (range: 4-77 characters)
|
| 162 |
+
- **Training/Validation Split**: 99.9% / 0.1%
|
| 163 |
|
| 164 |
### TLD Distribution
|
| 165 |
- **Total Unique TLDs**: 1,274
|
|
|
|
| 179 |
|
| 180 |
## Evaluation
|
| 181 |
|
| 182 |
+
### Perplexity Analysis
|
| 183 |
+
|
| 184 |
+
The model achieved a validation perplexity of **3.36**, which means:
|
| 185 |
+
- The model effectively chooses between ~3.36 possible characters on average at each position
|
| 186 |
+
- This represents excellent performance for domain name modeling
|
| 187 |
+
- The low perplexity indicates strong pattern learning, including:
|
| 188 |
+
- TLD patterns (high certainty after dots)
|
| 189 |
+
- Common domain prefixes and suffixes
|
| 190 |
+
- Valid character sequences in domain names
|
| 191 |
+
|
| 192 |
+
### Training Progression
|
| 193 |
+
- **Initial**: Loss=2.31, Perplexity=10.08
|
| 194 |
+
- **Epoch 1**: ~4.5-5.0 perplexity (estimated)
|
| 195 |
+
- **Epoch 2**: Loss=1.21, Perplexity=3.36
|
| 196 |
+
- **Best achieved**: Perplexity=3.49 (training), 3.36 (validation)
|
| 197 |
+
|
| 198 |
+
The model appears to be approaching an asymptotic performance around 3.2-3.5 perplexity, suggesting it has learned most learnable patterns in the domain dataset.
|
| 199 |
|
| 200 |
## Technical Specifications
|
| 201 |
|
|
|
|
| 212 |
- PyTorch 2.0+
|
| 213 |
- Mixed precision training (BF16)
|
| 214 |
- Custom training loop implementation
|
| 215 |
+
- Gradient clipping: 1.0
|
| 216 |
+
- Training tracked with Weights & Biases
|
| 217 |
|
| 218 |
## Citation
|
| 219 |
|
| 220 |
If you use this model, please cite:
|
| 221 |
|
| 222 |
```bibtex
|
| 223 |
+
@misc{domain-mlm-2025,
|
| 224 |
title={Domain MLM: Character-Level Language Modeling for Domain Names},
|
| 225 |
author={humbleworth},
|
| 226 |
+
year={2025},
|
| 227 |
publisher={Hugging Face},
|
| 228 |
+
howpublished={\url{https://huggingface.co/humbleworth/domain-mlm-epoch-2}}
|
| 229 |
}
|
| 230 |
```
|
| 231 |
|
|
|
|
| 237 |
|
| 238 |
- Based on Google's CANINE-c model
|
| 239 |
- Trained using the humbleworth/registered-domains dataset
|
| 240 |
+
- Optimized training code for NVIDIA A100 GPUs
|
| 241 |
+
- Training infrastructure provided by Lambda Labs
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 528359880
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f475ef3e19f78b48fbbd7dec1c0b888f5bda4e1800f6ca5b1106f799c3acb6e9
|
| 3 |
size 528359880
|