gpriday commited on
Commit
7f3b70a
·
verified ·
1 Parent(s): 43194c3

Update to epoch 2 checkpoint with 3.36 validation perplexity

Browse files
Files changed (2) hide show
  1. README.md +69 -17
  2. model.safetensors +1 -1
README.md CHANGED
@@ -11,22 +11,37 @@ tags:
11
  datasets:
12
  - humbleworth/registered-domains
13
  base_model: google/canine-c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ---
15
 
16
- # Domain MLM - CANINE Character-Level Model for Domain Names
17
 
18
  This model is a CANINE-based character-level language model that has been further pre-trained on domain names using masked language modeling (MLM). It's designed to understand and predict patterns in domain names at the character level.
19
 
20
  ## Model Description
21
 
22
- This is a checkpoint from epoch 1 of training CANINE-c on domain name data. The model continues pretraining from Google's CANINE-c base model, adapting it specifically to domain name patterns through masked character prediction.
23
 
24
  ### Key Features
25
 
26
  - **Character-level processing**: Works directly with Unicode code points, no tokenization required
27
  - **Domain-specific**: Pre-trained on 255M registered domain names
28
- - **Masked Language Modeling**: Trained to predict masked characters in domain names
29
  - **Efficient**: 132M parameters, suitable for downstream fine-tuning
 
30
 
31
  ### Architecture
32
 
@@ -43,11 +58,27 @@ This is a checkpoint from epoch 1 of training CANINE-c on domain name data. The
43
  - **Training Data**: humbleworth/registered-domains dataset (255M domains)
44
  - **Training Objective**: Masked Language Modeling (MLM) with 25% masking probability
45
  - **Masking Strategy**: Mix of contiguous spans (80%) and random characters (20%)
46
- - **Optimizer**: AdamW with learning rate 1e-5
47
- - **Batch Size**: 256 per device with gradient accumulation (effective batch size: 512)
48
- - **Hardware**: Optimized for NVIDIA A100 40GB
49
  - **Mixed Precision**: BF16 automatic mixed precision
50
  - **Training Framework**: PyTorch with custom training loop
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  ## Intended Uses & Limitations
53
 
@@ -58,14 +89,15 @@ This is a checkpoint from epoch 1 of training CANINE-c on domain name data. The
58
  - Feature extraction for domain-related tasks
59
  - Fine-tuning for domain classification tasks
60
  - Domain name generation (with additional fine-tuning)
 
61
 
62
  ### Limitations
63
 
64
- - This is an early checkpoint (epoch 1) - later checkpoints may perform better
65
  - Primarily trained on ASCII domain names
66
- - Limited to domains up to 128 characters
67
  - Not suitable for general text understanding tasks
68
  - Performance on internationalized domain names (IDN) may be limited
 
69
 
70
  ## How to Use
71
 
@@ -76,11 +108,11 @@ import torch
76
  from transformers import CanineTokenizer, CanineModel, CanineConfig
77
 
78
  # Load tokenizer
79
- tokenizer = CanineTokenizer.from_pretrained('humbleworth/domain-mlm')
80
 
81
  # Load base CANINE model
82
- config = CanineConfig.from_pretrained('humbleworth/domain-mlm')
83
- model = CanineModel.from_pretrained('humbleworth/domain-mlm')
84
 
85
  # Encode a domain
86
  domain = "example.com"
@@ -106,7 +138,7 @@ from train_mlm import CanineForMaskedLM
106
 
107
  # Load model with MLM head
108
  model = CanineForMaskedLM(config)
109
- model.canine = CanineModel.from_pretrained('humbleworth/domain-mlm')
110
 
111
  # Load MLM head weights
112
  state_dict = torch.load('training_state.bin', map_location='cpu')
@@ -127,6 +159,7 @@ The model was trained on the [humbleworth/registered-domains](https://huggingfac
127
  - **Source**: [Domains Project](https://domainsproject.org/)
128
  - **Character Set**: 100% ASCII (no internationalized domains)
129
  - **Average Length**: 15.9 characters (range: 4-77 characters)
 
130
 
131
  ### TLD Distribution
132
  - **Total Unique TLDs**: 1,274
@@ -146,7 +179,23 @@ This comprehensive dataset provides excellent coverage of real-world domain patt
146
 
147
  ## Evaluation
148
 
149
- This is an intermediate checkpoint. Full evaluation metrics will be available with the final model release. The model achieved reasonable perplexity on the validation set during training.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
 
151
  ## Technical Specifications
152
 
@@ -163,18 +212,20 @@ This is an intermediate checkpoint. Full evaluation metrics will be available wi
163
  - PyTorch 2.0+
164
  - Mixed precision training (BF16)
165
  - Custom training loop implementation
 
 
166
 
167
  ## Citation
168
 
169
  If you use this model, please cite:
170
 
171
  ```bibtex
172
- @misc{domain-mlm-2024,
173
  title={Domain MLM: Character-Level Language Modeling for Domain Names},
174
  author={humbleworth},
175
- year={2024},
176
  publisher={Hugging Face},
177
- howpublished={\url{https://huggingface.co/humbleworth/domain-mlm}}
178
  }
179
  ```
180
 
@@ -186,4 +237,5 @@ This model is released under the Apache 2.0 license.
186
 
187
  - Based on Google's CANINE-c model
188
  - Trained using the humbleworth/registered-domains dataset
189
- - Optimized training code for NVIDIA A100 GPUs
 
 
11
  datasets:
12
  - humbleworth/registered-domains
13
  base_model: google/canine-c
14
+ model-index:
15
+ - name: domain-mlm-epoch-2
16
+ results:
17
+ - task:
18
+ type: fill-mask
19
+ name: Masked Language Modeling
20
+ dataset:
21
+ name: humbleworth/registered-domains
22
+ type: humbleworth/registered-domains
23
+ split: validation
24
+ metrics:
25
+ - type: perplexity
26
+ value: 3.36
27
+ name: Validation Perplexity
28
  ---
29
 
30
+ # Domain MLM - CANINE Character-Level Model for Domain Names (Epoch 2)
31
 
32
  This model is a CANINE-based character-level language model that has been further pre-trained on domain names using masked language modeling (MLM). It's designed to understand and predict patterns in domain names at the character level.
33
 
34
  ## Model Description
35
 
36
+ This is a checkpoint from epoch 2 of training CANINE-c on domain name data. The model continues pretraining from Google's CANINE-c base model, adapting it specifically to domain name patterns through masked character prediction.
37
 
38
  ### Key Features
39
 
40
  - **Character-level processing**: Works directly with Unicode code points, no tokenization required
41
  - **Domain-specific**: Pre-trained on 255M registered domain names
42
+ - **Masked Language Modeling**: Trained to predict masked characters in domain names (25% masking probability)
43
  - **Efficient**: 132M parameters, suitable for downstream fine-tuning
44
+ - **Strong Performance**: Achieved 3.36 validation perplexity
45
 
46
  ### Architecture
47
 
 
58
  - **Training Data**: humbleworth/registered-domains dataset (255M domains)
59
  - **Training Objective**: Masked Language Modeling (MLM) with 25% masking probability
60
  - **Masking Strategy**: Mix of contiguous spans (80%) and random characters (20%)
61
+ - **Optimizer**: AdamW with learning rate 3e-5, weight decay 0.01
62
+ - **Batch Size**: 512 per device with gradient accumulation steps of 3 (effective batch size: 1,536)
63
+ - **Hardware**: NVIDIA A100 40GB
64
  - **Mixed Precision**: BF16 automatic mixed precision
65
  - **Training Framework**: PyTorch with custom training loop
66
+ - **Warmup Steps**: 2,000
67
+ - **Total Steps**: ~830,000 (2 epochs completed at 332,200 steps)
68
+ - **Training Time**: ~36 hours for 2 epochs
69
+
70
+ ### Performance Metrics
71
+
72
+ **Epoch 2 Results**:
73
+ - **Training Loss**: 1.29
74
+ - **Training Perplexity**: 3.62
75
+ - **Validation Loss**: 1.21
76
+ - **Validation Perplexity**: 3.36
77
+ - **Best Training Perplexity**: 3.49 (achieved during epoch 2)
78
+ - **Processing Speed**: 4,037 samples/second
79
+ - **GPU Memory Usage**: 2.85 GB (highly optimized)
80
+
81
+ The model shows excellent convergence, improving from an initial perplexity of 10.08 to 3.36 on validation data. The validation perplexity of 3.36 indicates the model effectively narrows down character predictions to approximately 3-4 likely candidates on average.
82
 
83
  ## Intended Uses & Limitations
84
 
 
89
  - Feature extraction for domain-related tasks
90
  - Fine-tuning for domain classification tasks
91
  - Domain name generation (with additional fine-tuning)
92
+ - Character-level anomaly detection in domains
93
 
94
  ### Limitations
95
 
 
96
  - Primarily trained on ASCII domain names
97
+ - Limited to domains up to 64 characters (training max_length)
98
  - Not suitable for general text understanding tasks
99
  - Performance on internationalized domain names (IDN) may be limited
100
+ - The model has learned strong biases toward common TLDs (.com, .net, .org)
101
 
102
  ## How to Use
103
 
 
108
  from transformers import CanineTokenizer, CanineModel, CanineConfig
109
 
110
  # Load tokenizer
111
+ tokenizer = CanineTokenizer.from_pretrained('humbleworth/domain-mlm-epoch-2')
112
 
113
  # Load base CANINE model
114
+ config = CanineConfig.from_pretrained('humbleworth/domain-mlm-epoch-2')
115
+ model = CanineModel.from_pretrained('humbleworth/domain-mlm-epoch-2')
116
 
117
  # Encode a domain
118
  domain = "example.com"
 
138
 
139
  # Load model with MLM head
140
  model = CanineForMaskedLM(config)
141
+ model.canine = CanineModel.from_pretrained('humbleworth/domain-mlm-epoch-2')
142
 
143
  # Load MLM head weights
144
  state_dict = torch.load('training_state.bin', map_location='cpu')
 
159
  - **Source**: [Domains Project](https://domainsproject.org/)
160
  - **Character Set**: 100% ASCII (no internationalized domains)
161
  - **Average Length**: 15.9 characters (range: 4-77 characters)
162
+ - **Training/Validation Split**: 99.9% / 0.1%
163
 
164
  ### TLD Distribution
165
  - **Total Unique TLDs**: 1,274
 
179
 
180
  ## Evaluation
181
 
182
+ ### Perplexity Analysis
183
+
184
+ The model achieved a validation perplexity of **3.36**, which means:
185
+ - The model effectively chooses between ~3.36 possible characters on average at each position
186
+ - This represents excellent performance for domain name modeling
187
+ - The low perplexity indicates strong pattern learning, including:
188
+ - TLD patterns (high certainty after dots)
189
+ - Common domain prefixes and suffixes
190
+ - Valid character sequences in domain names
191
+
192
+ ### Training Progression
193
+ - **Initial**: Loss=2.31, Perplexity=10.08
194
+ - **Epoch 1**: ~4.5-5.0 perplexity (estimated)
195
+ - **Epoch 2**: Loss=1.21, Perplexity=3.36
196
+ - **Best achieved**: Perplexity=3.49 (training), 3.36 (validation)
197
+
198
+ The model appears to be approaching an asymptotic performance around 3.2-3.5 perplexity, suggesting it has learned most learnable patterns in the domain dataset.
199
 
200
  ## Technical Specifications
201
 
 
212
  - PyTorch 2.0+
213
  - Mixed precision training (BF16)
214
  - Custom training loop implementation
215
+ - Gradient clipping: 1.0
216
+ - Training tracked with Weights & Biases
217
 
218
  ## Citation
219
 
220
  If you use this model, please cite:
221
 
222
  ```bibtex
223
+ @misc{domain-mlm-2025,
224
  title={Domain MLM: Character-Level Language Modeling for Domain Names},
225
  author={humbleworth},
226
+ year={2025},
227
  publisher={Hugging Face},
228
+ howpublished={\url{https://huggingface.co/humbleworth/domain-mlm-epoch-2}}
229
  }
230
  ```
231
 
 
237
 
238
  - Based on Google's CANINE-c model
239
  - Trained using the humbleworth/registered-domains dataset
240
+ - Optimized training code for NVIDIA A100 GPUs
241
+ - Training infrastructure provided by Lambda Labs
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:12ff2a2f7c7820d20880edd8f7521069c57ba546a67a8269c23b8ec26fdd5225
3
  size 528359880
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f475ef3e19f78b48fbbd7dec1c0b888f5bda4e1800f6ca5b1106f799c3acb6e9
3
  size 528359880