bahree commited on
Commit
e175925
Β·
verified Β·
1 Parent(s): bac70cd

Add model card

Browse files
Files changed (1) hide show
  1. README.md +169 -68
README.md CHANGED
@@ -20,9 +20,9 @@ tags:
20
 
21
  A compact GPT-2 Small model (~117M params) **trained from scratch** on historical London texts (1500–1850). Fast to run on CPU, and supports NVIDIA (CUDA) and AMD (ROCm) GPUs.
22
 
23
- **Note**:
24
- * This model was **trained from scratch** - not fine-tuned from existing models.
25
- * This page includes simple **virtual-env setup**, **install choices for CPU/CUDA/ROCm**, and an **auto-device inference** example so anyone can get going quickly.
26
 
27
  ---
28
 
@@ -31,39 +31,17 @@ A compact GPT-2 Small model (~117M params) **trained from scratch** on historica
31
  This is a **Small Language Model (SLM)** version of the London Historical LLM, **trained from scratch** using GPT-2 Small architecture on historical London texts with a custom historical tokenizer. The model was built from the ground up, not fine-tuned from existing models.
32
 
33
  ### Key Features
34
- - ~117M parameters (vs ~354M in the full model)
35
- - Custom historical tokenizer (β‰ˆ30k vocab)
36
- - London-specific context awareness and historical language patterns (e.g., *thou, thee, hath*)
37
- - Lower memory footprint and faster inference on commodity hardware
38
- - **Trained from scratch** - not fine-tuned from existing models
39
-
40
- ---
41
-
42
- ## Repository
43
-
44
- The complete source code, training scripts, and documentation for this model are available on GitHub:
45
-
46
- ** [https://github.com/bahree/helloLondon](https://github.com/bahree/helloLondon)**
47
-
48
- This repository includes:
49
- - Complete data collection pipeline for 1500-1850 historical English
50
- - Custom tokenizer optimized for historical text
51
- - Training infrastructure with GPU optimization
52
- - Evaluation and deployment tools
53
- - Comprehensive documentation and examples
54
-
55
- ### Quick Start with Repository
56
- ```bash
57
- git clone https://github.com/bahree/helloLondon.git
58
- cd helloLondon
59
- python 06_inference/test_published_models.py --model_type slm
60
- ```
61
 
62
  ---
63
 
64
  ## πŸ§ͺ Intended Use & Limitations
65
 
66
- **Use cases:** historical-style narrative generation, prompt-based exploration of London themes (1500–1850), creative writing aids.
67
  **Limitations:** may produce anachronisms or historically inaccurate statements; smaller models have less complex reasoning than larger LLMs. Validate outputs before downstream use.
68
 
69
  ---
@@ -135,7 +113,61 @@ Upgrade basics, then install Hugging Face libs:
135
 
136
  ```bash
137
  python -m pip install -U pip setuptools wheel
138
- python -m pip install "transformers[torch]" accelerate safetensors
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
  ```
140
 
141
  ---
@@ -170,36 +202,20 @@ outputs = model.generate(
170
  no_repeat_ngram_size=3,
171
  pad_token_id=tokenizer.eos_token_id,
172
  eos_token_id=tokenizer.eos_token_id,
 
173
  )
 
174
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
175
  ```
176
 
177
- ---
178
-
179
- ## πŸ“– **Sample Output**
180
-
181
- **Prompt:** "In the year 1834, I walked through the streets of London and witnessed"
182
-
183
- **Generated Text:**
184
- > "In the year 1834, I walked through the streets of London and witnessed a scene in which some of those who had no inclination to come in contact with him took part in his discourse. It was on this occasion that I perceived that he had been engaged in some new business connected with the house, but for some days it had not taken place, nor did he appear so desirous of pursuing any further display of interest. The result was, however, that if he came in contact witli any one else in company with him he must be regarded as an old acquaintance or companion, and when he came to the point of leaving, I had no leisure to take up his abode. The same evening, having ram ##bled about the streets, I observed that the young man who had just arrived from a neighbouring village at the time, was enjoying himself at a certain hour, and I thought that he would sleep quietly until morning, when he said in a low voice β€” " You are coming. Miss β€” I have come from the West Indies. " Then my father bade me go into the shop, and bid me put on his spectacles, which he had in his hand; but he replied no: the room was empty, and he did not want to see what had passed. When I asked him the cause of all this conversation, he answered in the affirmative, and turned away, saying that as soon as the lad could recover, the sight of him might be renewed. " Well, Mr. , " said I, " you have got a little more of your wages, do you? " " No, sir, thank 'ee kindly, " returned the boy, " but we don 't want to pay the poor rates. We"
185
-
186
- ---
187
-
188
  ## πŸ§ͺ **Testing Your Model**
189
 
190
- ### **Quick Testing (Recommended First)**
191
  ```bash
192
- # Test the published model with 10 automated prompts
193
  python 06_inference/test_published_models.py --model_type slm
194
  ```
195
 
196
- **What this does:**
197
- - Loads model from `bahree/london-historical-slm`
198
- - Tests 10 historical prompts automatically
199
- - Shows model info (vocab size, parameters, etc.)
200
- - Uses SLM-optimized generation parameters
201
- - **No user interaction** - just runs and reports results
202
-
203
  **Expected Output:**
204
  ```
205
  πŸ§ͺ Testing SLM Model: bahree/london-historical-slm
@@ -214,13 +230,10 @@ python 06_inference/test_published_models.py --model_type slm
214
  Max length: 512
215
 
216
  🎯 Testing generation with 10 prompts...
217
-
218
- --- Test 1/10 ---
219
- Prompt: In the year 1834, I walked through the streets of London and witnessed
220
- Generated: . a scene in which some of those who had no inclination to come in contact with him took part in his discourse . It was on this occasion that I perceived that he had been engaged in some new business connected with the house , but for some days it had not taken place , nor did he appear so desirous of pursuing any further display of interest . The result was , however , that if he came in contact witli any one else in company with him he must be regarded as an old acquaintance or companion , and when he came to the point of leaving , I had no leisure to take up his abode . The same evening , having ram ##bled about the streets , I observed that the young man who had just arrived from a neighbouring village at the time , was enjoying himself at a certain hour , and I thought that he would sleep quietly until morning , when he said in a low voice β€” " You are coming . Miss β€” I have come from the West Indies . " Then my father bade me go into the shop , and bid me put on his spectacles , which he had in his hand ; but he replied no : the room was empty , and he did not want to see what had passed . When I asked him the cause of all this conversation , he answered in the affirmative , and turned away , saying that as soon as the lad could recover , the sight of him might be renewed . " Well , Mr . , " said I , " you have got a little more of your wages , do you ? " " No , sir , thank ' ee kindly , " returned the boy , " but we don ' t want to pay the poor rates . We
221
  ```
222
 
223
- ### **Interactive Testing (For Exploration)**
224
  ```bash
225
  # Interactive mode for custom prompts
226
  python 06_inference/inference_unified.py --published --model_type slm --interactive
@@ -229,20 +242,86 @@ python 06_inference/inference_unified.py --published --model_type slm --interact
229
  python 06_inference/inference_unified.py --published --model_type slm --prompt "In the year 1834, I walked through the streets of London and witnessed"
230
  ```
231
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
232
  ---
233
 
234
  ## πŸ› οΈ Training Details
235
 
236
- * **Architecture:** Custom GPT-2 Small (built from scratch)
237
- * **Parameters:** ~117M
238
- * **Tokenizer:** Custom historical tokenizer (~30k vocab) with London-specific and historical tokens
239
- * **Data:** Historical London corpus (1500-1850) with proper segmentation
240
- * **Steps:** 4,000 steps (early stopping for SLM)
241
- * **Final Training Loss:** ~3.08 (good convergence)
242
- * **Final Validation Loss:** ~3.67 (good generalization)
243
- * **Training Time:** ~1.5 hours
244
- * **Hardware:** 1Γ— GPU training
245
- * **Training Method:** **Train from scratch** using `04_training/train_model_slm.py`
 
 
246
 
247
  ---
248
 
@@ -281,6 +360,28 @@ If you use this model, please cite:
281
 
282
  ---
283
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
284
  ## 🧾 License
285
 
286
- MIT (see [LICENSE](https://huggingface.co/bahree/london-historical-slm/blob/main/LICENSE) in repo).
 
20
 
21
  A compact GPT-2 Small model (~117M params) **trained from scratch** on historical London texts (1500–1850). Fast to run on CPU, and supports NVIDIA (CUDA) and AMD (ROCm) GPUs.
22
 
23
+ > **Note**: This model was **trained from scratch** - not fine-tuned from existing models.
24
+
25
+ > This page includes simple **virtual-env setup**, **install choices for CPU/CUDA/ROCm**, and an **auto-device inference** example so anyone can get going quickly.
26
 
27
  ---
28
 
 
31
  This is a **Small Language Model (SLM)** version of the London Historical LLM, **trained from scratch** using GPT-2 Small architecture on historical London texts with a custom historical tokenizer. The model was built from the ground up, not fine-tuned from existing models.
32
 
33
  ### Key Features
34
+ - ~117M parameters (vs ~354M in the full model)
35
+ - Custom historical tokenizer (β‰ˆ30k vocab)
36
+ - London-specific context awareness and historical language patterns (e.g., *thou, thee, hath*)
37
+ - Lower memory footprint and faster inference on commodity hardware
38
+ - **Trained from scratch** - not fine-tuned from existing models
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  ---
41
 
42
  ## πŸ§ͺ Intended Use & Limitations
43
 
44
+ **Use cases:** historical-style narrative generation, prompt-based exploration of London themes (1500–1850), creative writing aids.
45
  **Limitations:** may produce anachronisms or historically inaccurate statements; smaller models have less complex reasoning than larger LLMs. Validate outputs before downstream use.
46
 
47
  ---
 
113
 
114
  ```bash
115
  python -m pip install -U pip setuptools wheel
116
+ python -m pip install "transformers" "accelerate" "safetensors"
117
+ ```
118
+
119
+ ---
120
+
121
+ ## Install **one** PyTorch variant (CPU / NVIDIA / AMD)
122
+
123
+ Use **one** of the commands below. For the most accurate command per OS/accelerator and version, prefer PyTorch's **Get Started** selector.
124
+
125
+ ### A) CPU-only (Linux/Windows/macOS)
126
+
127
+ ```bash
128
+ pip install torch --index-url https://download.pytorch.org/whl/cpu
129
+ ```
130
+
131
+ ### B) NVIDIA GPU (CUDA)
132
+
133
+ Pick the CUDA series that matches your system (examples below):
134
+
135
+ ```bash
136
+ # CUDA 12.6
137
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
138
+
139
+ # CUDA 12.4
140
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
141
+
142
+ # CUDA 11.8
143
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
144
+ ```
145
+
146
+ ### C) AMD GPU (ROCm, **Linux-only**)
147
+
148
+ Install the ROCm build matching your ROCm runtime (examples):
149
+
150
+ ```bash
151
+ # ROCm 6.3
152
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
153
+
154
+ # ROCm 6.2 (incl. 6.2.x)
155
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4
156
+
157
+ # ROCm 6.1
158
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
159
+ ```
160
+
161
+ **Quick sanity check**
162
+
163
+ ```bash
164
+ python - <<'PY'
165
+ import torch
166
+ print("torch:", torch.__version__)
167
+ print("GPU available:", torch.cuda.is_available())
168
+ if torch.cuda.is_available():
169
+ print("device:", torch.cuda.get_device_name(0))
170
+ PY
171
  ```
172
 
173
  ---
 
202
  no_repeat_ngram_size=3,
203
  pad_token_id=tokenizer.eos_token_id,
204
  eos_token_id=tokenizer.eos_token_id,
205
+ early_stopping=True,
206
  )
207
+
208
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
209
  ```
210
 
 
 
 
 
 
 
 
 
 
 
 
211
  ## πŸ§ͺ **Testing Your Model**
212
 
213
+ ### **Quick Testing (10 Automated Prompts)**
214
  ```bash
215
+ # Test with 10 automated historical prompts
216
  python 06_inference/test_published_models.py --model_type slm
217
  ```
218
 
 
 
 
 
 
 
 
219
  **Expected Output:**
220
  ```
221
  πŸ§ͺ Testing SLM Model: bahree/london-historical-slm
 
230
  Max length: 512
231
 
232
  🎯 Testing generation with 10 prompts...
233
+ [10 automated tests with historical text generation]
 
 
 
234
  ```
235
 
236
+ ### **Interactive Testing**
237
  ```bash
238
  # Interactive mode for custom prompts
239
  python 06_inference/inference_unified.py --published --model_type slm --interactive
 
242
  python 06_inference/inference_unified.py --published --model_type slm --prompt "In the year 1834, I walked through the streets of London and witnessed"
243
  ```
244
 
245
+ **Need more headroom later?** Load with πŸ€— Accelerate and `device_map="auto"` to spread layers across available devices/CPU automatically.
246
+
247
+ ```python
248
+ from transformers import AutoTokenizer, AutoModelForCausalLM
249
+ tok = AutoTokenizer.from_pretrained(model_id)
250
+ model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
251
+ ```
252
+
253
+ ---
254
+
255
+ ## πŸͺŸ Windows Terminal one-liners
256
+
257
+ **PowerShell**
258
+
259
+ ```powershell
260
+ python -c "from transformers import AutoTokenizer,AutoModelForCausalLM; m='bahree/london-historical-slm'; t=AutoTokenizer.from_pretrained(m); model=AutoModelForCausalLM.from_pretrained(m); p='In the year 1834, I walked through the streets of London and witnessed'; i=t(p,return_tensors='pt'); print(t.decode(model.generate(i['input_ids'],max_new_tokens=50,do_sample=True)[0],skip_special_tokens=True))"
261
+ ```
262
+
263
+ **Command Prompt (CMD)**
264
+
265
+ ```cmd
266
+ python -c "from transformers import AutoTokenizer, AutoModelForCausalLM ^&^& import torch ^&^& m='bahree/london-historical-slm' ^&^& t=AutoTokenizer.from_pretrained(m) ^&^& model=AutoModelForCausalLM.from_pretrained(m) ^&^& p='In the year 1834, I walked through the streets of London and witnessed' ^&^& i=t(p, return_tensors='pt') ^&^& print(t.decode(model.generate(i['input_ids'], max_new_tokens=50, do_sample=True)[0], skip_special_tokens=True))"
267
+ ```
268
+
269
+ ---
270
+
271
+ ## πŸ’‘ Basic Usage (Python)
272
+
273
+ ```python
274
+ from transformers import AutoTokenizer, AutoModelForCausalLM
275
+
276
+ tokenizer = AutoTokenizer.from_pretrained("bahree/london-historical-slm")
277
+ model = AutoModelForCausalLM.from_pretrained("bahree/london-historical-slm")
278
+
279
+ if tokenizer.pad_token is None:
280
+ tokenizer.pad_token = tokenizer.eos_token
281
+
282
+ prompt = "In the year 1834, I walked through the streets of London and witnessed"
283
+ inputs = tokenizer(prompt, return_tensors="pt")
284
+ outputs = model.generate(
285
+ inputs["input_ids"],
286
+ max_new_tokens=50,
287
+ do_sample=True,
288
+ temperature=0.8,
289
+ top_p=0.95,
290
+ top_k=40,
291
+ repetition_penalty=1.2,
292
+ no_repeat_ngram_size=3,
293
+ pad_token_id=tokenizer.pad_token_id,
294
+ eos_token_id=tokenizer.eos_token_id,
295
+ early_stopping=True,
296
+ )
297
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
298
+ ```
299
+
300
+ ---
301
+
302
+ ## 🧰 Example Prompts
303
+
304
+ * **Tudor (1558):** "On this day in 1558, Queen Mary has died and …"
305
+ * **Stuart (1666):** "The Great Fire of London has consumed much of the city, and …"
306
+ * **Georgian/Victorian:** "As I journeyed through the streets of London, I observed …"
307
+ * **London specifics:** "Parliament sat in Westminster Hall …", "The Thames flowed dark and mysterious …"
308
+
309
  ---
310
 
311
  ## πŸ› οΈ Training Details
312
 
313
+ * **Architecture:** GPT-2 Small (12 layers, hidden size 768)
314
+ * **Params:** ~117M
315
+ * **Tokenizer:** custom historical tokenizer (~30k vocab) with London-specific and historical tokens
316
+ * **Data:** historical London corpus (1500–1850)
317
+ * **Steps/Epochs:** 30,000 steps (extended training for better convergence)
318
+ * **Batch/LR:** 32, 3e-4 (optimized for segmented data)
319
+ * **Hardware:** 2Γ— GPU training with Distributed Data Parallel
320
+ * **Final Training Loss:** 1.395 (43% improvement from 20K steps)
321
+ * **Model Flops Utilization:** 3.5% (excellent efficiency)
322
+ * **Training Method:** **Trained from scratch** - not fine-tuned
323
+ * **Context Length:** 256 tokens (optimized for historical text segments)
324
+ * **Status:** βœ… **Successfully published and tested** - ready for production use
325
 
326
  ---
327
 
 
360
 
361
  ---
362
 
363
+ ## Repository
364
+
365
+ The complete source code, training scripts, and documentation for this model are available on GitHub:
366
+
367
+ **πŸ”— [https://github.com/bahree/helloLondon](https://github.com/bahree/helloLondon)**
368
+
369
+ This repository includes:
370
+ - Complete data collection pipeline for 1500-1850 historical English
371
+ - Custom tokenizer optimized for historical text
372
+ - Training infrastructure with GPU optimization
373
+ - Evaluation and deployment tools
374
+ - Comprehensive documentation and examples
375
+
376
+ ### Quick Start with Repository
377
+ ```bash
378
+ git clone https://github.com/bahree/helloLondon.git
379
+ cd helloLondon
380
+ python 06_inference/test_published_models.py --model_type slm
381
+ ```
382
+
383
+ ---
384
+
385
  ## 🧾 License
386
 
387
+ MIT (see [LICENSE](https://github.com/bahree/helloLondon/blob/main/LICENSE) in repo).