File size: 14,156 Bytes
8e93bde
 
 
 
 
 
 
81d66fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4852b2b
 
 
 
 
 
 
81d66fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8e93bde
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
---
license: apache-2.0
language:
- en
base_model:
- huawei-noah/TinyBERT_General_6L_768D
---
# 🏠 TinyBERT Indian Address NER Model

This model is a fine-tuned **TinyBERT** for **Named Entity Recognition (NER)** on Indian addresses. It can extract and classify various address components from Indian address text with high accuracy, leveraging TinyBERT's efficient and lightweight architecture.

## 🎯 Model Description

TinyBERT fine-tuned for Indian address Named Entity Recognition (NER)

### Key Capabilities

- **Address Component Extraction**: Identify and classify various parts of Indian addresses
- **Multi-format Support**: Handle various Indian address formats and styles
- **Lightweight Architecture**: Built on TinyBERT's efficient transformer design
- **High Accuracy**: Fine-tuned on augmented Indian address dataset
- **Fast Inference**: Optimized TinyBERT for quick entity extraction
- **Robust Recognition**: Handles partial, incomplete, or informal addresses
- **Efficient Processing**: TinyBERT's compact design for better performance
- **Mobile-Friendly**: Smaller model size suitable for edge deployment
- **Resource Efficient**: Lower memory and computational requirements

## πŸ“Š Model Architecture

- **Base Model**: huawei-noah/TinyBERT_General_6L_768D (TinyBERT)
- **Model Type**: Token Classification (NER)
- **Vocabulary Size**: 30,522 tokens
- **Hidden Size**: 768
- **Number of Layers**: 6
- **Attention Heads**: 12
- **Max Sequence Length**: 512 tokens
- **Number of Labels**: 23
- **Model Size**: ~761MB
- **Checkpoint**: 20793

## Performance Metrics
| Average Type    | Precision | Recall | F1-Score |
| :-------------- | :-------- | :----- | :------- |
| Micro Average   | 0.93      | 0.94   | 0.94     |
| Macro Average   | 0.80      | 0.80   | 0.80     |
| Weighted Average | 0.93      | 0.94   | 0.94     |

## πŸš€ Usage Examples

```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import warnings
warnings.filterwarnings("ignore")

class IndianAddressNER:
    def __init__(self):
        model_name = "shiprocket-ai/open-tinybert-indian-address-ner"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(model_name)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        self.model.eval()
        
        # Entity mappings
        self.id2entity = {
        "0": "O",
        "1": "B-building_name",
        "2": "I-building_name",
        "3": "B-city",
        "4": "I-city",
        "5": "B-country",
        "6": "I-country",
        "7": "B-floor",
        "8": "I-floor",
        "9": "B-house_details",
        "10": "I-house_details",
        "11": "B-locality",
        "12": "I-locality",
        "13": "B-pincode",
        "14": "I-pincode",
        "15": "B-road",
        "16": "I-road",
        "17": "B-state",
        "18": "I-state",
        "19": "B-sub_locality",
        "20": "I-sub_locality",
        "21": "B-landmarks",
        "22": "I-landmarks"
}
    
    def predict(self, address):
        """Extract entities from an Indian address - FIXED VERSION"""
        if not address.strip():
            return {}
        
        # Tokenize with offset mapping for better text reconstruction
        inputs = self.tokenizer(
            address, 
            return_tensors="pt", 
            truncation=True, 
            padding=True, 
            max_length=128,
            return_offsets_mapping=True
        )
        
        # Extract offset mapping before moving to device
        offset_mapping = inputs.pop("offset_mapping")[0]
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        # Predict
        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_ids = torch.argmax(predictions, dim=-1)
            confidence_scores = torch.max(predictions, dim=-1)[0]
        
        # Extract entities using offset mapping
        entities = self.extract_entities_with_offsets(
            address, 
            predicted_ids[0], 
            confidence_scores[0], 
            offset_mapping
        )
        
        return entities
    
    def extract_entities_with_offsets(self, original_text, predicted_ids, confidences, offset_mapping):
        """Extract entities using offset mapping for accurate text reconstruction"""
        entities = {}
        current_entity = None
        
        for i, (pred_id, conf) in enumerate(zip(predicted_ids, confidences)):
            if i >= len(offset_mapping):
                break
                
            start, end = offset_mapping[i]
            
            # Skip special tokens (they have (0,0) mapping)
            if start == end == 0:
                continue
                
            label = self.id2entity.get(str(pred_id.item()), "O")
            
            if label.startswith("B-"):
                # Save previous entity
                if current_entity:
                    entity_type = current_entity["type"]
                    if entity_type not in entities:
                        entities[entity_type] = []
                    entities[entity_type].append({
                        "text": current_entity["text"],
                        "confidence": current_entity["confidence"]
                    })
                
                # Start new entity
                entity_type = label[2:]  # Remove "B-"
                current_entity = {
                    "type": entity_type,
                    "text": original_text[start:end],
                    "confidence": conf.item(),
                    "start": start,
                    "end": end
                }
            
            elif label.startswith("I-") and current_entity:
                # Continue current entity
                entity_type = label[2:]  # Remove "I-"
                if entity_type == current_entity["type"]:
                    # Extend the entity to include this token
                    current_entity["text"] = original_text[current_entity["start"]:end]
                    current_entity["confidence"] = (current_entity["confidence"] + conf.item()) / 2
                    current_entity["end"] = end
            
            elif label == "O" and current_entity:
                # End current entity
                entity_type = current_entity["type"]
                if entity_type not in entities:
                    entities[entity_type] = []
                entities[entity_type].append({
                    "text": current_entity["text"],
                    "confidence": current_entity["confidence"]
                })
                current_entity = None
        
        # Add final entity if exists
        if current_entity:
            entity_type = current_entity["type"]
            if entity_type not in entities:
                entities[entity_type] = []
            entities[entity_type].append({
                "text": current_entity["text"],
                "confidence": current_entity["confidence"]
            })
        
        return entities

# Usage example
ner = IndianAddressNER()

# Test addresses
test_addresses = [
    "Shop No 123, Sunshine Apartments, Andheri West, Mumbai, 400058",
    "DLF Cyber City, Sector 25, Gurgaon, Haryana",
    "Flat 201, MG Road, Bangalore, Karnataka, 560001",
    "Phoenix Mall, Kurla West, Mumbai"
]

print("🏠 INDIAN ADDRESS NER EXAMPLES")
print("=" * 50)

for address in test_addresses:
    print(f"\nπŸ“ Address: {address}")
    entities = ner.predict(address)
    
    if entities:
        for entity_type, entity_list in sorted(entities.items()):
            print(f"🏷️ {entity_type.replace('_', ' ').title()}:")
            for entity in entity_list:
                confidence = entity['confidence']
                text = entity['text']
                confidence_icon = "🟒" if confidence > 0.8 else "🟑" if confidence > 0.6 else "πŸ”΄"
                print(f"   {confidence_icon} {text} (confidence: {confidence:.3f})")
    else:
        print("❌ No entities found")
    print("-" * 40)
```

## 🏷️ Supported Entity Types

The model can identify and extract the following address components:

- **Building Name**: building_name
- **City**: city
- **Country**: country
- **Floor**: floor
- **House Details**: house_details
- **Landmarks**: landmarks
- **Locality**: locality
- **Pincode**: pincode
- **Road**: road
- **State**: state
- **Sub Locality**: sub_locality

## πŸ“ˆ Performance Highlights

- **Indian Address Optimized**: Specialized for Indian address patterns and formats
- **TinyBERT Advantage**: Efficient and lightweight transformer architecture
- **High Precision**: Accurate entity boundary detection
- **Multi-component Recognition**: Identifies multiple entities in complex addresses
- **Confidence Scoring**: Provides confidence scores for each extracted entity
- **Fast Inference**: Optimized for real-time applications
- **Robust Handling**: Works with partial or informal address formats
- **Compact Architecture**: TinyBERT's efficient design for deployment
- **Resource Friendly**: Lower computational requirements

## πŸ”§ Training Details

- **Dataset**: 300% augmented Indian address dataset
- **Training Strategy**: Fine-tuned from pre-trained TinyBERT
- **Specialization**: Indian address entity extraction
- **Context Length**: 128 tokens
- **Version**: v1.0
- **Framework**: PyTorch + Transformers
- **BIO Tagging**: Uses Begin-Inside-Outside tagging scheme
- **Base Model Advantage**: TinyBERT's efficient architecture and compact size

## πŸ’‘ Use Cases

### 1. **Address Parsing & Standardization**
- Parse unstructured address text into components
- Standardize address formats for databases
- Extract specific components for validation

### 2. **Form Auto-completion**
- Auto-fill address forms by extracting components
- Validate address field completeness
- Suggest corrections for incomplete addresses

### 3. **Data Processing & Migration**
- Clean legacy address databases
- Extract structured data from unstructured text
- Migrate addresses between different systems

### 4. **Logistics & Delivery**
- Extract delivery-relevant components
- Validate address completeness for shipping
- Improve address accuracy for last-mile delivery

### 5. **Geocoding Preprocessing**
- Prepare addresses for geocoding APIs
- Extract location components for mapping
- Improve geocoding accuracy with clean components

### 6. **Mobile & Edge Deployment**
- Deploy on mobile devices with limited resources
- Run inference on edge computing devices
- Integrate into lightweight applications

## ⚑ Performance Tips

1. **Input Length**: Keep addresses under 128 tokens for optimal performance
2. **Batch Processing**: Process multiple addresses in batches for efficiency
3. **GPU Usage**: Use GPU for faster inference on large datasets
4. **Confidence Filtering**: Filter results by confidence score for higher precision
5. **Text Preprocessing**: Clean input text for better recognition
6. **TinyBERT Advantage**: Model benefits from efficient architecture optimizations
7. **Edge Deployment**: Suitable for mobile and edge computing scenarios

## ⚠️ Limitations

- **Language Support**: Primarily optimized for English Indian addresses
- **Regional Variations**: May struggle with highly regional or colloquial formats
- **New Localities**: Performance may vary on very recent developments
- **Complex Formatting**: May have difficulty with highly unstructured text
- **Context Dependency**: Works best with clear address context

## πŸ“‹ Entity Mapping

The model uses BIO (Begin-Inside-Outside) tagging scheme:

```json
{
  "entity2id": {
    "O": 0,
    "B-building_name": 1,
    "I-building_name": 2,
    "B-city": 3,
    "I-city": 4,
    "B-country": 5,
    "I-country": 6,
    "B-floor": 7,
    "I-floor": 8,
    "B-house_details": 9,
    "I-house_details": 10,
    "B-locality": 11,
    "I-locality": 12,
    "B-pincode": 13,
    "I-pincode": 14,
    "B-road": 15,
    "I-road": 16,
    "B-state": 17,
    "I-state": 18,
    "B-sub_locality": 19,
    "I-sub_locality": 20,
    "B-landmarks": 21,
    "I-landmarks": 22
  },
  "id2entity": {
    "0": "O",
    "1": "B-building_name",
    "2": "I-building_name",
    "3": "B-city",
    "4": "I-city",
    "5": "B-country",
    "6": "I-country",
    "7": "B-floor",
    "8": "I-floor",
    "9": "B-house_details",
    "10": "I-house_details",
    "11": "B-locality",
    "12": "I-locality",
    "13": "B-pincode",
    "14": "I-pincode",
    "15": "B-road",
    "16": "I-road",
    "17": "B-state",
    "18": "I-state",
    "19": "B-sub_locality",
    "20": "I-sub_locality",
    "21": "B-landmarks",
    "22": "I-landmarks"
  }
}
```

## πŸ“‹ Model Files

- `config.json`: Model configuration and hyperparameters  
- `pytorch_model.bin` / `model.safetensors`: Model weights
- `tokenizer.json`: Tokenizer configuration
- `tokenizer_config.json`: Tokenizer settings
- `vocab.txt`: Vocabulary file
- `entity_mappings.json`: Entity type mappings

## πŸ”„ Model Updates

- **Version**: v1.0 (Checkpoint 20793)
- **Last Updated**: 2025-06-19
- **Training Completion**: Based on augmented Indian address dataset
- **Base Model**: TinyBERT for efficient transformer architecture

## πŸ“š Citation

If you use this model in your research or applications, please cite:

```bibtex
@misc{open-tinybert-indian-address-ner,
  title={TinyBERT Indian Address NER Model},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/shiprocket-ai/open-tinybert-indian-address-ner}
}
```

## πŸ“ž Support & Contact

For questions, issues, or feature requests:
- Open an issue in this repository
- Contact: shiprocket-ai team
- Documentation: See usage examples above

## πŸ“œ License

This model is released under the Apache 2.0 License. See LICENSE file for details.

---

*Specialized for Indian address entity recognition - Built with ❀️ by shiprocket-ai team using TinyBERT*