adjaysagar's picture
Update README.md
fc11437 verified
---
license: apache-2.0
language:
- en
base_model:
- huawei-noah/TinyBERT_General_6L_768D
---
# 🏠 TinyBERT Indian Address NER Model
This model is a fine-tuned **TinyBERT** for **Named Entity Recognition (NER)** on Indian addresses. It can extract and classify various address components from Indian address text with high accuracy, leveraging TinyBERT's efficient and lightweight architecture.
## 🎯 Model Description
TinyBERT fine-tuned for Indian address Named Entity Recognition (NER)
### Key Capabilities
- **Address Component Extraction**: Identify and classify various parts of Indian addresses
- **Multi-format Support**: Handle various Indian address formats and styles
- **Lightweight Architecture**: Built on TinyBERT's efficient transformer design
- **High Accuracy**: Fine-tuned on augmented Indian address dataset
- **Fast Inference**: Optimized TinyBERT for quick entity extraction
- **Robust Recognition**: Handles partial, incomplete, or informal addresses
- **Efficient Processing**: TinyBERT's compact design for better performance
- **Mobile-Friendly**: Smaller model size suitable for edge deployment
- **Resource Efficient**: Lower memory and computational requirements
## πŸ“Š Model Architecture
- **Base Model**: huawei-noah/TinyBERT_General_6L_768D (TinyBERT)
- **Model Type**: Token Classification (NER)
- **Vocabulary Size**: 30,522 tokens
- **Hidden Size**: 768
- **Number of Layers**: 6
- **Attention Heads**: 12
- **Max Sequence Length**: 512 tokens
- **Number of Labels**: 23
- **Model Size**: ~761MB
- **Checkpoint**: 20793
## Performance Metrics
| Average Type | Precision | Recall | F1-Score |
| :-------------- | :-------- | :----- | :------- |
| Micro Average | 0.93 | 0.94 | 0.94 |
| Macro Average | 0.80 | 0.80 | 0.80 |
| Weighted Average | 0.93 | 0.94 | 0.94 |
## πŸš€ Usage Examples
```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import warnings
warnings.filterwarnings("ignore")
class IndianAddressNER:
def __init__(self):
model_name = "shiprocket-ai/open-tinybert-indian-address-ner"
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForTokenClassification.from_pretrained(model_name)
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
self.model.eval()
# Entity mappings
self.id2entity = {
"0": "O",
"1": "B-building_name",
"2": "I-building_name",
"3": "B-city",
"4": "I-city",
"5": "B-country",
"6": "I-country",
"7": "B-floor",
"8": "I-floor",
"9": "B-house_details",
"10": "I-house_details",
"11": "B-locality",
"12": "I-locality",
"13": "B-pincode",
"14": "I-pincode",
"15": "B-road",
"16": "I-road",
"17": "B-state",
"18": "I-state",
"19": "B-sub_locality",
"20": "I-sub_locality",
"21": "B-landmarks",
"22": "I-landmarks"
}
def predict(self, address):
"""Extract entities from an Indian address - FIXED VERSION"""
if not address.strip():
return {}
# Tokenize with offset mapping for better text reconstruction
inputs = self.tokenizer(
address,
return_tensors="pt",
truncation=True,
padding=True,
max_length=128,
return_offsets_mapping=True
)
# Extract offset mapping before moving to device
offset_mapping = inputs.pop("offset_mapping")[0]
inputs = {k: v.to(self.device) for k, v in inputs.items()}
# Predict
with torch.no_grad():
outputs = self.model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_ids = torch.argmax(predictions, dim=-1)
confidence_scores = torch.max(predictions, dim=-1)[0]
# Extract entities using offset mapping
entities = self.extract_entities_with_offsets(
address,
predicted_ids[0],
confidence_scores[0],
offset_mapping
)
return entities
def extract_entities_with_offsets(self, original_text, predicted_ids, confidences, offset_mapping):
"""Extract entities using offset mapping for accurate text reconstruction"""
entities = {}
current_entity = None
for i, (pred_id, conf) in enumerate(zip(predicted_ids, confidences)):
if i >= len(offset_mapping):
break
start, end = offset_mapping[i]
# Skip special tokens (they have (0,0) mapping)
if start == end == 0:
continue
label = self.id2entity.get(str(pred_id.item()), "O")
if label.startswith("B-"):
# Save previous entity
if current_entity:
entity_type = current_entity["type"]
if entity_type not in entities:
entities[entity_type] = []
entities[entity_type].append({
"text": current_entity["text"],
"confidence": current_entity["confidence"]
})
# Start new entity
entity_type = label[2:] # Remove "B-"
current_entity = {
"type": entity_type,
"text": original_text[start:end],
"confidence": conf.item(),
"start": start,
"end": end
}
elif label.startswith("I-") and current_entity:
# Continue current entity
entity_type = label[2:] # Remove "I-"
if entity_type == current_entity["type"]:
# Extend the entity to include this token
current_entity["text"] = original_text[current_entity["start"]:end]
current_entity["confidence"] = (current_entity["confidence"] + conf.item()) / 2
current_entity["end"] = end
elif label == "O" and current_entity:
# End current entity
entity_type = current_entity["type"]
if entity_type not in entities:
entities[entity_type] = []
entities[entity_type].append({
"text": current_entity["text"],
"confidence": current_entity["confidence"]
})
current_entity = None
# Add final entity if exists
if current_entity:
entity_type = current_entity["type"]
if entity_type not in entities:
entities[entity_type] = []
entities[entity_type].append({
"text": current_entity["text"],
"confidence": current_entity["confidence"]
})
return entities
# Usage example
ner = IndianAddressNER()
# Test addresses
test_addresses = [
"Shop No 123, Sunshine Apartments, Andheri West, Mumbai, 400058",
"DLF Cyber City, Sector 25, Gurgaon, Haryana",
"Flat 201, MG Road, Bangalore, Karnataka, 560001",
"Phoenix Mall, Kurla West, Mumbai"
]
print("🏠 INDIAN ADDRESS NER EXAMPLES")
print("=" * 50)
for address in test_addresses:
print(f"\nπŸ“ Address: {address}")
entities = ner.predict(address)
if entities:
for entity_type, entity_list in sorted(entities.items()):
print(f"🏷️ {entity_type.replace('_', ' ').title()}:")
for entity in entity_list:
confidence = entity['confidence']
text = entity['text']
confidence_icon = "🟒" if confidence > 0.8 else "🟑" if confidence > 0.6 else "πŸ”΄"
print(f" {confidence_icon} {text} (confidence: {confidence:.3f})")
else:
print("❌ No entities found")
print("-" * 40)
```
## 🏷️ Supported Entity Types
The model can identify and extract the following address components:
- **Building Name**: building_name
- **City**: city
- **Country**: country
- **Floor**: floor
- **House Details**: house_details
- **Landmarks**: landmarks
- **Locality**: locality
- **Pincode**: pincode
- **Road**: road
- **State**: state
- **Sub Locality**: sub_locality
## πŸ“ˆ Performance Highlights
- **Indian Address Optimized**: Specialized for Indian address patterns and formats
- **TinyBERT Advantage**: Efficient and lightweight transformer architecture
- **High Precision**: Accurate entity boundary detection
- **Multi-component Recognition**: Identifies multiple entities in complex addresses
- **Confidence Scoring**: Provides confidence scores for each extracted entity
- **Fast Inference**: Optimized for real-time applications
- **Robust Handling**: Works with partial or informal address formats
- **Compact Architecture**: TinyBERT's efficient design for deployment
- **Resource Friendly**: Lower computational requirements
## πŸ”§ Training Details
- **Dataset**: 300% augmented Indian address dataset
- **Training Strategy**: Fine-tuned from pre-trained TinyBERT
- **Specialization**: Indian address entity extraction
- **Context Length**: 128 tokens
- **Version**: v1.0
- **Framework**: PyTorch + Transformers
- **BIO Tagging**: Uses Begin-Inside-Outside tagging scheme
- **Base Model Advantage**: TinyBERT's efficient architecture and compact size
## πŸ’‘ Use Cases
### 1. **Address Parsing & Standardization**
- Parse unstructured address text into components
- Standardize address formats for databases
- Extract specific components for validation
### 2. **Form Auto-completion**
- Auto-fill address forms by extracting components
- Validate address field completeness
- Suggest corrections for incomplete addresses
### 3. **Data Processing & Migration**
- Clean legacy address databases
- Extract structured data from unstructured text
- Migrate addresses between different systems
### 4. **Logistics & Delivery**
- Extract delivery-relevant components
- Validate address completeness for shipping
- Improve address accuracy for last-mile delivery
### 5. **Geocoding Preprocessing**
- Prepare addresses for geocoding APIs
- Extract location components for mapping
- Improve geocoding accuracy with clean components
### 6. **Mobile & Edge Deployment**
- Deploy on mobile devices with limited resources
- Run inference on edge computing devices
- Integrate into lightweight applications
## ⚑ Performance Tips
1. **Input Length**: Keep addresses under 128 tokens for optimal performance
2. **Batch Processing**: Process multiple addresses in batches for efficiency
3. **GPU Usage**: Use GPU for faster inference on large datasets
4. **Confidence Filtering**: Filter results by confidence score for higher precision
5. **Text Preprocessing**: Clean input text for better recognition
6. **TinyBERT Advantage**: Model benefits from efficient architecture optimizations
7. **Edge Deployment**: Suitable for mobile and edge computing scenarios
## ⚠️ Limitations
- **Language Support**: Primarily optimized for English Indian addresses
- **Regional Variations**: May struggle with highly regional or colloquial formats
- **New Localities**: Performance may vary on very recent developments
- **Complex Formatting**: May have difficulty with highly unstructured text
- **Context Dependency**: Works best with clear address context
## πŸ“‹ Entity Mapping
The model uses BIO (Begin-Inside-Outside) tagging scheme:
```json
{
"entity2id": {
"O": 0,
"B-building_name": 1,
"I-building_name": 2,
"B-city": 3,
"I-city": 4,
"B-country": 5,
"I-country": 6,
"B-floor": 7,
"I-floor": 8,
"B-house_details": 9,
"I-house_details": 10,
"B-locality": 11,
"I-locality": 12,
"B-pincode": 13,
"I-pincode": 14,
"B-road": 15,
"I-road": 16,
"B-state": 17,
"I-state": 18,
"B-sub_locality": 19,
"I-sub_locality": 20,
"B-landmarks": 21,
"I-landmarks": 22
},
"id2entity": {
"0": "O",
"1": "B-building_name",
"2": "I-building_name",
"3": "B-city",
"4": "I-city",
"5": "B-country",
"6": "I-country",
"7": "B-floor",
"8": "I-floor",
"9": "B-house_details",
"10": "I-house_details",
"11": "B-locality",
"12": "I-locality",
"13": "B-pincode",
"14": "I-pincode",
"15": "B-road",
"16": "I-road",
"17": "B-state",
"18": "I-state",
"19": "B-sub_locality",
"20": "I-sub_locality",
"21": "B-landmarks",
"22": "I-landmarks"
}
}
```
## πŸ“‹ Model Files
- `config.json`: Model configuration and hyperparameters
- `pytorch_model.bin` / `model.safetensors`: Model weights
- `tokenizer.json`: Tokenizer configuration
- `tokenizer_config.json`: Tokenizer settings
- `vocab.txt`: Vocabulary file
- `entity_mappings.json`: Entity type mappings
## πŸ”„ Model Updates
- **Version**: v1.0 (Checkpoint 20793)
- **Last Updated**: 2025-06-19
- **Training Completion**: Based on augmented Indian address dataset
- **Base Model**: TinyBERT for efficient transformer architecture
## πŸ“š Citation
If you use this model in your research or applications, please cite:
```bibtex
@misc{open-tinybert-indian-address-ner,
title={TinyBERT Indian Address NER Model},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/shiprocket-ai/open-tinybert-indian-address-ner}
}
```
## πŸ“ž Support & Contact
For questions, issues, or feature requests:
- Open an issue in this repository
- Contact: shiprocket-ai team
- Documentation: See usage examples above
## πŸ“œ License
This model is released under the Apache 2.0 License. See LICENSE file for details.
---
*Specialized for Indian address entity recognition - Built with ❀️ by shiprocket-ai team using TinyBERT*