File size: 14,156 Bytes

---
license: apache-2.0
language:
- en
base_model:
- huawei-noah/TinyBERT_General_6L_768D
---
# 🏠 TinyBERT Indian Address NER Model

This model is a fine-tuned **TinyBERT** for **Named Entity Recognition (NER)** on Indian addresses. It can extract and classify various address components from Indian address text with high accuracy, leveraging TinyBERT's efficient and lightweight architecture.

## 🎯 Model Description

TinyBERT fine-tuned for Indian address Named Entity Recognition (NER)

### Key Capabilities

- **Address Component Extraction**: Identify and classify various parts of Indian addresses
- **Multi-format Support**: Handle various Indian address formats and styles
- **Lightweight Architecture**: Built on TinyBERT's efficient transformer design
- **High Accuracy**: Fine-tuned on augmented Indian address dataset
- **Fast Inference**: Optimized TinyBERT for quick entity extraction
- **Robust Recognition**: Handles partial, incomplete, or informal addresses
- **Efficient Processing**: TinyBERT's compact design for better performance
- **Mobile-Friendly**: Smaller model size suitable for edge deployment
- **Resource Efficient**: Lower memory and computational requirements

## 📊 Model Architecture

- **Base Model**: huawei-noah/TinyBERT_General_6L_768D (TinyBERT)
- **Model Type**: Token Classification (NER)
- **Vocabulary Size**: 30,522 tokens
- **Hidden Size**: 768
- **Number of Layers**: 6
- **Attention Heads**: 12
- **Max Sequence Length**: 512 tokens
- **Number of Labels**: 23
- **Model Size**: ~761MB
- **Checkpoint**: 20793

## Performance Metrics
| Average Type    | Precision | Recall | F1-Score |
| :-------------- | :-------- | :----- | :------- |
| Micro Average   | 0.93      | 0.94   | 0.94     |
| Macro Average   | 0.80      | 0.80   | 0.80     |
| Weighted Average | 0.93      | 0.94   | 0.94     |

## 🚀 Usage Examples

```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import warnings
warnings.filterwarnings("ignore")

class IndianAddressNER:
    def __init__(self):
        model_name = "shiprocket-ai/open-tinybert-indian-address-ner"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(model_name)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        self.model.eval()
        
        # Entity mappings
        self.id2entity = {
        "0": "O",
        "1": "B-building_name",
        "2": "I-building_name",
        "3": "B-city",
        "4": "I-city",
        "5": "B-country",
        "6": "I-country",
        "7": "B-floor",
        "8": "I-floor",
        "9": "B-house_details",
        "10": "I-house_details",
        "11": "B-locality",
        "12": "I-locality",
        "13": "B-pincode",
        "14": "I-pincode",
        "15": "B-road",
        "16": "I-road",
        "17": "B-state",
        "18": "I-state",
        "19": "B-sub_locality",
        "20": "I-sub_locality",
        "21": "B-landmarks",
        "22": "I-landmarks"
}
    
    def predict(self, address):
        """Extract entities from an Indian address - FIXED VERSION"""
        if not address.strip():
            return {}
        
        # Tokenize with offset mapping for better text reconstruction
        inputs = self.tokenizer(
            address, 
            return_tensors="pt", 
            truncation=True, 
            padding=True, 
            max_length=128,
            return_offsets_mapping=True
        )
        
        # Extract offset mapping before moving to device
        offset_mapping = inputs.pop("offset_mapping")[0]
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        # Predict
        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_ids = torch.argmax(predictions, dim=-1)
            confidence_scores = torch.max(predictions, dim=-1)[0]
        
        # Extract entities using offset mapping
        entities = self.extract_entities_with_offsets(
            address, 
            predicted_ids[0], 
            confidence_scores[0], 
            offset_mapping
        )
        
        return entities
    
    def extract_entities_with_offsets(self, original_text, predicted_ids, confidences, offset_mapping):
        """Extract entities using offset mapping for accurate text reconstruction"""
        entities = {}
        current_entity = None
        
        for i, (pred_id, conf) in enumerate(zip(predicted_ids, confidences)):
            if i >= len(offset_mapping):
                break
                
            start, end = offset_mapping[i]
            
            # Skip special tokens (they have (0,0) mapping)
            if start == end == 0:
                continue
                
            label = self.id2entity.get(str(pred_id.item()), "O")
            
            if label.startswith("B-"):
                # Save previous entity
                if current_entity:
                    entity_type = current_entity["type"]
                    if entity_type not in entities:
                        entities[entity_type] = []
                    entities[entity_type].append({
                        "text": current_entity["text"],
                        "confidence": current_entity["confidence"]
                    })
                
                # Start new entity
                entity_type = label[2:]  # Remove "B-"
                current_entity = {
                    "type": entity_type,
                    "text": original_text[start:end],
                    "confidence": conf.item(),
                    "start": start,
                    "end": end
                }
            
            elif label.startswith("I-") and current_entity:
                # Continue current entity
                entity_type = label[2:]  # Remove "I-"
                if entity_type == current_entity["type"]:
                    # Extend the entity to include this token
                    current_entity["text"] = original_text[current_entity["start"]:end]
                    current_entity["confidence"] = (current_entity["confidence"] + conf.item()) / 2
                    current_entity["end"] = end
            
            elif label == "O" and current_entity:
                # End current entity
                entity_type = current_entity["type"]
                if entity_type not in entities:
                    entities[entity_type] = []
                entities[entity_type].append({
                    "text": current_entity["text"],
                    "confidence": current_entity["confidence"]
                })
                current_entity = None
        
        # Add final entity if exists
        if current_entity:
            entity_type = current_entity["type"]
            if entity_type not in entities:
                entities[entity_type] = []
            entities[entity_type].append({
                "text": current_entity["text"],
                "confidence": current_entity["confidence"]
            })
        
        return entities

# Usage example
ner = IndianAddressNER()

# Test addresses
test_addresses = [
    "Shop No 123, Sunshine Apartments, Andheri West, Mumbai, 400058",
    "DLF Cyber City, Sector 25, Gurgaon, Haryana",
    "Flat 201, MG Road, Bangalore, Karnataka, 560001",
    "Phoenix Mall, Kurla West, Mumbai"
]

print("🏠 INDIAN ADDRESS NER EXAMPLES")
print("=" * 50)

for address in test_addresses:
    print(f"\n📍 Address: {address}")
    entities = ner.predict(address)
    
    if entities:
        for entity_type, entity_list in sorted(entities.items()):
            print(f"🏷️ {entity_type.replace('_', ' ').title()}:")
            for entity in entity_list:
                confidence = entity['confidence']
                text = entity['text']
                confidence_icon = "🟢" if confidence > 0.8 else "🟡" if confidence > 0.6 else "🔴"
                print(f"   {confidence_icon} {text} (confidence: {confidence:.3f})")
    else:
        print("❌ No entities found")
    print("-" * 40)
```

## 🏷️ Supported Entity Types

The model can identify and extract the following address components:

- **Building Name**: building_name
- **City**: city
- **Country**: country
- **Floor**: floor
- **House Details**: house_details
- **Landmarks**: landmarks
- **Locality**: locality
- **Pincode**: pincode
- **Road**: road
- **State**: state
- **Sub Locality**: sub_locality

## 📈 Performance Highlights

- **Indian Address Optimized**: Specialized for Indian address patterns and formats
- **TinyBERT Advantage**: Efficient and lightweight transformer architecture
- **High Precision**: Accurate entity boundary detection
- **Multi-component Recognition**: Identifies multiple entities in complex addresses
- **Confidence Scoring**: Provides confidence scores for each extracted entity
- **Fast Inference**: Optimized for real-time applications
- **Robust Handling**: Works with partial or informal address formats
- **Compact Architecture**: TinyBERT's efficient design for deployment
- **Resource Friendly**: Lower computational requirements

## 🔧 Training Details

- **Dataset**: 300% augmented Indian address dataset
- **Training Strategy**: Fine-tuned from pre-trained TinyBERT
- **Specialization**: Indian address entity extraction
- **Context Length**: 128 tokens
- **Version**: v1.0
- **Framework**: PyTorch + Transformers
- **BIO Tagging**: Uses Begin-Inside-Outside tagging scheme
- **Base Model Advantage**: TinyBERT's efficient architecture and compact size

## 💡 Use Cases

### 1. **Address Parsing & Standardization**
- Parse unstructured address text into components
- Standardize address formats for databases
- Extract specific components for validation

### 2. **Form Auto-completion**
- Auto-fill address forms by extracting components
- Validate address field completeness
- Suggest corrections for incomplete addresses

### 3. **Data Processing & Migration**
- Clean legacy address databases
- Extract structured data from unstructured text
- Migrate addresses between different systems

### 4. **Logistics & Delivery**
- Extract delivery-relevant components
- Validate address completeness for shipping
- Improve address accuracy for last-mile delivery

### 5. **Geocoding Preprocessing**
- Prepare addresses for geocoding APIs
- Extract location components for mapping
- Improve geocoding accuracy with clean components

### 6. **Mobile & Edge Deployment**
- Deploy on mobile devices with limited resources
- Run inference on edge computing devices
- Integrate into lightweight applications

## ⚡ Performance Tips

1. **Input Length**: Keep addresses under 128 tokens for optimal performance
2. **Batch Processing**: Process multiple addresses in batches for efficiency
3. **GPU Usage**: Use GPU for faster inference on large datasets
4. **Confidence Filtering**: Filter results by confidence score for higher precision
5. **Text Preprocessing**: Clean input text for better recognition
6. **TinyBERT Advantage**: Model benefits from efficient architecture optimizations
7. **Edge Deployment**: Suitable for mobile and edge computing scenarios

## ⚠️ Limitations

- **Language Support**: Primarily optimized for English Indian addresses
- **Regional Variations**: May struggle with highly regional or colloquial formats
- **New Localities**: Performance may vary on very recent developments
- **Complex Formatting**: May have difficulty with highly unstructured text
- **Context Dependency**: Works best with clear address context

## 📋 Entity Mapping

The model uses BIO (Begin-Inside-Outside) tagging scheme:

```json
{
  "entity2id": {
    "O": 0,
    "B-building_name": 1,
    "I-building_name": 2,
    "B-city": 3,
    "I-city": 4,
    "B-country": 5,
    "I-country": 6,
    "B-floor": 7,
    "I-floor": 8,
    "B-house_details": 9,
    "I-house_details": 10,
    "B-locality": 11,
    "I-locality": 12,
    "B-pincode": 13,
    "I-pincode": 14,
    "B-road": 15,
    "I-road": 16,
    "B-state": 17,
    "I-state": 18,
    "B-sub_locality": 19,
    "I-sub_locality": 20,
    "B-landmarks": 21,
    "I-landmarks": 22
  },
  "id2entity": {
    "0": "O",
    "1": "B-building_name",
    "2": "I-building_name",
    "3": "B-city",
    "4": "I-city",
    "5": "B-country",
    "6": "I-country",
    "7": "B-floor",
    "8": "I-floor",
    "9": "B-house_details",
    "10": "I-house_details",
    "11": "B-locality",
    "12": "I-locality",
    "13": "B-pincode",
    "14": "I-pincode",
    "15": "B-road",
    "16": "I-road",
    "17": "B-state",
    "18": "I-state",
    "19": "B-sub_locality",
    "20": "I-sub_locality",
    "21": "B-landmarks",
    "22": "I-landmarks"
  }
}
```

## 📋 Model Files

- `config.json`: Model configuration and hyperparameters  
- `pytorch_model.bin` / `model.safetensors`: Model weights
- `tokenizer.json`: Tokenizer configuration
- `tokenizer_config.json`: Tokenizer settings
- `vocab.txt`: Vocabulary file
- `entity_mappings.json`: Entity type mappings

## 🔄 Model Updates

- **Version**: v1.0 (Checkpoint 20793)
- **Last Updated**: 2025-06-19
- **Training Completion**: Based on augmented Indian address dataset
- **Base Model**: TinyBERT for efficient transformer architecture

## 📚 Citation

If you use this model in your research or applications, please cite:

```bibtex
@misc{open-tinybert-indian-address-ner,
  title={TinyBERT Indian Address NER Model},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/shiprocket-ai/open-tinybert-indian-address-ner}
}
```

## 📞 Support & Contact

For questions, issues, or feature requests:
- Open an issue in this repository
- Contact: shiprocket-ai team
- Documentation: See usage examples above

## 📜 License

This model is released under the Apache 2.0 License. See LICENSE file for details.

---

*Specialized for Indian address entity recognition - Built with ❤️ by shiprocket-ai team using TinyBERT*