🏠 IndicBERT Indian Address NER Model

This model is a fine-tuned IndicBERT for Named Entity Recognition (NER) on Indian addresses. It can extract and classify various address components from Indian address text with high accuracy, leveraging IndicBERT's superior understanding of Indian language patterns and contexts.

🎯 Model Description

IndicBERT fine-tuned for Indian address Named Entity Recognition (NER)

Key Capabilities

Address Component Extraction: Identify and classify various parts of Indian addresses
Multi-format Support: Handle various Indian address formats and styles
Indic Language Optimized: Built on IndicBERT for better Indian context understanding
High Accuracy: Fine-tuned on augmented Indian address dataset
Fast Inference: Optimized IndicBERT for quick entity extraction
Robust Recognition: Handles partial, incomplete, or informal addresses
Cultural Context: Better understanding of Indian naming conventions and locality patterns

📊 Model Architecture

Base Model: ai4bharat/indic-bert (IndicBERT)
Model Type: Token Classification (NER)
Vocabulary Size: 200,000 tokens
Hidden Size: 768
Number of Layers: 12
Attention Heads: 12
Max Sequence Length: 512 tokens
Number of Labels: 23
Model Size: ~396MB
Checkpoint: 20793

🚀 Usage Examples

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import warnings
warnings.filterwarnings("ignore")

class IndianAddressNER:
    def __init__(self):
        model_name = "shiprocket-ai/open-indicbert-indian-address-ner"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(model_name)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        self.model.eval()
        
        # Entity mappings
        self.id2entity = {
        "0": "O",
        "1": "B-building_name",
        "2": "I-building_name",
        "3": "B-city",
        "4": "I-city",
        "5": "B-country",
        "6": "I-country",
        "7": "B-floor",
        "8": "I-floor",
        "9": "B-house_details",
        "10": "I-house_details",
        "11": "B-locality",
        "12": "I-locality",
        "13": "B-pincode",
        "14": "I-pincode",
        "15": "B-road",
        "16": "I-road",
        "17": "B-state",
        "18": "I-state",
        "19": "B-sub_locality",
        "20": "I-sub_locality",
        "21": "B-landmarks",
        "22": "I-landmarks"
}
    
    def predict(self, address):
        """Extract entities from an Indian address"""
        if not address.strip():
            return {}
        
        # Tokenize
        inputs = self.tokenizer(
            address, 
            return_tensors="pt", 
            truncation=True, 
            padding=True, 
            max_length=128
        )
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        # Predict
        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_ids = torch.argmax(predictions, dim=-1)
            confidence_scores = torch.max(predictions, dim=-1)[0]
        
        # Convert to tokens and labels
        tokens = self.tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
        predicted_labels = [self.id2entity.get(str(id.item()), "O") for id in predicted_ids[0]]
        confidences = confidence_scores[0].cpu().numpy()
        
        # Group entities
        entities = self.group_entities(tokens, predicted_labels, confidences)
        return entities
    
    def group_entities(self, tokens, labels, confidences):
        """Group B- and I- tags into complete entities"""
        entities = {}
        current_entity = None
        
        for i, (token, label, conf) in enumerate(zip(tokens, labels, confidences)):
            if token in ["[CLS]", "[SEP]", "[PAD]"]:
                continue
            
            if label.startswith("B-"):
                # Save previous entity
                if current_entity:
                    entity_type = current_entity["type"]
                    if entity_type not in entities:
                        entities[entity_type] = []
                    entities[entity_type].append({
                        "text": current_entity["text"].replace("##", ""),
                        "confidence": current_entity["confidence"]
                    })
                
                # Start new entity
                entity_type = label[2:]  # Remove "B-"
                current_entity = {
                    "type": entity_type,
                    "text": token,
                    "confidence": conf
                }
            
            elif label.startswith("I-") and current_entity:
                # Continue current entity
                entity_type = label[2:]  # Remove "I-"
                if entity_type == current_entity["type"]:
                    current_entity["text"] += token
                    current_entity["confidence"] = (current_entity["confidence"] + conf) / 2
            
            elif label == "O" and current_entity:
                # End current entity
                entity_type = current_entity["type"]
                if entity_type not in entities:
                    entities[entity_type] = []
                entities[entity_type].append({
                    "text": current_entity["text"].replace("##", ""),
                    "confidence": current_entity["confidence"]
                })
                current_entity = None
        
        # Add final entity if exists
        if current_entity:
            entity_type = current_entity["type"]
            if entity_type not in entities:
                entities[entity_type] = []
            entities[entity_type].append({
                "text": current_entity["text"].replace("##", ""),
                "confidence": current_entity["confidence"]
            })
        
        return entities

# Usage example
ner = IndianAddressNER()

# Test addresses
test_addresses = [
    "Shop No 123, Sunshine Apartments, Andheri West, Mumbai, 400058",
    "DLF Cyber City, Sector 25, Gurgaon, Haryana",
    "Flat 201, MG Road, Bangalore, Karnataka, 560001",
    "Phoenix Mall, Kurla West, Mumbai"
]

print("🏠 INDIAN ADDRESS NER EXAMPLES")
print("=" * 50)

for address in test_addresses:
    print(f"\n📍 Address: {address}")
    entities = ner.predict(address)
    
    if entities:
        for entity_type, entity_list in sorted(entities.items()):
            print(f"🏷️ {entity_type.replace('_', ' ').title()}:")
            for entity in entity_list:
                confidence = entity['confidence']
                text = entity['text']
                confidence_icon = "🟢" if confidence > 0.8 else "🟡" if confidence > 0.6 else "🔴"
                print(f"   {confidence_icon} {text} (confidence: {confidence:.3f})")
    else:
        print("❌ No entities found")
    print("-" * 40)

🏷️ Supported Entity Types

The model can identify and extract the following address components:

Building Name: building_name
City: city
Country: country
Floor: floor
House Details: house_details
Landmarks: landmarks
Locality: locality
Pincode: pincode
Road: road
State: state
Sub Locality: sub_locality

📈 Performance Highlights

Indian Address Optimized: Specialized for Indian address patterns and formats
IndicBERT Advantage: Better understanding of Indian linguistic patterns
High Precision: Accurate entity boundary detection
Multi-component Recognition: Identifies multiple entities in complex addresses
Confidence Scoring: Provides confidence scores for each extracted entity
Fast Inference: Optimized for real-time applications
Robust Handling: Works with partial or informal address formats
Cultural Awareness: Better recognition of Indian place names and conventions

🔧 Training Details

Dataset: 300% augmented Indian address dataset
Training Strategy: Fine-tuned from pre-trained IndicBERT
Specialization: Indian address entity extraction
Context Length: 128 tokens
Version: v1.0
Framework: PyTorch + Transformers
BIO Tagging: Uses Begin-Inside-Outside tagging scheme
Base Model Advantage: IndicBERT's pre-training on Indian language data

💡 Use Cases

1. Address Parsing & Standardization

Parse unstructured address text into components
Standardize address formats for databases
Extract specific components for validation

2. Form Auto-completion

Auto-fill address forms by extracting components
Validate address field completeness
Suggest corrections for incomplete addresses

3. Data Processing & Migration

Clean legacy address databases
Extract structured data from unstructured text
Migrate addresses between different systems

4. Logistics & Delivery

Extract delivery-relevant components
Validate address completeness for shipping
Improve address accuracy for last-mile delivery

5. Geocoding Preprocessing

Prepare addresses for geocoding APIs
Extract location components for mapping
Improve geocoding accuracy with clean components

⚡ Performance Tips

Input Length: Keep addresses under 128 tokens for optimal performance
Batch Processing: Process multiple addresses in batches for efficiency
GPU Usage: Use GPU for faster inference on large datasets
Confidence Filtering: Filter results by confidence score for higher precision
Text Preprocessing: Clean input text for better recognition
IndicBERT Advantage: Model performs better on Indian language patterns

⚠️ Limitations

Language Support: Primarily optimized for English Indian addresses
Regional Variations: May struggle with highly regional or colloquial formats
New Localities: Performance may vary on very recent developments
Complex Formatting: May have difficulty with highly unstructured text
Context Dependency: Works best with clear address context

📋 Entity Mapping

The model uses BIO (Begin-Inside-Outside) tagging scheme:

{
  "entity2id": {
    "O": 0,
    "B-building_name": 1,
    "I-building_name": 2,
    "B-city": 3,
    "I-city": 4,
    "B-country": 5,
    "I-country": 6,
    "B-floor": 7,
    "I-floor": 8,
    "B-house_details": 9,
    "I-house_details": 10,
    "B-locality": 11,
    "I-locality": 12,
    "B-pincode": 13,
    "I-pincode": 14,
    "B-road": 15,
    "I-road": 16,
    "B-state": 17,
    "I-state": 18,
    "B-sub_locality": 19,
    "I-sub_locality": 20,
    "B-landmarks": 21,
    "I-landmarks": 22
  },
  "id2entity": {
    "0": "O",
    "1": "B-building_name",
    "2": "I-building_name",
    "3": "B-city",
    "4": "I-city",
    "5": "B-country",
    "6": "I-country",
    "7": "B-floor",
    "8": "I-floor",
    "9": "B-house_details",
    "10": "I-house_details",
    "11": "B-locality",
    "12": "I-locality",
    "13": "B-pincode",
    "14": "I-pincode",
    "15": "B-road",
    "16": "I-road",
    "17": "B-state",
    "18": "I-state",
    "19": "B-sub_locality",
    "20": "I-sub_locality",
    "21": "B-landmarks",
    "22": "I-landmarks"
  }
}

📋 Model Files

config.json: Model configuration and hyperparameters
pytorch_model.bin / model.safetensors: Model weights
tokenizer.json: Tokenizer configuration
tokenizer_config.json: Tokenizer settings
vocab.txt: Vocabulary file
entity_mappings.json: Entity type mappings

🔄 Model Updates

Version: v1.0 (Checkpoint 20793)
Last Updated: 2025-06-18
Training Completion: Based on augmented Indian address dataset
Base Model: IndicBERT for enhanced Indian context understanding

📚 Citation

If you use this model in your research or applications, please cite:

@misc{open-indicbert-indian-address-ner,
  title={IndicBERT Indian Address NER Model},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/shiprocket-ai/open-indicbert-indian-address-ner}
}

📞 Support & Contact

For questions, issues, or feature requests:

Open an issue in this repository
Contact: shiprocket-ai team
Documentation: See usage examples above

📜 License

This model is released under the Apache 2.0 License. See LICENSE file for details.

Specialized for Indian address entity recognition - Built with ❤️ by shiprocket-ai team using IndicBERT

shiprocket-ai
/

open-indicbert-indian-address-ner