YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🏠 IndicBERT Indian Address NER Model

This model is a fine-tuned IndicBERT for Named Entity Recognition (NER) on Indian addresses. It can extract and classify various address components from Indian address text with high accuracy, leveraging IndicBERT's superior understanding of Indian language patterns and contexts.

🎯 Model Description

IndicBERT fine-tuned for Indian address Named Entity Recognition (NER)

Key Capabilities

  • Address Component Extraction: Identify and classify various parts of Indian addresses
  • Multi-format Support: Handle various Indian address formats and styles
  • Indic Language Optimized: Built on IndicBERT for better Indian context understanding
  • High Accuracy: Fine-tuned on augmented Indian address dataset
  • Fast Inference: Optimized IndicBERT for quick entity extraction
  • Robust Recognition: Handles partial, incomplete, or informal addresses
  • Cultural Context: Better understanding of Indian naming conventions and locality patterns

πŸ“Š Model Architecture

  • Base Model: ai4bharat/indic-bert (IndicBERT)
  • Model Type: Token Classification (NER)
  • Vocabulary Size: 200,000 tokens
  • Hidden Size: 768
  • Number of Layers: 12
  • Attention Heads: 12
  • Max Sequence Length: 512 tokens
  • Number of Labels: 23
  • Model Size: ~396MB
  • Checkpoint: 20793

πŸš€ Usage Examples

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import warnings
warnings.filterwarnings("ignore")

class IndianAddressNER:
    def __init__(self):
        model_name = "shiprocket-ai/open-indicbert-indian-address-ner"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(model_name)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        self.model.eval()
        
        # Entity mappings
        self.id2entity = {
        "0": "O",
        "1": "B-building_name",
        "2": "I-building_name",
        "3": "B-city",
        "4": "I-city",
        "5": "B-country",
        "6": "I-country",
        "7": "B-floor",
        "8": "I-floor",
        "9": "B-house_details",
        "10": "I-house_details",
        "11": "B-locality",
        "12": "I-locality",
        "13": "B-pincode",
        "14": "I-pincode",
        "15": "B-road",
        "16": "I-road",
        "17": "B-state",
        "18": "I-state",
        "19": "B-sub_locality",
        "20": "I-sub_locality",
        "21": "B-landmarks",
        "22": "I-landmarks"
}
    
    def predict(self, address):
        """Extract entities from an Indian address"""
        if not address.strip():
            return {}
        
        # Tokenize
        inputs = self.tokenizer(
            address, 
            return_tensors="pt", 
            truncation=True, 
            padding=True, 
            max_length=128
        )
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        # Predict
        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_ids = torch.argmax(predictions, dim=-1)
            confidence_scores = torch.max(predictions, dim=-1)[0]
        
        # Convert to tokens and labels
        tokens = self.tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
        predicted_labels = [self.id2entity.get(str(id.item()), "O") for id in predicted_ids[0]]
        confidences = confidence_scores[0].cpu().numpy()
        
        # Group entities
        entities = self.group_entities(tokens, predicted_labels, confidences)
        return entities
    
    def group_entities(self, tokens, labels, confidences):
        """Group B- and I- tags into complete entities"""
        entities = {}
        current_entity = None
        
        for i, (token, label, conf) in enumerate(zip(tokens, labels, confidences)):
            if token in ["[CLS]", "[SEP]", "[PAD]"]:
                continue
            
            if label.startswith("B-"):
                # Save previous entity
                if current_entity:
                    entity_type = current_entity["type"]
                    if entity_type not in entities:
                        entities[entity_type] = []
                    entities[entity_type].append({
                        "text": current_entity["text"].replace("##", ""),
                        "confidence": current_entity["confidence"]
                    })
                
                # Start new entity
                entity_type = label[2:]  # Remove "B-"
                current_entity = {
                    "type": entity_type,
                    "text": token,
                    "confidence": conf
                }
            
            elif label.startswith("I-") and current_entity:
                # Continue current entity
                entity_type = label[2:]  # Remove "I-"
                if entity_type == current_entity["type"]:
                    current_entity["text"] += token
                    current_entity["confidence"] = (current_entity["confidence"] + conf) / 2
            
            elif label == "O" and current_entity:
                # End current entity
                entity_type = current_entity["type"]
                if entity_type not in entities:
                    entities[entity_type] = []
                entities[entity_type].append({
                    "text": current_entity["text"].replace("##", ""),
                    "confidence": current_entity["confidence"]
                })
                current_entity = None
        
        # Add final entity if exists
        if current_entity:
            entity_type = current_entity["type"]
            if entity_type not in entities:
                entities[entity_type] = []
            entities[entity_type].append({
                "text": current_entity["text"].replace("##", ""),
                "confidence": current_entity["confidence"]
            })
        
        return entities

# Usage example
ner = IndianAddressNER()

# Test addresses
test_addresses = [
    "Shop No 123, Sunshine Apartments, Andheri West, Mumbai, 400058",
    "DLF Cyber City, Sector 25, Gurgaon, Haryana",
    "Flat 201, MG Road, Bangalore, Karnataka, 560001",
    "Phoenix Mall, Kurla West, Mumbai"
]

print("🏠 INDIAN ADDRESS NER EXAMPLES")
print("=" * 50)

for address in test_addresses:
    print(f"\nπŸ“ Address: {address}")
    entities = ner.predict(address)
    
    if entities:
        for entity_type, entity_list in sorted(entities.items()):
            print(f"🏷️ {entity_type.replace('_', ' ').title()}:")
            for entity in entity_list:
                confidence = entity['confidence']
                text = entity['text']
                confidence_icon = "🟒" if confidence > 0.8 else "🟑" if confidence > 0.6 else "πŸ”΄"
                print(f"   {confidence_icon} {text} (confidence: {confidence:.3f})")
    else:
        print("❌ No entities found")
    print("-" * 40)

🏷️ Supported Entity Types

The model can identify and extract the following address components:

  • Building Name: building_name
  • City: city
  • Country: country
  • Floor: floor
  • House Details: house_details
  • Landmarks: landmarks
  • Locality: locality
  • Pincode: pincode
  • Road: road
  • State: state
  • Sub Locality: sub_locality

πŸ“ˆ Performance Highlights

  • Indian Address Optimized: Specialized for Indian address patterns and formats
  • IndicBERT Advantage: Better understanding of Indian linguistic patterns
  • High Precision: Accurate entity boundary detection
  • Multi-component Recognition: Identifies multiple entities in complex addresses
  • Confidence Scoring: Provides confidence scores for each extracted entity
  • Fast Inference: Optimized for real-time applications
  • Robust Handling: Works with partial or informal address formats
  • Cultural Awareness: Better recognition of Indian place names and conventions

πŸ”§ Training Details

  • Dataset: 300% augmented Indian address dataset
  • Training Strategy: Fine-tuned from pre-trained IndicBERT
  • Specialization: Indian address entity extraction
  • Context Length: 128 tokens
  • Version: v1.0
  • Framework: PyTorch + Transformers
  • BIO Tagging: Uses Begin-Inside-Outside tagging scheme
  • Base Model Advantage: IndicBERT's pre-training on Indian language data

πŸ’‘ Use Cases

1. Address Parsing & Standardization

  • Parse unstructured address text into components
  • Standardize address formats for databases
  • Extract specific components for validation

2. Form Auto-completion

  • Auto-fill address forms by extracting components
  • Validate address field completeness
  • Suggest corrections for incomplete addresses

3. Data Processing & Migration

  • Clean legacy address databases
  • Extract structured data from unstructured text
  • Migrate addresses between different systems

4. Logistics & Delivery

  • Extract delivery-relevant components
  • Validate address completeness for shipping
  • Improve address accuracy for last-mile delivery

5. Geocoding Preprocessing

  • Prepare addresses for geocoding APIs
  • Extract location components for mapping
  • Improve geocoding accuracy with clean components

⚑ Performance Tips

  1. Input Length: Keep addresses under 128 tokens for optimal performance
  2. Batch Processing: Process multiple addresses in batches for efficiency
  3. GPU Usage: Use GPU for faster inference on large datasets
  4. Confidence Filtering: Filter results by confidence score for higher precision
  5. Text Preprocessing: Clean input text for better recognition
  6. IndicBERT Advantage: Model performs better on Indian language patterns

⚠️ Limitations

  • Language Support: Primarily optimized for English Indian addresses
  • Regional Variations: May struggle with highly regional or colloquial formats
  • New Localities: Performance may vary on very recent developments
  • Complex Formatting: May have difficulty with highly unstructured text
  • Context Dependency: Works best with clear address context

πŸ“‹ Entity Mapping

The model uses BIO (Begin-Inside-Outside) tagging scheme:

{
  "entity2id": {
    "O": 0,
    "B-building_name": 1,
    "I-building_name": 2,
    "B-city": 3,
    "I-city": 4,
    "B-country": 5,
    "I-country": 6,
    "B-floor": 7,
    "I-floor": 8,
    "B-house_details": 9,
    "I-house_details": 10,
    "B-locality": 11,
    "I-locality": 12,
    "B-pincode": 13,
    "I-pincode": 14,
    "B-road": 15,
    "I-road": 16,
    "B-state": 17,
    "I-state": 18,
    "B-sub_locality": 19,
    "I-sub_locality": 20,
    "B-landmarks": 21,
    "I-landmarks": 22
  },
  "id2entity": {
    "0": "O",
    "1": "B-building_name",
    "2": "I-building_name",
    "3": "B-city",
    "4": "I-city",
    "5": "B-country",
    "6": "I-country",
    "7": "B-floor",
    "8": "I-floor",
    "9": "B-house_details",
    "10": "I-house_details",
    "11": "B-locality",
    "12": "I-locality",
    "13": "B-pincode",
    "14": "I-pincode",
    "15": "B-road",
    "16": "I-road",
    "17": "B-state",
    "18": "I-state",
    "19": "B-sub_locality",
    "20": "I-sub_locality",
    "21": "B-landmarks",
    "22": "I-landmarks"
  }
}

πŸ“‹ Model Files

  • config.json: Model configuration and hyperparameters
  • pytorch_model.bin / model.safetensors: Model weights
  • tokenizer.json: Tokenizer configuration
  • tokenizer_config.json: Tokenizer settings
  • vocab.txt: Vocabulary file
  • entity_mappings.json: Entity type mappings

πŸ”„ Model Updates

  • Version: v1.0 (Checkpoint 20793)
  • Last Updated: 2025-06-18
  • Training Completion: Based on augmented Indian address dataset
  • Base Model: IndicBERT for enhanced Indian context understanding

πŸ“š Citation

If you use this model in your research or applications, please cite:

@misc{open-indicbert-indian-address-ner,
  title={IndicBERT Indian Address NER Model},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/shiprocket-ai/open-indicbert-indian-address-ner}
}

πŸ“ž Support & Contact

For questions, issues, or feature requests:

  • Open an issue in this repository
  • Contact: shiprocket-ai team
  • Documentation: See usage examples above

πŸ“œ License

This model is released under the Apache 2.0 License. See LICENSE file for details.


Specialized for Indian address entity recognition - Built with ❀️ by shiprocket-ai team using IndicBERT

Downloads last month
102
Safetensors
Model size
32.9M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using shiprocket-ai/open-indicbert-indian-address-ner 1