YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Self-Organizing Map (SOM) Model for Document Clustering

A trained Self-Organizing Map model for clustering and visualizing high-dimensional document embeddings. This model was trained on technical documentation and can be used for document similarity analysis, topic discovery, and semantic clustering.

📊 Model Details

Model Type: Self-Organizing Map (SOM)
Training Data: 11,412 records
Embedding Dimension: 3,072 (OpenAI Large Embedding Model)
Number of Clusters: 625
Grid Size: 25x25
Learning Rate: 0.1
Sigma: 1.0

🎯 Use Cases

Document Clustering: Group similar documents based on semantic similarity
Topic Discovery: Identify common themes and topics in large document collections
Semantic Search: Find related documents through vector similarity
Data Visualization: Interactive visualization of document relationships
Knowledge Organization: Structure and organize large knowledge bases

📁 Model Files

som_model.pkl: Trained SOM model weights and parameters
cluster_assignments.json: Document-to-cluster assignments for all 11,412 records
cluster_analysis.json: Detailed analysis of each cluster including keywords and topics
interactive_som_map.html: Interactive visualization of the SOM grid with cluster information

🚀 Quick Start

Installation

pip install numpy scikit-learn matplotlib plotly

Loading and Using the Model

import pickle
import json
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load the trained SOM model
with open('som_model.pkl', 'rb') as f:
    som_model = pickle.load(f)

# Load cluster assignments
with open('cluster_assignments.json', 'r') as f:
    cluster_assignments = json.load(f)

# Load cluster analysis
with open('cluster_analysis.json', 'r') as f:
    cluster_analysis = json.load(f)

# Example: Get cluster for a new document embedding
def get_cluster_for_embedding(embedding, som_model):
    """Get the cluster assignment for a new document embedding"""
    # Find the best matching unit (BMU)
    bmu = som_model.winner(embedding)
    return f"{bmu[0]},{bmu[1]}"

# Example: Find similar documents
def find_similar_documents(embedding, cluster_assignments, top_k=5):
    """Find similar documents based on cluster membership"""
    cluster = get_cluster_for_embedding(embedding, som_model)
    
    # Get all documents in the same cluster
    cluster_docs = [doc for doc, doc_cluster in cluster_assignments.items() 
                   if doc_cluster == cluster]
    
    return cluster_docs[:top_k]

Interactive Visualization

Open interactive_som_map.html in a web browser to explore the SOM grid interactively. The visualization shows:

Cluster sizes and distributions
Top keywords for each cluster
Topic analysis
Document counts per cluster

📈 Model Performance

Based on the cluster analysis:

Total Documents: 11,412
Total Clusters: 625 (25x25 grid)
Silhouette Score: -0.0078
Calinski-Harabasz Score: 13.69
Davies-Bouldin Score: 2.33

🔍 Cluster Analysis

The model identifies meaningful clusters with distinct topics. For example, one of the largest clusters (659 documents) focuses on:

Keywords: connector, anypoint, mule, studio, connectors
Topics: Configuration, API integration, MuleSoft platform usage

🛠️ Advanced Usage

Custom Clustering

# Train a new SOM with different parameters
from minisom import MiniSom

def train_custom_som(embeddings, grid_size=(20, 20), sigma=1.0, learning_rate=0.1):
    som = MiniSom(grid_size[0], grid_size[1], embeddings.shape[1], 
                  sigma=sigma, learning_rate=learning_rate, random_seed=42)
    som.train_random(embeddings, 100)
    return som

Cluster Analysis

def analyze_cluster(cluster_key, cluster_analysis):
    """Get detailed information about a specific cluster"""
    for cluster in cluster_analysis['top_clusters']:
        if cluster['cluster_key'] == cluster_key:
            return {
                'size': cluster['size'],
                'keywords': cluster['keywords'],
                'topics': cluster['topics']
            }
    return None

📚 Dependencies

numpy: Numerical computations
scikit-learn: Machine learning utilities
minisom: Self-Organizing Map implementation
matplotlib: Static plotting
plotly: Interactive visualizations
pandas: Data manipulation

🤝 Contributing

This model is part of a larger document processing and clustering pipeline. For questions or contributions, please refer to the main project repository.

📄 License

This model is provided for research and educational purposes. Please ensure compliance with the original data source licenses when using this model.

🔗 Related Resources

Note: This model was trained on technical documentation and may be most effective for similar types of content. For best results, ensure your input documents are in the same domain or consider fine-tuning the model on your specific data.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support