|
# Self-Organizing Map (SOM) Model for Document Clustering |
|
|
|
A trained Self-Organizing Map model for clustering and visualizing high-dimensional document embeddings. This model was trained on technical documentation and can be used for document similarity analysis, topic discovery, and semantic clustering. |
|
|
|
## π Model Details |
|
|
|
- **Model Type**: Self-Organizing Map (SOM) |
|
- **Training Data**: 11,412 records |
|
- **Embedding Dimension**: 3,072 (OpenAI Large Embedding Model) |
|
- **Number of Clusters**: 625 |
|
- **Grid Size**: 25x25 |
|
- **Learning Rate**: 0.1 |
|
- **Sigma**: 1.0 |
|
|
|
## π― Use Cases |
|
|
|
- **Document Clustering**: Group similar documents based on semantic similarity |
|
- **Topic Discovery**: Identify common themes and topics in large document collections |
|
- **Semantic Search**: Find related documents through vector similarity |
|
- **Data Visualization**: Interactive visualization of document relationships |
|
- **Knowledge Organization**: Structure and organize large knowledge bases |
|
|
|
## π Model Files |
|
|
|
- `som_model.pkl`: Trained SOM model weights and parameters |
|
- `cluster_assignments.json`: Document-to-cluster assignments for all 11,412 records |
|
- `cluster_analysis.json`: Detailed analysis of each cluster including keywords and topics |
|
- `interactive_som_map.html`: Interactive visualization of the SOM grid with cluster information |
|
|
|
## π Quick Start |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install numpy scikit-learn matplotlib plotly |
|
``` |
|
|
|
### Loading and Using the Model |
|
|
|
```python |
|
import pickle |
|
import json |
|
import numpy as np |
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
|
# Load the trained SOM model |
|
with open('som_model.pkl', 'rb') as f: |
|
som_model = pickle.load(f) |
|
|
|
# Load cluster assignments |
|
with open('cluster_assignments.json', 'r') as f: |
|
cluster_assignments = json.load(f) |
|
|
|
# Load cluster analysis |
|
with open('cluster_analysis.json', 'r') as f: |
|
cluster_analysis = json.load(f) |
|
|
|
# Example: Get cluster for a new document embedding |
|
def get_cluster_for_embedding(embedding, som_model): |
|
"""Get the cluster assignment for a new document embedding""" |
|
# Find the best matching unit (BMU) |
|
bmu = som_model.winner(embedding) |
|
return f"{bmu[0]},{bmu[1]}" |
|
|
|
# Example: Find similar documents |
|
def find_similar_documents(embedding, cluster_assignments, top_k=5): |
|
"""Find similar documents based on cluster membership""" |
|
cluster = get_cluster_for_embedding(embedding, som_model) |
|
|
|
# Get all documents in the same cluster |
|
cluster_docs = [doc for doc, doc_cluster in cluster_assignments.items() |
|
if doc_cluster == cluster] |
|
|
|
return cluster_docs[:top_k] |
|
``` |
|
|
|
### Interactive Visualization |
|
|
|
Open `interactive_som_map.html` in a web browser to explore the SOM grid interactively. The visualization shows: |
|
|
|
- Cluster sizes and distributions |
|
- Top keywords for each cluster |
|
- Topic analysis |
|
- Document counts per cluster |
|
|
|
## π Model Performance |
|
|
|
Based on the cluster analysis: |
|
|
|
- **Total Documents**: 11,412 |
|
- **Total Clusters**: 625 (25x25 grid) |
|
- **Silhouette Score**: -0.0078 |
|
- **Calinski-Harabasz Score**: 13.69 |
|
- **Davies-Bouldin Score**: 2.33 |
|
|
|
## π Cluster Analysis |
|
|
|
The model identifies meaningful clusters with distinct topics. For example, one of the largest clusters (659 documents) focuses on: |
|
|
|
- **Keywords**: connector, anypoint, mule, studio, connectors |
|
- **Topics**: Configuration, API integration, MuleSoft platform usage |
|
|
|
## π οΈ Advanced Usage |
|
|
|
### Custom Clustering |
|
|
|
```python |
|
# Train a new SOM with different parameters |
|
from minisom import MiniSom |
|
|
|
def train_custom_som(embeddings, grid_size=(20, 20), sigma=1.0, learning_rate=0.1): |
|
som = MiniSom(grid_size[0], grid_size[1], embeddings.shape[1], |
|
sigma=sigma, learning_rate=learning_rate, random_seed=42) |
|
som.train_random(embeddings, 100) |
|
return som |
|
``` |
|
|
|
### Cluster Analysis |
|
|
|
```python |
|
def analyze_cluster(cluster_key, cluster_analysis): |
|
"""Get detailed information about a specific cluster""" |
|
for cluster in cluster_analysis['top_clusters']: |
|
if cluster['cluster_key'] == cluster_key: |
|
return { |
|
'size': cluster['size'], |
|
'keywords': cluster['keywords'], |
|
'topics': cluster['topics'] |
|
} |
|
return None |
|
``` |
|
|
|
## π Dependencies |
|
|
|
- `numpy`: Numerical computations |
|
- `scikit-learn`: Machine learning utilities |
|
- `minisom`: Self-Organizing Map implementation |
|
- `matplotlib`: Static plotting |
|
- `plotly`: Interactive visualizations |
|
- `pandas`: Data manipulation |
|
|
|
## π€ Contributing |
|
|
|
This model is part of a larger document processing and clustering pipeline. For questions or contributions, please refer to the main project repository. |
|
|
|
## π License |
|
|
|
This model is provided for research and educational purposes. Please ensure compliance with the original data source licenses when using this model. |
|
|
|
## π Related Resources |
|
|
|
- [Self-Organizing Maps Tutorial](https://en.wikipedia.org/wiki/Self-organizing_map) |
|
- [MiniSom Documentation](https://github.com/JustGlowing/minisom) |
|
- [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings) |
|
|
|
--- |
|
|
|
**Note**: This model was trained on technical documentation and may be most effective for similar types of content. For best results, ensure your input documents are in the same domain or consider fine-tuning the model on your specific data. |
|
|