|
--- |
|
license: mit |
|
datasets: |
|
- sentence-transformers/all-nli |
|
- sentence-transformers/stsb |
|
base_model: |
|
- rootxhacker/arthemis-instruct |
|
tags: |
|
- bert |
|
- embedding |
|
--- |
|
# rootxhacker/arthemis-embedding |
|
|
|
This is a text embedding model finetuned from **arthemislm-base** on the **all-nli-pair**, **all-nli-pair-class**, **all-nli-pair-score**, **all-nli-triplet**, **stsb**, **quora** and **natural-questions** datasets. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. |
|
|
|
The **Arthemis Embedding** model is a 155.8M parameter text embedding model that incorporates **Spiking Neural Networks (SNNs)** and **Liquid Time Constants (LTCs)** for enhanced temporal dynamics and semantic representation learning. This neuromorphic architecture provides unique advantages in classification tasks while maintaining competitive performance across various text understanding benchmarks. |
|
|
|
This embedding model performs on par with jinaai/jina-embeddings-v2-base-en on MTEB |
|
|
|
## Model Details |
|
|
|
**Model Type**: Text Embedding |
|
**Supported Languages**: English |
|
**Number of Parameters**: 155.8M |
|
**Context Length**: 1024 tokens |
|
**Embedding Dimension**: 768 |
|
**Base Model**: arthemislm-base |
|
**Training Data**: all-nli-pair, all-nli-pair-class, all-nli-pair-score, all-nli-triplet, stsb, quora, natural-questions |
|
|
|
### Architecture Features |
|
- **Spiking Neural Networks** in attention mechanisms for temporal processing |
|
- **Liquid Time Constants** in feed-forward layers for adaptive dynamics |
|
- **12-layer transformer backbone** with neuromorphic enhancements |
|
- **RoPE positional encoding** for sequence understanding |
|
- **Surrogate gradient training** for differentiable spike computation |
|
|
|
|
|
## Inference |
|
|
|
In this gist you can find code for inference of this embedding model |
|
|
|
```bash |
|
https://gist.github.com/harishsg993010/220c24f0b2c41a6287a8579cd17c838f |
|
``` |
|
|
|
|
|
## Usage (Python) |
|
|
|
Using this model with the custom implementation: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
import torch |
|
import numpy as np |
|
|
|
# Load model (using the custom MTEBLlamaSNNLTCEncoder) |
|
from mteb_benchmark_snn_ltc import MTEBLlamaSNNLTCEncoder |
|
|
|
model = MTEBLlamaSNNLTCEncoder('rootxhacker/arthemis-embedding') |
|
|
|
# Encode sentences |
|
sentences = ["This is an example sentence", "Each sentence is converted"] |
|
embeddings = model.encode(sentences, task_name="similarity") |
|
|
|
print(f"Embeddings shape: {embeddings.shape}") # (2, 768) |
|
print(f"Embedding dimension: {embeddings.shape[1]}") |
|
``` |
|
|
|
## Usage (Custom Implementation) |
|
|
|
For direct usage with the neuromorphic architecture: |
|
|
|
```python |
|
import torch |
|
import torch.nn as nn |
|
from transformers import AutoTokenizer |
|
|
|
# Initialize tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small") |
|
tokenizer.pad_token = tokenizer.eos_token |
|
|
|
# Load the model |
|
model = MTEBLlamaSNNLTCEncoder('rootxhacker/arthemis-embedding') |
|
|
|
# Process text |
|
sentences = ['This is an example sentence', 'Each sentence is converted'] |
|
embeddings = model.encode(sentences, task_name="embedding_task") |
|
|
|
# Use embeddings for similarity |
|
from scipy.spatial.distance import cosine |
|
similarity = 1 - cosine(embeddings[0], embeddings[1]) |
|
print(f"Cosine similarity: {similarity:.4f}") |
|
``` |
|
|
|
## Evaluation |
|
|
|
The model has been evaluated on 41 tasks from the **MTEB (Massive Text Embedding Benchmark)**: |
|
|
|
### MTEB Performance |
|
|
|
| Task Type | Average Score | Tasks Count | Best Individual Score | |
|
|-----------|---------------|-------------|----------------------| |
|
| **Classification** | **42.78** | 8 | Amazon Counterfactual: 65.43 | |
|
| **STS** | **39.96** | 8 | STS17: 58.48 | |
|
| **Clustering** | **28.54** | 8 | ArXiv Hierarchical: 49.82 | |
|
| **Retrieval** | **12.41** | 5 | Twitter URL: 53.78 | |
|
| **Other** | **13.07** | 12 | Ask Ubuntu: 43.56 | |
|
|
|
**Overall MTEB Score: 27.05** (across 41 tasks) |
|
|
|
### Notable Individual Results |
|
|
|
| Task | Score | Task Type | |
|
|------|-------|-----------| |
|
| Amazon Counterfactual Classification | 65.43 | Classification | |
|
| STS17 | 58.48 | Semantic Similarity | |
|
| Toxic Conversations Classification | 55.54 | Classification | |
|
| IMDB Classification | 51.69 | Classification | |
|
| SICK-R | 49.24 | Semantic Similarity | |
|
| ArXiv Hierarchical Clustering | 49.82 | Clustering | |
|
| Banking77 Classification | 29.98 | Classification | |
|
| STSBenchmark | 36.82 | Semantic Similarity | |
|
|
|
## Model Strengths |
|
|
|
- **Classification Excellence**: Superior performance on text classification tasks with 42.78% average |
|
- **Semantic Understanding**: Strong semantic textual similarity capabilities (39.96% average) |
|
- **Neuromorphic Advantages**: Unique spiking neural architecture provides enhanced pattern recognition |
|
- **Temporal Processing**: Liquid time constants enable adaptive sequence processing |
|
- **Robust Embeddings**: 768-dimensional vectors capture rich semantic representations |
|
|
|
## Applications |
|
|
|
- **Text Classification**: Financial intent detection, sentiment analysis, content moderation |
|
- **Semantic Search**: Document retrieval and similarity matching |
|
- **Clustering**: Automatic text organization and topic discovery |
|
- **Content Safety**: Toxic content detection and content moderation |
|
- **Question Answering**: Similarity-based answer retrieval |
|
- **Paraphrase Mining**: Finding semantically equivalent text pairs |
|
- **Semantic Textual Similarity**: Measuring text similarity for various applications |
|
|
|
## Training Details |
|
|
|
The model was finetuned from the **arthemislm-base** foundation model using multiple high-quality datasets: |
|
|
|
- **all-nli-pair**: Natural Language Inference pair datasets |
|
- **all-nli-pair-class**: Classification variants of NLI pairs |
|
- **all-nli-pair-score**: Scored NLI pairs for similarity learning |
|
- **all-nli-triplet**: Triplet learning from NLI data |
|
- **stsb**: Semantic Textual Similarity Benchmark |
|
- **quora**: Quora Question Pairs for paraphrase detection |
|
- **natural-questions**: Google's Natural Questions dataset |
|
|
|
The neuromorphic enhancements were integrated during training to provide: |
|
- Spiking neuron dynamics in attention layers |
|
- Liquid time constant adaptation in feed-forward networks |
|
- Surrogate gradient optimization for spike-based learning |
|
- Enhanced temporal pattern recognition capabilities |
|
|
|
## Technical Specifications |
|
|
|
``` |
|
Architecture: Transformer with SNN/LTC enhancements |
|
Hidden Size: 768 |
|
Intermediate Size: 2048 |
|
Attention Heads: 12 |
|
Layers: 12 |
|
Max Position Embeddings: 1024 |
|
Vocabulary Size: 50,257 |
|
Spiking Threshold: 1.0 |
|
LTC Hidden Size: 256 |
|
Training Precision: FP32 |
|
``` |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{arthemis-embedding-2024, |
|
title={Arthemis Embedding: A Neuromorphic Text Embedding Model}, |
|
author={rootxhacker}, |
|
year={2024}, |
|
howpublished={\url{https://huggingface.co/rootxhacker/arthemis-embedding}} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
Please refer to the model files for licensing information. |