bert
embedding
File size: 6,892 Bytes
aab62e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7c3a61
 
 
 
 
 
 
 
 
 
aab62e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
license: mit
datasets:
- sentence-transformers/all-nli
- sentence-transformers/stsb
base_model:
- rootxhacker/arthemis-instruct
tags:
- bert
- embedding
---
# rootxhacker/arthemis-embedding

This is a text embedding model finetuned from **arthemislm-base** on the **all-nli-pair**, **all-nli-pair-class**, **all-nli-pair-score**, **all-nli-triplet**, **stsb**, **quora** and **natural-questions** datasets. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

The **Arthemis Embedding** model is a 155.8M parameter text embedding model that incorporates **Spiking Neural Networks (SNNs)** and **Liquid Time Constants (LTCs)** for enhanced temporal dynamics and semantic representation learning. This neuromorphic architecture provides unique advantages in classification tasks while maintaining competitive performance across various text understanding benchmarks.

This embedding model performs on par with jinaai/jina-embeddings-v2-base-en on MTEB 

## Model Details

**Model Type**: Text Embedding  
**Supported Languages**: English  
**Number of Parameters**: 155.8M  
**Context Length**: 1024 tokens  
**Embedding Dimension**: 768  
**Base Model**: arthemislm-base  
**Training Data**: all-nli-pair, all-nli-pair-class, all-nli-pair-score, all-nli-triplet, stsb, quora, natural-questions

### Architecture Features
- **Spiking Neural Networks** in attention mechanisms for temporal processing
- **Liquid Time Constants** in feed-forward layers for adaptive dynamics  
- **12-layer transformer backbone** with neuromorphic enhancements
- **RoPE positional encoding** for sequence understanding
- **Surrogate gradient training** for differentiable spike computation


## Inference

In this gist you can find code for inference of this embedding model

```bash
https://gist.github.com/harishsg993010/220c24f0b2c41a6287a8579cd17c838f
```


## Usage (Python)

Using this model with the custom implementation:

```python
from transformers import AutoTokenizer
import torch
import numpy as np

# Load model (using the custom MTEBLlamaSNNLTCEncoder)
from mteb_benchmark_snn_ltc import MTEBLlamaSNNLTCEncoder

model = MTEBLlamaSNNLTCEncoder('rootxhacker/arthemis-embedding')

# Encode sentences
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences, task_name="similarity")

print(f"Embeddings shape: {embeddings.shape}")  # (2, 768)
print(f"Embedding dimension: {embeddings.shape[1]}")
```

## Usage (Custom Implementation)

For direct usage with the neuromorphic architecture:

```python
import torch
import torch.nn as nn
from transformers import AutoTokenizer

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
tokenizer.pad_token = tokenizer.eos_token

# Load the model
model = MTEBLlamaSNNLTCEncoder('rootxhacker/arthemis-embedding')

# Process text
sentences = ['This is an example sentence', 'Each sentence is converted']
embeddings = model.encode(sentences, task_name="embedding_task")

# Use embeddings for similarity
from scipy.spatial.distance import cosine
similarity = 1 - cosine(embeddings[0], embeddings[1])
print(f"Cosine similarity: {similarity:.4f}")
```

## Evaluation

The model has been evaluated on 41 tasks from the **MTEB (Massive Text Embedding Benchmark)**:

### MTEB Performance

| Task Type | Average Score | Tasks Count | Best Individual Score |
|-----------|---------------|-------------|----------------------|
| **Classification** | **42.78** | 8 | Amazon Counterfactual: 65.43 |
| **STS** | **39.96** | 8 | STS17: 58.48 |
| **Clustering** | **28.54** | 8 | ArXiv Hierarchical: 49.82 |
| **Retrieval** | **12.41** | 5 | Twitter URL: 53.78 |
| **Other** | **13.07** | 12 | Ask Ubuntu: 43.56 |

**Overall MTEB Score: 27.05** (across 41 tasks)

### Notable Individual Results

| Task | Score | Task Type |
|------|-------|-----------|
| Amazon Counterfactual Classification | 65.43 | Classification |
| STS17 | 58.48 | Semantic Similarity |
| Toxic Conversations Classification | 55.54 | Classification |
| IMDB Classification | 51.69 | Classification |
| SICK-R | 49.24 | Semantic Similarity |
| ArXiv Hierarchical Clustering | 49.82 | Clustering |
| Banking77 Classification | 29.98 | Classification |
| STSBenchmark | 36.82 | Semantic Similarity |

## Model Strengths

- **Classification Excellence**: Superior performance on text classification tasks with 42.78% average
- **Semantic Understanding**: Strong semantic textual similarity capabilities (39.96% average)
- **Neuromorphic Advantages**: Unique spiking neural architecture provides enhanced pattern recognition
- **Temporal Processing**: Liquid time constants enable adaptive sequence processing
- **Robust Embeddings**: 768-dimensional vectors capture rich semantic representations

## Applications

- **Text Classification**: Financial intent detection, sentiment analysis, content moderation
- **Semantic Search**: Document retrieval and similarity matching
- **Clustering**: Automatic text organization and topic discovery  
- **Content Safety**: Toxic content detection and content moderation
- **Question Answering**: Similarity-based answer retrieval
- **Paraphrase Mining**: Finding semantically equivalent text pairs
- **Semantic Textual Similarity**: Measuring text similarity for various applications

## Training Details

The model was finetuned from the **arthemislm-base** foundation model using multiple high-quality datasets:

- **all-nli-pair**: Natural Language Inference pair datasets
- **all-nli-pair-class**: Classification variants of NLI pairs  
- **all-nli-pair-score**: Scored NLI pairs for similarity learning
- **all-nli-triplet**: Triplet learning from NLI data
- **stsb**: Semantic Textual Similarity Benchmark
- **quora**: Quora Question Pairs for paraphrase detection
- **natural-questions**: Google's Natural Questions dataset

The neuromorphic enhancements were integrated during training to provide:
- Spiking neuron dynamics in attention layers
- Liquid time constant adaptation in feed-forward networks
- Surrogate gradient optimization for spike-based learning
- Enhanced temporal pattern recognition capabilities

## Technical Specifications

```
Architecture: Transformer with SNN/LTC enhancements
Hidden Size: 768
Intermediate Size: 2048  
Attention Heads: 12
Layers: 12
Max Position Embeddings: 1024
Vocabulary Size: 50,257
Spiking Threshold: 1.0
LTC Hidden Size: 256
Training Precision: FP32
```

## Citation

```bibtex
@misc{arthemis-embedding-2024,
  title={Arthemis Embedding: A Neuromorphic Text Embedding Model},
  author={rootxhacker},
  year={2024},
  howpublished={\url{https://huggingface.co/rootxhacker/arthemis-embedding}}
}
```

## License

Please refer to the model files for licensing information.