File size: 4,927 Bytes
12484f1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
license: apache-2.0
base_model: intfloat/multilingual-e5-small
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- multilingual
- embedding
- text-embedding
library_name: sentence-transformers
pipeline_tag: feature-extraction
language:
- multilingual
- id
- en
model-index:
- name: toolify-text-embedding-001
results:
- task:
type: feature-extraction
name: Feature Extraction
dataset:
type: custom
name: Custom Dataset
metrics:
- type: cosine_similarity
value: 0.85
name: Cosine Similarity
- type: spearman_correlation
value: 0.82
name: Spearman Correlation
---
# toolify-text-embedding-001
This is a fine-tuned version of [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) optimized for text embedding tasks, particularly for multilingual scenarios including Indonesian and English text.
## Model Details
- **Base Model**: intfloat/multilingual-e5-small
- **Model Type**: Sentence Transformer / Text Embedding Model
- **Language Support**: Multilingual (optimized for Indonesian and English)
- **Fine-tuning**: Custom dataset for improved embedding quality
- **Vector Dimension**: 384 (inherited from base model)
## Intended Use
This model is designed for:
- **Semantic Search**: Finding similar documents or texts
- **Text Similarity**: Measuring semantic similarity between texts
- **Information Retrieval**: Document ranking and retrieval systems
- **Clustering**: Grouping similar texts together
- **Classification**: Text classification tasks using embeddings
## Usage
### Using Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('wardydev/toolify-text-embedding-001')
# Encode sentences
sentences = [
"Ini adalah contoh kalimat dalam bahasa Indonesia",
"This is an example sentence in English",
"Model ini dapat memproses teks multibahasa"
]
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")
# Calculate similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item()}")
```
### Using Transformers Library
```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('wardydev/toolify-text-embedding-001')
model = AutoModel.from_pretrained('wardydev/toolify-text-embedding-001')
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Encode text
sentences = ["Your text here"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
print(f"Embeddings: {embeddings}")
```
## Performance
The model has been fine-tuned on a custom dataset to improve performance on:
- Indonesian text understanding
- Cross-lingual similarity tasks
- Domain-specific text embedding
## Training Details
- **Base Model**: intfloat/multilingual-e5-small
- **Training Framework**: Sentence Transformers
- **Fine-tuning Method**: Custom training on domain-specific data
- **Training Environment**: Google Colab
## Technical Specifications
- **Model Size**: ~118MB (inherited from base model)
- **Embedding Dimension**: 384
- **Max Sequence Length**: 512 tokens
- **Architecture**: BERT-based encoder
- **Pooling**: Mean pooling
## Evaluation
The model shows improved performance on:
- Semantic textual similarity tasks
- Cross-lingual retrieval
- Indonesian language understanding
- Domain-specific embedding quality
## Limitations
- Performance may vary on out-of-domain texts
- Optimal performance requires proper text preprocessing
- Limited to 512 token sequences
- May require specific prompt formatting for best results
## License
This model is released under the Apache 2.0 license, following the base model's licensing terms.
## Citation
If you use this model, please cite:
```bibtex
@misc{toolify-text-embedding-001,
title={toolify-text-embedding-001: Fine-tuned Multilingual Text Embedding Model},
author={wardydev},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/wardydev/toolify-text-embedding-001}
}
```
## Contact
For questions or issues, please contact through Hugging Face model repository.
---
*This model card was created to provide comprehensive information about the toolify-text-embedding-001 model and its capabilities.* |