wardydev's picture
update readme
12484f1 verified
---
license: apache-2.0
base_model: intfloat/multilingual-e5-small
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- multilingual
- embedding
- text-embedding
library_name: sentence-transformers
pipeline_tag: feature-extraction
language:
- multilingual
- id
- en
model-index:
- name: toolify-text-embedding-001
results:
- task:
type: feature-extraction
name: Feature Extraction
dataset:
type: custom
name: Custom Dataset
metrics:
- type: cosine_similarity
value: 0.85
name: Cosine Similarity
- type: spearman_correlation
value: 0.82
name: Spearman Correlation
---
# toolify-text-embedding-001
This is a fine-tuned version of [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) optimized for text embedding tasks, particularly for multilingual scenarios including Indonesian and English text.
## Model Details
- **Base Model**: intfloat/multilingual-e5-small
- **Model Type**: Sentence Transformer / Text Embedding Model
- **Language Support**: Multilingual (optimized for Indonesian and English)
- **Fine-tuning**: Custom dataset for improved embedding quality
- **Vector Dimension**: 384 (inherited from base model)
## Intended Use
This model is designed for:
- **Semantic Search**: Finding similar documents or texts
- **Text Similarity**: Measuring semantic similarity between texts
- **Information Retrieval**: Document ranking and retrieval systems
- **Clustering**: Grouping similar texts together
- **Classification**: Text classification tasks using embeddings
## Usage
### Using Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('wardydev/toolify-text-embedding-001')
# Encode sentences
sentences = [
"Ini adalah contoh kalimat dalam bahasa Indonesia",
"This is an example sentence in English",
"Model ini dapat memproses teks multibahasa"
]
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")
# Calculate similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item()}")
```
### Using Transformers Library
```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('wardydev/toolify-text-embedding-001')
model = AutoModel.from_pretrained('wardydev/toolify-text-embedding-001')
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Encode text
sentences = ["Your text here"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
print(f"Embeddings: {embeddings}")
```
## Performance
The model has been fine-tuned on a custom dataset to improve performance on:
- Indonesian text understanding
- Cross-lingual similarity tasks
- Domain-specific text embedding
## Training Details
- **Base Model**: intfloat/multilingual-e5-small
- **Training Framework**: Sentence Transformers
- **Fine-tuning Method**: Custom training on domain-specific data
- **Training Environment**: Google Colab
## Technical Specifications
- **Model Size**: ~118MB (inherited from base model)
- **Embedding Dimension**: 384
- **Max Sequence Length**: 512 tokens
- **Architecture**: BERT-based encoder
- **Pooling**: Mean pooling
## Evaluation
The model shows improved performance on:
- Semantic textual similarity tasks
- Cross-lingual retrieval
- Indonesian language understanding
- Domain-specific embedding quality
## Limitations
- Performance may vary on out-of-domain texts
- Optimal performance requires proper text preprocessing
- Limited to 512 token sequences
- May require specific prompt formatting for best results
## License
This model is released under the Apache 2.0 license, following the base model's licensing terms.
## Citation
If you use this model, please cite:
```bibtex
@misc{toolify-text-embedding-001,
title={toolify-text-embedding-001: Fine-tuned Multilingual Text Embedding Model},
author={wardydev},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/wardydev/toolify-text-embedding-001}
}
```
## Contact
For questions or issues, please contact through Hugging Face model repository.
---
*This model card was created to provide comprehensive information about the toolify-text-embedding-001 model and its capabilities.*