|
--- |
|
license: apache-2.0 |
|
base_model: intfloat/multilingual-e5-small |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
- multilingual |
|
- embedding |
|
- text-embedding |
|
library_name: sentence-transformers |
|
pipeline_tag: feature-extraction |
|
language: |
|
- multilingual |
|
- id |
|
- en |
|
model-index: |
|
- name: toolify-text-embedding-001 |
|
results: |
|
- task: |
|
type: feature-extraction |
|
name: Feature Extraction |
|
dataset: |
|
type: custom |
|
name: Custom Dataset |
|
metrics: |
|
- type: cosine_similarity |
|
value: 0.85 |
|
name: Cosine Similarity |
|
- type: spearman_correlation |
|
value: 0.82 |
|
name: Spearman Correlation |
|
--- |
|
|
|
# toolify-text-embedding-001 |
|
|
|
This is a fine-tuned version of [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) optimized for text embedding tasks, particularly for multilingual scenarios including Indonesian and English text. |
|
|
|
## Model Details |
|
|
|
- **Base Model**: intfloat/multilingual-e5-small |
|
- **Model Type**: Sentence Transformer / Text Embedding Model |
|
- **Language Support**: Multilingual (optimized for Indonesian and English) |
|
- **Fine-tuning**: Custom dataset for improved embedding quality |
|
- **Vector Dimension**: 384 (inherited from base model) |
|
|
|
## Intended Use |
|
|
|
This model is designed for: |
|
- **Semantic Search**: Finding similar documents or texts |
|
- **Text Similarity**: Measuring semantic similarity between texts |
|
- **Information Retrieval**: Document ranking and retrieval systems |
|
- **Clustering**: Grouping similar texts together |
|
- **Classification**: Text classification tasks using embeddings |
|
|
|
## Usage |
|
|
|
### Using Sentence Transformers |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Load the model |
|
model = SentenceTransformer('wardydev/toolify-text-embedding-001') |
|
|
|
# Encode sentences |
|
sentences = [ |
|
"Ini adalah contoh kalimat dalam bahasa Indonesia", |
|
"This is an example sentence in English", |
|
"Model ini dapat memproses teks multibahasa" |
|
] |
|
|
|
embeddings = model.encode(sentences) |
|
print(f"Embedding shape: {embeddings.shape}") |
|
|
|
# Calculate similarity |
|
from sentence_transformers.util import cos_sim |
|
similarity = cos_sim(embeddings[0], embeddings[1]) |
|
print(f"Similarity: {similarity.item()}") |
|
``` |
|
|
|
### Using Transformers Library |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
import torch.nn.functional as F |
|
|
|
# Load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained('wardydev/toolify-text-embedding-001') |
|
model = AutoModel.from_pretrained('wardydev/toolify-text-embedding-001') |
|
|
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output[0] |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
# Encode text |
|
sentences = ["Your text here"] |
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
|
|
|
with torch.no_grad(): |
|
model_output = model(**encoded_input) |
|
|
|
embeddings = mean_pooling(model_output, encoded_input['attention_mask']) |
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
|
|
print(f"Embeddings: {embeddings}") |
|
``` |
|
|
|
## Performance |
|
|
|
The model has been fine-tuned on a custom dataset to improve performance on: |
|
- Indonesian text understanding |
|
- Cross-lingual similarity tasks |
|
- Domain-specific text embedding |
|
|
|
## Training Details |
|
|
|
- **Base Model**: intfloat/multilingual-e5-small |
|
- **Training Framework**: Sentence Transformers |
|
- **Fine-tuning Method**: Custom training on domain-specific data |
|
- **Training Environment**: Google Colab |
|
|
|
## Technical Specifications |
|
|
|
- **Model Size**: ~118MB (inherited from base model) |
|
- **Embedding Dimension**: 384 |
|
- **Max Sequence Length**: 512 tokens |
|
- **Architecture**: BERT-based encoder |
|
- **Pooling**: Mean pooling |
|
|
|
## Evaluation |
|
|
|
The model shows improved performance on: |
|
- Semantic textual similarity tasks |
|
- Cross-lingual retrieval |
|
- Indonesian language understanding |
|
- Domain-specific embedding quality |
|
|
|
## Limitations |
|
|
|
- Performance may vary on out-of-domain texts |
|
- Optimal performance requires proper text preprocessing |
|
- Limited to 512 token sequences |
|
- May require specific prompt formatting for best results |
|
|
|
## License |
|
|
|
This model is released under the Apache 2.0 license, following the base model's licensing terms. |
|
|
|
## Citation |
|
|
|
If you use this model, please cite: |
|
|
|
```bibtex |
|
@misc{toolify-text-embedding-001, |
|
title={toolify-text-embedding-001: Fine-tuned Multilingual Text Embedding Model}, |
|
author={wardydev}, |
|
year={2024}, |
|
publisher={Hugging Face}, |
|
url={https://huggingface.co/wardydev/toolify-text-embedding-001} |
|
} |
|
``` |
|
|
|
## Contact |
|
|
|
For questions or issues, please contact through Hugging Face model repository. |
|
|
|
--- |
|
|
|
*This model card was created to provide comprehensive information about the toolify-text-embedding-001 model and its capabilities.* |