|
--- |
|
tags: |
|
- sentence-transformers |
|
- embeddings |
|
- multilingual |
|
- NLP |
|
- Indic-languages |
|
- semantic-search |
|
- similarity |
|
library_name: sentence-transformers |
|
license: other |
|
license_name: krutrim-community-license-agreement-version-1.0 |
|
license_link: LICENSE.md |
|
--- |
|
|
|
# Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages |
|
[](https://huggingface.co/krutrim-ai-labs/vyakyarth) [](https://github.com/ola-krutrim/Vyakyarth) [](https://cloud.olakrutrim.com/console/inference-service?section=models&modelName=Krutrim&artifactName=Vyakyarth&artifactType=model) [](https://ai-labs.olakrutrim.com/models/Vyakyarth-1-Indic-Embedding) |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/stsb-xlm-r-multilingual](https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. |
|
|
|
[](https://www.youtube.com/watch?v=N1f8IlZCUi4) |
|
|
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can load this model and run inference. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
from sklearn.metrics.pairwise import cosine_similarity |
|
import numpy as np |
|
|
|
|
|
# Download from the 🤗 Hub |
|
model = SentenceTransformer("krutrim-ai-labs/vyakyarth") |
|
# Run inference |
|
sentences = ["मैं अपने दोस्त से मिला", "I met my friend", "I love you" |
|
] |
|
embeddings = np.array(model.encode(sentences)) |
|
|
|
print(cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]) |
|
# Score : 0.9861017 |
|
|
|
print(cosine_similarity([embeddings[0]], [embeddings[2]])[0][0]) |
|
# Score 0.26329127 |
|
``` |
|
|
|
<!-- |
|
### Direct Usage (Transformers) |
|
|
|
<details><summary>Click to see the direct usage in Transformers</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Downstream Usage (Sentence Transformers) |
|
|
|
You can finetune this model on your own dataset. |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Out-of-Scope Use |
|
|
|
*List how the model may foreseeably be misused and address what users ought not to do with the model.* |
|
--> |
|
|
|
<!-- |
|
### Recommendations |
|
|
|
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
|
--> |
|
|
|
|
|
### Evaluation/Benchmarking |
|
### Dataset Name : Flores Cross Lingual Sentence Retrieval Task of IndicXtreme Benchmark |
|
|
|
| **Language** | **MuRIL** | **IndicBERT** | **Vyakyarth** | **jina-embeddings-v3** | |
|
|--------------|------|----------|----------|----------------| |
|
| **Bengali** | 77.0 | 91.0 | **98.7** | 97.4 | |
|
| **Gujarati** | 67.0 | 92.4 | **98.7** | 97.3 | |
|
| **Hindi** | 84.2 | 90.5 | **99.9** | 98.8 | |
|
| **Kannada** | 88.4 | 89.1 | **99.2** | 96.8 | |
|
| **Malayalam**| 82.2 | 89.2 | **98.7** | 96.3 | |
|
| **Marathi** | 83.9 | 92.5 | **98.8** | 97.1 | |
|
| **Sanskrit** | 36.4 | 30.4 | **90.1** | 84.1 | |
|
| **Tamil** | 79.4 | 90.0 | **97.9** | 95.8 | |
|
| **Telugu** | 43.5 | 88.6 | **97.5** | 97.3 | |
|
|
|
|
|
```json |
|
{ |
|
"scale": 20.0, |
|
"similarity_fct": "cos_sim" |
|
} |
|
``` |
|
|
|
### Model Sources |
|
|
|
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net) |
|
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) |
|
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) |
|
|
|
### Full Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel |
|
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
|
) |
|
``` |
|
|
|
## License |
|
This code repository and the model weights are licensed under the [Krutrim Community License.](LICENSE.md) |
|
|
|
## 7. Citation |
|
|
|
``` |
|
@inproceedings{ |
|
author={Pushkar Singh, Sandeep Kumar Pandey, Rajkiran Panuganti}, |
|
title={Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages}, |
|
year = {2024}, |
|
publisher = {GitHub}, |
|
journal = {GitHub repository}, |
|
howpublished = {\url{https://github.com/ola-krutrim/Vyakyarth}} |
|
} |
|
``` |
|
|
|
## Contact |
|
Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub. |