File size: 9,274 Bytes
9553acf 1dc8b9e 9553acf 7fe0661 20a3a70 ee8ef3b 20a3a70 49879ad 20a3a70 49879ad 20a3a70 49879ad 20a3a70 49879ad 20a3a70 66740af a91091a 20a3a70 cb1417d 7c1b6a1 f0809fc cb1417d 34147fd 7c1b6a1 226abd5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
tags:
- sentence-transformers
- embeddings
- multilingual
- NLP
- Indic-languages
- semantic-search
- similarity
library_name: sentence-transformers
license: other
license_name: krutrim-community-license-agreement-version-1.0
license_link: LICENSE.md
---
# Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages
[](https://huggingface.co/krutrim-ai-labs/vyakyarth) [](https://github.com/ola-krutrim/Vyakyarth) [](https://cloud.olakrutrim.com/console/inference-service?section=models&modelName=Krutrim&artifactName=Vyakyarth&artifactType=model) [](https://ai-labs.olakrutrim.com/models/Vyakyarth-1-Indic-Embedding)
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/stsb-xlm-r-multilingual](https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
[](https://www.youtube.com/watch?v=N1f8IlZCUi4)
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Download from the 🤗 Hub
model = SentenceTransformer("krutrim-ai-labs/vyakyarth")
# Run inference
sentences = ["मैं अपने दोस्त से मिला", "I met my friend", "I love you"
]
embeddings = np.array(model.encode(sentences))
print(cosine_similarity([embeddings[0]], [embeddings[1]])[0][0])
# Score : 0.9861017
print(cosine_similarity([embeddings[0]], [embeddings[2]])[0][0])
# Score 0.26329127
```
<!--
### Direct Usage (Transformers)
<details><summary>Click to see the direct usage in Transformers</summary>
</details>
-->
<!--
### Downstream Usage (Sentence Transformers)
You can finetune this model on your own dataset.
<details><summary>Click to expand</summary>
</details>
-->
<!--
### Out-of-Scope Use
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->
<!--
### Recommendations
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->
### Evaluation/Benchmarking
### Dataset Name : Flores Cross Lingual Sentence Retrieval Task of IndicXtreme Benchmark
| **Language** | **MuRIL** | **IndicBERT** | **Vyakyarth** | **jina-embeddings-v3** |
|--------------|------|----------|----------|----------------|
| **Bengali** | 77.0 | 91.0 | **98.7** | 97.4 |
| **Gujarati** | 67.0 | 92.4 | **98.7** | 97.3 |
| **Hindi** | 84.2 | 90.5 | **99.9** | 98.8 |
| **Kannada** | 88.4 | 89.1 | **99.2** | 96.8 |
| **Malayalam**| 82.2 | 89.2 | **98.7** | 96.3 |
| **Marathi** | 83.9 | 92.5 | **98.8** | 97.1 |
| **Sanskrit** | 36.4 | 30.4 | **90.1** | 84.1 |
| **Tamil** | 79.4 | 90.0 | **97.9** | 95.8 |
| **Telugu** | 43.5 | 88.6 | **97.5** | 97.3 |
```json
{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
```
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```
## License
This code repository and the model weights are licensed under the [Krutrim Community License.](LICENSE.md)
## 7. Citation
```
@inproceedings{
author={Pushkar Singh, Sandeep Kumar Pandey, Rajkiran Panuganti},
title={Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ola-krutrim/Vyakyarth}}
}
```
## Contact
Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub. |