File size: 9,274 Bytes
9553acf
 
 
 
 
 
 
 
 
 
1dc8b9e
 
 
9553acf
 
 
7fe0661
20a3a70
 
 
ee8ef3b
 
20a3a70
 
 
 
 
 
 
 
 
 
 
 
 
 
49879ad
 
 
20a3a70
 
49879ad
20a3a70
49879ad
20a3a70
49879ad
 
 
 
 
 
 
20a3a70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66740af
 
a91091a
 
 
 
 
 
 
 
 
 
 
 
 
 
20a3a70
 
 
 
 
 
 
cb1417d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c1b6a1
f0809fc
cb1417d
34147fd
 
 
 
 
 
 
 
 
 
 
 
 
7c1b6a1
226abd5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
tags:
- sentence-transformers
- embeddings
- multilingual
- NLP
- Indic-languages
- semantic-search
- similarity
library_name: sentence-transformers
license: other
license_name: krutrim-community-license-agreement-version-1.0
license_link: LICENSE.md
---

# Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages
[![Static Badge](https://img.shields.io/badge/Huggingface-Vyakyarth-yellow?logo=huggingface)](https://huggingface.co/krutrim-ai-labs/vyakyarth)	[![Static Badge](https://img.shields.io/badge/Github-Vyakyarth-green?logo=github)](https://github.com/ola-krutrim/Vyakyarth)	[![Static Badge](https://img.shields.io/badge/Krutrim_Cloud-Vyakyarth-orange?logo=)](https://cloud.olakrutrim.com/console/inference-service?section=models&modelName=Krutrim&artifactName=Vyakyarth&artifactType=model)	[![Static Badge](https://img.shields.io/badge/Krutrim_AI_Labs-Vyakyarth-blue?logo=)](https://ai-labs.olakrutrim.com/models/Vyakyarth-1-Indic-Embedding)

This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/stsb-xlm-r-multilingual](https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

[![Vyakyarth](https://img.youtube.com/vi/N1f8IlZCUi4/0.jpg)](https://www.youtube.com/watch?v=N1f8IlZCUi4)


## Usage

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


# Download from the 🤗 Hub
model = SentenceTransformer("krutrim-ai-labs/vyakyarth")
# Run inference
sentences = ["मैं अपने दोस्त से मिला", "I met my friend", "I love you"
]
embeddings = np.array(model.encode(sentences))

print(cosine_similarity([embeddings[0]], [embeddings[1]])[0][0])
# Score : 0.9861017

print(cosine_similarity([embeddings[0]], [embeddings[2]])[0][0])
# Score 0.26329127
```

<!--
### Direct Usage (Transformers)

<details><summary>Click to see the direct usage in Transformers</summary>

</details>
-->

<!--
### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

<!--
### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->


### Evaluation/Benchmarking 
### Dataset Name : Flores Cross Lingual Sentence Retrieval Task of IndicXtreme Benchmark

| **Language**  | **MuRIL** | **IndicBERT** | **Vyakyarth** | **jina-embeddings-v3** |
|--------------|------|----------|----------|----------------|
| **Bengali**  | 77.0  | 91.0 | **98.7** | 97.4 |
| **Gujarati** | 67.0  | 92.4 | **98.7** | 97.3 |
| **Hindi**    | 84.2  | 90.5 | **99.9** | 98.8 |
| **Kannada**  | 88.4  | 89.1 | **99.2** | 96.8 |
| **Malayalam**| 82.2  | 89.2 | **98.7** | 96.3 |
| **Marathi**  | 83.9  | 92.5 | **98.8** | 97.1 |
| **Sanskrit** | 36.4  | 30.4 | **90.1** | 84.1 |
| **Tamil**    | 79.4  | 90.0 | **97.9** | 95.8 |
| **Telugu**   | 43.5  | 88.6 | **97.5** | 97.3 |


  ```json
  {
      "scale": 20.0,
      "similarity_fct": "cos_sim"
  }
  ```

### Model Sources

- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```

## License
This code repository and the model weights are licensed under the [Krutrim Community License.](LICENSE.md)

## 7. Citation

```
@inproceedings{
  author={Pushkar Singh, Sandeep Kumar Pandey, Rajkiran Panuganti},
  title={Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ola-krutrim/Vyakyarth}}
}
```

## Contact
Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.