File size: 3,702 Bytes
3c2df2e
294887b
e56a2f9
294887b
 
 
 
49eb33a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ba321fe
49eb33a
 
ba321fe
49eb33a
 
 
 
 
 
 
 
 
 
 
 
 
 
0d206d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
294887b
 
49eb33a
294887b
49eb33a
 
 
294887b
 
 
 
49eb33a
 
 
 
 
294887b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49eb33a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: apache-2.0
tags:
- setfit
- sentence-transformers
- text-classification
pipeline_tag: text-classification
library_name: sentence-transformers
metrics:
  - accuracy
  - f1
  - precision
  - recall
language:
  - en 
  - fr 
  - ko
  - zh 
  - ja
  - pt
  - ru
datasets:
  - imdb
model-index:
  - name: germla/satoken
    results:
      - task: 
          type: text-classification
          name: sentiment-analysis
        dataset:
          type: imdb
          name: imdb
          split: test
        metrics:
          - type: accuracy
            value: 73.976
            name: Accuracy
          - type: f1
            value: 73.1667079105832
            name: F1
          - type: precision
            value: 75.51506895964584
            name: Precision
          - type: recall
            value: 70.96
            name: Recall
      - task:
          type: text-classification
          name: sentiment-analysis
        dataset:
          type: sepidmnorozy/Russian_sentiment
          name: sepidmnorozy/Russian_sentiment
          split: train
        metrics:
          - type: accuracy
            value: 75.66371681415929
            name: Accuracy
          - type: f1
            value: 83.64218714253031
            name: F1
          - type: precision
            value: 75.25730753396459
            name: Precision
          - type: recall
            value: 94.129763130793
            name: Recall
---

# Satoken

This is a [SetFit model](https://github.com/huggingface/setfit) trained on multilingual datasets (mentioned below) for Sentiment classification.

The model has been trained using an efficient few-shot learning technique that involves:

1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
2. Training a classification head with features from the fine-tuned Sentence Transformer.

It is utilized by [Germla](https://github.com/germla) for it's feedback analysis tool. (specifically the Sentiment analysis feature)

For other models (specific language-basis) check [here](https://github.com/germla/satoken#available-models)

# Usage

To use this model for inference, first install the SetFit library:

```bash
python -m pip install setfit
```

You can then run inference as follows:

```python
from setfit import SetFitModel

# Download from Hub and run inference
model = SetFitModel.from_pretrained("germla/satoken")
# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
```

# Training Details

## Training Data

- [IMDB](https://huggingface.co/datasets/imdb)
- [RuReviews](https://github.com/sismetanin/rureviews)
- [chABSA](https://github.com/chakki-works/chABSA-dataset)
- [Glyph](https://github.com/zhangxiangxiao/glyph)
- [nsmc](https://github.com/e9t/nsmc)
- [Allocine](https://huggingface.co/datasets/allocine)
- [Portuguese Tweets for Sentiment Analysis](https://www.kaggle.com/datasets/augustop/portuguese-tweets-for-sentiment-analysis)

## Training Procedure

We made sure to have a balanced dataset.
The model was trained on only 35% (50% for chinese) of the train split of all datasets.

### Preprocessing

- Basic Cleaning (removal of dups, links, mentions, hashtags, etc.)
- Removal of stopwords using [nltk](https://www.nltk.org/)

### Speeds, Sizes, Times

The training procedure took 6hours on the NVIDIA T4 GPU.

## Evaluation

### Testing Data, Factors & Metrics

- [IMDB test split](https://huggingface.co/datasets/imdb)

# Environmental Impact

- Hardware Type: NVIDIA T4 GPU
- Hours used: 6
- Cloud Provider: Amazon Web Services
- Compute Region: ap-south-1 (Mumbai)
- Carbon Emitted: 0.39 [kg co2 eq.](https://mlco2.github.io/impact/#co2eq)