File size: 3,702 Bytes
3c2df2e 294887b e56a2f9 294887b 49eb33a ba321fe 49eb33a ba321fe 49eb33a 0d206d6 294887b 49eb33a 294887b 49eb33a 294887b 49eb33a 294887b 49eb33a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
license: apache-2.0
tags:
- setfit
- sentence-transformers
- text-classification
pipeline_tag: text-classification
library_name: sentence-transformers
metrics:
- accuracy
- f1
- precision
- recall
language:
- en
- fr
- ko
- zh
- ja
- pt
- ru
datasets:
- imdb
model-index:
- name: germla/satoken
results:
- task:
type: text-classification
name: sentiment-analysis
dataset:
type: imdb
name: imdb
split: test
metrics:
- type: accuracy
value: 73.976
name: Accuracy
- type: f1
value: 73.1667079105832
name: F1
- type: precision
value: 75.51506895964584
name: Precision
- type: recall
value: 70.96
name: Recall
- task:
type: text-classification
name: sentiment-analysis
dataset:
type: sepidmnorozy/Russian_sentiment
name: sepidmnorozy/Russian_sentiment
split: train
metrics:
- type: accuracy
value: 75.66371681415929
name: Accuracy
- type: f1
value: 83.64218714253031
name: F1
- type: precision
value: 75.25730753396459
name: Precision
- type: recall
value: 94.129763130793
name: Recall
---
# Satoken
This is a [SetFit model](https://github.com/huggingface/setfit) trained on multilingual datasets (mentioned below) for Sentiment classification.
The model has been trained using an efficient few-shot learning technique that involves:
1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
2. Training a classification head with features from the fine-tuned Sentence Transformer.
It is utilized by [Germla](https://github.com/germla) for it's feedback analysis tool. (specifically the Sentiment analysis feature)
For other models (specific language-basis) check [here](https://github.com/germla/satoken#available-models)
# Usage
To use this model for inference, first install the SetFit library:
```bash
python -m pip install setfit
```
You can then run inference as follows:
```python
from setfit import SetFitModel
# Download from Hub and run inference
model = SetFitModel.from_pretrained("germla/satoken")
# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
```
# Training Details
## Training Data
- [IMDB](https://huggingface.co/datasets/imdb)
- [RuReviews](https://github.com/sismetanin/rureviews)
- [chABSA](https://github.com/chakki-works/chABSA-dataset)
- [Glyph](https://github.com/zhangxiangxiao/glyph)
- [nsmc](https://github.com/e9t/nsmc)
- [Allocine](https://huggingface.co/datasets/allocine)
- [Portuguese Tweets for Sentiment Analysis](https://www.kaggle.com/datasets/augustop/portuguese-tweets-for-sentiment-analysis)
## Training Procedure
We made sure to have a balanced dataset.
The model was trained on only 35% (50% for chinese) of the train split of all datasets.
### Preprocessing
- Basic Cleaning (removal of dups, links, mentions, hashtags, etc.)
- Removal of stopwords using [nltk](https://www.nltk.org/)
### Speeds, Sizes, Times
The training procedure took 6hours on the NVIDIA T4 GPU.
## Evaluation
### Testing Data, Factors & Metrics
- [IMDB test split](https://huggingface.co/datasets/imdb)
# Environmental Impact
- Hardware Type: NVIDIA T4 GPU
- Hours used: 6
- Cloud Provider: Amazon Web Services
- Compute Region: ap-south-1 (Mumbai)
- Carbon Emitted: 0.39 [kg co2 eq.](https://mlco2.github.io/impact/#co2eq)
|