satoken / README.md
gaurishhs's picture
Update README.md
0d206d6
---
license: apache-2.0
tags:
- setfit
- sentence-transformers
- text-classification
pipeline_tag: text-classification
library_name: sentence-transformers
metrics:
- accuracy
- f1
- precision
- recall
language:
- en
- fr
- ko
- zh
- ja
- pt
- ru
datasets:
- imdb
model-index:
- name: germla/satoken
results:
- task:
type: text-classification
name: sentiment-analysis
dataset:
type: imdb
name: imdb
split: test
metrics:
- type: accuracy
value: 73.976
name: Accuracy
- type: f1
value: 73.1667079105832
name: F1
- type: precision
value: 75.51506895964584
name: Precision
- type: recall
value: 70.96
name: Recall
- task:
type: text-classification
name: sentiment-analysis
dataset:
type: sepidmnorozy/Russian_sentiment
name: sepidmnorozy/Russian_sentiment
split: train
metrics:
- type: accuracy
value: 75.66371681415929
name: Accuracy
- type: f1
value: 83.64218714253031
name: F1
- type: precision
value: 75.25730753396459
name: Precision
- type: recall
value: 94.129763130793
name: Recall
---
# Satoken
This is a [SetFit model](https://github.com/huggingface/setfit) trained on multilingual datasets (mentioned below) for Sentiment classification.
The model has been trained using an efficient few-shot learning technique that involves:
1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
2. Training a classification head with features from the fine-tuned Sentence Transformer.
It is utilized by [Germla](https://github.com/germla) for it's feedback analysis tool. (specifically the Sentiment analysis feature)
For other models (specific language-basis) check [here](https://github.com/germla/satoken#available-models)
# Usage
To use this model for inference, first install the SetFit library:
```bash
python -m pip install setfit
```
You can then run inference as follows:
```python
from setfit import SetFitModel
# Download from Hub and run inference
model = SetFitModel.from_pretrained("germla/satoken")
# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
```
# Training Details
## Training Data
- [IMDB](https://huggingface.co/datasets/imdb)
- [RuReviews](https://github.com/sismetanin/rureviews)
- [chABSA](https://github.com/chakki-works/chABSA-dataset)
- [Glyph](https://github.com/zhangxiangxiao/glyph)
- [nsmc](https://github.com/e9t/nsmc)
- [Allocine](https://huggingface.co/datasets/allocine)
- [Portuguese Tweets for Sentiment Analysis](https://www.kaggle.com/datasets/augustop/portuguese-tweets-for-sentiment-analysis)
## Training Procedure
We made sure to have a balanced dataset.
The model was trained on only 35% (50% for chinese) of the train split of all datasets.
### Preprocessing
- Basic Cleaning (removal of dups, links, mentions, hashtags, etc.)
- Removal of stopwords using [nltk](https://www.nltk.org/)
### Speeds, Sizes, Times
The training procedure took 6hours on the NVIDIA T4 GPU.
## Evaluation
### Testing Data, Factors & Metrics
- [IMDB test split](https://huggingface.co/datasets/imdb)
# Environmental Impact
- Hardware Type: NVIDIA T4 GPU
- Hours used: 6
- Cloud Provider: Amazon Web Services
- Compute Region: ap-south-1 (Mumbai)
- Carbon Emitted: 0.39 [kg co2 eq.](https://mlco2.github.io/impact/#co2eq)