|
--- |
|
license: apache-2.0 |
|
tags: |
|
- setfit |
|
- sentence-transformers |
|
- text-classification |
|
pipeline_tag: text-classification |
|
library_name: sentence-transformers |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- precision |
|
- recall |
|
language: |
|
- en |
|
- fr |
|
- ko |
|
- zh |
|
- ja |
|
- pt |
|
- ru |
|
datasets: |
|
- imdb |
|
model-index: |
|
- name: germla/satoken |
|
results: |
|
- task: |
|
type: text-classification |
|
name: sentiment-analysis |
|
dataset: |
|
type: imdb |
|
name: imdb |
|
split: test |
|
metrics: |
|
- type: accuracy |
|
value: 73.976 |
|
name: Accuracy |
|
- type: f1 |
|
value: 73.1667079105832 |
|
name: F1 |
|
- type: precision |
|
value: 75.51506895964584 |
|
name: Precision |
|
- type: recall |
|
value: 70.96 |
|
name: Recall |
|
- task: |
|
type: text-classification |
|
name: sentiment-analysis |
|
dataset: |
|
type: sepidmnorozy/Russian_sentiment |
|
name: sepidmnorozy/Russian_sentiment |
|
split: train |
|
metrics: |
|
- type: accuracy |
|
value: 75.66371681415929 |
|
name: Accuracy |
|
- type: f1 |
|
value: 83.64218714253031 |
|
name: F1 |
|
- type: precision |
|
value: 75.25730753396459 |
|
name: Precision |
|
- type: recall |
|
value: 94.129763130793 |
|
name: Recall |
|
--- |
|
|
|
# Satoken |
|
|
|
This is a [SetFit model](https://github.com/huggingface/setfit) trained on multilingual datasets (mentioned below) for Sentiment classification. |
|
|
|
The model has been trained using an efficient few-shot learning technique that involves: |
|
|
|
1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning. |
|
2. Training a classification head with features from the fine-tuned Sentence Transformer. |
|
|
|
It is utilized by [Germla](https://github.com/germla) for it's feedback analysis tool. (specifically the Sentiment analysis feature) |
|
|
|
For other models (specific language-basis) check [here](https://github.com/germla/satoken#available-models) |
|
|
|
# Usage |
|
|
|
To use this model for inference, first install the SetFit library: |
|
|
|
```bash |
|
python -m pip install setfit |
|
``` |
|
|
|
You can then run inference as follows: |
|
|
|
```python |
|
from setfit import SetFitModel |
|
|
|
# Download from Hub and run inference |
|
model = SetFitModel.from_pretrained("germla/satoken") |
|
# Run inference |
|
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"]) |
|
``` |
|
|
|
# Training Details |
|
|
|
## Training Data |
|
|
|
- [IMDB](https://huggingface.co/datasets/imdb) |
|
- [RuReviews](https://github.com/sismetanin/rureviews) |
|
- [chABSA](https://github.com/chakki-works/chABSA-dataset) |
|
- [Glyph](https://github.com/zhangxiangxiao/glyph) |
|
- [nsmc](https://github.com/e9t/nsmc) |
|
- [Allocine](https://huggingface.co/datasets/allocine) |
|
- [Portuguese Tweets for Sentiment Analysis](https://www.kaggle.com/datasets/augustop/portuguese-tweets-for-sentiment-analysis) |
|
|
|
## Training Procedure |
|
|
|
We made sure to have a balanced dataset. |
|
The model was trained on only 35% (50% for chinese) of the train split of all datasets. |
|
|
|
### Preprocessing |
|
|
|
- Basic Cleaning (removal of dups, links, mentions, hashtags, etc.) |
|
- Removal of stopwords using [nltk](https://www.nltk.org/) |
|
|
|
### Speeds, Sizes, Times |
|
|
|
The training procedure took 6hours on the NVIDIA T4 GPU. |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
- [IMDB test split](https://huggingface.co/datasets/imdb) |
|
|
|
# Environmental Impact |
|
|
|
- Hardware Type: NVIDIA T4 GPU |
|
- Hours used: 6 |
|
- Cloud Provider: Amazon Web Services |
|
- Compute Region: ap-south-1 (Mumbai) |
|
- Carbon Emitted: 0.39 [kg co2 eq.](https://mlco2.github.io/impact/#co2eq) |
|
|