germla
/

satoken

Text Classification

sentence-transformers

Model card Files Files and versions

satoken / README.md

gaurishhs's picture

Update README.md

0d206d6 about 2 years ago

|

history blame contribute delete

3.7 kB

	---
	license: apache-2.0
	tags:
	- setfit
	- sentence-transformers
	- text-classification
	pipeline_tag: text-classification
	library_name: sentence-transformers
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	language:
	- en
	- fr
	- ko
	- zh
	- ja
	- pt
	- ru
	datasets:
	- imdb
	model-index:
	- name: germla/satoken
	results:
	- task:
	type: text-classification
	name: sentiment-analysis
	dataset:
	type: imdb
	name: imdb
	split: test
	metrics:
	- type: accuracy
	value: 73.976
	name: Accuracy
	- type: f1
	value: 73.1667079105832
	name: F1
	- type: precision
	value: 75.51506895964584
	name: Precision
	- type: recall
	value: 70.96
	name: Recall
	- task:
	type: text-classification
	name: sentiment-analysis
	dataset:
	type: sepidmnorozy/Russian_sentiment
	name: sepidmnorozy/Russian_sentiment
	split: train
	metrics:
	- type: accuracy
	value: 75.66371681415929
	name: Accuracy
	- type: f1
	value: 83.64218714253031
	name: F1
	- type: precision
	value: 75.25730753396459
	name: Precision
	- type: recall
	value: 94.129763130793
	name: Recall
	---

	# Satoken

	This is a [SetFit model](https://github.com/huggingface/setfit) trained on multilingual datasets (mentioned below) for Sentiment classification.

	The model has been trained using an efficient few-shot learning technique that involves:

	1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
	2. Training a classification head with features from the fine-tuned Sentence Transformer.

	It is utilized by [Germla](https://github.com/germla) for it's feedback analysis tool. (specifically the Sentiment analysis feature)

	For other models (specific language-basis) check [here](https://github.com/germla/satoken#available-models)

	# Usage

	To use this model for inference, first install the SetFit library:

	```bash
	python -m pip install setfit
	```

	You can then run inference as follows:

	```python
	from setfit import SetFitModel

	# Download from Hub and run inference
	model = SetFitModel.from_pretrained("germla/satoken")
	# Run inference
	preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
	```

	# Training Details

	## Training Data

	- [IMDB](https://huggingface.co/datasets/imdb)
	- [RuReviews](https://github.com/sismetanin/rureviews)
	- [chABSA](https://github.com/chakki-works/chABSA-dataset)
	- [Glyph](https://github.com/zhangxiangxiao/glyph)
	- [nsmc](https://github.com/e9t/nsmc)
	- [Allocine](https://huggingface.co/datasets/allocine)
	- [Portuguese Tweets for Sentiment Analysis](https://www.kaggle.com/datasets/augustop/portuguese-tweets-for-sentiment-analysis)

	## Training Procedure

	We made sure to have a balanced dataset.
	The model was trained on only 35% (50% for chinese) of the train split of all datasets.

	### Preprocessing

	- Basic Cleaning (removal of dups, links, mentions, hashtags, etc.)
	- Removal of stopwords using [nltk](https://www.nltk.org/)

	### Speeds, Sizes, Times

	The training procedure took 6hours on the NVIDIA T4 GPU.

	## Evaluation

	### Testing Data, Factors & Metrics

	- [IMDB test split](https://huggingface.co/datasets/imdb)

	# Environmental Impact

	- Hardware Type: NVIDIA T4 GPU
	- Hours used: 6
	- Cloud Provider: Amazon Web Services
	- Compute Region: ap-south-1 (Mumbai)
	- Carbon Emitted: 0.39 [kg co2 eq.](https://mlco2.github.io/impact/#co2eq)