Update README.md

2abcc91 verified 4 months ago

4.02 kB

	---
	license: apache-2.0
	tags:
	- t5
	- text-classification
	- open-source-ai
	- liar-dataset
	- educational
	- demo
	datasets:
	- liar
	- open-source-ai-liar
	language:
	- en
	base_model:
	- t5-small
	pipeline_tag: text2text-generation
	---

	# open-source-ai-t5-liar-lens

	This is a fine-tuned version of [`t5-small`](https://huggingface.co/t5-small), trained to classify short claims and quotes in the style of the [LIAR dataset](https://huggingface.co/datasets/liar). It was developed as part of the Open Source AI book project by Jerry Cuomo and José De Jesús to demonstrate fine-tuning and evaluation techniques with a light, satirical touch.

	### What's Different?

	To test T5’s summarization-as-classification capabilities, we augmented the LIAR dataset with 225 synthetic examples drawn from our own book. These were written to echo the style of political claims—confident, compressible, and occasionally absurd. It’s a tongue-in-cheek benchmark, but a useful one. It lets us explore how a summarization model handles short-form reasoning, fake-ish news, and the delightful blur between fact and fiction in machine learning writing.

	So while the original LIAR dataset supplies factual claims from political discourse, our additions bring in quotes that parody open-source mantras, AI hype cycles, and technical one-liners. The result is a model that scores both campaign promises and keynote punchlines with equal scrutiny.

	### Task Format

	This model treats classification as a text-to-text generation task. Each input is a short claim or quote, and the model responds with one of six factuality labels, generated directly as a lowercase string:

	- `pants-fire`
	- `false`
	- `barely-true`
	- `half-true`
	- `mostly-true`
	- `true`

	The input format uses a summarization-style prefix to frame the task:

	Example Input:
	```
	summarize: Python is the fastest programming language available.
	```

	Example Output:
	```
	half-true
	```

	This response reflects the model’s ability to evaluate short-form claims with nuance, producing a graded label based on its understanding of truthfulness.

	### Training Details

	- Base model: `t5-small`
	- Datasets:
	- [LIAR](https://huggingface.co/datasets/liar)
	- [Open Source AI LIAR-style CSV](https://github.com/OpenSourceAI-Book/code/blob/main/datasets/open_source_ai-liar.csv)
	- Epochs: 5
	- Batch size: 4
	- Max input length: 128 tokens
	- Platform: Google Colab
	- Checkpoint: `open-source-ai-t5-liar-lens`

	### Intended Use

	This model is designed for educational use, particularly in demonstrating:
	- Lightweight fine-tuning with Hugging Face Transformers
	- Classification-as-generation using T5
	- Transparent model publishing and benchmarking
	- The fuzziness of truth in machine learning culture

	It is not intended for production-grade fact-checking or regulatory enforcement.

	### Example Usage

	```python
	### Example Usage

	from transformers import T5ForConditionalGeneration, T5Tokenizer

	# Load the fine-tuned model and tokenizer
	model = T5ForConditionalGeneration.from_pretrained(
	"gcuomo/open-source-ai-t5-liar-lens"
	)
	tokenizer = T5Tokenizer.from_pretrained(
	"gcuomo/open-source-ai-t5-liar-lens"
	)

	# Prepare input
	statement = "Blockchain guarantees ethical outcomes in all AI systems."
	prompt = f"summarize: {statement}"
	inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=128)

	# Generate prediction
	output = model.generate(**inputs, max_new_tokens=8)
	prediction = tokenizer.decode(output[0], skip_special_tokens=True).strip().lower()

	# Print result
	print("Predicted label:", prediction)

	```

	### Citation

	If you reference this model or its training methodology, please cite:

	> Cuomo, J. & De Jesús, J. (2025). Open Source AI. No Starch Press.
	> Training datasets:
	> - [LIAR dataset](https://huggingface.co/datasets/liar)
	> - [Open Source AI LIAR-style dataset](https://github.com/OpenSourceAI-Book/code/blob/main/datasets/open_source_ai-liar.csv)