Update README.md

9bc8579 verified 3 months ago

6.13 kB

	---
	library_name: transformers
	language:
	- ht
	license: apache-2.0
	base_model: openai/whisper-medium
	tags:
	- generated_from_trainer
	datasets:
	- jsbeaudry/creole-text-voice
	model-index:
	- name: whisper small creole oswald
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->


	# whisper-medium-creole-oswald

	This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the creole-text-voice dataset.
	The main objective is to create a 99% accurate Haitian Creole Speech-to-Text model, capable of transcribing diverse Haitian voices across accents, regions, and speaking styles.

	---

	## 🧠 Model description

	whisper-medium-creole-oswald is optimized for Haitian Creole automatic speech recognition (ASR). It builds upon the Whisper architecture by OpenAI and adapts it to Haitian Creole through transfer learning and fine-tuning on a high-quality curated dataset containing hours of Haitian Creole audio-text pairs.

	- Architecture: Whisper Medium
	- Fine-tuned for: Haitian Creole (Kreyòl Ayisyen)
	- Vocabulary: Based on Latin script (Creole orthography), preserving diacritics and linguistic nuances.
	- Voice types: Made with female synthetics voices.
	- Sampling rate: 16kHz
	- Training objective: Maximize transcription accuracy for everyday Creole speech

	---


	### ✅ Intended uses
	- Transcribe Haitian Creole speech from:
	- Voice notes
	- Radio shows
	- Interviews
	- Public speeches
	- Educational content
	- Synthetic voices

	- Enable Creole voice interfaces in:
	- Voice assistants
	- Transcription services
	- Language-learning tools
	- Chatbots and accessibility platforms

	### ⚠️ Limitations
	- May struggle with:
	- Heavily code-switched speech (Creole + French/English mixed)
	- Extremely poor audio quality (e.g., heavy background noise)
	- Very fast or mumbled speech in some dialects
	- Long duration audio file
	- Not optimized for real-time transcription on low-resource devices
	- Fine-tuned on a specific dataset – might generalize less to completely unseen voice types or rare accents

	---

	## 📊 Training and evaluation data

	The model was trained on the creole-text-voice dataset, which includes:

	- 5 hours of Haitian Creole Synthetic speech
	- Annotated, time-aligned text transcripts following standard Creole orthography

	### Sources for next steps:
	- Public domain radio and podcast archives
	- Open-access interviews and spoken-word audio
	- Community-submitted voice samples

	### Preprocessing steps:
	- Voice Activity Detection (VAD)
	- Noise filtering and audio normalization
	- Manual transcript review and correction


	## Model usage script

	```python
	# Load model directly
	from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
	import librosa
	import numpy as np
	import torch

	processor = AutoProcessor.from_pretrained("jsbeaudry/whisper-medium-oswald")
	model = AutoModelForSpeechSeq2Seq.from_pretrained("jsbeaudry/whisper-medium-oswald")

	def transcript (audio_file_path):

	# Load audio
	speech_array, sampling_rate = librosa.load(audio_file_path, sr=16000)

	# Convert the NumPy array to a PyTorch tensor
	speech_array_pt = torch.from_numpy(speech_array).unsqueeze(0)

	input_features = processor(speech_array, sampling_rate=sampling_rate, return_tensors="pt").input_features

	# 2. Generate predictions
	predicted_ids = model.generate(input_features)

	# 3. Decode the predictions
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

	# print(transcription)
	return transcription

	text = transcript("/path_audio")

	print(text)
	```


	## Model usage with gradio (UI)

	```python

	from transformers import pipeline
	import gradio as gr

	# Load Whisper model
	print("Loading model...")
	pipe = pipeline(model="jsbeaudry/whisper-medium-oswald")
	print("Model loaded successfully.")

	# Transcription function
	def transcribe(audio_path):
	if audio_path is None:
	return "Please upload or record an audio file first."
	result = pipe(audio_path)
	return result["text"]

	# Build Gradio interface
	def create_interface():
	with gr.Blocks(title="Whisper Medium - Haitian Creole") as demo:
	gr.Markdown("# 🎙️ Whisper Medium Creole ASR")
	gr.Markdown(
	"Upload an audio file or record your voice in Haitian Creole. "
	"Then click Transcribe to see the result."
	)

	with gr.Row():
	with gr.Column():
	audio_input = gr.Audio(source="upload", type="filepath", label="🎧 Upload Audio")
	audio_input2 = gr.Audio(source="microphone", type="filepath", label="🎤 Record Audio")
	with gr.Column():
	transcribe_button = gr.Button("🔍 Transcribe")
	output_text = gr.Textbox(label="📝 Transcribed Text", lines=4)


	transcribe_button.click(fn=transcribe, inputs=audio_input, outputs=output_text)
	transcribe_button.click(fn=transcribe, inputs=audio_input2, outputs=output_text)

	return demo

	if __name__ == "__main__":
	interface = create_interface()
	interface.launch()
	```

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-05
	- train_batch_size: 16
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 500
	- num_epochs: 5
	- mixed_precision_training: Native AMP


	### Framework versions

	- Transformers 4.46.1
	- Pytorch 2.6.0+cu124
	- Datasets 3.5.0
	- Tokenizers 0.20.3



	## 📌 Citation

	If you use this model, please cite:

	```bibtex
	@misc{whispermediumcreoleoswald2025,
	title={Whisper Medium Creole - Oswald},
	author={Jean sauvenel beaudry},
	year={2025},
	howpublished={\url{https://huggingface.co/jsbeaudry}}
	}