|
--- |
|
library_name: transformers |
|
language: |
|
- ht |
|
license: apache-2.0 |
|
base_model: openai/whisper-medium |
|
tags: |
|
- generated_from_trainer |
|
datasets: |
|
- jsbeaudry/creole-text-voice |
|
model-index: |
|
- name: whisper small creole oswald |
|
results: [] |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
|
|
# whisper-medium-creole-oswald |
|
|
|
This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the **creole-text-voice** dataset. |
|
The main objective is to create a **99% accurate Haitian Creole Speech-to-Text model**, capable of transcribing diverse Haitian voices across accents, regions, and speaking styles. |
|
|
|
--- |
|
|
|
## π§ Model description |
|
|
|
**whisper-medium-creole-oswald** is optimized for Haitian Creole automatic speech recognition (ASR). It builds upon the Whisper architecture by OpenAI and adapts it to Haitian Creole through transfer learning and fine-tuning on a high-quality curated dataset containing hours of Haitian Creole audio-text pairs. |
|
|
|
- **Architecture**: Whisper Medium |
|
- **Fine-tuned for**: Haitian Creole (KreyΓ²l Ayisyen) |
|
- **Vocabulary**: Based on Latin script (Creole orthography), preserving diacritics and linguistic nuances. |
|
- **Voice types**: Made with female synthetics voices. |
|
- **Sampling rate**: 16kHz |
|
- **Training objective**: Maximize transcription accuracy for everyday Creole speech |
|
|
|
--- |
|
|
|
|
|
### β
Intended uses |
|
- Transcribe Haitian Creole speech from: |
|
- Voice notes |
|
- Radio shows |
|
- Interviews |
|
- Public speeches |
|
- Educational content |
|
- Synthetic voices |
|
|
|
- Enable Creole voice interfaces in: |
|
- Voice assistants |
|
- Transcription services |
|
- Language-learning tools |
|
- Chatbots and accessibility platforms |
|
|
|
### β οΈ Limitations |
|
- May struggle with: |
|
- Heavily code-switched speech (Creole + French/English mixed) |
|
- Extremely poor audio quality (e.g., heavy background noise) |
|
- Very fast or mumbled speech in some dialects |
|
- Long duration audio file |
|
- Not optimized for **real-time transcription** on low-resource devices |
|
- Fine-tuned on a specific dataset β might generalize less to completely unseen voice types or rare accents |
|
|
|
--- |
|
|
|
## π Training and evaluation data |
|
|
|
The model was trained on the **creole-text-voice** dataset, which includes: |
|
|
|
- **5 hours** of Haitian Creole Synthetic speech |
|
- Annotated, time-aligned text transcripts following standard Creole orthography |
|
|
|
### Sources for next steps: |
|
- Public domain radio and podcast archives |
|
- Open-access interviews and spoken-word audio |
|
- Community-submitted voice samples |
|
|
|
### Preprocessing steps: |
|
- Voice Activity Detection (VAD) |
|
- Noise filtering and audio normalization |
|
- Manual transcript review and correction |
|
|
|
|
|
## Model usage script |
|
|
|
```python |
|
# Load model directly |
|
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq |
|
import librosa |
|
import numpy as np |
|
import torch |
|
|
|
processor = AutoProcessor.from_pretrained("jsbeaudry/whisper-medium-oswald") |
|
model = AutoModelForSpeechSeq2Seq.from_pretrained("jsbeaudry/whisper-medium-oswald") |
|
|
|
def transcript (audio_file_path): |
|
|
|
# Load audio |
|
speech_array, sampling_rate = librosa.load(audio_file_path, sr=16000) |
|
|
|
# Convert the NumPy array to a PyTorch tensor |
|
speech_array_pt = torch.from_numpy(speech_array).unsqueeze(0) |
|
|
|
input_features = processor(speech_array, sampling_rate=sampling_rate, return_tensors="pt").input_features |
|
|
|
# 2. Generate predictions |
|
predicted_ids = model.generate(input_features) |
|
|
|
# 3. Decode the predictions |
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) |
|
|
|
# print(transcription) |
|
return transcription |
|
|
|
text = transcript("/path_audio") |
|
|
|
print(text) |
|
``` |
|
|
|
|
|
## Model usage with gradio (UI) |
|
|
|
```python |
|
|
|
from transformers import pipeline |
|
import gradio as gr |
|
|
|
# Load Whisper model |
|
print("Loading model...") |
|
pipe = pipeline(model="jsbeaudry/whisper-medium-oswald") |
|
print("Model loaded successfully.") |
|
|
|
# Transcription function |
|
def transcribe(audio_path): |
|
if audio_path is None: |
|
return "Please upload or record an audio file first." |
|
result = pipe(audio_path) |
|
return result["text"] |
|
|
|
# Build Gradio interface |
|
def create_interface(): |
|
with gr.Blocks(title="Whisper Medium - Haitian Creole") as demo: |
|
gr.Markdown("# ποΈ Whisper Medium Creole ASR") |
|
gr.Markdown( |
|
"Upload an audio file or record your voice in Haitian Creole. " |
|
"Then click **Transcribe** to see the result." |
|
) |
|
|
|
with gr.Row(): |
|
with gr.Column(): |
|
audio_input = gr.Audio(source="upload", type="filepath", label="π§ Upload Audio") |
|
audio_input2 = gr.Audio(source="microphone", type="filepath", label="π€ Record Audio") |
|
with gr.Column(): |
|
transcribe_button = gr.Button("π Transcribe") |
|
output_text = gr.Textbox(label="π Transcribed Text", lines=4) |
|
|
|
|
|
transcribe_button.click(fn=transcribe, inputs=audio_input, outputs=output_text) |
|
transcribe_button.click(fn=transcribe, inputs=audio_input2, outputs=output_text) |
|
|
|
return demo |
|
|
|
if __name__ == "__main__": |
|
interface = create_interface() |
|
interface.launch() |
|
``` |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 1e-05 |
|
- train_batch_size: 16 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_steps: 500 |
|
- num_epochs: 5 |
|
- mixed_precision_training: Native AMP |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.46.1 |
|
- Pytorch 2.6.0+cu124 |
|
- Datasets 3.5.0 |
|
- Tokenizers 0.20.3 |
|
|
|
|
|
|
|
## π Citation |
|
|
|
If you use this model, please cite: |
|
|
|
```bibtex |
|
@misc{whispermediumcreoleoswald2025, |
|
title={Whisper Medium Creole - Oswald}, |
|
author={Jean sauvenel beaudry}, |
|
year={2025}, |
|
howpublished={\url{https://huggingface.co/jsbeaudry}} |
|
} |
|
|
|
|