import gradio as gr from model import CnnVoiceClassifier from glob import glob model = CnnVoiceClassifier() audio_component = gr.Audio(type='filepath', label='Upload your audio file here') label_component = gr.Label(label='Gender classification result') sample_female = glob('female_*.wav') sample_male = glob('male_*.wav') title = 'CNN Voice Classifier 👨👩' description = '
I created this AI model as a side project stemming from a larger one. While exploring the Mozilla Common Voice dataset for training a Text-to-Speech (TTS) model, I noticed a significant number of audio samples had incorrect gender labels. This would severely degrade the quality and training of any TTS model. My goal was to find a quick and easy way to determine a person's gender from their voice. To my surprise, "easy" solutions like the one found here weren't robust enough to handle background noise or poor microphone quality. Conversely, robust solutions, such as complex Transformer models like the one here, were too resource-intensive. I believed there had to be a relatively simple model, like a Convolutional Neural Network (CNN), that could reliably distinguish a person's voice despite environmental noise and recording artifacts.
To develop this Convolutional Neural Network (CNN), the quality of the dataset was a top priority. I aimed to use datasets with a wide variety of sound conditions and languages to prevent the model from being biased towards any particular pattern other than a person's gender.
To achieve this, I combined three distinct datasets:
By combining these diverse datasets with data augmentation techniques, such as adding background noise to the audio clips at training time, I was able to create a very robust model.
The architecture for this model was inspired by MobileNetV3, which is known for being lightweight and efficient, though originally designed for images. Fortunately, it's easily adaptable for 1D signals like audio. The diagram below illustrates the core building block of the architecture, called the Universal Inverted Bottleneck. This block features two convolutional layers surrounding a DepthWise layer and a simple attention mechanism known as Squeeze and Excitation. A skip connection can be included if the configuration allows for it. The complete model architecture, shown on the right, is composed of numerous such blocks with various configurations of skip connections, filters, and attention mechanisms.
This model is incredibly compact, taking up only 4MB. This makes it perfect for running on mobile devices or even directly within web browsers. I trained the model on Google Colab for approximately 114 epochs, utilizing ReduceLROnPlateau and EarlyStopping for optimization. To ensure broad compatibility, I converted the model to the universal ONNX format, making it easy to deploy across various platforms including Linux, Windows, macOS, Android, iOS, and WebGPU. Below, you can see the graphs illustrating the evolution of the model's Accuracy, Loss, and Learning Rate during training.