Cristian Tatu

I created this AI model as a side project stemming from a larger one. While exploring the Mozilla Common Voice dataset for training a Text-to-Speech (TTS) model, I noticed a significant number of audio samples had incorrect gender labels. This would severely degrade the quality and training of any TTS model. My goal was to find a quick and easy way to determine a person's gender from their voice. To my surprise, "easy" solutions like the one found here weren't robust enough to handle background noise or poor microphone quality. Conversely, robust solutions, such as complex Transformer models like the one here, were too resource-intensive. I believed there had to be a relatively simple model, like a Convolutional Neural Network (CNN), that could reliably distinguish a person's voice despite environmental noise and recording artifacts.

To develop this Convolutional Neural Network (CNN), the quality of the dataset was a top priority. I aimed to use datasets with a wide variety of sound conditions and languages to prevent the model from being biased towards any particular pattern other than a person's gender.
To achieve this, I combined three distinct datasets:

aGender - This dataset includes german speech samples recorded over public telephone lines, featuring both read and semi-spontaneous speech..
litagin/moe-speech - A high-quality collection of character acting speech audio performed by Japanese professional voice actors.
AudioSet - A large-scale compilation of human-labeled 10-second sound clips extracted from YouTube videos.

By combining these diverse datasets with data augmentation techniques, such as adding background noise to the audio clips at training time, I was able to create a very robust model.

Model Architecture:

The architecture for this model was inspired by MobileNetV3, which is known for being lightweight and efficient, though originally designed for images. Fortunately, it's easily adaptable for 1D signals like audio. The diagram below illustrates the core building block of the architecture, called the Universal Inverted Bottleneck. This block features two convolutional layers surrounding a DepthWise layer and a simple attention mechanism known as Squeeze and Excitation. A skip connection can be included if the configuration allows for it. The complete model architecture, shown on the right, is composed of numerous such blocks with various configurations of skip connections, filters, and attention mechanisms.

This model is incredibly compact, taking up only 4MB. This makes it perfect for running on mobile devices or even directly within web browsers. I trained the model on Google Colab for approximately 114 epochs, utilizing ReduceLROnPlateau and EarlyStopping for optimization. To ensure broad compatibility, I converted the model to the universal ONNX format, making it easy to deploy across various platforms including Linux, Windows, macOS, Android, iOS, and WebGPU. Below, you can see the graphs illustrating the evolution of the model's Accuracy, Loss, and Learning Rate during training.

This application is a Convolutional Neural Network (CNN) voice gender classifier. You can upload an audio file or record it from your microphone and the application will predict the gender of the speaker.

Cristian Tatu

Model Architecture: