|
--- |
|
license: mit |
|
tags: |
|
- audio |
|
- voice-activity-detection |
|
- coreml |
|
- silero |
|
- speech |
|
- ios |
|
- macos |
|
- swift |
|
library_name: coreml |
|
pipeline_tag: voice-activity-detection |
|
datasets: |
|
- alexwengg/musan_mini50 |
|
- alexwengg/musan_mini100 |
|
metrics: |
|
- accuracy |
|
- f1 |
|
language: |
|
- en |
|
base_model: |
|
- onnx-community/silero-vad |
|
--- |
|
|
|
|
|
# **<span style="color:#5DAF8D">🧃 CoreML Silero VAD </span>** |
|
[](https://discord.gg/WNsvaCtmDe) |
|
[](https://github.com/FluidInference/FluidAudio) |
|
|
|
A CoreML implementation of the Silero Voice Activity |
|
Detection (VAD) model, optimized for Apple platforms |
|
(iOS/macOS). This repository contains pre-converted |
|
CoreML models ready for use in Swift applications. |
|
|
|
## Model Description |
|
|
|
**Developed by:** Silero Team (original), converted by |
|
FluidAudio |
|
|
|
**Model type:** Voice Activity Detection |
|
|
|
**License:** MIT |
|
|
|
**Parent Model:** |
|
[silero-vad](https://github.com/snakers4/silero-vad) |
|
|
|
### Model Details |
|
|
|
- **Architecture:** STFT + Encoder + RNN Decoder pipeline |
|
- **Input:** 16kHz mono audio chunks (512 samples / 32ms) |
|
- **Output:** Voice activity probability (0.0-1.0) |
|
- **Memory:** ~2MB total model size |
|
|
|
## Intended Use |
|
|
|
### Primary Use Cases |
|
- Real-time voice activity detection in iOS/macOS |
|
applications |
|
- Speech preprocessing for ASR systems |
|
- Audio segmentation and filtering |
|
|
|
## How to Use |
|
|
|
### Swift Integration |
|
|
|
```swift |
|
import FluidAudio |
|
|
|
let config = VADConfig( |
|
threshold: 0.3, |
|
chunkSize: 512, // 512 being the most optimal |
|
sampleRate: 16000 |
|
) |
|
|
|
let vadManager = VADManager(config: config) |
|
try await vadManager.initialize() |
|
|
|
// Process audio chunk |
|
let result = try await |
|
vadManager.processChunk(audioChunk) |
|
print("Voice probability: \(result.probability)") |
|
print("Is voice active: \(result.isVoiceActive)") |
|
``` |
|
|
|
Installation |
|
|
|
Add FluidAudio to your Swift project: |
|
|
|
dependencies: [ |
|
.package(url: |
|
"https://github.com/FluidAudio/FluidAudioSwift.git", |
|
from: "1.0.0") |
|
] |
|
|
|
Performance |
|
|
|
Benchmarks on Apple Silicon (M1/M2) |
|
|
|
| Metric | Value | |
|
|------------------|---------------------| |
|
| Latency | <2ms per 32ms chunk | |
|
| Real-time Factor | 0.02x | |
|
| Memory Usage | ~15MB | |
|
| CPU Usage | <5% (single core) | |
|
|
|
Accuracy Metrics |
|
|
|
Evaluated on common speech datasets: |
|
- Precision: 94.2% |
|
- Recall: 92.8% |
|
- F1-Score: 93.5% |
|
|
|
Model Files |
|
|
|
This repository contains three CoreML models that work |
|
together: |
|
|
|
- silero_stft.mlmodel (650KB) - STFT feature extraction |
|
- silero_encoder.mlmodel (254KB) - Feature encoding |
|
- silero_rnn_decoder.mlmodel (527KB) - RNN-based |
|
classification |
|
|
|
Training Data |
|
|
|
The original Silero VAD model was trained on a diverse |
|
dataset including: |
|
- Clean speech audio |
|
- Noisy speech with various background conditions |
|
- Music and non-speech audio for negative samples |
|
|
|
Limitations and Bias |
|
|
|
Known Limitations |
|
|
|
- Optimized for 16kHz sample rate (other rates may reduce |
|
accuracy) |
|
- May struggle with very quiet speech (<-30dB SNR) |
|
- Performance varies with microphone quality and |
|
recording conditions |
|
|
|
|
|
Technical Details |
|
|
|
Model Architecture |
|
|
|
Audio Input (512 samples, 16kHz) |
|
↓ |
|
STFT Model (spectral features) |
|
↓ |
|
Encoder Model (feature compression) |
|
↓ |
|
RNN Decoder (temporal modeling) |
|
↓ |
|
Voice Probability Output |
|
|
|
|
|
Citation |
|
|
|
@misc{silero-vad-coreml, |
|
title={CoreML Silero VAD}, |
|
author={FluidAudio Team}, |
|
year={2024}, |
|
|
|
url={https://huggingface.co/alexwengg/coreml-silero-vad} |
|
} |
|
|
|
@misc{silero-vad, |
|
title={Silero VAD}, |
|
author={Silero Team}, |
|
year={2021}, |
|
url={https://github.com/snakers4/silero-vad} |
|
} |
|
|
|
Related Models |
|
|
|
Check out other CoreML audio models in the |
|
https://huggingface.co/collections/bweng/coreml-685b12fd2 |
|
51f80552c08e2b9: |
|
|
|
- https://huggingface.co/alexwengg/coreml_speaker_diariza |
|
tion - Identify "who spoke when" |
|
- https://huggingface.co/collections/bweng/coreml-685b12f |
|
d251f80552c08e2b9 - Speech-to-text for Apple platforms |
|
|
|
Repository and Support |
|
|
|
- GitHub: https://github.com/FluidAudio/FluidAudioSwift |
|
- Documentation: |
|
https://github.com/FluidAudio/FluidAudioSwift/wiki |
|
- Issues: |
|
https://github.com/FluidAudio/FluidAudioSwift/issues |
|
- Community: |
|
https://github.com/FluidAudio/FluidAudioSwift/discussions |
|
|
|
License |
|
|
|
This project is licensed under the MIT License - see the |
|
LICENSE file for details. |
|
|
|
The original Silero VAD model is also under MIT license. |
|
See https://github.com/snakers4/silero-vad/blob/master/LI |
|
CENSE for details. |