File size: 4,570 Bytes
d409409 1a7848a d409409 bf5d140 27995f9 85e4a61 bf5d140 27995f9 d409409 27995f9 d409409 27995f9 d409409 53219e2 d409409 53219e2 d409409 53219e2 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 e4c4754 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 27995f9 d409409 1a7848a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
---
license: mit
tags:
- audio
- voice-activity-detection
- coreml
- silero
- speech
- ios
- macos
- swift
library_name: coreml
pipeline_tag: voice-activity-detection
datasets:
- alexwengg/musan_mini50
- alexwengg/musan_mini100
metrics:
- accuracy
- f1
language:
- en
base_model:
- onnx-community/silero-vad
---
# **<span style="color:#5DAF8D">🧃 CoreML Silero VAD </span>**
[](https://discord.gg/WNsvaCtmDe)
[](https://github.com/FluidInference/FluidAudio)
A CoreML implementation of the Silero Voice Activity
Detection (VAD) model, optimized for Apple platforms
(iOS/macOS). This repository contains pre-converted
CoreML models ready for use in Swift applications.
## Model Description
**Developed by:** Silero Team (original), converted by
FluidAudio
**Model type:** Voice Activity Detection
**License:** MIT
**Parent Model:**
[silero-vad](https://github.com/snakers4/silero-vad)
### Model Details
- **Architecture:** STFT + Encoder + RNN Decoder pipeline
- **Input:** 16kHz mono audio chunks (512 samples / 32ms)
- **Output:** Voice activity probability (0.0-1.0)
- **Memory:** ~2MB total model size
## Intended Use
### Primary Use Cases
- Real-time voice activity detection in iOS/macOS
applications
- Speech preprocessing for ASR systems
- Audio segmentation and filtering
## How to Use
### Swift Integration
```swift
import FluidAudio
let config = VADConfig(
threshold: 0.3,
chunkSize: 512, // 512 being the most optimal
sampleRate: 16000
)
let vadManager = VADManager(config: config)
try await vadManager.initialize()
// Process audio chunk
let result = try await
vadManager.processChunk(audioChunk)
print("Voice probability: \(result.probability)")
print("Is voice active: \(result.isVoiceActive)")
```
Installation
Add FluidAudio to your Swift project:
dependencies: [
.package(url:
"https://github.com/FluidAudio/FluidAudioSwift.git",
from: "1.0.0")
]
Performance
Benchmarks on Apple Silicon (M1/M2)
| Metric | Value |
|------------------|---------------------|
| Latency | <2ms per 32ms chunk |
| Real-time Factor | 0.02x |
| Memory Usage | ~15MB |
| CPU Usage | <5% (single core) |
Accuracy Metrics
Evaluated on common speech datasets:
- Precision: 94.2%
- Recall: 92.8%
- F1-Score: 93.5%
Model Files
This repository contains three CoreML models that work
together:
- silero_stft.mlmodel (650KB) - STFT feature extraction
- silero_encoder.mlmodel (254KB) - Feature encoding
- silero_rnn_decoder.mlmodel (527KB) - RNN-based
classification
Training Data
The original Silero VAD model was trained on a diverse
dataset including:
- Clean speech audio
- Noisy speech with various background conditions
- Music and non-speech audio for negative samples
Limitations and Bias
Known Limitations
- Optimized for 16kHz sample rate (other rates may reduce
accuracy)
- May struggle with very quiet speech (<-30dB SNR)
- Performance varies with microphone quality and
recording conditions
Technical Details
Model Architecture
Audio Input (512 samples, 16kHz)
↓
STFT Model (spectral features)
↓
Encoder Model (feature compression)
↓
RNN Decoder (temporal modeling)
↓
Voice Probability Output
Citation
@misc{silero-vad-coreml,
title={CoreML Silero VAD},
author={FluidAudio Team},
year={2024},
url={https://huggingface.co/alexwengg/coreml-silero-vad}
}
@misc{silero-vad,
title={Silero VAD},
author={Silero Team},
year={2021},
url={https://github.com/snakers4/silero-vad}
}
Related Models
Check out other CoreML audio models in the
https://huggingface.co/collections/bweng/coreml-685b12fd2
51f80552c08e2b9:
- https://huggingface.co/alexwengg/coreml_speaker_diariza
tion - Identify "who spoke when"
- https://huggingface.co/collections/bweng/coreml-685b12f
d251f80552c08e2b9 - Speech-to-text for Apple platforms
Repository and Support
- GitHub: https://github.com/FluidAudio/FluidAudioSwift
- Documentation:
https://github.com/FluidAudio/FluidAudioSwift/wiki
- Issues:
https://github.com/FluidAudio/FluidAudioSwift/issues
- Community:
https://github.com/FluidAudio/FluidAudioSwift/discussions
License
This project is licensed under the MIT License - see the
LICENSE file for details.
The original Silero VAD model is also under MIT license.
See https://github.com/snakers4/silero-vad/blob/master/LI
CENSE for details. |