File size: 6,851 Bytes
dcb88b6 baa3eb3 dcb88b6 baa3eb3 dcb88b6 baa3eb3 dcb88b6 baa3eb3 dcb88b6 baa3eb3 dcb88b6 baa3eb3 dcb88b6 baa3eb3 dcb88b6 baa3eb3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
---
title: VAD Demo - Real-time Speech Detection
emoji: π€
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.42.0
app_file: app.py
pinned: false
license: mit
---
# π€ VAD Demo: Real-time Speech Detection Framework
[](https://huggingface.co/spaces/gbibbo/vad_demo)
[](https://waspaa.com)
> **Real-time multi-model voice activity detection with interactive visualization - optimized for CPU and free Hugging Face Spaces**
This demo showcases a comprehensive **speech removal framework** designed for privacy-preserving audio recordings, featuring **3 state-of-the-art AI models** with **real-time processing** and **interactive visualization**.
## π― **Live Demo Features**
### π€ **Multi-Model Support**
Compare 3 different AI models side-by-side:
| Model | Parameters | Speed | Accuracy | Best For |
|-------|------------|-------|----------|----------|
| **Silero-VAD** | 1.8M | β‘β‘β‘ | ββββ | General purpose |
| **WebRTC-VAD** | <0.1M | β‘β‘β‘β‘ | βββ | Ultra-fast processing |
| **E-PANNs** | 22M | β‘β‘ | ββββ | Efficient AI (73% parameter reduction) |
### π **Real-time Visualization**
- **Dual Analysis**: Compare two models simultaneously
- **Waveform Display**: Live audio visualization
- **Probability Charts**: Real-time speech detection confidence
- **Performance Metrics**: Processing time comparison across models
### π **Privacy-Preserving Applications**
- **Smart Home Audio**: Remove personal conversations while preserving environmental sounds
- **GDPR Compliance**: Privacy-aware audio dataset processing
- **Real-time Processing**: Continuous 4-second chunk analysis at 16kHz
- **CPU Optimized**: Runs efficiently on standard hardware
## π **Quick Start**
### Option 1: Use Live Demo (Recommended)
Click the Hugging Face Spaces badge above to try the demo instantly!
### Option 2: Run Locally
```bash
git clone https://huggingface.co/spaces/gbibbo/vad_demo
cd vad_demo
pip install -r requirements.txt
python app.py
```
## ποΈ **How to Use**
1. **π€ Record Audio**: Click microphone and record 2-4 seconds of speech
2. **π§ Select Models**: Choose different models for Model A and Model B comparison
3. **βοΈ Adjust Threshold**: Lower = more sensitive detection (0.0-1.0)
4. **π― Process**: Click "Process Audio" to analyze
5. **π View Results**: Observe probability charts and detailed analysis
## ποΈ **Technical Architecture**
### **CPU Optimization Strategies**
- **Lazy Loading**: Models load only when needed
- **Efficient Processing**: Optimized audio chunk processing
- **Memory Management**: Smart buffer management for continuous operation
- **Fallback Systems**: Graceful degradation when models unavailable
### **Audio Processing Pipeline**
```
Audio Input (Microphone)
β
Preprocessing (Normalization, Resampling)
β
Feature Extraction (Spectrograms, MFCCs)
β
Multi-Model Inference (Parallel Processing)
β
Visualization (Interactive Plotly Dashboard)
```
### **Model Implementation Details**
#### **Silero-VAD** (Production Ready)
- **Source**: `torch.hub` official Silero model
- **Optimization**: Direct PyTorch inference
- **Memory**: ~50MB RAM usage
- **Latency**: ~30ms processing time
#### **WebRTC-VAD** (Ultra-Fast)
- **Source**: Google WebRTC project
- **Fallback**: Energy-based VAD when WebRTC unavailable
- **Latency**: <5ms processing time
- **Memory**: ~10MB RAM usage
#### **E-PANNs** (Efficient Deep Learning)
- **Features**: Mel-spectrogram + MFCC analysis
- **Optimization**: Simplified neural architecture
- **Speed**: 2-3x faster than full PANNs
- **Memory**: ~150MB RAM usage
## π **Performance Benchmarks**
Evaluated on **CHiME-Home dataset** (adapted for CPU):
| Model | F1-Score | RTF (CPU) | Memory | Use Case |
|-------|----------|-----------|--------|-----------|
| Silero-VAD | 0.806 | 0.065 | 50MB | Lightweight |
| WebRTC-VAD | 0.708 | 0.003 | 10MB | Ultra-fast |
| E-PANNs | 0.847 | 0.180 | 150MB | Balanced |
*RTF: Real-Time Factor (lower is better, <1.0 = real-time capable)*
## π¬ **Research Applications**
### **Privacy-Preserving Audio Processing**
- **Domestic Recordings**: Remove personal conversations
- **Smart Speakers**: Privacy-aware voice assistants
- **Audio Datasets**: GDPR-compliant data collection
- **Surveillance Systems**: Selective audio monitoring
### **Speech Technology Research**
- **Model Comparison**: Benchmark different VAD approaches
- **Real-time Systems**: Low-latency speech detection
- **Edge Computing**: CPU-efficient processing
- **Hybrid Systems**: Combine multiple detection methods
## π **Technical Specifications**
### **System Requirements**
- **CPU**: 2+ cores (4+ recommended)
- **RAM**: 1GB minimum (2GB recommended)
- **Python**: 3.8+ (3.10+ recommended)
- **Browser**: Chrome/Firefox with microphone support
### **Hugging Face Spaces Optimization**
- **Memory Limit**: Designed for 16GB Spaces limit
- **CPU Cores**: Optimized for 8-core allocation
- **Storage**: <500MB model storage requirement
- **Networking**: Minimal external dependencies
### **Audio Specifications**
- **Input Format**: 16-bit PCM, mono/stereo
- **Sample Rates**: 8kHz, 16kHz, 32kHz, 48kHz (auto-conversion)
- **Chunk Size**: 4-second processing windows
- **Latency**: <200ms processing delay
## π **Research Citation**
If you use this demo in your research, please cite:
```bibtex
@inproceedings{bibbo2025speech,
title={Speech Removal Framework for Privacy-Preserving Audio Recordings},
author={[Authors omitted for review]},
booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
year={2025},
organization={IEEE}
}
```
## π€ **Contributing**
We welcome contributions! Areas for improvement:
- **New Models**: Add state-of-the-art VAD models
- **Optimization**: Further CPU/memory optimizations
- **Features**: Additional visualization and analysis tools
- **Documentation**: Improve tutorials and examples
## π **Support**
- **Issues**: [GitHub Issues](https://github.com/gbibbo/vad_demo/issues)
- **Discussions**: [Hugging Face Discussions](https://huggingface.co/spaces/gbibbo/vad_demo/discussions)
- **WASPAA 2025**: Visit our paper presentation
## π **License**
This project is licensed under the **MIT License**.
## π **Acknowledgments**
- **Silero-VAD**: Silero Team
- **WebRTC-VAD**: Google WebRTC Project
- **E-PANNs**: Efficient PANNs Implementation
- **Hugging Face**: Free Spaces hosting
- **Funding**: AI4S, University of Surrey, EPSRC, CVSSP
---
**π― Ready for WASPAA 2025 Demo** | **β‘ CPU Optimized** | **π Free to Use** | **π€ Hugging Face Spaces** |