|
|
--- |
|
|
title: VAD Demo - Real-time Speech Detection |
|
|
emoji: 🎤 |
|
|
colorFrom: blue |
|
|
colorTo: green |
|
|
sdk: gradio |
|
|
sdk_version: 4.42.0 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# 🎤 VAD Demo: Real-time Speech Detection Framework |
|
|
|
|
|
[](https://huggingface.co/spaces/gbibbo/vad_demo) |
|
|
[](https://waspaa.com) |
|
|
|
|
|
> **Real-time multi-model voice activity detection with interactive visualization - optimized for CPU and free Hugging Face Spaces** |
|
|
|
|
|
This demo showcases a comprehensive **speech removal framework** designed for privacy-preserving audio recordings, featuring **3 state-of-the-art AI models** with **real-time processing** and **interactive visualization**. |
|
|
|
|
|
## 🎯 **Live Demo Features** |
|
|
|
|
|
### 🤖 **Multi-Model Support** |
|
|
Compare 3 different AI models side-by-side: |
|
|
|
|
|
| Model | Parameters | Speed | Accuracy | Best For | |
|
|
|-------|------------|-------|----------|----------| |
|
|
| **Silero-VAD** | 1.8M | ⚡⚡⚡ | ⭐⭐⭐⭐ | General purpose | |
|
|
| **WebRTC-VAD** | <0.1M | ⚡⚡⚡⚡ | ⭐⭐⭐ | Ultra-fast processing | |
|
|
| **E-PANNs** | 22M | ⚡⚡ | ⭐⭐⭐⭐ | Efficient AI (73% parameter reduction) | |
|
|
|
|
|
### 📊 **Real-time Visualization** |
|
|
- **Dual Analysis**: Compare two models simultaneously |
|
|
- **Waveform Display**: Live audio visualization |
|
|
- **Probability Charts**: Real-time speech detection confidence |
|
|
- **Performance Metrics**: Processing time comparison across models |
|
|
|
|
|
### 🔒 **Privacy-Preserving Applications** |
|
|
- **Smart Home Audio**: Remove personal conversations while preserving environmental sounds |
|
|
- **GDPR Compliance**: Privacy-aware audio dataset processing |
|
|
- **Real-time Processing**: Continuous 4-second chunk analysis at 16kHz |
|
|
- **CPU Optimized**: Runs efficiently on standard hardware |
|
|
|
|
|
## 🚀 **Quick Start** |
|
|
|
|
|
### Option 1: Use Live Demo (Recommended) |
|
|
Click the Hugging Face Spaces badge above to try the demo instantly! |
|
|
|
|
|
### Option 2: Run Locally |
|
|
```bash |
|
|
git clone https://huggingface.co/spaces/gbibbo/vad_demo |
|
|
cd vad_demo |
|
|
pip install -r requirements.txt |
|
|
python app.py |
|
|
``` |
|
|
|
|
|
## 🎛️ **How to Use** |
|
|
|
|
|
1. **🎤 Record Audio**: Click microphone and record 2-4 seconds of speech |
|
|
2. **🔧 Select Models**: Choose different models for Model A and Model B comparison |
|
|
3. **⚙️ Adjust Threshold**: Lower = more sensitive detection (0.0-1.0) |
|
|
4. **🎯 Process**: Click "Process Audio" to analyze |
|
|
5. **📊 View Results**: Observe probability charts and detailed analysis |
|
|
|
|
|
## 🏗️ **Technical Architecture** |
|
|
|
|
|
### **CPU Optimization Strategies** |
|
|
- **Lazy Loading**: Models load only when needed |
|
|
- **Efficient Processing**: Optimized audio chunk processing |
|
|
- **Memory Management**: Smart buffer management for continuous operation |
|
|
- **Fallback Systems**: Graceful degradation when models unavailable |
|
|
|
|
|
### **Audio Processing Pipeline** |
|
|
``` |
|
|
Audio Input (Microphone) |
|
|
↓ |
|
|
Preprocessing (Normalization, Resampling) |
|
|
↓ |
|
|
Feature Extraction (Spectrograms, MFCCs) |
|
|
↓ |
|
|
Multi-Model Inference (Parallel Processing) |
|
|
↓ |
|
|
Visualization (Interactive Plotly Dashboard) |
|
|
``` |
|
|
|
|
|
### **Model Implementation Details** |
|
|
|
|
|
#### **Silero-VAD** (Production Ready) |
|
|
- **Source**: `torch.hub` official Silero model |
|
|
- **Optimization**: Direct PyTorch inference |
|
|
- **Memory**: ~50MB RAM usage |
|
|
- **Latency**: ~30ms processing time |
|
|
|
|
|
#### **WebRTC-VAD** (Ultra-Fast) |
|
|
- **Source**: Google WebRTC project |
|
|
- **Fallback**: Energy-based VAD when WebRTC unavailable |
|
|
- **Latency**: <5ms processing time |
|
|
- **Memory**: ~10MB RAM usage |
|
|
|
|
|
#### **E-PANNs** (Efficient Deep Learning) |
|
|
- **Features**: Mel-spectrogram + MFCC analysis |
|
|
- **Optimization**: Simplified neural architecture |
|
|
- **Speed**: 2-3x faster than full PANNs |
|
|
- **Memory**: ~150MB RAM usage |
|
|
|
|
|
## 📈 **Performance Benchmarks** |
|
|
|
|
|
Evaluated on **CHiME-Home dataset** (adapted for CPU): |
|
|
|
|
|
| Model | F1-Score | RTF (CPU) | Memory | Use Case | |
|
|
|-------|----------|-----------|--------|-----------| |
|
|
| Silero-VAD | 0.806 | 0.065 | 50MB | Lightweight | |
|
|
| WebRTC-VAD | 0.708 | 0.003 | 10MB | Ultra-fast | |
|
|
| E-PANNs | 0.847 | 0.180 | 150MB | Balanced | |
|
|
|
|
|
*RTF: Real-Time Factor (lower is better, <1.0 = real-time capable)* |
|
|
|
|
|
## 🔬 **Research Applications** |
|
|
|
|
|
### **Privacy-Preserving Audio Processing** |
|
|
- **Domestic Recordings**: Remove personal conversations |
|
|
- **Smart Speakers**: Privacy-aware voice assistants |
|
|
- **Audio Datasets**: GDPR-compliant data collection |
|
|
- **Surveillance Systems**: Selective audio monitoring |
|
|
|
|
|
### **Speech Technology Research** |
|
|
- **Model Comparison**: Benchmark different VAD approaches |
|
|
- **Real-time Systems**: Low-latency speech detection |
|
|
- **Edge Computing**: CPU-efficient processing |
|
|
- **Hybrid Systems**: Combine multiple detection methods |
|
|
|
|
|
## 📊 **Technical Specifications** |
|
|
|
|
|
### **System Requirements** |
|
|
- **CPU**: 2+ cores (4+ recommended) |
|
|
- **RAM**: 1GB minimum (2GB recommended) |
|
|
- **Python**: 3.8+ (3.10+ recommended) |
|
|
- **Browser**: Chrome/Firefox with microphone support |
|
|
|
|
|
### **Hugging Face Spaces Optimization** |
|
|
- **Memory Limit**: Designed for 16GB Spaces limit |
|
|
- **CPU Cores**: Optimized for 8-core allocation |
|
|
- **Storage**: <500MB model storage requirement |
|
|
- **Networking**: Minimal external dependencies |
|
|
|
|
|
### **Audio Specifications** |
|
|
- **Input Format**: 16-bit PCM, mono/stereo |
|
|
- **Sample Rates**: 8kHz, 16kHz, 32kHz, 48kHz (auto-conversion) |
|
|
- **Chunk Size**: 4-second processing windows |
|
|
- **Latency**: <200ms processing delay |
|
|
|
|
|
## 📚 **Research Citation** |
|
|
|
|
|
If you use this demo in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{bibbo2025speech, |
|
|
title={Speech Removal Framework for Privacy-Preserving Audio Recordings}, |
|
|
author={[Authors omitted for review]}, |
|
|
booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, |
|
|
year={2025}, |
|
|
organization={IEEE} |
|
|
} |
|
|
``` |
|
|
|
|
|
## 🤝 **Contributing** |
|
|
|
|
|
We welcome contributions! Areas for improvement: |
|
|
- **New Models**: Add state-of-the-art VAD models |
|
|
- **Optimization**: Further CPU/memory optimizations |
|
|
- **Features**: Additional visualization and analysis tools |
|
|
- **Documentation**: Improve tutorials and examples |
|
|
|
|
|
## 📞 **Support** |
|
|
|
|
|
- **Issues**: [GitHub Issues](https://github.com/gbibbo/vad_demo/issues) |
|
|
- **Discussions**: [Hugging Face Discussions](https://huggingface.co/spaces/gbibbo/vad_demo/discussions) |
|
|
- **WASPAA 2025**: Visit our paper presentation |
|
|
|
|
|
## 📄 **License** |
|
|
|
|
|
This project is licensed under the **MIT License**. |
|
|
|
|
|
## 🙏 **Acknowledgments** |
|
|
|
|
|
- **Silero-VAD**: Silero Team |
|
|
- **WebRTC-VAD**: Google WebRTC Project |
|
|
- **E-PANNs**: Efficient PANNs Implementation |
|
|
- **Hugging Face**: Free Spaces hosting |
|
|
- **Funding**: AI4S, University of Surrey, EPSRC, CVSSP |
|
|
|
|
|
--- |
|
|
|
|
|
**🎯 Ready for WASPAA 2025 Demo** | **⚡ CPU Optimized** | **🆓 Free to Use** | **🤗 Hugging Face Spaces** |