vad_demo / README.md
Gabriel Bibbó
🔧 DEFINITIVE FIX: Downgrade to Gradio 4.42.0 to solve JSON schema bug
baa3eb3
---
title: VAD Demo - Real-time Speech Detection
emoji: 🎤
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.42.0
app_file: app.py
pinned: false
license: mit
---
# 🎤 VAD Demo: Real-time Speech Detection Framework
[![Hugging Face Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/gbibbo/vad_demo)
[![WASPAA 2025](https://img.shields.io/badge/WASPAA-2025-green)](https://waspaa.com)
> **Real-time multi-model voice activity detection with interactive visualization - optimized for CPU and free Hugging Face Spaces**
This demo showcases a comprehensive **speech removal framework** designed for privacy-preserving audio recordings, featuring **3 state-of-the-art AI models** with **real-time processing** and **interactive visualization**.
## 🎯 **Live Demo Features**
### 🤖 **Multi-Model Support**
Compare 3 different AI models side-by-side:
| Model | Parameters | Speed | Accuracy | Best For |
|-------|------------|-------|----------|----------|
| **Silero-VAD** | 1.8M | ⚡⚡⚡ | ⭐⭐⭐⭐ | General purpose |
| **WebRTC-VAD** | <0.1M | ⚡⚡⚡⚡ | ⭐⭐⭐ | Ultra-fast processing |
| **E-PANNs** | 22M | ⚡⚡ | ⭐⭐⭐⭐ | Efficient AI (73% parameter reduction) |
### 📊 **Real-time Visualization**
- **Dual Analysis**: Compare two models simultaneously
- **Waveform Display**: Live audio visualization
- **Probability Charts**: Real-time speech detection confidence
- **Performance Metrics**: Processing time comparison across models
### 🔒 **Privacy-Preserving Applications**
- **Smart Home Audio**: Remove personal conversations while preserving environmental sounds
- **GDPR Compliance**: Privacy-aware audio dataset processing
- **Real-time Processing**: Continuous 4-second chunk analysis at 16kHz
- **CPU Optimized**: Runs efficiently on standard hardware
## 🚀 **Quick Start**
### Option 1: Use Live Demo (Recommended)
Click the Hugging Face Spaces badge above to try the demo instantly!
### Option 2: Run Locally
```bash
git clone https://huggingface.co/spaces/gbibbo/vad_demo
cd vad_demo
pip install -r requirements.txt
python app.py
```
## 🎛️ **How to Use**
1. **🎤 Record Audio**: Click microphone and record 2-4 seconds of speech
2. **🔧 Select Models**: Choose different models for Model A and Model B comparison
3. **⚙️ Adjust Threshold**: Lower = more sensitive detection (0.0-1.0)
4. **🎯 Process**: Click "Process Audio" to analyze
5. **📊 View Results**: Observe probability charts and detailed analysis
## 🏗️ **Technical Architecture**
### **CPU Optimization Strategies**
- **Lazy Loading**: Models load only when needed
- **Efficient Processing**: Optimized audio chunk processing
- **Memory Management**: Smart buffer management for continuous operation
- **Fallback Systems**: Graceful degradation when models unavailable
### **Audio Processing Pipeline**
```
Audio Input (Microphone)
Preprocessing (Normalization, Resampling)
Feature Extraction (Spectrograms, MFCCs)
Multi-Model Inference (Parallel Processing)
Visualization (Interactive Plotly Dashboard)
```
### **Model Implementation Details**
#### **Silero-VAD** (Production Ready)
- **Source**: `torch.hub` official Silero model
- **Optimization**: Direct PyTorch inference
- **Memory**: ~50MB RAM usage
- **Latency**: ~30ms processing time
#### **WebRTC-VAD** (Ultra-Fast)
- **Source**: Google WebRTC project
- **Fallback**: Energy-based VAD when WebRTC unavailable
- **Latency**: <5ms processing time
- **Memory**: ~10MB RAM usage
#### **E-PANNs** (Efficient Deep Learning)
- **Features**: Mel-spectrogram + MFCC analysis
- **Optimization**: Simplified neural architecture
- **Speed**: 2-3x faster than full PANNs
- **Memory**: ~150MB RAM usage
## 📈 **Performance Benchmarks**
Evaluated on **CHiME-Home dataset** (adapted for CPU):
| Model | F1-Score | RTF (CPU) | Memory | Use Case |
|-------|----------|-----------|--------|-----------|
| Silero-VAD | 0.806 | 0.065 | 50MB | Lightweight |
| WebRTC-VAD | 0.708 | 0.003 | 10MB | Ultra-fast |
| E-PANNs | 0.847 | 0.180 | 150MB | Balanced |
*RTF: Real-Time Factor (lower is better, <1.0 = real-time capable)*
## 🔬 **Research Applications**
### **Privacy-Preserving Audio Processing**
- **Domestic Recordings**: Remove personal conversations
- **Smart Speakers**: Privacy-aware voice assistants
- **Audio Datasets**: GDPR-compliant data collection
- **Surveillance Systems**: Selective audio monitoring
### **Speech Technology Research**
- **Model Comparison**: Benchmark different VAD approaches
- **Real-time Systems**: Low-latency speech detection
- **Edge Computing**: CPU-efficient processing
- **Hybrid Systems**: Combine multiple detection methods
## 📊 **Technical Specifications**
### **System Requirements**
- **CPU**: 2+ cores (4+ recommended)
- **RAM**: 1GB minimum (2GB recommended)
- **Python**: 3.8+ (3.10+ recommended)
- **Browser**: Chrome/Firefox with microphone support
### **Hugging Face Spaces Optimization**
- **Memory Limit**: Designed for 16GB Spaces limit
- **CPU Cores**: Optimized for 8-core allocation
- **Storage**: <500MB model storage requirement
- **Networking**: Minimal external dependencies
### **Audio Specifications**
- **Input Format**: 16-bit PCM, mono/stereo
- **Sample Rates**: 8kHz, 16kHz, 32kHz, 48kHz (auto-conversion)
- **Chunk Size**: 4-second processing windows
- **Latency**: <200ms processing delay
## 📚 **Research Citation**
If you use this demo in your research, please cite:
```bibtex
@inproceedings{bibbo2025speech,
title={Speech Removal Framework for Privacy-Preserving Audio Recordings},
author={[Authors omitted for review]},
booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
year={2025},
organization={IEEE}
}
```
## 🤝 **Contributing**
We welcome contributions! Areas for improvement:
- **New Models**: Add state-of-the-art VAD models
- **Optimization**: Further CPU/memory optimizations
- **Features**: Additional visualization and analysis tools
- **Documentation**: Improve tutorials and examples
## 📞 **Support**
- **Issues**: [GitHub Issues](https://github.com/gbibbo/vad_demo/issues)
- **Discussions**: [Hugging Face Discussions](https://huggingface.co/spaces/gbibbo/vad_demo/discussions)
- **WASPAA 2025**: Visit our paper presentation
## 📄 **License**
This project is licensed under the **MIT License**.
## 🙏 **Acknowledgments**
- **Silero-VAD**: Silero Team
- **WebRTC-VAD**: Google WebRTC Project
- **E-PANNs**: Efficient PANNs Implementation
- **Hugging Face**: Free Spaces hosting
- **Funding**: AI4S, University of Surrey, EPSRC, CVSSP
---
**🎯 Ready for WASPAA 2025 Demo** | **⚡ CPU Optimized** | **🆓 Free to Use** | **🤗 Hugging Face Spaces**