Spaces:

gbibbo
/

vad_demo

Running

File size: 6,851 Bytes

dcb88b6
 
 
 
baa3eb3
dcb88b6
baa3eb3
dcb88b6
 
 
 
 
 
 
baa3eb3
 
dcb88b6
baa3eb3
dcb88b6
baa3eb3
dcb88b6
baa3eb3
dcb88b6
baa3eb3
 
dcb88b6
baa3eb3

---
title: VAD Demo - Real-time Speech Detection
emoji: 🎤
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.42.0
app_file: app.py
pinned: false
license: mit
---

# 🎤 VAD Demo: Real-time Speech Detection Framework

[![Hugging Face Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/gbibbo/vad_demo)
[![WASPAA 2025](https://img.shields.io/badge/WASPAA-2025-green)](https://waspaa.com)

> **Real-time multi-model voice activity detection with interactive visualization - optimized for CPU and free Hugging Face Spaces**

This demo showcases a comprehensive **speech removal framework** designed for privacy-preserving audio recordings, featuring **3 state-of-the-art AI models** with **real-time processing** and **interactive visualization**.

## 🎯 **Live Demo Features**

### 🤖 **Multi-Model Support**
Compare 3 different AI models side-by-side:

| Model | Parameters | Speed | Accuracy | Best For |
|-------|------------|-------|----------|----------|
| **Silero-VAD** | 1.8M | ⚡⚡⚡ | ⭐⭐⭐⭐ | General purpose |
| **WebRTC-VAD** | <0.1M | ⚡⚡⚡⚡ | ⭐⭐⭐ | Ultra-fast processing |
| **E-PANNs** | 22M | ⚡⚡ | ⭐⭐⭐⭐ | Efficient AI (73% parameter reduction) |

### 📊 **Real-time Visualization**
- **Dual Analysis**: Compare two models simultaneously
- **Waveform Display**: Live audio visualization
- **Probability Charts**: Real-time speech detection confidence
- **Performance Metrics**: Processing time comparison across models

### 🔒 **Privacy-Preserving Applications**
- **Smart Home Audio**: Remove personal conversations while preserving environmental sounds
- **GDPR Compliance**: Privacy-aware audio dataset processing
- **Real-time Processing**: Continuous 4-second chunk analysis at 16kHz
- **CPU Optimized**: Runs efficiently on standard hardware

## 🚀 **Quick Start**

### Option 1: Use Live Demo (Recommended)
Click the Hugging Face Spaces badge above to try the demo instantly!

### Option 2: Run Locally
```bash
git clone https://huggingface.co/spaces/gbibbo/vad_demo
cd vad_demo
pip install -r requirements.txt
python app.py
```

## 🎛️ **How to Use**

1. **🎤 Record Audio**: Click microphone and record 2-4 seconds of speech
2. **🔧 Select Models**: Choose different models for Model A and Model B comparison
3. **⚙️ Adjust Threshold**: Lower = more sensitive detection (0.0-1.0)
4. **🎯 Process**: Click "Process Audio" to analyze
5. **📊 View Results**: Observe probability charts and detailed analysis

## 🏗️ **Technical Architecture**

### **CPU Optimization Strategies**
- **Lazy Loading**: Models load only when needed
- **Efficient Processing**: Optimized audio chunk processing
- **Memory Management**: Smart buffer management for continuous operation
- **Fallback Systems**: Graceful degradation when models unavailable

### **Audio Processing Pipeline**
```
Audio Input (Microphone) 
    ↓ 
Preprocessing (Normalization, Resampling)
    ↓
Feature Extraction (Spectrograms, MFCCs)
    ↓
Multi-Model Inference (Parallel Processing)
    ↓
Visualization (Interactive Plotly Dashboard)
```

### **Model Implementation Details**

#### **Silero-VAD** (Production Ready)
- **Source**: `torch.hub` official Silero model
- **Optimization**: Direct PyTorch inference
- **Memory**: ~50MB RAM usage
- **Latency**: ~30ms processing time

#### **WebRTC-VAD** (Ultra-Fast)  
- **Source**: Google WebRTC project
- **Fallback**: Energy-based VAD when WebRTC unavailable
- **Latency**: <5ms processing time
- **Memory**: ~10MB RAM usage

#### **E-PANNs** (Efficient Deep Learning)
- **Features**: Mel-spectrogram + MFCC analysis  
- **Optimization**: Simplified neural architecture
- **Speed**: 2-3x faster than full PANNs
- **Memory**: ~150MB RAM usage

## 📈 **Performance Benchmarks**

Evaluated on **CHiME-Home dataset** (adapted for CPU):

| Model | F1-Score | RTF (CPU) | Memory | Use Case |
|-------|----------|-----------|--------|-----------|
| Silero-VAD | 0.806 | 0.065 | 50MB | Lightweight |
| WebRTC-VAD | 0.708 | 0.003 | 10MB | Ultra-fast |
| E-PANNs | 0.847 | 0.180 | 150MB | Balanced |

*RTF: Real-Time Factor (lower is better, <1.0 = real-time capable)*

## 🔬 **Research Applications**

### **Privacy-Preserving Audio Processing**
- **Domestic Recordings**: Remove personal conversations
- **Smart Speakers**: Privacy-aware voice assistants  
- **Audio Datasets**: GDPR-compliant data collection
- **Surveillance Systems**: Selective audio monitoring

### **Speech Technology Research**
- **Model Comparison**: Benchmark different VAD approaches
- **Real-time Systems**: Low-latency speech detection
- **Edge Computing**: CPU-efficient processing
- **Hybrid Systems**: Combine multiple detection methods

## 📊 **Technical Specifications**

### **System Requirements**
- **CPU**: 2+ cores (4+ recommended)
- **RAM**: 1GB minimum (2GB recommended)  
- **Python**: 3.8+ (3.10+ recommended)
- **Browser**: Chrome/Firefox with microphone support

### **Hugging Face Spaces Optimization**
- **Memory Limit**: Designed for 16GB Spaces limit
- **CPU Cores**: Optimized for 8-core allocation
- **Storage**: <500MB model storage requirement
- **Networking**: Minimal external dependencies

### **Audio Specifications**
- **Input Format**: 16-bit PCM, mono/stereo
- **Sample Rates**: 8kHz, 16kHz, 32kHz, 48kHz (auto-conversion)
- **Chunk Size**: 4-second processing windows
- **Latency**: <200ms processing delay

## 📚 **Research Citation**

If you use this demo in your research, please cite:

```bibtex
@inproceedings{bibbo2025speech,
    title={Speech Removal Framework for Privacy-Preserving Audio Recordings},
    author={[Authors omitted for review]},
    booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
    year={2025},
    organization={IEEE}
}
```

## 🤝 **Contributing**

We welcome contributions! Areas for improvement:
- **New Models**: Add state-of-the-art VAD models
- **Optimization**: Further CPU/memory optimizations  
- **Features**: Additional visualization and analysis tools
- **Documentation**: Improve tutorials and examples

## 📞 **Support**

- **Issues**: [GitHub Issues](https://github.com/gbibbo/vad_demo/issues)
- **Discussions**: [Hugging Face Discussions](https://huggingface.co/spaces/gbibbo/vad_demo/discussions)
- **WASPAA 2025**: Visit our paper presentation

## 📄 **License**

This project is licensed under the **MIT License**.

## 🙏 **Acknowledgments**

- **Silero-VAD**: Silero Team
- **WebRTC-VAD**: Google WebRTC Project
- **E-PANNs**: Efficient PANNs Implementation
- **Hugging Face**: Free Spaces hosting
- **Funding**: AI4S, University of Surrey, EPSRC, CVSSP

---

**🎯 Ready for WASPAA 2025 Demo** | **⚡ CPU Optimized** | **🆓 Free to Use** | **🤗 Hugging Face Spaces**