File size: 6,851 Bytes
dcb88b6
 
 
 
baa3eb3
dcb88b6
baa3eb3
dcb88b6
 
 
 
 
 
 
baa3eb3
 
dcb88b6
baa3eb3
dcb88b6
baa3eb3
dcb88b6
baa3eb3
dcb88b6
baa3eb3
 
dcb88b6
baa3eb3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: VAD Demo - Real-time Speech Detection
emoji: 🎀
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.42.0
app_file: app.py
pinned: false
license: mit
---

# 🎀 VAD Demo: Real-time Speech Detection Framework

[![Hugging Face Spaces](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/gbibbo/vad_demo)
[![WASPAA 2025](https://img.shields.io/badge/WASPAA-2025-green)](https://waspaa.com)

> **Real-time multi-model voice activity detection with interactive visualization - optimized for CPU and free Hugging Face Spaces**

This demo showcases a comprehensive **speech removal framework** designed for privacy-preserving audio recordings, featuring **3 state-of-the-art AI models** with **real-time processing** and **interactive visualization**.

## 🎯 **Live Demo Features**

### πŸ€– **Multi-Model Support**
Compare 3 different AI models side-by-side:

| Model | Parameters | Speed | Accuracy | Best For |
|-------|------------|-------|----------|----------|
| **Silero-VAD** | 1.8M | ⚑⚑⚑ | ⭐⭐⭐⭐ | General purpose |
| **WebRTC-VAD** | <0.1M | ⚑⚑⚑⚑ | ⭐⭐⭐ | Ultra-fast processing |
| **E-PANNs** | 22M | ⚑⚑ | ⭐⭐⭐⭐ | Efficient AI (73% parameter reduction) |

### πŸ“Š **Real-time Visualization**
- **Dual Analysis**: Compare two models simultaneously
- **Waveform Display**: Live audio visualization
- **Probability Charts**: Real-time speech detection confidence
- **Performance Metrics**: Processing time comparison across models

### πŸ”’ **Privacy-Preserving Applications**
- **Smart Home Audio**: Remove personal conversations while preserving environmental sounds
- **GDPR Compliance**: Privacy-aware audio dataset processing
- **Real-time Processing**: Continuous 4-second chunk analysis at 16kHz
- **CPU Optimized**: Runs efficiently on standard hardware

## πŸš€ **Quick Start**

### Option 1: Use Live Demo (Recommended)
Click the Hugging Face Spaces badge above to try the demo instantly!

### Option 2: Run Locally
```bash
git clone https://huggingface.co/spaces/gbibbo/vad_demo
cd vad_demo
pip install -r requirements.txt
python app.py
```

## πŸŽ›οΈ **How to Use**

1. **🎀 Record Audio**: Click microphone and record 2-4 seconds of speech
2. **πŸ”§ Select Models**: Choose different models for Model A and Model B comparison
3. **βš™οΈ Adjust Threshold**: Lower = more sensitive detection (0.0-1.0)
4. **🎯 Process**: Click "Process Audio" to analyze
5. **πŸ“Š View Results**: Observe probability charts and detailed analysis

## πŸ—οΈ **Technical Architecture**

### **CPU Optimization Strategies**
- **Lazy Loading**: Models load only when needed
- **Efficient Processing**: Optimized audio chunk processing
- **Memory Management**: Smart buffer management for continuous operation
- **Fallback Systems**: Graceful degradation when models unavailable

### **Audio Processing Pipeline**
```
Audio Input (Microphone) 
    ↓ 
Preprocessing (Normalization, Resampling)
    ↓
Feature Extraction (Spectrograms, MFCCs)
    ↓
Multi-Model Inference (Parallel Processing)
    ↓
Visualization (Interactive Plotly Dashboard)
```

### **Model Implementation Details**

#### **Silero-VAD** (Production Ready)
- **Source**: `torch.hub` official Silero model
- **Optimization**: Direct PyTorch inference
- **Memory**: ~50MB RAM usage
- **Latency**: ~30ms processing time

#### **WebRTC-VAD** (Ultra-Fast)  
- **Source**: Google WebRTC project
- **Fallback**: Energy-based VAD when WebRTC unavailable
- **Latency**: <5ms processing time
- **Memory**: ~10MB RAM usage

#### **E-PANNs** (Efficient Deep Learning)
- **Features**: Mel-spectrogram + MFCC analysis  
- **Optimization**: Simplified neural architecture
- **Speed**: 2-3x faster than full PANNs
- **Memory**: ~150MB RAM usage

## πŸ“ˆ **Performance Benchmarks**

Evaluated on **CHiME-Home dataset** (adapted for CPU):

| Model | F1-Score | RTF (CPU) | Memory | Use Case |
|-------|----------|-----------|--------|-----------|
| Silero-VAD | 0.806 | 0.065 | 50MB | Lightweight |
| WebRTC-VAD | 0.708 | 0.003 | 10MB | Ultra-fast |
| E-PANNs | 0.847 | 0.180 | 150MB | Balanced |

*RTF: Real-Time Factor (lower is better, <1.0 = real-time capable)*

## πŸ”¬ **Research Applications**

### **Privacy-Preserving Audio Processing**
- **Domestic Recordings**: Remove personal conversations
- **Smart Speakers**: Privacy-aware voice assistants  
- **Audio Datasets**: GDPR-compliant data collection
- **Surveillance Systems**: Selective audio monitoring

### **Speech Technology Research**
- **Model Comparison**: Benchmark different VAD approaches
- **Real-time Systems**: Low-latency speech detection
- **Edge Computing**: CPU-efficient processing
- **Hybrid Systems**: Combine multiple detection methods

## πŸ“Š **Technical Specifications**

### **System Requirements**
- **CPU**: 2+ cores (4+ recommended)
- **RAM**: 1GB minimum (2GB recommended)  
- **Python**: 3.8+ (3.10+ recommended)
- **Browser**: Chrome/Firefox with microphone support

### **Hugging Face Spaces Optimization**
- **Memory Limit**: Designed for 16GB Spaces limit
- **CPU Cores**: Optimized for 8-core allocation
- **Storage**: <500MB model storage requirement
- **Networking**: Minimal external dependencies

### **Audio Specifications**
- **Input Format**: 16-bit PCM, mono/stereo
- **Sample Rates**: 8kHz, 16kHz, 32kHz, 48kHz (auto-conversion)
- **Chunk Size**: 4-second processing windows
- **Latency**: <200ms processing delay

## πŸ“š **Research Citation**

If you use this demo in your research, please cite:

```bibtex
@inproceedings{bibbo2025speech,
    title={Speech Removal Framework for Privacy-Preserving Audio Recordings},
    author={[Authors omitted for review]},
    booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
    year={2025},
    organization={IEEE}
}
```

## 🀝 **Contributing**

We welcome contributions! Areas for improvement:
- **New Models**: Add state-of-the-art VAD models
- **Optimization**: Further CPU/memory optimizations  
- **Features**: Additional visualization and analysis tools
- **Documentation**: Improve tutorials and examples

## πŸ“ž **Support**

- **Issues**: [GitHub Issues](https://github.com/gbibbo/vad_demo/issues)
- **Discussions**: [Hugging Face Discussions](https://huggingface.co/spaces/gbibbo/vad_demo/discussions)
- **WASPAA 2025**: Visit our paper presentation

## πŸ“„ **License**

This project is licensed under the **MIT License**.

## πŸ™ **Acknowledgments**

- **Silero-VAD**: Silero Team
- **WebRTC-VAD**: Google WebRTC Project
- **E-PANNs**: Efficient PANNs Implementation
- **Hugging Face**: Free Spaces hosting
- **Funding**: AI4S, University of Surrey, EPSRC, CVSSP

---

**🎯 Ready for WASPAA 2025 Demo** | **⚑ CPU Optimized** | **πŸ†“ Free to Use** | **πŸ€— Hugging Face Spaces**