Spaces:

gbibbo
/

vad_demo

Running

App Files Files Community

vad_demo / README.md

Gabriel Bibbó

🔧 DEFINITIVE FIX: Downgrade to Gradio 4.42.0 to solve JSON schema bug

baa3eb3 15 days ago

preview code

raw

history blame contribute delete

6.85 kB

	---
	title: VAD Demo - Real-time Speech Detection
	emoji: 🎤
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: 4.42.0
	app_file: app.py
	pinned: false
	license: mit
	---

	# 🎤 VAD Demo: Real-time Speech Detection Framework

	[![Hugging Face Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/gbibbo/vad_demo)
	[![WASPAA 2025](https://img.shields.io/badge/WASPAA-2025-green)](https://waspaa.com)

	> Real-time multi-model voice activity detection with interactive visualization - optimized for CPU and free Hugging Face Spaces

	This demo showcases a comprehensive speech removal framework designed for privacy-preserving audio recordings, featuring 3 state-of-the-art AI models with real-time processing and interactive visualization.

	## 🎯 Live Demo Features

	### 🤖 Multi-Model Support
	Compare 3 different AI models side-by-side:

	\| Model \| Parameters \| Speed \| Accuracy \| Best For \|
	\|-------\|------------\|-------\|----------\|----------\|
	\| Silero-VAD \| 1.8M \| ⚡⚡⚡ \| ⭐⭐⭐⭐ \| General purpose \|
	\| WebRTC-VAD \| <0.1M \| ⚡⚡⚡⚡ \| ⭐⭐⭐ \| Ultra-fast processing \|
	\| E-PANNs \| 22M \| ⚡⚡ \| ⭐⭐⭐⭐ \| Efficient AI (73% parameter reduction) \|

	### 📊 Real-time Visualization
	- Dual Analysis: Compare two models simultaneously
	- Waveform Display: Live audio visualization
	- Probability Charts: Real-time speech detection confidence
	- Performance Metrics: Processing time comparison across models

	### 🔒 Privacy-Preserving Applications
	- Smart Home Audio: Remove personal conversations while preserving environmental sounds
	- GDPR Compliance: Privacy-aware audio dataset processing
	- Real-time Processing: Continuous 4-second chunk analysis at 16kHz
	- CPU Optimized: Runs efficiently on standard hardware

	## 🚀 Quick Start

	### Option 1: Use Live Demo (Recommended)
	Click the Hugging Face Spaces badge above to try the demo instantly!

	### Option 2: Run Locally
	```bash
	git clone https://huggingface.co/spaces/gbibbo/vad_demo
	cd vad_demo
	pip install -r requirements.txt
	python app.py
	```

	## 🎛️ How to Use

	1. 🎤 Record Audio: Click microphone and record 2-4 seconds of speech
	2. 🔧 Select Models: Choose different models for Model A and Model B comparison
	3. ⚙️ Adjust Threshold: Lower = more sensitive detection (0.0-1.0)
	4. 🎯 Process: Click "Process Audio" to analyze
	5. 📊 View Results: Observe probability charts and detailed analysis

	## 🏗️ Technical Architecture

	### CPU Optimization Strategies
	- Lazy Loading: Models load only when needed
	- Efficient Processing: Optimized audio chunk processing
	- Memory Management: Smart buffer management for continuous operation
	- Fallback Systems: Graceful degradation when models unavailable

	### Audio Processing Pipeline
	```
	Audio Input (Microphone)
	↓
	Preprocessing (Normalization, Resampling)
	↓
	Feature Extraction (Spectrograms, MFCCs)
	↓
	Multi-Model Inference (Parallel Processing)
	↓
	Visualization (Interactive Plotly Dashboard)
	```

	### Model Implementation Details

	#### Silero-VAD (Production Ready)
	- Source: `torch.hub` official Silero model
	- Optimization: Direct PyTorch inference
	- Memory: ~50MB RAM usage
	- Latency: ~30ms processing time

	#### WebRTC-VAD (Ultra-Fast)
	- Source: Google WebRTC project
	- Fallback: Energy-based VAD when WebRTC unavailable
	- Latency: <5ms processing time
	- Memory: ~10MB RAM usage

	#### E-PANNs (Efficient Deep Learning)
	- Features: Mel-spectrogram + MFCC analysis
	- Optimization: Simplified neural architecture
	- Speed: 2-3x faster than full PANNs
	- Memory: ~150MB RAM usage

	## 📈 Performance Benchmarks

	Evaluated on CHiME-Home dataset (adapted for CPU):

	\| Model \| F1-Score \| RTF (CPU) \| Memory \| Use Case \|
	\|-------\|----------\|-----------\|--------\|-----------\|
	\| Silero-VAD \| 0.806 \| 0.065 \| 50MB \| Lightweight \|
	\| WebRTC-VAD \| 0.708 \| 0.003 \| 10MB \| Ultra-fast \|
	\| E-PANNs \| 0.847 \| 0.180 \| 150MB \| Balanced \|

	RTF: Real-Time Factor (lower is better, <1.0 = real-time capable)

	## 🔬 Research Applications

	### Privacy-Preserving Audio Processing
	- Domestic Recordings: Remove personal conversations
	- Smart Speakers: Privacy-aware voice assistants
	- Audio Datasets: GDPR-compliant data collection
	- Surveillance Systems: Selective audio monitoring

	### Speech Technology Research
	- Model Comparison: Benchmark different VAD approaches
	- Real-time Systems: Low-latency speech detection
	- Edge Computing: CPU-efficient processing
	- Hybrid Systems: Combine multiple detection methods

	## 📊 Technical Specifications

	### System Requirements
	- CPU: 2+ cores (4+ recommended)
	- RAM: 1GB minimum (2GB recommended)
	- Python: 3.8+ (3.10+ recommended)
	- Browser: Chrome/Firefox with microphone support

	### Hugging Face Spaces Optimization
	- Memory Limit: Designed for 16GB Spaces limit
	- CPU Cores: Optimized for 8-core allocation
	- Storage: <500MB model storage requirement
	- Networking: Minimal external dependencies

	### Audio Specifications
	- Input Format: 16-bit PCM, mono/stereo
	- Sample Rates: 8kHz, 16kHz, 32kHz, 48kHz (auto-conversion)
	- Chunk Size: 4-second processing windows
	- Latency: <200ms processing delay

	## 📚 Research Citation

	If you use this demo in your research, please cite:

	```bibtex
	@inproceedings{bibbo2025speech,
	title={Speech Removal Framework for Privacy-Preserving Audio Recordings},
	author={[Authors omitted for review]},
	booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
	year={2025},
	organization={IEEE}
	}
	```

	## 🤝 Contributing

	We welcome contributions! Areas for improvement:
	- New Models: Add state-of-the-art VAD models
	- Optimization: Further CPU/memory optimizations
	- Features: Additional visualization and analysis tools
	- Documentation: Improve tutorials and examples

	## 📞 Support

	- Issues: [GitHub Issues](https://github.com/gbibbo/vad_demo/issues)
	- Discussions: [Hugging Face Discussions](https://huggingface.co/spaces/gbibbo/vad_demo/discussions)
	- WASPAA 2025: Visit our paper presentation

	## 📄 License

	This project is licensed under the MIT License.

	## 🙏 Acknowledgments

	- Silero-VAD: Silero Team
	- WebRTC-VAD: Google WebRTC Project
	- E-PANNs: Efficient PANNs Implementation
	- Hugging Face: Free Spaces hosting
	- Funding: AI4S, University of Surrey, EPSRC, CVSSP

	---

	🎯 Ready for WASPAA 2025 Demo \| ⚡ CPU Optimized \| 🆓 Free to Use \| 🤗 Hugging Face Spaces

	---
	title: VAD Demo - Real-time Speech Detection
	emoji: 🎤
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: 4.42.0
	app_file: app.py
	pinned: false
	license: mit
	---

	# 🎤 VAD Demo: Real-time Speech Detection Framework

	[![Hugging Face Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/gbibbo/vad_demo)
	[![WASPAA 2025](https://img.shields.io/badge/WASPAA-2025-green)](https://waspaa.com)

	> Real-time multi-model voice activity detection with interactive visualization - optimized for CPU and free Hugging Face Spaces

	This demo showcases a comprehensive speech removal framework designed for privacy-preserving audio recordings, featuring 3 state-of-the-art AI models with real-time processing and interactive visualization.

	## 🎯 Live Demo Features

	### 🤖 Multi-Model Support
	Compare 3 different AI models side-by-side:

	\| Model \| Parameters \| Speed \| Accuracy \| Best For \|
	\|-------\|------------\|-------\|----------\|----------\|
	\| Silero-VAD \| 1.8M \| ⚡⚡⚡ \| ⭐⭐⭐⭐ \| General purpose \|
	\| WebRTC-VAD \| <0.1M \| ⚡⚡⚡⚡ \| ⭐⭐⭐ \| Ultra-fast processing \|
	\| E-PANNs \| 22M \| ⚡⚡ \| ⭐⭐⭐⭐ \| Efficient AI (73% parameter reduction) \|

	### 📊 Real-time Visualization
	- Dual Analysis: Compare two models simultaneously
	- Waveform Display: Live audio visualization
	- Probability Charts: Real-time speech detection confidence
	- Performance Metrics: Processing time comparison across models

	### 🔒 Privacy-Preserving Applications
	- Smart Home Audio: Remove personal conversations while preserving environmental sounds
	- GDPR Compliance: Privacy-aware audio dataset processing
	- Real-time Processing: Continuous 4-second chunk analysis at 16kHz
	- CPU Optimized: Runs efficiently on standard hardware

	## 🚀 Quick Start

	### Option 1: Use Live Demo (Recommended)
	Click the Hugging Face Spaces badge above to try the demo instantly!

	### Option 2: Run Locally
	```bash
	git clone https://huggingface.co/spaces/gbibbo/vad_demo
	cd vad_demo
	pip install -r requirements.txt
	python app.py
	```

	## 🎛️ How to Use

	1. 🎤 Record Audio: Click microphone and record 2-4 seconds of speech
	2. 🔧 Select Models: Choose different models for Model A and Model B comparison
	3. ⚙️ Adjust Threshold: Lower = more sensitive detection (0.0-1.0)
	4. 🎯 Process: Click "Process Audio" to analyze
	5. 📊 View Results: Observe probability charts and detailed analysis

	## 🏗️ Technical Architecture

	### CPU Optimization Strategies
	- Lazy Loading: Models load only when needed
	- Efficient Processing: Optimized audio chunk processing
	- Memory Management: Smart buffer management for continuous operation
	- Fallback Systems: Graceful degradation when models unavailable

	### Audio Processing Pipeline
	```
	Audio Input (Microphone)
	↓
	Preprocessing (Normalization, Resampling)
	↓
	Feature Extraction (Spectrograms, MFCCs)
	↓
	Multi-Model Inference (Parallel Processing)
	↓
	Visualization (Interactive Plotly Dashboard)
	```

	### Model Implementation Details

	#### Silero-VAD (Production Ready)
	- Source: `torch.hub` official Silero model
	- Optimization: Direct PyTorch inference
	- Memory: ~50MB RAM usage
	- Latency: ~30ms processing time

	#### WebRTC-VAD (Ultra-Fast)
	- Source: Google WebRTC project
	- Fallback: Energy-based VAD when WebRTC unavailable
	- Latency: <5ms processing time
	- Memory: ~10MB RAM usage

	#### E-PANNs (Efficient Deep Learning)
	- Features: Mel-spectrogram + MFCC analysis
	- Optimization: Simplified neural architecture
	- Speed: 2-3x faster than full PANNs
	- Memory: ~150MB RAM usage

	## 📈 Performance Benchmarks

	Evaluated on CHiME-Home dataset (adapted for CPU):

	\| Model \| F1-Score \| RTF (CPU) \| Memory \| Use Case \|
	\|-------\|----------\|-----------\|--------\|-----------\|
	\| Silero-VAD \| 0.806 \| 0.065 \| 50MB \| Lightweight \|
	\| WebRTC-VAD \| 0.708 \| 0.003 \| 10MB \| Ultra-fast \|
	\| E-PANNs \| 0.847 \| 0.180 \| 150MB \| Balanced \|

	RTF: Real-Time Factor (lower is better, <1.0 = real-time capable)

	## 🔬 Research Applications

	### Privacy-Preserving Audio Processing
	- Domestic Recordings: Remove personal conversations
	- Smart Speakers: Privacy-aware voice assistants
	- Audio Datasets: GDPR-compliant data collection
	- Surveillance Systems: Selective audio monitoring

	### Speech Technology Research
	- Model Comparison: Benchmark different VAD approaches
	- Real-time Systems: Low-latency speech detection
	- Edge Computing: CPU-efficient processing
	- Hybrid Systems: Combine multiple detection methods

	## 📊 Technical Specifications

	### System Requirements
	- CPU: 2+ cores (4+ recommended)
	- RAM: 1GB minimum (2GB recommended)
	- Python: 3.8+ (3.10+ recommended)
	- Browser: Chrome/Firefox with microphone support

	### Hugging Face Spaces Optimization
	- Memory Limit: Designed for 16GB Spaces limit
	- CPU Cores: Optimized for 8-core allocation
	- Storage: <500MB model storage requirement
	- Networking: Minimal external dependencies

	### Audio Specifications
	- Input Format: 16-bit PCM, mono/stereo
	- Sample Rates: 8kHz, 16kHz, 32kHz, 48kHz (auto-conversion)
	- Chunk Size: 4-second processing windows
	- Latency: <200ms processing delay

	## 📚 Research Citation

	If you use this demo in your research, please cite:

	```bibtex
	@inproceedings{bibbo2025speech,
	title={Speech Removal Framework for Privacy-Preserving Audio Recordings},
	author={[Authors omitted for review]},
	booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
	year={2025},
	organization={IEEE}
	}
	```

	## 🤝 Contributing

	We welcome contributions! Areas for improvement:
	- New Models: Add state-of-the-art VAD models
	- Optimization: Further CPU/memory optimizations
	- Features: Additional visualization and analysis tools
	- Documentation: Improve tutorials and examples

	## 📞 Support

	- Issues: [GitHub Issues](https://github.com/gbibbo/vad_demo/issues)
	- Discussions: [Hugging Face Discussions](https://huggingface.co/spaces/gbibbo/vad_demo/discussions)
	- WASPAA 2025: Visit our paper presentation

	## 📄 License

	This project is licensed under the MIT License.

	## 🙏 Acknowledgments

	- Silero-VAD: Silero Team
	- WebRTC-VAD: Google WebRTC Project
	- E-PANNs: Efficient PANNs Implementation
	- Hugging Face: Free Spaces hosting
	- Funding: AI4S, University of Surrey, EPSRC, CVSSP

	---

	🎯 Ready for WASPAA 2025 Demo \| ⚡ CPU Optimized \| 🆓 Free to Use \| 🤗 Hugging Face Spaces