Spaces:

KNipun
/

Whisper-AI-Psychiatric

Sleeping

App Files Files Community

Whisper-AI-Psychiatric / AUDIO_FEATURES.md

KNipun

Upload 14 files

cf0bb06 verified about 1 month ago

preview code

raw

history blame contribute delete

5.49 kB

	# Audio Features Documentation - Whisper AI-Psychiatric

	## Overview
	The Whisper AI-Psychiatric application now includes speech-to-text and text-to-speech capabilities to enhance user interaction through voice input and audio responses.

	## Features Added

	### 🎤 Speech-to-Text (STT)
	- Model: Whisper-tiny (located in `stt-model/whisper-tiny/`)
	- Functionality: Converts user voice input to text for chat interaction
	- Input Methods:
	- Real-time audio recording (using microphone)
	- Audio file upload (supports WAV, MP3, M4A, FLAC)

	### 🔊 Text-to-Speech (TTS)
	- Model: Kokoro-82M (located in `tts-model/Kokoro-82M/`)
	- Functionality: Converts AI responses to speech audio
	- Features:
	- Adjustable speech speed (0.5x to 2.0x)
	- Auto-play option for responses
	- Manual play button for each response

	## Installation Requirements

	### Required Packages
	Run one of the following to install audio processing packages:

	Option 1: Using batch file (Windows)
	```bash
	install_audio_packages.bat
	```

	Option 2: Using PowerShell (Windows)
	```powershell
	.\install_audio_packages.ps1
	```

	Option 3: Manual installation
	```bash
	pip install librosa>=0.10.0
	pip install soundfile>=0.12.0
	pip install audio-recorder-streamlit>=0.0.8
	pip install scipy>=1.9.0
	```

	### Updated requirements.txt
	The requirements.txt file has been updated to include:
	- `librosa>=0.10.0` - Audio processing library
	- `soundfile>=0.12.0` - Audio file I/O
	- `audio-recorder-streamlit>=0.0.8` - Streamlit audio recording component
	- `scipy>=1.9.0` - Scientific computing (audio processing support)

	## Usage Guide

	### Using Speech-to-Text

	1. Real-time Recording:
	- Click the microphone icon in the "Voice Input" section
	- Speak your question clearly
	- Click "Stop" when finished
	- Click "🔄 Transcribe Audio" to convert speech to text
	- The transcribed text will automatically be sent to the chat

	2. File Upload:
	- If the microphone recorder is not available, use the file uploader
	- Upload an audio file (WAV, MP3, M4A, FLAC)
	- Click "🔄 Transcribe Uploaded Audio"
	- The transcribed text will be processed

	### Using Text-to-Speech

	1. Enable/Disable TTS:
	- Use the "Enable Text-to-Speech" checkbox in the sidebar
	- Adjust "Audio Speed" slider (0.5x to 2.0x normal speed)

	2. Playing Responses:
	- Each AI response will have a "🔊 Play" button
	- Click to generate and play the audio version of the response
	- Audio will auto-play when generated

	## Technical Implementation

	### Speech-to-Text Pipeline
	1. Audio input captured/uploaded
	2. Audio processed using librosa (resampled to 16kHz)
	3. Whisper model processes audio features
	4. Generated transcription added to chat

	### Text-to-Speech Pipeline
	1. AI response text processed
	2. Kokoro-82M model generates speech audio
	3. Audio served through HTML5 audio player
	4. Supports speed adjustment and auto-play

	## Sidebar Features

	### Model Status Indicators
	- ✅ Whisper AI Model Loaded
	- ✅ FAISS Index Loaded
	- ✅ Speech-to-Text Loaded

	### Audio Settings
	- Enable Text-to-Speech: Toggle TTS functionality
	- Audio Speed: Adjust playback speed (0.5x - 2.0x)

	### Voice Input Tips
	- Speak clearly and distinctly
	- Minimize background noise
	- Keep recordings under 30 seconds for best results
	- Ensure good microphone quality

	## Troubleshooting

	### Common Issues

	1. Microphone Not Working:
	- Check browser permissions for microphone access
	- Use the file upload option as fallback
	- Ensure audio-recorder-streamlit is properly installed

	2. Audio Quality Issues:
	- Use a quiet environment
	- Speak clearly and at normal pace
	- Check microphone quality

	3. TTS Not Working:
	- Verify Kokoro-82M model is in correct directory
	- Check audio player compatibility in browser
	- Ensure scipy and audio libraries are installed

	4. Import Errors:
	- Run the installation scripts
	- Manually install missing packages
	- Check virtual environment activation

	### Model Paths
	Ensure the following model directories exist:
	- Speech-to-Text: `stt-model/whisper-tiny/`
	- Text-to-Speech: `tts-model/Kokoro-82M/`
	- Main AI Model: `model/Whisper-psychology-gemma-3-1b/`

	## Browser Compatibility

	### Recommended Browsers
	- Chrome (best support for audio features)
	- Firefox
	- Edge
	- Safari (may have limited microphone support)

	### Required Permissions
	- Microphone access for voice recording
	- Audio playback for TTS responses

	## Future Enhancements

	### Planned Features
	- Voice activity detection for hands-free operation
	- Multiple voice options for TTS
	- Real-time streaming transcription
	- Noise cancellation for better STT accuracy
	- Custom wake words for voice activation

	### Performance Optimizations
	- Model quantization for faster inference
	- Audio preprocessing optimization
	- Caching for frequently used TTS phrases
	- Background audio processing

	## Support

	For issues or questions:
	1. Check the troubleshooting section above
	2. Verify all dependencies are installed
	3. Test with simple audio files first
	4. Check browser console for error messages

	## Version Information
	- Version: 2.0 (Audio Features)
	- Added: Speech-to-Text and Text-to-Speech capabilities
	- Base Version: 1.0 (Text-only chat interface)