Spaces:
Sleeping
Sleeping
A newer version of the Streamlit SDK is available:
1.49.1
Audio Features Documentation - Whisper AI-Psychiatric
Overview
The Whisper AI-Psychiatric application now includes speech-to-text and text-to-speech capabilities to enhance user interaction through voice input and audio responses.
Features Added
🎤 Speech-to-Text (STT)
- Model: Whisper-tiny (located in
stt-model/whisper-tiny/
) - Functionality: Converts user voice input to text for chat interaction
- Input Methods:
- Real-time audio recording (using microphone)
- Audio file upload (supports WAV, MP3, M4A, FLAC)
🔊 Text-to-Speech (TTS)
- Model: Kokoro-82M (located in
tts-model/Kokoro-82M/
) - Functionality: Converts AI responses to speech audio
- Features:
- Adjustable speech speed (0.5x to 2.0x)
- Auto-play option for responses
- Manual play button for each response
Installation Requirements
Required Packages
Run one of the following to install audio processing packages:
Option 1: Using batch file (Windows)
install_audio_packages.bat
Option 2: Using PowerShell (Windows)
.\install_audio_packages.ps1
Option 3: Manual installation
pip install librosa>=0.10.0
pip install soundfile>=0.12.0
pip install audio-recorder-streamlit>=0.0.8
pip install scipy>=1.9.0
Updated requirements.txt
The requirements.txt file has been updated to include:
librosa>=0.10.0
- Audio processing librarysoundfile>=0.12.0
- Audio file I/Oaudio-recorder-streamlit>=0.0.8
- Streamlit audio recording componentscipy>=1.9.0
- Scientific computing (audio processing support)
Usage Guide
Using Speech-to-Text
Real-time Recording:
- Click the microphone icon in the "Voice Input" section
- Speak your question clearly
- Click "Stop" when finished
- Click "🔄 Transcribe Audio" to convert speech to text
- The transcribed text will automatically be sent to the chat
File Upload:
- If the microphone recorder is not available, use the file uploader
- Upload an audio file (WAV, MP3, M4A, FLAC)
- Click "🔄 Transcribe Uploaded Audio"
- The transcribed text will be processed
Using Text-to-Speech
Enable/Disable TTS:
- Use the "Enable Text-to-Speech" checkbox in the sidebar
- Adjust "Audio Speed" slider (0.5x to 2.0x normal speed)
Playing Responses:
- Each AI response will have a "🔊 Play" button
- Click to generate and play the audio version of the response
- Audio will auto-play when generated
Technical Implementation
Speech-to-Text Pipeline
- Audio input captured/uploaded
- Audio processed using librosa (resampled to 16kHz)
- Whisper model processes audio features
- Generated transcription added to chat
Text-to-Speech Pipeline
- AI response text processed
- Kokoro-82M model generates speech audio
- Audio served through HTML5 audio player
- Supports speed adjustment and auto-play
Sidebar Features
Model Status Indicators
- ✅ Whisper AI Model Loaded
- ✅ FAISS Index Loaded
- ✅ Speech-to-Text Loaded
Audio Settings
- Enable Text-to-Speech: Toggle TTS functionality
- Audio Speed: Adjust playback speed (0.5x - 2.0x)
Voice Input Tips
- Speak clearly and distinctly
- Minimize background noise
- Keep recordings under 30 seconds for best results
- Ensure good microphone quality
Troubleshooting
Common Issues
Microphone Not Working:
- Check browser permissions for microphone access
- Use the file upload option as fallback
- Ensure audio-recorder-streamlit is properly installed
Audio Quality Issues:
- Use a quiet environment
- Speak clearly and at normal pace
- Check microphone quality
TTS Not Working:
- Verify Kokoro-82M model is in correct directory
- Check audio player compatibility in browser
- Ensure scipy and audio libraries are installed
Import Errors:
- Run the installation scripts
- Manually install missing packages
- Check virtual environment activation
Model Paths
Ensure the following model directories exist:
- Speech-to-Text:
stt-model/whisper-tiny/
- Text-to-Speech:
tts-model/Kokoro-82M/
- Main AI Model:
model/Whisper-psychology-gemma-3-1b/
Browser Compatibility
Recommended Browsers
- Chrome (best support for audio features)
- Firefox
- Edge
- Safari (may have limited microphone support)
Required Permissions
- Microphone access for voice recording
- Audio playback for TTS responses
Future Enhancements
Planned Features
- Voice activity detection for hands-free operation
- Multiple voice options for TTS
- Real-time streaming transcription
- Noise cancellation for better STT accuracy
- Custom wake words for voice activation
Performance Optimizations
- Model quantization for faster inference
- Audio preprocessing optimization
- Caching for frequently used TTS phrases
- Background audio processing
Support
For issues or questions:
- Check the troubleshooting section above
- Verify all dependencies are installed
- Test with simple audio files first
- Check browser console for error messages
Version Information
- Version: 2.0 (Audio Features)
- Added: Speech-to-Text and Text-to-Speech capabilities
- Base Version: 1.0 (Text-only chat interface)