Whisper-AI-Psychiatric / AUDIO_FEATURES.md
KNipun's picture
Upload 14 files
cf0bb06 verified
# Audio Features Documentation - Whisper AI-Psychiatric
## Overview
The Whisper AI-Psychiatric application now includes speech-to-text and text-to-speech capabilities to enhance user interaction through voice input and audio responses.
## Features Added
### 🎤 Speech-to-Text (STT)
- **Model**: Whisper-tiny (located in `stt-model/whisper-tiny/`)
- **Functionality**: Converts user voice input to text for chat interaction
- **Input Methods**:
- Real-time audio recording (using microphone)
- Audio file upload (supports WAV, MP3, M4A, FLAC)
### 🔊 Text-to-Speech (TTS)
- **Model**: Kokoro-82M (located in `tts-model/Kokoro-82M/`)
- **Functionality**: Converts AI responses to speech audio
- **Features**:
- Adjustable speech speed (0.5x to 2.0x)
- Auto-play option for responses
- Manual play button for each response
## Installation Requirements
### Required Packages
Run one of the following to install audio processing packages:
**Option 1: Using batch file (Windows)**
```bash
install_audio_packages.bat
```
**Option 2: Using PowerShell (Windows)**
```powershell
.\install_audio_packages.ps1
```
**Option 3: Manual installation**
```bash
pip install librosa>=0.10.0
pip install soundfile>=0.12.0
pip install audio-recorder-streamlit>=0.0.8
pip install scipy>=1.9.0
```
### Updated requirements.txt
The requirements.txt file has been updated to include:
- `librosa>=0.10.0` - Audio processing library
- `soundfile>=0.12.0` - Audio file I/O
- `audio-recorder-streamlit>=0.0.8` - Streamlit audio recording component
- `scipy>=1.9.0` - Scientific computing (audio processing support)
## Usage Guide
### Using Speech-to-Text
1. **Real-time Recording**:
- Click the microphone icon in the "Voice Input" section
- Speak your question clearly
- Click "Stop" when finished
- Click "🔄 Transcribe Audio" to convert speech to text
- The transcribed text will automatically be sent to the chat
2. **File Upload**:
- If the microphone recorder is not available, use the file uploader
- Upload an audio file (WAV, MP3, M4A, FLAC)
- Click "🔄 Transcribe Uploaded Audio"
- The transcribed text will be processed
### Using Text-to-Speech
1. **Enable/Disable TTS**:
- Use the "Enable Text-to-Speech" checkbox in the sidebar
- Adjust "Audio Speed" slider (0.5x to 2.0x normal speed)
2. **Playing Responses**:
- Each AI response will have a "🔊 Play" button
- Click to generate and play the audio version of the response
- Audio will auto-play when generated
## Technical Implementation
### Speech-to-Text Pipeline
1. Audio input captured/uploaded
2. Audio processed using librosa (resampled to 16kHz)
3. Whisper model processes audio features
4. Generated transcription added to chat
### Text-to-Speech Pipeline
1. AI response text processed
2. Kokoro-82M model generates speech audio
3. Audio served through HTML5 audio player
4. Supports speed adjustment and auto-play
## Sidebar Features
### Model Status Indicators
- ✅ Whisper AI Model Loaded
- ✅ FAISS Index Loaded
- ✅ Speech-to-Text Loaded
### Audio Settings
- **Enable Text-to-Speech**: Toggle TTS functionality
- **Audio Speed**: Adjust playback speed (0.5x - 2.0x)
### Voice Input Tips
- Speak clearly and distinctly
- Minimize background noise
- Keep recordings under 30 seconds for best results
- Ensure good microphone quality
## Troubleshooting
### Common Issues
1. **Microphone Not Working**:
- Check browser permissions for microphone access
- Use the file upload option as fallback
- Ensure audio-recorder-streamlit is properly installed
2. **Audio Quality Issues**:
- Use a quiet environment
- Speak clearly and at normal pace
- Check microphone quality
3. **TTS Not Working**:
- Verify Kokoro-82M model is in correct directory
- Check audio player compatibility in browser
- Ensure scipy and audio libraries are installed
4. **Import Errors**:
- Run the installation scripts
- Manually install missing packages
- Check virtual environment activation
### Model Paths
Ensure the following model directories exist:
- Speech-to-Text: `stt-model/whisper-tiny/`
- Text-to-Speech: `tts-model/Kokoro-82M/`
- Main AI Model: `model/Whisper-psychology-gemma-3-1b/`
## Browser Compatibility
### Recommended Browsers
- Chrome (best support for audio features)
- Firefox
- Edge
- Safari (may have limited microphone support)
### Required Permissions
- Microphone access for voice recording
- Audio playback for TTS responses
## Future Enhancements
### Planned Features
- Voice activity detection for hands-free operation
- Multiple voice options for TTS
- Real-time streaming transcription
- Noise cancellation for better STT accuracy
- Custom wake words for voice activation
### Performance Optimizations
- Model quantization for faster inference
- Audio preprocessing optimization
- Caching for frequently used TTS phrases
- Background audio processing
## Support
For issues or questions:
1. Check the troubleshooting section above
2. Verify all dependencies are installed
3. Test with simple audio files first
4. Check browser console for error messages
## Version Information
- **Version**: 2.0 (Audio Features)
- **Added**: Speech-to-Text and Text-to-Speech capabilities
- **Base Version**: 1.0 (Text-only chat interface)