Spaces:
Sleeping
Sleeping
File size: 5,493 Bytes
cf0bb06 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
# Audio Features Documentation - Whisper AI-Psychiatric
## Overview
The Whisper AI-Psychiatric application now includes speech-to-text and text-to-speech capabilities to enhance user interaction through voice input and audio responses.
## Features Added
### 🎤 Speech-to-Text (STT)
- **Model**: Whisper-tiny (located in `stt-model/whisper-tiny/`)
- **Functionality**: Converts user voice input to text for chat interaction
- **Input Methods**:
- Real-time audio recording (using microphone)
- Audio file upload (supports WAV, MP3, M4A, FLAC)
### 🔊 Text-to-Speech (TTS)
- **Model**: Kokoro-82M (located in `tts-model/Kokoro-82M/`)
- **Functionality**: Converts AI responses to speech audio
- **Features**:
- Adjustable speech speed (0.5x to 2.0x)
- Auto-play option for responses
- Manual play button for each response
## Installation Requirements
### Required Packages
Run one of the following to install audio processing packages:
**Option 1: Using batch file (Windows)**
```bash
install_audio_packages.bat
```
**Option 2: Using PowerShell (Windows)**
```powershell
.\install_audio_packages.ps1
```
**Option 3: Manual installation**
```bash
pip install librosa>=0.10.0
pip install soundfile>=0.12.0
pip install audio-recorder-streamlit>=0.0.8
pip install scipy>=1.9.0
```
### Updated requirements.txt
The requirements.txt file has been updated to include:
- `librosa>=0.10.0` - Audio processing library
- `soundfile>=0.12.0` - Audio file I/O
- `audio-recorder-streamlit>=0.0.8` - Streamlit audio recording component
- `scipy>=1.9.0` - Scientific computing (audio processing support)
## Usage Guide
### Using Speech-to-Text
1. **Real-time Recording**:
- Click the microphone icon in the "Voice Input" section
- Speak your question clearly
- Click "Stop" when finished
- Click "🔄 Transcribe Audio" to convert speech to text
- The transcribed text will automatically be sent to the chat
2. **File Upload**:
- If the microphone recorder is not available, use the file uploader
- Upload an audio file (WAV, MP3, M4A, FLAC)
- Click "🔄 Transcribe Uploaded Audio"
- The transcribed text will be processed
### Using Text-to-Speech
1. **Enable/Disable TTS**:
- Use the "Enable Text-to-Speech" checkbox in the sidebar
- Adjust "Audio Speed" slider (0.5x to 2.0x normal speed)
2. **Playing Responses**:
- Each AI response will have a "🔊 Play" button
- Click to generate and play the audio version of the response
- Audio will auto-play when generated
## Technical Implementation
### Speech-to-Text Pipeline
1. Audio input captured/uploaded
2. Audio processed using librosa (resampled to 16kHz)
3. Whisper model processes audio features
4. Generated transcription added to chat
### Text-to-Speech Pipeline
1. AI response text processed
2. Kokoro-82M model generates speech audio
3. Audio served through HTML5 audio player
4. Supports speed adjustment and auto-play
## Sidebar Features
### Model Status Indicators
- ✅ Whisper AI Model Loaded
- ✅ FAISS Index Loaded
- ✅ Speech-to-Text Loaded
### Audio Settings
- **Enable Text-to-Speech**: Toggle TTS functionality
- **Audio Speed**: Adjust playback speed (0.5x - 2.0x)
### Voice Input Tips
- Speak clearly and distinctly
- Minimize background noise
- Keep recordings under 30 seconds for best results
- Ensure good microphone quality
## Troubleshooting
### Common Issues
1. **Microphone Not Working**:
- Check browser permissions for microphone access
- Use the file upload option as fallback
- Ensure audio-recorder-streamlit is properly installed
2. **Audio Quality Issues**:
- Use a quiet environment
- Speak clearly and at normal pace
- Check microphone quality
3. **TTS Not Working**:
- Verify Kokoro-82M model is in correct directory
- Check audio player compatibility in browser
- Ensure scipy and audio libraries are installed
4. **Import Errors**:
- Run the installation scripts
- Manually install missing packages
- Check virtual environment activation
### Model Paths
Ensure the following model directories exist:
- Speech-to-Text: `stt-model/whisper-tiny/`
- Text-to-Speech: `tts-model/Kokoro-82M/`
- Main AI Model: `model/Whisper-psychology-gemma-3-1b/`
## Browser Compatibility
### Recommended Browsers
- Chrome (best support for audio features)
- Firefox
- Edge
- Safari (may have limited microphone support)
### Required Permissions
- Microphone access for voice recording
- Audio playback for TTS responses
## Future Enhancements
### Planned Features
- Voice activity detection for hands-free operation
- Multiple voice options for TTS
- Real-time streaming transcription
- Noise cancellation for better STT accuracy
- Custom wake words for voice activation
### Performance Optimizations
- Model quantization for faster inference
- Audio preprocessing optimization
- Caching for frequently used TTS phrases
- Background audio processing
## Support
For issues or questions:
1. Check the troubleshooting section above
2. Verify all dependencies are installed
3. Test with simple audio files first
4. Check browser console for error messages
## Version Information
- **Version**: 2.0 (Audio Features)
- **Added**: Speech-to-Text and Text-to-Speech capabilities
- **Base Version**: 1.0 (Text-only chat interface)
|