Audio Features Documentation - Whisper AI-Psychiatric

Overview

The Whisper AI-Psychiatric application now includes speech-to-text and text-to-speech capabilities to enhance user interaction through voice input and audio responses.

Features Added

🎤 Speech-to-Text (STT)

Model: Whisper-tiny (located in stt-model/whisper-tiny/)
Functionality: Converts user voice input to text for chat interaction
Input Methods:
- Real-time audio recording (using microphone)
- Audio file upload (supports WAV, MP3, M4A, FLAC)

🔊 Text-to-Speech (TTS)

Model: Kokoro-82M (located in tts-model/Kokoro-82M/)
Functionality: Converts AI responses to speech audio
Features:
- Adjustable speech speed (0.5x to 2.0x)
- Auto-play option for responses
- Manual play button for each response

Installation Requirements

Required Packages

Run one of the following to install audio processing packages:

Option 1: Using batch file (Windows)

install_audio_packages.bat

Option 2: Using PowerShell (Windows)

.\install_audio_packages.ps1

Option 3: Manual installation

pip install librosa>=0.10.0
pip install soundfile>=0.12.0
pip install audio-recorder-streamlit>=0.0.8
pip install scipy>=1.9.0

Updated requirements.txt

The requirements.txt file has been updated to include:

librosa>=0.10.0 - Audio processing library
soundfile>=0.12.0 - Audio file I/O
audio-recorder-streamlit>=0.0.8 - Streamlit audio recording component
scipy>=1.9.0 - Scientific computing (audio processing support)

Usage Guide

Using Speech-to-Text

Real-time Recording:
- Click the microphone icon in the "Voice Input" section
- Speak your question clearly
- Click "Stop" when finished
- Click "🔄 Transcribe Audio" to convert speech to text
- The transcribed text will automatically be sent to the chat
File Upload:
- If the microphone recorder is not available, use the file uploader
- Upload an audio file (WAV, MP3, M4A, FLAC)
- Click "🔄 Transcribe Uploaded Audio"
- The transcribed text will be processed

Using Text-to-Speech

Enable/Disable TTS:
- Use the "Enable Text-to-Speech" checkbox in the sidebar
- Adjust "Audio Speed" slider (0.5x to 2.0x normal speed)
Playing Responses:
- Each AI response will have a "🔊 Play" button
- Click to generate and play the audio version of the response
- Audio will auto-play when generated

Technical Implementation

Speech-to-Text Pipeline

Audio input captured/uploaded
Audio processed using librosa (resampled to 16kHz)
Whisper model processes audio features
Generated transcription added to chat

Text-to-Speech Pipeline

AI response text processed
Kokoro-82M model generates speech audio
Audio served through HTML5 audio player
Supports speed adjustment and auto-play

Sidebar Features

Model Status Indicators

✅ Whisper AI Model Loaded
✅ FAISS Index Loaded
✅ Speech-to-Text Loaded

Audio Settings

Enable Text-to-Speech: Toggle TTS functionality
Audio Speed: Adjust playback speed (0.5x - 2.0x)

Voice Input Tips

Speak clearly and distinctly
Minimize background noise
Keep recordings under 30 seconds for best results
Ensure good microphone quality

Troubleshooting

Common Issues

Microphone Not Working:
- Check browser permissions for microphone access
- Use the file upload option as fallback
- Ensure audio-recorder-streamlit is properly installed
Audio Quality Issues:
- Use a quiet environment
- Speak clearly and at normal pace
- Check microphone quality
TTS Not Working:
- Verify Kokoro-82M model is in correct directory
- Check audio player compatibility in browser
- Ensure scipy and audio libraries are installed
Import Errors:
- Run the installation scripts
- Manually install missing packages
- Check virtual environment activation

Model Paths

Ensure the following model directories exist:

Speech-to-Text: stt-model/whisper-tiny/
Text-to-Speech: tts-model/Kokoro-82M/
Main AI Model: model/Whisper-psychology-gemma-3-1b/

Browser Compatibility

Recommended Browsers

Chrome (best support for audio features)
Firefox
Edge
Safari (may have limited microphone support)

Required Permissions

Microphone access for voice recording
Audio playback for TTS responses

Future Enhancements

Planned Features

Voice activity detection for hands-free operation
Multiple voice options for TTS
Real-time streaming transcription
Noise cancellation for better STT accuracy
Custom wake words for voice activation

Performance Optimizations

Model quantization for faster inference
Audio preprocessing optimization
Caching for frequently used TTS phrases
Background audio processing

Support

For issues or questions:

Check the troubleshooting section above
Verify all dependencies are installed
Test with simple audio files first
Check browser console for error messages

Version Information

Version: 2.0 (Audio Features)
Added: Speech-to-Text and Text-to-Speech capabilities
Base Version: 1.0 (Text-only chat interface)