Whisper-AI-Psychiatric / AUDIO_FEATURES.md
KNipun's picture
Upload 14 files
cf0bb06 verified

A newer version of the Streamlit SDK is available: 1.49.1

Upgrade

Audio Features Documentation - Whisper AI-Psychiatric

Overview

The Whisper AI-Psychiatric application now includes speech-to-text and text-to-speech capabilities to enhance user interaction through voice input and audio responses.

Features Added

🎤 Speech-to-Text (STT)

  • Model: Whisper-tiny (located in stt-model/whisper-tiny/)
  • Functionality: Converts user voice input to text for chat interaction
  • Input Methods:
    • Real-time audio recording (using microphone)
    • Audio file upload (supports WAV, MP3, M4A, FLAC)

🔊 Text-to-Speech (TTS)

  • Model: Kokoro-82M (located in tts-model/Kokoro-82M/)
  • Functionality: Converts AI responses to speech audio
  • Features:
    • Adjustable speech speed (0.5x to 2.0x)
    • Auto-play option for responses
    • Manual play button for each response

Installation Requirements

Required Packages

Run one of the following to install audio processing packages:

Option 1: Using batch file (Windows)

install_audio_packages.bat

Option 2: Using PowerShell (Windows)

.\install_audio_packages.ps1

Option 3: Manual installation

pip install librosa>=0.10.0
pip install soundfile>=0.12.0
pip install audio-recorder-streamlit>=0.0.8
pip install scipy>=1.9.0

Updated requirements.txt

The requirements.txt file has been updated to include:

  • librosa>=0.10.0 - Audio processing library
  • soundfile>=0.12.0 - Audio file I/O
  • audio-recorder-streamlit>=0.0.8 - Streamlit audio recording component
  • scipy>=1.9.0 - Scientific computing (audio processing support)

Usage Guide

Using Speech-to-Text

  1. Real-time Recording:

    • Click the microphone icon in the "Voice Input" section
    • Speak your question clearly
    • Click "Stop" when finished
    • Click "🔄 Transcribe Audio" to convert speech to text
    • The transcribed text will automatically be sent to the chat
  2. File Upload:

    • If the microphone recorder is not available, use the file uploader
    • Upload an audio file (WAV, MP3, M4A, FLAC)
    • Click "🔄 Transcribe Uploaded Audio"
    • The transcribed text will be processed

Using Text-to-Speech

  1. Enable/Disable TTS:

    • Use the "Enable Text-to-Speech" checkbox in the sidebar
    • Adjust "Audio Speed" slider (0.5x to 2.0x normal speed)
  2. Playing Responses:

    • Each AI response will have a "🔊 Play" button
    • Click to generate and play the audio version of the response
    • Audio will auto-play when generated

Technical Implementation

Speech-to-Text Pipeline

  1. Audio input captured/uploaded
  2. Audio processed using librosa (resampled to 16kHz)
  3. Whisper model processes audio features
  4. Generated transcription added to chat

Text-to-Speech Pipeline

  1. AI response text processed
  2. Kokoro-82M model generates speech audio
  3. Audio served through HTML5 audio player
  4. Supports speed adjustment and auto-play

Sidebar Features

Model Status Indicators

  • ✅ Whisper AI Model Loaded
  • ✅ FAISS Index Loaded
  • ✅ Speech-to-Text Loaded

Audio Settings

  • Enable Text-to-Speech: Toggle TTS functionality
  • Audio Speed: Adjust playback speed (0.5x - 2.0x)

Voice Input Tips

  • Speak clearly and distinctly
  • Minimize background noise
  • Keep recordings under 30 seconds for best results
  • Ensure good microphone quality

Troubleshooting

Common Issues

  1. Microphone Not Working:

    • Check browser permissions for microphone access
    • Use the file upload option as fallback
    • Ensure audio-recorder-streamlit is properly installed
  2. Audio Quality Issues:

    • Use a quiet environment
    • Speak clearly and at normal pace
    • Check microphone quality
  3. TTS Not Working:

    • Verify Kokoro-82M model is in correct directory
    • Check audio player compatibility in browser
    • Ensure scipy and audio libraries are installed
  4. Import Errors:

    • Run the installation scripts
    • Manually install missing packages
    • Check virtual environment activation

Model Paths

Ensure the following model directories exist:

  • Speech-to-Text: stt-model/whisper-tiny/
  • Text-to-Speech: tts-model/Kokoro-82M/
  • Main AI Model: model/Whisper-psychology-gemma-3-1b/

Browser Compatibility

Recommended Browsers

  • Chrome (best support for audio features)
  • Firefox
  • Edge
  • Safari (may have limited microphone support)

Required Permissions

  • Microphone access for voice recording
  • Audio playback for TTS responses

Future Enhancements

Planned Features

  • Voice activity detection for hands-free operation
  • Multiple voice options for TTS
  • Real-time streaming transcription
  • Noise cancellation for better STT accuracy
  • Custom wake words for voice activation

Performance Optimizations

  • Model quantization for faster inference
  • Audio preprocessing optimization
  • Caching for frequently used TTS phrases
  • Background audio processing

Support

For issues or questions:

  1. Check the troubleshooting section above
  2. Verify all dependencies are installed
  3. Test with simple audio files first
  4. Check browser console for error messages

Version Information

  • Version: 2.0 (Audio Features)
  • Added: Speech-to-Text and Text-to-Speech capabilities
  • Base Version: 1.0 (Text-only chat interface)