Whisper-AI-Psychiatric / TTS_SETUP.md
KNipun's picture
Upload 14 files
cf0bb06 verified

A newer version of the Streamlit SDK is available: 1.49.1

Upgrade

Text-to-Speech (TTS) Setup Guide

Kokoro-82M Implementation

✅ Fixed Issues

  1. File Access Error: Fixed the "process cannot access the file" error by using BytesIO instead of temporary files
  2. Proper Error Handling: Graceful fallback when Kokoro is not available
  3. Silent Fallback: No error messages when Kokoro fails, just uses backup audio generation

🎯 Current Status

  • Primary TTS: Kokoro-82M (if fully configured)
  • Fallback TTS: Multi-harmonic tone generation with speech-like patterns
  • File Handling: Fixed using in-memory BytesIO buffers
  • Audio Format: WAV format, 22050 Hz sample rate

📦 Requirements

  • kokoro>=0.9.2 ✅ Installed
  • soundfile>=0.12.0 ✅ Already available
  • librosa>=0.10.0 ✅ Already available

🔧 Optional: Full Kokoro Setup

To enable full Kokoro-82M TTS (currently using fallback):

  1. Install espeak-ng (system-level):

    # Windows: Download from https://github.com/espeak-ng/espeak-ng/releases
    # Or use chocolatey: choco install espeak
    
    # Ubuntu/Debian:
    sudo apt-get install espeak-ng
    
    # macOS:
    brew install espeak-ng
    
  2. Test Kokoro Installation:

    from kokoro import KPipeline
    pipeline = KPipeline(lang_code='a')
    

🎵 Current Audio Features

  • Fallback Audio: Multi-harmonic synthesis simulating speech patterns
  • Speed Control: Adjustable speech speed (0.5x to 2.0x)
  • Text Cleaning: Removes markdown, emojis, and special characters
  • Length Limiting: Automatically truncates long text to 500 characters
  • In-Memory Processing: No temporary files, prevents file access errors

🔍 Troubleshooting

Issue: "process cannot access the file"

Status: ✅ FIXED - Now uses BytesIO instead of temporary files

Issue: Kokoro import errors

Solution: Falls back to synthetic audio generation automatically

Issue: No audio generated

Check:

  1. Audio is enabled in browser
  2. TTS is enabled in sidebar settings
  3. Check browser console for errors

🎯 Voice Features Available

  • Speech-to-Text: Whisper-tiny model ✅
  • Text-to-Speech: Kokoro-82M (fallback: synthetic) ✅
  • Speed Control: 0.5x to 2.0x ✅
  • Auto-processing: Speech → AI Response ✅

🔮 Future Improvements

  1. Enhanced Kokoro Setup: Complete espeak-ng integration
  2. Voice Selection: Multiple Kokoro voices (af_heart, etc.)
  3. Emotion Control: Emotional speech synthesis
  4. SSML Support: Speech Synthesis Markup Language
  5. Caching: Audio response caching for repeated text

📝 Usage

The TTS system works automatically:

  1. AI generates text response
  2. Click "🔊 Play" button next to response
  3. Audio generates using best available method (Kokoro → Fallback)
  4. Audio plays automatically in browser

⚡ Performance

  • Fallback Audio: ~0.1-0.5 seconds generation time
  • Kokoro Audio: ~1-3 seconds generation time (when available)
  • Memory Usage: Minimal (in-memory processing)
  • File System: No temporary files created