Spaces:

KNipun
/

Whisper-AI-Psychiatric

Sleeping

File size: 5,493 Bytes

cf0bb06

# Audio Features Documentation - Whisper AI-Psychiatric

## Overview
The Whisper AI-Psychiatric application now includes speech-to-text and text-to-speech capabilities to enhance user interaction through voice input and audio responses.

## Features Added

### 🎤 Speech-to-Text (STT)
- **Model**: Whisper-tiny (located in `stt-model/whisper-tiny/`)
- **Functionality**: Converts user voice input to text for chat interaction
- **Input Methods**:
  - Real-time audio recording (using microphone)
  - Audio file upload (supports WAV, MP3, M4A, FLAC)

### 🔊 Text-to-Speech (TTS)
- **Model**: Kokoro-82M (located in `tts-model/Kokoro-82M/`)
- **Functionality**: Converts AI responses to speech audio
- **Features**:
  - Adjustable speech speed (0.5x to 2.0x)
  - Auto-play option for responses
  - Manual play button for each response

## Installation Requirements

### Required Packages
Run one of the following to install audio processing packages:

**Option 1: Using batch file (Windows)**
```bash

install_audio_packages.bat

```

**Option 2: Using PowerShell (Windows)**
```powershell

.\install_audio_packages.ps1

```

**Option 3: Manual installation**
```bash

pip install librosa>=0.10.0

pip install soundfile>=0.12.0

pip install audio-recorder-streamlit>=0.0.8

pip install scipy>=1.9.0

```

### Updated requirements.txt
The requirements.txt file has been updated to include:
- `librosa>=0.10.0` - Audio processing library
- `soundfile>=0.12.0` - Audio file I/O
- `audio-recorder-streamlit>=0.0.8` - Streamlit audio recording component
- `scipy>=1.9.0` - Scientific computing (audio processing support)

## Usage Guide

### Using Speech-to-Text

1. **Real-time Recording**:
   - Click the microphone icon in the "Voice Input" section
   - Speak your question clearly
   - Click "Stop" when finished
   - Click "🔄 Transcribe Audio" to convert speech to text
   - The transcribed text will automatically be sent to the chat

2. **File Upload**:
   - If the microphone recorder is not available, use the file uploader
   - Upload an audio file (WAV, MP3, M4A, FLAC)
   - Click "🔄 Transcribe Uploaded Audio"
   - The transcribed text will be processed

### Using Text-to-Speech

1. **Enable/Disable TTS**:
   - Use the "Enable Text-to-Speech" checkbox in the sidebar
   - Adjust "Audio Speed" slider (0.5x to 2.0x normal speed)

2. **Playing Responses**:
   - Each AI response will have a "🔊 Play" button
   - Click to generate and play the audio version of the response
   - Audio will auto-play when generated

## Technical Implementation

### Speech-to-Text Pipeline
1. Audio input captured/uploaded
2. Audio processed using librosa (resampled to 16kHz)
3. Whisper model processes audio features
4. Generated transcription added to chat

### Text-to-Speech Pipeline
1. AI response text processed
2. Kokoro-82M model generates speech audio
3. Audio served through HTML5 audio player
4. Supports speed adjustment and auto-play

## Sidebar Features

### Model Status Indicators
- ✅ Whisper AI Model Loaded
- ✅ FAISS Index Loaded  
- ✅ Speech-to-Text Loaded

### Audio Settings
- **Enable Text-to-Speech**: Toggle TTS functionality
- **Audio Speed**: Adjust playback speed (0.5x - 2.0x)

### Voice Input Tips
- Speak clearly and distinctly
- Minimize background noise
- Keep recordings under 30 seconds for best results
- Ensure good microphone quality

## Troubleshooting

### Common Issues

1. **Microphone Not Working**:
   - Check browser permissions for microphone access
   - Use the file upload option as fallback
   - Ensure audio-recorder-streamlit is properly installed

2. **Audio Quality Issues**:
   - Use a quiet environment
   - Speak clearly and at normal pace
   - Check microphone quality

3. **TTS Not Working**:
   - Verify Kokoro-82M model is in correct directory
   - Check audio player compatibility in browser
   - Ensure scipy and audio libraries are installed

4. **Import Errors**:
   - Run the installation scripts
   - Manually install missing packages
   - Check virtual environment activation

### Model Paths
Ensure the following model directories exist:
- Speech-to-Text: `stt-model/whisper-tiny/`
- Text-to-Speech: `tts-model/Kokoro-82M/`
- Main AI Model: `model/Whisper-psychology-gemma-3-1b/`

## Browser Compatibility

### Recommended Browsers
- Chrome (best support for audio features)
- Firefox
- Edge
- Safari (may have limited microphone support)

### Required Permissions
- Microphone access for voice recording
- Audio playback for TTS responses

## Future Enhancements

### Planned Features
- Voice activity detection for hands-free operation
- Multiple voice options for TTS
- Real-time streaming transcription
- Noise cancellation for better STT accuracy
- Custom wake words for voice activation

### Performance Optimizations
- Model quantization for faster inference
- Audio preprocessing optimization
- Caching for frequently used TTS phrases
- Background audio processing

## Support

For issues or questions:
1. Check the troubleshooting section above
2. Verify all dependencies are installed
3. Test with simple audio files first
4. Check browser console for error messages

## Version Information
- **Version**: 2.0 (Audio Features)
- **Added**: Speech-to-Text and Text-to-Speech capabilities
- **Base Version**: 1.0 (Text-only chat interface)