File size: 5,493 Bytes
cf0bb06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
# Audio Features Documentation - Whisper AI-Psychiatric

## Overview
The Whisper AI-Psychiatric application now includes speech-to-text and text-to-speech capabilities to enhance user interaction through voice input and audio responses.

## Features Added

### 🎤 Speech-to-Text (STT)
- **Model**: Whisper-tiny (located in `stt-model/whisper-tiny/`)
- **Functionality**: Converts user voice input to text for chat interaction
- **Input Methods**:
  - Real-time audio recording (using microphone)
  - Audio file upload (supports WAV, MP3, M4A, FLAC)

### 🔊 Text-to-Speech (TTS)
- **Model**: Kokoro-82M (located in `tts-model/Kokoro-82M/`)
- **Functionality**: Converts AI responses to speech audio
- **Features**:
  - Adjustable speech speed (0.5x to 2.0x)
  - Auto-play option for responses
  - Manual play button for each response

## Installation Requirements

### Required Packages
Run one of the following to install audio processing packages:

**Option 1: Using batch file (Windows)**
```bash

install_audio_packages.bat

```

**Option 2: Using PowerShell (Windows)**
```powershell

.\install_audio_packages.ps1

```

**Option 3: Manual installation**
```bash

pip install librosa>=0.10.0

pip install soundfile>=0.12.0

pip install audio-recorder-streamlit>=0.0.8

pip install scipy>=1.9.0

```

### Updated requirements.txt
The requirements.txt file has been updated to include:
- `librosa>=0.10.0` - Audio processing library
- `soundfile>=0.12.0` - Audio file I/O
- `audio-recorder-streamlit>=0.0.8` - Streamlit audio recording component
- `scipy>=1.9.0` - Scientific computing (audio processing support)

## Usage Guide

### Using Speech-to-Text

1. **Real-time Recording**:
   - Click the microphone icon in the "Voice Input" section
   - Speak your question clearly
   - Click "Stop" when finished
   - Click "🔄 Transcribe Audio" to convert speech to text
   - The transcribed text will automatically be sent to the chat

2. **File Upload**:
   - If the microphone recorder is not available, use the file uploader
   - Upload an audio file (WAV, MP3, M4A, FLAC)
   - Click "🔄 Transcribe Uploaded Audio"
   - The transcribed text will be processed

### Using Text-to-Speech

1. **Enable/Disable TTS**:
   - Use the "Enable Text-to-Speech" checkbox in the sidebar
   - Adjust "Audio Speed" slider (0.5x to 2.0x normal speed)

2. **Playing Responses**:
   - Each AI response will have a "🔊 Play" button
   - Click to generate and play the audio version of the response
   - Audio will auto-play when generated

## Technical Implementation

### Speech-to-Text Pipeline
1. Audio input captured/uploaded
2. Audio processed using librosa (resampled to 16kHz)
3. Whisper model processes audio features
4. Generated transcription added to chat

### Text-to-Speech Pipeline
1. AI response text processed
2. Kokoro-82M model generates speech audio
3. Audio served through HTML5 audio player
4. Supports speed adjustment and auto-play

## Sidebar Features

### Model Status Indicators
- ✅ Whisper AI Model Loaded
- ✅ FAISS Index Loaded  
- ✅ Speech-to-Text Loaded

### Audio Settings
- **Enable Text-to-Speech**: Toggle TTS functionality
- **Audio Speed**: Adjust playback speed (0.5x - 2.0x)

### Voice Input Tips
- Speak clearly and distinctly
- Minimize background noise
- Keep recordings under 30 seconds for best results
- Ensure good microphone quality

## Troubleshooting

### Common Issues

1. **Microphone Not Working**:
   - Check browser permissions for microphone access
   - Use the file upload option as fallback
   - Ensure audio-recorder-streamlit is properly installed

2. **Audio Quality Issues**:
   - Use a quiet environment
   - Speak clearly and at normal pace
   - Check microphone quality

3. **TTS Not Working**:
   - Verify Kokoro-82M model is in correct directory
   - Check audio player compatibility in browser
   - Ensure scipy and audio libraries are installed

4. **Import Errors**:
   - Run the installation scripts
   - Manually install missing packages
   - Check virtual environment activation

### Model Paths
Ensure the following model directories exist:
- Speech-to-Text: `stt-model/whisper-tiny/`
- Text-to-Speech: `tts-model/Kokoro-82M/`
- Main AI Model: `model/Whisper-psychology-gemma-3-1b/`

## Browser Compatibility

### Recommended Browsers
- Chrome (best support for audio features)
- Firefox
- Edge
- Safari (may have limited microphone support)

### Required Permissions
- Microphone access for voice recording
- Audio playback for TTS responses

## Future Enhancements

### Planned Features
- Voice activity detection for hands-free operation
- Multiple voice options for TTS
- Real-time streaming transcription
- Noise cancellation for better STT accuracy
- Custom wake words for voice activation

### Performance Optimizations
- Model quantization for faster inference
- Audio preprocessing optimization
- Caching for frequently used TTS phrases
- Background audio processing

## Support

For issues or questions:
1. Check the troubleshooting section above
2. Verify all dependencies are installed
3. Test with simple audio files first
4. Check browser console for error messages

## Version Information
- **Version**: 2.0 (Audio Features)
- **Added**: Speech-to-Text and Text-to-Speech capabilities
- **Base Version**: 1.0 (Text-only chat interface)