Spaces:

wishitwerethe90s
/

voice-assistant

Sleeping

App Files Files Community

voice-assistant / README.md

wishitwerethe90s

Upload folder using huggingface_hub

c2ac364 verified 3 months ago

preview code

raw

history blame contribute delete

5.42 kB

	---
	title: voice-assistant
	app_file: gradio_app.py
	sdk: gradio
	sdk_version: 5.29.1
	---
	# Real-time Conversational AI Chatbot Backend

	This project implements a Python-based backend for a real-time conversational AI chatbot. It features Speech-to-Text (STT), Language Model (LLM) processing via Google's Gemini API, and streaming Text-to-Speech (TTS) capabilities, all orchestrated through a FastAPI web server with WebSocket support for interactive conversations.

	## Core Features

	- Speech-to-Text (STT): Utilizes OpenAI's Whisper model to transcribe user's spoken audio into text.
	- Language Model (LLM): Integrates with Google's Gemini API (e.g., `gemini-1.5-flash-latest`) for generating intelligent and contextual responses.
	- Text-to-Speech (TTS) with Streaming: Employs AI4Bharat's IndicParler-TTS model (via `parler-tts` library) with `ParlerTTSStreamer` to convert the LLM's text response into audible speech, streamed chunk by chunk for faster time-to-first-audio.
	- Real-time Interaction: A WebSocket endpoint (`/ws/conversation`) manages the live, bidirectional flow of audio and text data between the client and server.
	- Component Testing: Includes individual HTTP RESTful endpoints for testing STT, LLM, and TTS functionalities separately.
	- Basic Client Demo: Provides a simple HTML/JavaScript client served at the root (`/`) for demonstrating the WebSocket conversation flow.

	## Technologies Used

	- Backend Framework: FastAPI
	- ASR (STT): OpenAI Whisper
	- LLM: Google Gemini API (via `google-generativeai` SDK)
	- TTS: AI4Bharat IndicParler-TTS (via `parler-tts` and `transformers`)
	- Audio Processing: `soundfile`, `librosa`
	- Async & Concurrency: `asyncio`, `threading` (for ParlerTTSStreamer)
	- ML/DL: PyTorch
	- Web Server: Uvicorn

	## Setup and Installation

	1. Clone the Repository (if applicable)

	```bash
	git clone <your-repo-url>
	cd <your-repo-name>
	```

	2. Create a Python Virtual Environment

	- Using `venv`:
	```bash
	python -m venv venv
	source venv/bin/activate # On Windows: venv\Scripts\activate
	```
	- Or using `conda`:
	```bash
	conda create -n voicebot_env python=3.10 # Or your preferred Python 3.9+
	conda activate voicebot_env
	```

	3. Install Dependencies

	```bash
	pip install -r requirements.txt
	```

	Ensure you have `ffmpeg` installed on your system, as Whisper requires it.
	(e.g., `sudo apt update && sudo apt install ffmpeg` on Debian/Ubuntu)

	4. Set Environment Variables:
	- Gemini API Key: Obtain an API key from [Google AI Studio](https://aistudio.google.com/). Set it as an environment variable:
	```bash
	export GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY"
	```
	(For Windows PowerShell: `$env:GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY"`)
	- (Optional) Whisper Model Size:
	```bash
	export WHISPER_MODEL_SIZE="base" # (e.g., tiny, base, small, medium, large)
	```
	Defaults to "base" if not set.

	### HTTP RESTful Endpoints

	These are standard FastAPI path operations for testing individual components:

	- `POST /api/stt`: Upload an audio file to get its transcription.
	- `POST /api/llm`: Send text in a JSON payload to get a response from Gemini.
	- `POST /api/tts`: Send text in a JSON payload to get synthesized audio (non-streaming for this HTTP endpoint, returns base64 encoded WAV).

	### WebSocket Endpoint: `/ws/conversation`

	This is the primary endpoint for real-time, bidirectional conversational interaction:

	- `@app.websocket("/ws/conversation")` defines the WebSocket route.
	- Connection Handling: Accepts new WebSocket connections.
	- Main Interaction Loop:
	1. Receive Audio: Waits to receive audio data (bytes) from the client (`await websocket.receive_bytes()`).
	2. STT: Calls `transcribe_audio_bytes()` to get text from the user's audio. Sends `USER_TRANSCRIPT: <text>` back to the client.
	3. LLM: Calls `generate_gemini_response()` with the transcribed text. Sends `ASSISTANT_RESPONSE_TEXT: <text>` back to the client.
	4. Streaming TTS:
	- Sends a `TTS_STREAM_START: {<audio_params>}` message to the client, informing it about the sample rate, channels, and bit depth of the upcoming audio stream.
	- Iterates through the `synthesize_speech_streaming()` asynchronous generator.
	- For each `audio_chunk_bytes` yielded, it sends these raw audio bytes to the client using `await websocket.send_bytes()`.
	- If `websocket.send_bytes()` fails (e.g., client disconnected), the loop breaks, and the `cancellation_event` is set to signal the TTS thread.
	- After the stream is complete (or cancelled), it sends a `TTS_STREAM_END` message.
	- Error Handling: Includes `try...except WebSocketDisconnect` to handle client disconnections gracefully and a general exception handler.
	- Cleanup: The `finally` block ensures the `cancellation_event` for TTS is set and attempts to close the WebSocket.

	## How to Run

	1. Ensure all setup steps (environment, dependencies, API key) are complete.
	2. Execute the script:
	```bash
	python main.py
	```
	Or, for development with auto-reload:
	```bash
	uvicorn main:app --reload --host 0.0.0.0 --port 8000
	```
	3. The server will start, and you should see logs indicating that models are being loaded.