File size: 3,259 Bytes
865ef46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8970226
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
title: Voice Clonning
emoji: "πŸ—£οΈ"
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: "3.0"
app_file: app.py
---
# Voice Clonning

This Space allows users to clone voices using a pre-trained model. Upload a reference audio file, type your text, and hear the result!

**Usage Instructions:**
1. Upload your reference voice file  
2. Enter text to synthesize  
3. Click **Submit** and listen to the cloned voice output

**Notes:**  
– Requires moderate CPU.  
– For faster performance, consider toggling GPU under Settings.



# XTTS v2 Voice Cloning Demo (Coqui TTS)

This demo clones a speaker's voice from a short reference sample and synthesizes text in multiple languages using the XTTS v2 model.

Contents:
- `clone_voice.py` β€” CLI script to run voice cloning

Requirements:
- Python 3.9–3.11 recommended
- Windows, macOS, or Linux

## 1) Setup (recommended: virtual environment)

Windows (PowerShell):
```
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
```

macOS/Linux (bash):
```
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
```

### CPU-only install
```
pip install TTS
```

### GPU (CUDA) install (Windows/Linux)
1) Install a CUDA-enabled PyTorch build compatible with your CUDA version. Example for CUDA 12.1:
```
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
```
2) Then install Coqui TTS:
```
pip install TTS
```
3) Verify CUDA availability (optional):
```
python - << "PY"
import torch
print("CUDA available:", torch.cuda.is_available())
PY
```
If the above prints False but you expected True, you likely installed a CPU-only PyTorch or mismatched CUDA build.

## 2) Prepare a reference voice sample
- Short clip: 6–15 seconds is usually enough.
- Clean speech, minimal background noise, no music.
- Mono WAV (16–48 kHz recommended). Many formats work, but WAV is safest.
- Place the file in this folder, e.g., `reference_voice.wav`.

## 3) Run the demo
From this `demotask` directory:

CPU:
```
python clone_voice.py --text "Ok signore, l'ho completato e qui ci sono i file WAV di riferimento." --speaker_wav "reference1.wav" --language it --output "output_it.wav" --device cpu



On first run, the model `tts_models/multilingual/multi-dataset/xtts_v2` will be downloaded automatically. The result is saved as `output.wav`.

Common language codes: `en`, `it`, `es`, `fr`, `de`, `pt`, `pl`, `nl`, `tr`, `ru`, `zh`, `ja`, `ko`.

## 4) Troubleshooting
- CUDA not used: Ensure you installed a CUDA-enabled PyTorch (see above) and your GPU drivers/CUDA runtime are installed. Then use `--device cuda`.
- Out of memory (OOM): Try CPU mode or shorter text; ensure no other GPU-heavy apps are running.
- Reference file not found: Check the `--speaker_wav` path.
- Bad audio quality: Use a cleaner/longer reference sample, reduce background noise, and avoid clipping. Try 16 kHz or 22.05/24/44.1 kHz mono WAV.
- Slow on CPU: This is expected. GPU is recommended for speed.

## 5) Notes
- This script auto-selects CUDA if available when `--device` is not provided.
- For repeatable environments, consider pinning versions in a `requirements.txt`.
- Model: `tts_models/multilingual/multi-dataset/xtts_v2`.