Custom bark models
Hi,
how to use your files into bark?
thanks!
Like this:
def _load_filtered_model(
config_path: str,
weights_path: str,
device: str = "cpu",
) -> Any:
"""
Load a model from separate config and weights files.
This is the replacement for _load_model_pth that works with the split files.
Args:
config_path: Path to the config JSON file
weights_path: Path to the weights PTH file
device: Device to load the model onto ("cpu", "cuda", etc.)
Returns:
The loaded model or a dict containing model and tokenizer if applicable
"""
# Check if files exist
if not os.path.exists(config_path):
raise FileNotFoundError(f"Config file not found: {config_path}")
if not os.path.exists(weights_path):
raise FileNotFoundError(f"Weights file not found: {weights_path}")
# Load the config
logger.info(f"Loading model config from {config_path}")
with open(config_path, "r") as f:
config_data = json.load(f)
# Extract the model args
model_args = config_data["model_config"]
model_type = config_data.get("model_type", "text")
# This is just an example - you'd need to adjust based on actual imports
if model_type == "text":
from bark.model import GPTConfig as ConfigClass, GPT as ModelClass
elif model_type == "coarse":
from bark.model import GPTConfig as ConfigClass, GPT as ModelClass
elif model_type == "fine":
from bark.model_fine import (
FineGPTConfig as ConfigClass,
FineGPT as ModelClass,
)
else:
raise NotImplementedError(f"Model type {model_type} not supported")
# Initialize model with config
logger.info(f"Initializing {model_type} model")
gptconf = ConfigClass(**model_args)
model = ModelClass(gptconf)
model.to(device)
# Load the state dict
logger.info(f"Loading model weights from {weights_path}")
if weights_path.endswith(".safetensors"):
from safetensors.torch import load_file
state_dict = load_file(weights_path, device=device)
else:
state_dict = torch.load(weights_path, map_location=device)
# Load state dict into model
model.load_state_dict(state_dict, strict=False)
if config_data.get("torch_dtype", "float32") == "bfloat16":
model.bfloat16()
# Get model stats
n_params = sum(p.numel() for p in model.parameters())
val_loss = model_args.get("best_val_loss", "N/A")
logger.info(f"Model loaded: {round(n_params/1e6, 1)}M params, {val_loss} loss")
# Set model to eval mode and move to device
model.eval()
# Clear memory
del state_dict
torch.cuda.empty_cache() if torch.cuda.is_available() else None
# Special handling for text models that need tokenizers
if model_type == "text":
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
return {
"model": model,
"tokenizer": tokenizer,
}
return model
def _load_models(ckpt_dir: str, device: str = "cuda"):
extension = ".pth" if ckpt_dir.endswith("-pth") else ".safetensors"
model_text_big = _load_filtered_model(
ckpt_dir + "text_model_config.json",
ckpt_dir + "text" + extension,
device=device,
)
model_coarse_big = _load_filtered_model(
ckpt_dir + "coarse_model_config.json",
ckpt_dir + "coarse" + extension,
device=device,
)
model_fine_big = _load_filtered_model(
ckpt_dir + "fine_model_config.json",
ckpt_dir + "fine" + extension,
device=device,
)
return model_text_big, model_coarse_big, model_fine_big
def load_models_into_bark(ckpt_dir: str):
model_text_big, model_coarse_big, model_fine_big = _load_models(ckpt_dir)
from bark.generation import models
models["text"] = model_text_big
models["coarse"] = model_coarse_big
models["fine"] = model_fine_big
_clear_cuda_cache()
I can offer additional support and debugging on TTS WebUI discord server: https://discord.gg/V8BKTVRtJ9
thanks! but are your models still support bark languages?
I wonder what you removed from the original models and if it's a new training
or just an epuration of the original. thanks!
Nothing removed, it still is the original. git mylo alerted me to the fact that the original files were bloated. So because I had also done very basic (FP32-BF16) quantization on bark, I realized I could upload these for usage.
Basically, if you occasionally work on bark but want to use like 70% less disk space, these are for you.
I like bark for it's deadpan simplicity - 3 chained GPTs, thus for me it's easy to research and experiment on.
ok understood. yes Bark is nice, but very unstable to get rid of weird voice and sound behavior! Im fighting since days to have a "normal" speech.
Do you think we can train more bark compatible models?
Do you think we can train more bark compatible models?
We could, but I do not think we should. We have better and more efficient methods now. For example, encodec is no longer the best audio encoder, thus we could train a model that predicts better tokens than encodec. Such tokens would either improve speed or quality.
Additionally, you'd want to have a model that can do parallel decoding, so instead of 1st generating all of the semantic tokens, then audio tokens and then refined audio tokens, you'd generate 1 semantic token -> 1 audio token -> 1 refined token. This would increase the speed significantly in cases where 1 model is too small to use the entire GPU (and bark is too small).
I'd rather suggest adding LoRA, control nets or fine tunes. Bark is able to generate very interesting audio; however, it does it randomly. Having control over it would make it a much more interesting model.
Fine tuning also can be used to reduce the size of the model. If you have a well functioning fine-tuning pipeline, you can use that pipeline together with quantization to get a smaller, potentially faster model that still produces the same results.
For one, bark has a fairly large embedding size which could be reduced. At around 10'000 tokens, it could be optimized for a single language and thus be boosted in speed. Although the non-English languages have reduced quality, foreign language TTS is rare even nowadays; thus it warrants interest.
thanks for all details. I need indeed all bark supported languages.
for now I would like to use your bark models with coqui-tts but seems not to be accepted even if I change the default checksum and file name from the bark config.
I don't have a clue why your models does not work with coqui-tts bar. I tried all of them and despite of all settings I changed and used to accept third parties model nothing worked. raging.....
Any idea of how to use other language than English?
I tried to put [GERMAN] or [CHINESE] at the start of each sentence but the result is no words but funky sounds.
thanks
Usually just writing the language you want is enough, additionally by using a voice.
I will get back to you once I launch chatterbox updates, which might still take a few days.
ok thanks! I'm afraid the issue comes from the bark model coqui-tts is using (erogol/bark).
I'm very curious of how your version works with the 17 supported languages.
I really suspect that it is the same model erogol/bark, suno/bark and others. I have heard some rumors but not actually seen any non-original barks.
Therefore, the model should be the same on CoquiTTS, here and on this demo page:
https://huggingface.co/spaces/suno/bark