rsxdalv/suno · Custom bark models

9 days ago

Hi,
how to use your files into bark?
thanks!

Owner 9 days ago

Like this:


def _load_filtered_model(
    config_path: str,
    weights_path: str,
    device: str = "cpu",
) -> Any:
    """
    Load a model from separate config and weights files.
    This is the replacement for _load_model_pth that works with the split files.

    Args:
        config_path: Path to the config JSON file
        weights_path: Path to the weights PTH file
        device: Device to load the model onto ("cpu", "cuda", etc.)

    Returns:
        The loaded model or a dict containing model and tokenizer if applicable
    """
    # Check if files exist
    if not os.path.exists(config_path):
        raise FileNotFoundError(f"Config file not found: {config_path}")
    if not os.path.exists(weights_path):
        raise FileNotFoundError(f"Weights file not found: {weights_path}")

    # Load the config
    logger.info(f"Loading model config from {config_path}")
    with open(config_path, "r") as f:
        config_data = json.load(f)

    # Extract the model args
    model_args = config_data["model_config"]
    model_type = config_data.get("model_type", "text")

    # This is just an example - you'd need to adjust based on actual imports
    if model_type == "text":
        from bark.model import GPTConfig as ConfigClass, GPT as ModelClass
    elif model_type == "coarse":
        from bark.model import GPTConfig as ConfigClass, GPT as ModelClass
    elif model_type == "fine":
        from bark.model_fine import (
            FineGPTConfig as ConfigClass,
            FineGPT as ModelClass,
        )
    else:
        raise NotImplementedError(f"Model type {model_type} not supported")

    # Initialize model with config
    logger.info(f"Initializing {model_type} model")
    gptconf = ConfigClass(**model_args)
    model = ModelClass(gptconf)
    model.to(device)

    # Load the state dict
    logger.info(f"Loading model weights from {weights_path}")
    if weights_path.endswith(".safetensors"):
        from safetensors.torch import load_file

        state_dict = load_file(weights_path, device=device)
    else:
        state_dict = torch.load(weights_path, map_location=device)

    # Load state dict into model
    model.load_state_dict(state_dict, strict=False)

    if config_data.get("torch_dtype", "float32") == "bfloat16":
        model.bfloat16()

    # Get model stats
    n_params = sum(p.numel() for p in model.parameters())
    val_loss = model_args.get("best_val_loss", "N/A")
    logger.info(f"Model loaded: {round(n_params/1e6, 1)}M params, {val_loss} loss")

    # Set model to eval mode and move to device
    model.eval()

    # Clear memory
    del state_dict
    torch.cuda.empty_cache() if torch.cuda.is_available() else None

    # Special handling for text models that need tokenizers
    if model_type == "text":
        from transformers import BertTokenizer

        tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
        return {
            "model": model,
            "tokenizer": tokenizer,
        }
    return model


def _load_models(ckpt_dir: str, device: str = "cuda"):
    extension = ".pth" if ckpt_dir.endswith("-pth") else ".safetensors"

    model_text_big = _load_filtered_model(
        ckpt_dir + "text_model_config.json",
        ckpt_dir + "text" + extension,
        device=device,
    )

    model_coarse_big = _load_filtered_model(
        ckpt_dir + "coarse_model_config.json",
        ckpt_dir + "coarse" + extension,
        device=device,
    )

    model_fine_big = _load_filtered_model(
        ckpt_dir + "fine_model_config.json",
        ckpt_dir + "fine" + extension,
        device=device,
    )
    return model_text_big, model_coarse_big, model_fine_big


def load_models_into_bark(ckpt_dir: str):
    model_text_big, model_coarse_big, model_fine_big = _load_models(ckpt_dir)
    from bark.generation import models

    models["text"] = model_text_big
    models["coarse"] = model_coarse_big
    models["fine"] = model_fine_big
    _clear_cuda_cache()

I can offer additional support and debugging on TTS WebUI discord server: https://discord.gg/V8BKTVRtJ9

zermok

6 days ago

thanks! but are your models still support bark languages?
I wonder what you removed from the original models and if it's a new training
or just an epuration of the original. thanks!

rsxdalv

Owner 6 days ago

Nothing removed, it still is the original. git mylo alerted me to the fact that the original files were bloated. So because I had also done very basic (FP32-BF16) quantization on bark, I realized I could upload these for usage.
Basically, if you occasionally work on bark but want to use like 70% less disk space, these are for you.

I like bark for it's deadpan simplicity - 3 chained GPTs, thus for me it's easy to research and experiment on.

zermok

6 days ago

ok understood. yes Bark is nice, but very unstable to get rid of weird voice and sound behavior! Im fighting since days to have a "normal" speech.

zermok

5 days ago

Do you think we can train more bark compatible models?

rsxdalv

Owner 4 days ago

Do you think we can train more bark compatible models?

We could, but I do not think we should. We have better and more efficient methods now. For example, encodec is no longer the best audio encoder, thus we could train a model that predicts better tokens than encodec. Such tokens would either improve speed or quality.
Additionally, you'd want to have a model that can do parallel decoding, so instead of 1st generating all of the semantic tokens, then audio tokens and then refined audio tokens, you'd generate 1 semantic token -> 1 audio token -> 1 refined token. This would increase the speed significantly in cases where 1 model is too small to use the entire GPU (and bark is too small).

rsxdalv

Owner 4 days ago

I'd rather suggest adding LoRA, control nets or fine tunes. Bark is able to generate very interesting audio; however, it does it randomly. Having control over it would make it a much more interesting model.
Fine tuning also can be used to reduce the size of the model. If you have a well functioning fine-tuning pipeline, you can use that pipeline together with quantization to get a smaller, potentially faster model that still produces the same results.
For one, bark has a fairly large embedding size which could be reduced. At around 10'000 tokens, it could be optimized for a single language and thus be boosted in speed. Although the non-English languages have reduced quality, foreign language TTS is rare even nowadays; thus it warrants interest.

zermok

4 days ago

thanks for all details. I need indeed all bark supported languages.
for now I would like to use your bark models with coqui-tts but seems not to be accepted even if I change the default checksum and file name from the bark config.

zermok

4 days ago

I don't have a clue why your models does not work with coqui-tts bar. I tried all of them and despite of all settings I changed and used to accept third parties model nothing worked. raging.....

zermok

2 days ago

Any idea of how to use other language than English?
I tried to put [GERMAN] or [CHINESE] at the start of each sentence but the result is no words but funky sounds.
thanks

rsxdalv

Owner 1 day ago

Usually just writing the language you want is enough, additionally by using a voice.
I will get back to you once I launch chatterbox updates, which might still take a few days.

zermok

1 day ago

•

edited 1 day ago

ok thanks! I'm afraid the issue comes from the bark model coqui-tts is using (erogol/bark).
I'm very curious of how your version works with the 17 supported languages.

rsxdalv

Owner about 9 hours ago

I really suspect that it is the same model erogol/bark, suno/bark and others. I have heard some rumors but not actually seen any non-original barks.
Therefore, the model should be the same on CoquiTTS, here and on this demo page:
https://huggingface.co/spaces/suno/bark