TinyLlama-1.1B-Chat โ€” ONNX (FP16)

ONNX export of TinyLlama-1.1B-Chat-v1.0 (1.1B parameters, FP16 weights) with KV cache support for efficient autoregressive generation.

Converted for use with inference4j, an inference-only AI library for Java.

Original Source

Usage with inference4j

try (var gen = OnnxTextGenerator.tinyLlama().build()) {
    GenerationResult result = gen.generate("What is Java?");
    System.out.println(result.text());
}

Model Details

Property Value
Architecture LlamaForCausalLM (1.1B parameters, 22 layers, 2048 hidden, 32 heads, 4 KV heads)
Task Text generation (instruction-tuned, Zephyr chat template)
Precision FP16
Context length 2048 tokens
Vocabulary 32,000 tokens (SentencePiece BPE)
Chat template Zephyr (`<
Original framework PyTorch (transformers)
Export method Hugging Face Optimum (with KV cache, FP16)

License

This model is licensed under the Apache License 2.0. Original model by TinyLlama.

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support