Sutra-Instruct-350M
Sutra-Instruct-350M is a custom-built, 350-million parameter causal language model trained using nanaGPT architecture.
π§ Model Architecture & Details
- Architecture: Custom nanoGPT-based Transformer
- Parameter Count: 350M
- Format:
safetensors - Embeddings: Tied (
lm_headandwteshare memory) - Creator: Abhiray
π Training Pipeline
This model was not fine-tuned from an existing corporate base model (like Llama or Mistral). Its brain was initialized from absolute zero and trained through a rigorous two-phase pipeline:
Phase 1: Pre-Training (The Foundation) The base logic was built by streaming a highly curated mix of academic and coding datasets:
HuggingFaceFW/fineweb-edu(High-level English and academic structure)open-web-math/open-web-math(Mathematical logic and formatting)bigcode/starcoderdata(Python syntax and code structure)roneneldan/TinyStories(Basic grammar and narrative flow)
Phase 2: Supervised Fine-Tuning (SFT)
Once the model learned how to speak, it was fine-tuned using the yahma/alpaca-cleaned dataset to teach it the standard Instruction: and Response: conversational format.
βοΈ Recommended Generation Settings
Because this is a compact 350M parameter model, standard generation settings may result in looping or wild hallucinations. For the absolute best outputs, use the following configuration:
- Temperature:
0.5 - Top-K:
50 - Repetition Penalty:
1.3 - Max Length:
400-500 - one can use generation_config.json file in repo
β οΈ Limitations & Bias
- Hallucinations: As a 350M parameter model, Sutra does not have the physical parameter count to act as a factual encyclopedia. It will confidently hallucinate historical dates, math solutions, and trivia.
- Coding: While it understands Python syntax and will output beautifully formatted code blocks (thanks to StarCoder), complex logical scripts may fail.
- Best Use Case: Sutra excels at structural formatting, grammar, summarizing provided context, generate short stories, and acting as a lightweight, lightning-fast local testing model.