Asterisk: Hybrid ASPP-Attention Architecture

Asterisk is a research implementation that combines the ASPP (Adjacency-Structured Parallel Propagation) operator with standard attention mechanisms to enhance the SmolLM2-135M model. The model implements a hybrid architecture that fuses graph-based local reasoning (ASPP) with global attention for improved expressiveness on structured reasoning tasks.

Model Description

  • Base Model: SmolLM2-135M-Instruct
  • Architecture: Hybrid ASPP-Attention (30 hybrid layers)
  • Parameters: 171.2M (35M additional ASPP parameters)
  • Training: Supervised Fine-Tuning on Capybara dataset
  • Framework: Transformers 4.57.6, TRL 0.27.0

Evaluation Results

Evaluated on LM-Evaluation-Harness:

Task Metric Score Stderr
HellaSwag acc_norm 0.4430 Β±0.0157
ARC-Easy acc_norm 0.5450 Β±0.0158
ARC-Challenge acc_norm 0.2884 Β±0.0132
PIQA acc_norm 0.6770 Β±0.0148
WinoGrande acc 0.5210 Β±0.0158

Key Innovation: The Asterisk Operator (β˜…-operator)

The Asterisk Operator performs local parallel state evolution through point-wise transformations:

h_i^(t+1) = Ο†(h_i^(t))  [K-step iterative evolution]

This is then gated and fused with standard Llama attention outputs:

output = gate * ASPP(x) + (1-gate) * Attention(x)

Architecture

1. ASPPOperator (Point-wise Parallel Propagation)

class ASPPOperator:
    """

    Forward pass:
    1. Optional dimensionality reduction: h_t = down_proj(hidden_states)
    2. K-step evolution: h_t = h_t + Ξ± * Ο†(h_t)  [K times]
    3. Layer normalization after each step
    4. Optional projection back: output = up_proj(h_t)

    Parameters:
    - hidden_size: 576 (model dimension)
    - aspp_hidden_dim: 256 (internal ASPP dimension)
    - aspp_num_steps: 8 (evolution iterations)
    - aspp_dropout: 0.2
    """

Pseudocode:

function ASPP(hidden_states):
    # Optional dimensionality reduction
    if use_projection:
        h_t ← down_proj(hidden_states)
        h_t ← dropout(h_t)
    else:
        h_t ← hidden_states

    # Learnable number of steps
    k_steps ← max(1, int(sigmoid(k_logit) * num_steps))

    # K-step point-wise evolution
    for t = 1 to k_steps:
        # Point-wise update: Ο†(h_t) = MLP(h_t)
        h_t_next ← update_net(h_t)

        # Scaled residual connection
        h_t ← h_t + residual_scale * h_t_next
        h_t ← layer_norm(h_t)

    # Project back to original dimension
    if use_projection:
        h_t ← up_proj(h_t)
        h_t ← dropout(h_t)

    return h_t

2. HybridASPPAttentionLayer

class HybridASPPAttentionLayer(LlamaDecoderLayer):
    """
    Extends LlamaDecoderLayer with parallel ASPP branch

    Architecture:
    1. Input LayerNorm
    2. Parallel branches:
       - ASPP operator for local structured reasoning
       - Standard LlamaAttention for global context
    3. Gated fusion: gate * ASPP + (1-gate) * Attention
    4. Residual connection
    5. Feed-forward MLP
    """

Pseudocode:

function HybridLayer(hidden_states, attention_mask, ...):
    residual ← hidden_states
    hidden_states ← input_layernorm(hidden_states)

    # Parallel branches
    aspp_output ← aspp_operator(hidden_states)
    attn_output ← self_attention(hidden_states, attention_mask, ...)

    # Gated fusion
    fusion_input ← concat([aspp_output, attn_output])
    gate ← sigmoid(linear(dropout(fusion_input)))
    fused_output ← gate * aspp_output + (1 - gate) * attn_output

    # Residual connection
    hidden_states ← residual + fused_output

    # MLP block
    residual ← hidden_states
    hidden_states ← post_attention_layernorm(hidden_states)
    hidden_states ← mlp(hidden_states)
    hidden_states ← residual + hidden_states

    return hidden_states

3. AsteriskForCausalLM

class AsteriskForCausalLM(LlamaForCausalLM):
    """
    Main model class with custom model_type "asterisk"

    Configuration:
    - hybrid_layer_indices: None (all 30 layers are hybrid)
    - aspp_hidden_dim: 256 (reduces overfitting)
    - aspp_num_steps: 8 (learnable, actual steps β‰ˆ 6)
    - aspp_dropout: 0.2
    """

Note: These are preliminary results with sample limits. Full evaluation pending.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "path/to/Asterisk",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("path/to/Asterisk")

# Generate text
messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Configuration

  • Dataset: Capybara (conversational instruction-following)
  • Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
  • Batch Size: 4 per device, gradient accumulation=4 (effective batch=16)
  • Epochs: 2
  • Scheduler: Cosine with warmup (100 steps)
  • Mixed Precision: bfloat16
  • Gradient Checkpointing: Enabled

ASPP Configuration

aspp_hidden_dim = 256      # Internal dimension (vs 576 model hidden_size)
aspp_num_steps = 8         # Max evolution steps (learnable)
aspp_dropout = 0.2         # Regularization
hybrid_layer_indices = None  # All 30 layers

Model Creation from Base

from AsteriskForCausalLM import AsteriskForCausalLM

# Create Asterisk model from SmolLM2 base
model, base_model = AsteriskForCausalLM.from_pretrained_base(
    "HuggingFaceTB/SmolLM2-135M-Instruct",
    hybrid_layer_indices=None,  # None = all layers
    aspp_hidden_dim=256,        # Internal ASPP dimension
    aspp_num_steps=8,           # K-step evolution
    aspp_dropout=0.2,           # Dropout rate
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Base model parameters are transferred, ASPP parameters initialized randomly
model.load_state_dict(base_model.state_dict(), strict=False)

Theoretical Background

Universality (Theorem 2.1)

ASPP can simulate any Message-Passing Neural Network (MPNN) function on finite graphs in D steps, where D is the graph diameter.

Convergence (Theorem 2.2)

Exponential convergence to fixed points with rate c=0.76 under Lipschitz continuity.

Turing Completeness

Proven via cyclic tag system simulation - ASPP can compute any Turing-computable function given sufficient depth.

Implementation Note: This implementation simplifies theoretical ASPP to point-wise evolution to reduce overfitting while maintaining iterative refinement benefits.

Files in Checkpoint

Asterisk/
β”œβ”€β”€ AsteriskForCausalLM.py    # Model implementation (required for trust_remote_code)
β”œβ”€β”€ config.json                # Model configuration with auto_map
β”œβ”€β”€ model.safetensors          # Model weights
β”œβ”€β”€ tokenizer.json             # Tokenizer
β”œβ”€β”€ generation_config.json     # Generation settings
└── README.md                  # This file

Dependencies

pip install torch>=2.0.0
pip install transformers>=4.40.0
pip install trl>=0.8.0
pip install datasets>=2.14.0
pip install accelerate>=0.25.0
pip install bitsandbytes

Citations

If you use this model, please cite:

@misc{asterisk2026,
  title={Asterisk: Hybrid ASPP-Attention Architecture for Enhanced Language Modeling},
  author={NoesisLab},
  year={2026},
  publisher={Huggingface},
  url={https://huggingface.co/NoesisLab/Asterisk}
}
@misc{vonwerra2022trl,
  title={{TRL: Transformer Reinforcement Learning}},
  author={Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin GallouΓ©dec},
  year={2020},
  journal={GitHub repository},
  publisher={GitHub},
  howpublished={\url{https://github.com/huggingface/trl}}
}
@article{allal2024SmolLM2,
  title={SmolLM2 - with great data, comes great performance},
  author={Allal, Loubna Ben and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro},
  year={2024}
}

License

This model inherits the Apache 2.0 license from SmolLM2-135M-Instruct.

Framework Versions

  • TRL: 0.27.0
  • Transformers: 4.57.6
  • PyTorch: 2.8.0+cu128
  • Datasets: 4.5.0
  • Tokenizers: 0.22.2

Acknowledgments

Built on top of SmolLM2-135M-Instruct by HuggingFace. Training framework powered by TRL.

Downloads last month
45
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for NoesisLab/Asterisk

Finetuned
(227)
this model
Finetunes
1 model