SmollerLM2-360M-Instruct-Pruned

A structurally pruned version of the SmolLM2-360M-Instruct model.

This model was created as an experiment in pruning.

Pruning Methodology

The model underwent Structured Every-Nth Neuron Pruning. Unlike random dropout or unstructured pruning, this method maintains the dense matrix format required by standard hardware accelerators.

  • Target: Intermediate MLP (Feed-Forward) layers.
  • Strategy: Every 20th neuron was removed ($1/20$).
  • Dimension Shift: Intermediate size reduced from 2560 to 2432.

Memory Efficiency

While the original model is distributed in FP32, this model provides an optimization that makes it significantly more accessible:

  • Precision Reduction (FP32 → FP16): We converted the weights to half-precision, instantly cutting the memory footprint by 50%.

Total Savings: 51.6% smaller than the original version.

Recommended Usage

TECHNICAL NOTE: On CPUs without native FP16 support, this model may experience a 'tax' resulting in slower tokens-per-second than the original. This model is RAM-optimized, not necessarily CPU-Latency optimized in its raw FP16 state.

To get the best performance, it is recommended to use it on a GPU or via 4-bit/8-bit quantization to bypass CPU floating-point limitations.

You will first need to install the bitsandbytes library in Python. (pip install bitsandbytes)

GPU Loading (Fastest)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Fu01978/SmollerLM2-360M-Instruct-Pruned"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    quantization_config={"load_in_8bit": True},
    device_map="auto"
)

CPU Loading

If running on lower-end CPUs, load the model in 4-bit to ensure the weights fit in the L1/L2 cache:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Fu01978/SmollerLM2-360M-Instruct-Pruned"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config={"load_in_4bit": True},
    device_map="auto"
)

Generate

Use the following snippet to chat with the model. This uses the chat template.

# Define your message(s)
messages = [
    {"role": "user", "content": "Explain the concept of gravity."}
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.2,
        do_sample=True,
        repetition_penalty=1.1
    )

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Example Output (4-bit Quantized)

  • Prompt: "Explain the concept of gravity."
  • Output:

    Gravity is indeed one of the most fundamental concepts in physics and mathematics. It's essentially the "attraction" between two bodies or masses. According to Einstein's theory of general relativity, mass warps space-time around it, creating a gravitational field that attracts other objects with mass. This means that anything having mass has a gravitational pull on other matter, making them feel heavy. For example, when you drop an object, you're not really feeling its weight; rather, you're feeling the gravitational force exerted by the Earth. The actual weight of an object depends on how massive the object itself is, which can be calculated using formulas like F=G x m/r^2 where G is the gravitational constant, [hit 150 token limit]

Limitations & Bias

As a pruned version of SmolLM2, this model inherits the biases of its parent.

While the pruning was found to be stable, users may encounter slight regressions in mathematical reasoning compared to the full model.

Downloads last month
366
Safetensors
Model size
0.4B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Fu01978/SmollerLM2-360M-Instruct-Pruned

Finetuned
(134)
this model
Quantizations
2 models