YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Bio-ACDC: Biological Sequence Model Coevolution

An adaptation of AC/DC (Assessment Coevolving with Diverse Capabilities) for biological language models.

Overview

Bio-ACDC coevolves populations of biological language models (for DNA, RNA, and Protein sequences) with synthetic sequence tasks to discover specialized model experts.

Core Components

1. Coevolution Loop (`core.py`)

Initialization: Seed models are evaluated on base tasks
Offspring Generation: Parents are merged and optionally mutated
Evaluation: New models are tested on current task pool
Archive Update: Dominated Novelty Search (DNS) maintains diverse Pareto archive
Task Generation: New tasks target weaknesses discovered in the archive

2. Task Pool (`tasks.py`)

Generates and manages biological sequence tasks:

Protein tasks: Motif recognition, sequence completion, structure prediction
DNA tasks: Regulatory element detection, motif localization
RNA tasks: Secondary structure prediction, motif finding

Tasks are auto-generated targeting archive weaknesses.

3. Model Merging (`mergers.py`)

Multiple merging strategies:

Linear Merge: Weighted average of parent parameters
SLERP: Spherical linear interpolation
Task Vector: Arithmetic with base model subtraction

4. Mutation (`mutators.py`)

Controlled perturbation operators:

Gaussian Noise: Add random noise to weights
Layer Scale: Randomly scale specific layers
Dropout: Structured pruning of weights

5. Archive (`archive.py`)

Dominated Novelty Search maintains a Pareto archive:

Maximizes fitness + novelty
Novelty based on unique capabilities vs fitter solutions
Difficulty-aware weighting

6. Evaluator (`evaluator.py`)

Evaluates models on biological tasks:

Sequence identity/similarity
Motif containment
Perplexity
RNA structure prediction

Usage

Quick Start

from bio_acdc import BioACDC, BioACDCConfig
from bio_acdc.tasks import BioTaskPool
from bio_acdc.mergers import LinearMerge
from bio_acdc.mutators import GaussianNoiseMutator
from bio_acdc.evaluator import BioEvaluator

# Configuration
config = BioACDCConfig(
    seed_model_paths=[
        "facebook/esm2_t33_650M_UR50D",
        "InstaDeepAI/nucleotide-transformer-v2-500m-multi-species",
    ],
    archive_size=20,
    num_generations=10,
    offspring_per_gen=5,
    output_dir="./bio_acdc_output",
)

# Components
task_pool = BioTaskPool(seed=42)
evaluator = BioEvaluator()
merger = LinearMerge()
mutator = GaussianNoiseMutator(std=0.01)

# Create Bio-ACDC
bio_acdc = BioACDC(
    config=config,
    task_pool=task_pool,
    evaluator=evaluator,
    merger=merger,
    mutator=mutator,
)

# Run evolution
final_archive = bio_acdc.evolve()

# Best model
best = bio_acdc.archive.get_best()
print(f"Best model: {best.model_path}, Fitness: {best.fitness:.4f}")

Using with Custom Models

# For ESM-2 protein models
config = BioACDCConfig(
    seed_model_paths=[
        "facebook/esm2_t33_650M_UR50D",
        "facebook/esm2_t30_150M_UR50D",
    ],
)

# For Nucleotide Transformer DNA models
config = BioACDCConfig(
    seed_model_paths=[
        "InstaDeepAI/nucleotide-transformer-v2-500m-multi-species",
        "InstaDeepAI/nucleotide-transformer-v2-100m-multi-species",
    ],
)

Architecture Comparison: ACDC vs Bio-ACDC

Feature	ACDC (SakanaAI)	Bio-ACDC
Domain	General NLP (text)	Biological sequences
Seed Models	Qwen/Llama LLMs	ESM-2, NT, protein/DNA LMs
Tasks	Synthetic code/math/text	Motif detection, sequence completion, structure prediction
Evaluation	LLM-as-judge + code sandbox	Sequence similarity, biological metrics
Merging	SLERP, Linear, Task Vectors	Same (adapted for masked LMs)
Archive	Dominated Novelty Search	Same
Task Gen	LLM-based task creation	Rule-based + evolutionary targeting

Key Biological Adaptations

Sequence-Specific Tasks: Motifs, regulatory elements, structure prediction
Token-Aware Evaluation: Handles amino acids (20 AA), nucleotides (4 NT)
Masked LM Support: Works with ESM-2 and Nucleotide Transformer (masked language models)
Motif-Based Difficulty: Tasks target specific biological motifs
Structure Evaluation: RNA secondary structure comparison

Requirements

torch>=2.0
transformers>=4.30
datasets
numpy
safetensors
biopython  # Optional, for advanced sequence analysis

Citation

@software{bio_acdc_2024,
  title = {Bio-ACDC: Coevolution of Biological Language Models and Sequence Tasks},
  author = {Adapted from SakanaAI AC/DC},
  year = {2024},
  url = {https://acdc-llm.github.io}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support