YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Bio-ACDC: Biological Sequence Model Coevolution

An adaptation of AC/DC (Assessment Coevolving with Diverse Capabilities) for biological language models.

Overview

Bio-ACDC coevolves populations of biological language models (for DNA, RNA, and Protein sequences) with synthetic sequence tasks to discover specialized model experts.

Core Components

1. Coevolution Loop (core.py)

  • Initialization: Seed models are evaluated on base tasks
  • Offspring Generation: Parents are merged and optionally mutated
  • Evaluation: New models are tested on current task pool
  • Archive Update: Dominated Novelty Search (DNS) maintains diverse Pareto archive
  • Task Generation: New tasks target weaknesses discovered in the archive

2. Task Pool (tasks.py)

Generates and manages biological sequence tasks:

  • Protein tasks: Motif recognition, sequence completion, structure prediction
  • DNA tasks: Regulatory element detection, motif localization
  • RNA tasks: Secondary structure prediction, motif finding

Tasks are auto-generated targeting archive weaknesses.

3. Model Merging (mergers.py)

Multiple merging strategies:

  • Linear Merge: Weighted average of parent parameters
  • SLERP: Spherical linear interpolation
  • Task Vector: Arithmetic with base model subtraction

4. Mutation (mutators.py)

Controlled perturbation operators:

  • Gaussian Noise: Add random noise to weights
  • Layer Scale: Randomly scale specific layers
  • Dropout: Structured pruning of weights

5. Archive (archive.py)

Dominated Novelty Search maintains a Pareto archive:

  • Maximizes fitness + novelty
  • Novelty based on unique capabilities vs fitter solutions
  • Difficulty-aware weighting

6. Evaluator (evaluator.py)

Evaluates models on biological tasks:

  • Sequence identity/similarity
  • Motif containment
  • Perplexity
  • RNA structure prediction

Usage

Quick Start

from bio_acdc import BioACDC, BioACDCConfig
from bio_acdc.tasks import BioTaskPool
from bio_acdc.mergers import LinearMerge
from bio_acdc.mutators import GaussianNoiseMutator
from bio_acdc.evaluator import BioEvaluator

# Configuration
config = BioACDCConfig(
    seed_model_paths=[
        "facebook/esm2_t33_650M_UR50D",
        "InstaDeepAI/nucleotide-transformer-v2-500m-multi-species",
    ],
    archive_size=20,
    num_generations=10,
    offspring_per_gen=5,
    output_dir="./bio_acdc_output",
)

# Components
task_pool = BioTaskPool(seed=42)
evaluator = BioEvaluator()
merger = LinearMerge()
mutator = GaussianNoiseMutator(std=0.01)

# Create Bio-ACDC
bio_acdc = BioACDC(
    config=config,
    task_pool=task_pool,
    evaluator=evaluator,
    merger=merger,
    mutator=mutator,
)

# Run evolution
final_archive = bio_acdc.evolve()

# Best model
best = bio_acdc.archive.get_best()
print(f"Best model: {best.model_path}, Fitness: {best.fitness:.4f}")

Using with Custom Models

# For ESM-2 protein models
config = BioACDCConfig(
    seed_model_paths=[
        "facebook/esm2_t33_650M_UR50D",
        "facebook/esm2_t30_150M_UR50D",
    ],
)

# For Nucleotide Transformer DNA models
config = BioACDCConfig(
    seed_model_paths=[
        "InstaDeepAI/nucleotide-transformer-v2-500m-multi-species",
        "InstaDeepAI/nucleotide-transformer-v2-100m-multi-species",
    ],
)

Architecture Comparison: ACDC vs Bio-ACDC

Feature ACDC (SakanaAI) Bio-ACDC
Domain General NLP (text) Biological sequences
Seed Models Qwen/Llama LLMs ESM-2, NT, protein/DNA LMs
Tasks Synthetic code/math/text Motif detection, sequence completion, structure prediction
Evaluation LLM-as-judge + code sandbox Sequence similarity, biological metrics
Merging SLERP, Linear, Task Vectors Same (adapted for masked LMs)
Archive Dominated Novelty Search Same
Task Gen LLM-based task creation Rule-based + evolutionary targeting

Key Biological Adaptations

  1. Sequence-Specific Tasks: Motifs, regulatory elements, structure prediction
  2. Token-Aware Evaluation: Handles amino acids (20 AA), nucleotides (4 NT)
  3. Masked LM Support: Works with ESM-2 and Nucleotide Transformer (masked language models)
  4. Motif-Based Difficulty: Tasks target specific biological motifs
  5. Structure Evaluation: RNA secondary structure comparison

Requirements

torch>=2.0
transformers>=4.30
datasets
numpy
safetensors
biopython  # Optional, for advanced sequence analysis

Citation

@software{bio_acdc_2024,
  title = {Bio-ACDC: Coevolution of Biological Language Models and Sequence Tasks},
  author = {Adapted from SakanaAI AC/DC},
  year = {2024},
  url = {https://acdc-llm.github.io}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support