YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Bio-ACDC: Biological Sequence Model Coevolution
An adaptation of AC/DC (Assessment Coevolving with Diverse Capabilities) for biological language models.
Overview
Bio-ACDC coevolves populations of biological language models (for DNA, RNA, and Protein sequences) with synthetic sequence tasks to discover specialized model experts.
Core Components
1. Coevolution Loop (core.py)
- Initialization: Seed models are evaluated on base tasks
- Offspring Generation: Parents are merged and optionally mutated
- Evaluation: New models are tested on current task pool
- Archive Update: Dominated Novelty Search (DNS) maintains diverse Pareto archive
- Task Generation: New tasks target weaknesses discovered in the archive
2. Task Pool (tasks.py)
Generates and manages biological sequence tasks:
- Protein tasks: Motif recognition, sequence completion, structure prediction
- DNA tasks: Regulatory element detection, motif localization
- RNA tasks: Secondary structure prediction, motif finding
Tasks are auto-generated targeting archive weaknesses.
3. Model Merging (mergers.py)
Multiple merging strategies:
- Linear Merge: Weighted average of parent parameters
- SLERP: Spherical linear interpolation
- Task Vector: Arithmetic with base model subtraction
4. Mutation (mutators.py)
Controlled perturbation operators:
- Gaussian Noise: Add random noise to weights
- Layer Scale: Randomly scale specific layers
- Dropout: Structured pruning of weights
5. Archive (archive.py)
Dominated Novelty Search maintains a Pareto archive:
- Maximizes fitness + novelty
- Novelty based on unique capabilities vs fitter solutions
- Difficulty-aware weighting
6. Evaluator (evaluator.py)
Evaluates models on biological tasks:
- Sequence identity/similarity
- Motif containment
- Perplexity
- RNA structure prediction
Usage
Quick Start
from bio_acdc import BioACDC, BioACDCConfig
from bio_acdc.tasks import BioTaskPool
from bio_acdc.mergers import LinearMerge
from bio_acdc.mutators import GaussianNoiseMutator
from bio_acdc.evaluator import BioEvaluator
# Configuration
config = BioACDCConfig(
seed_model_paths=[
"facebook/esm2_t33_650M_UR50D",
"InstaDeepAI/nucleotide-transformer-v2-500m-multi-species",
],
archive_size=20,
num_generations=10,
offspring_per_gen=5,
output_dir="./bio_acdc_output",
)
# Components
task_pool = BioTaskPool(seed=42)
evaluator = BioEvaluator()
merger = LinearMerge()
mutator = GaussianNoiseMutator(std=0.01)
# Create Bio-ACDC
bio_acdc = BioACDC(
config=config,
task_pool=task_pool,
evaluator=evaluator,
merger=merger,
mutator=mutator,
)
# Run evolution
final_archive = bio_acdc.evolve()
# Best model
best = bio_acdc.archive.get_best()
print(f"Best model: {best.model_path}, Fitness: {best.fitness:.4f}")
Using with Custom Models
# For ESM-2 protein models
config = BioACDCConfig(
seed_model_paths=[
"facebook/esm2_t33_650M_UR50D",
"facebook/esm2_t30_150M_UR50D",
],
)
# For Nucleotide Transformer DNA models
config = BioACDCConfig(
seed_model_paths=[
"InstaDeepAI/nucleotide-transformer-v2-500m-multi-species",
"InstaDeepAI/nucleotide-transformer-v2-100m-multi-species",
],
)
Architecture Comparison: ACDC vs Bio-ACDC
| Feature | ACDC (SakanaAI) | Bio-ACDC |
|---|---|---|
| Domain | General NLP (text) | Biological sequences |
| Seed Models | Qwen/Llama LLMs | ESM-2, NT, protein/DNA LMs |
| Tasks | Synthetic code/math/text | Motif detection, sequence completion, structure prediction |
| Evaluation | LLM-as-judge + code sandbox | Sequence similarity, biological metrics |
| Merging | SLERP, Linear, Task Vectors | Same (adapted for masked LMs) |
| Archive | Dominated Novelty Search | Same |
| Task Gen | LLM-based task creation | Rule-based + evolutionary targeting |
Key Biological Adaptations
- Sequence-Specific Tasks: Motifs, regulatory elements, structure prediction
- Token-Aware Evaluation: Handles amino acids (20 AA), nucleotides (4 NT)
- Masked LM Support: Works with ESM-2 and Nucleotide Transformer (masked language models)
- Motif-Based Difficulty: Tasks target specific biological motifs
- Structure Evaluation: RNA secondary structure comparison
Requirements
torch>=2.0
transformers>=4.30
datasets
numpy
safetensors
biopython # Optional, for advanced sequence analysis
Citation
@software{bio_acdc_2024,
title = {Bio-ACDC: Coevolution of Biological Language Models and Sequence Tasks},
author = {Adapted from SakanaAI AC/DC},
year = {2024},
url = {https://acdc-llm.github.io}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support