YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Joint NT-ESM2 DNA-Protein Models

This repository contains jointly trained Nucleotide Transformer (NT) and ESM2 models for DNA-protein sequence analysis.

Model Components

DNA Model (dna/)

  • Type: Nucleotide Transformer for DNA sequences
  • Context: 4096 tokens
  • Training: Transcript-specific coding sequences

Protein Model (protein/)

  • Type: ESM2 for protein sequences
  • Variant: Large model
  • Training: Corresponding protein sequences

Usage

from transformers import AutoModel, AutoTokenizer

# Load DNA model
dna_model = AutoModel.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna")
dna_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna")

# Load protein model
protein_model = AutoModel.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein")  
protein_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein")

# Example joint usage
dna_seq = "ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA"
protein_seq = "MKRISLHHHHHHHQVTVRWD"

dna_inputs = dna_tokenizer(dna_seq, return_tensors="pt")
protein_inputs = protein_tokenizer(protein_seq, return_tensors="pt")

dna_outputs = dna_model(**dna_inputs)
protein_outputs = protein_model(**protein_inputs)

Training Details

  • Joint Training: Models trained together for cross-modal understanding
  • Batch Size: 8
  • Data: Transcript-specific coding sequences with corresponding proteins
  • Architecture: Maintained original NT and ESM2 architectures

Repository Structure

β”œβ”€β”€ dna/                    # NT DNA model
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   β”œβ”€β”€ vocab.txt
β”‚   └── special_tokens_map.json
β”œβ”€β”€ protein/                # ESM2 protein model  
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   β”œβ”€β”€ vocab.txt
β”‚   └── special_tokens_map.json
└── joint_config.json       # Joint model configuration

Citation

If you use these models, please cite the original NT and ESM2 papers along with your joint training methodology.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support