You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

lit2vec-tldr-bart (DistilBART fine-tuned for chemistry TL;DRs)

lit2vec-tldr-bart is a DistilBART model fine-tuned on 19,992 CC-BY licensed chemistry abstracts to produce concise TL;DR-style summaries aligned with methods β†’ results β†’ significance. It’s designed for scientific abstractive summarization, semantic indexing, and knowledge-graph population in chemistry and related fields.


πŸ§ͺ Evaluation (held-out test)

Split ROUGE-1 ROUGE-2 ROUGE-Lsum
Test 56.11 30.78 45.43

Validation RLsum: 46.05
Metrics computed with evaluate's rouge (NLTK sentence segmentation, use_stemmer=True).


πŸš€ Quickstart

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

repo = "Bocklitz-Lab/lit2vec-tldr-bart"

tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSeq2SeqLM.from_pretrained(repo)
gen = GenerationConfig.from_pretrained(repo)  # loads default decoding params

text = "Proton exchange membrane fuel cells convert chemical energy into electricity..."
inputs = tok(text, return_tensors="pt", truncation=True, max_length=1024)

summary_ids = model.generate(**inputs, **gen.to_dict())
print(tok.decode(summary_ids[0], skip_special_tokens=True))

Batch inference (PyTorch)

texts = [
  "Abstract 1 ...",
  "Abstract 2 ...",
]
batch = tok(texts, return_tensors="pt", padding=True, truncation=True, max_length=1024)
out = model.generate(**batch, **gen.to_dict())
summaries = tok.batch_decode(out, skip_special_tokens=True)

πŸ”§ Default decoding (saved in generation_config.json)

These are the defaults saved with the model (you can override at generate() time):

{
  "max_length": 142,
  "min_length": 56,
  "early_stopping": true,
  "num_beams": 4,
  "length_penalty": 2.0,
  "no_repeat_ngram_size": 3,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2
}

πŸ“Š Training details

  • Base: sshleifer/distilbart-cnn-12-6 (Distilled BART)
  • Data: 19,992 CC-BY chemistry abstracts with TL;DR summaries
  • Splits: train=17,992 / val=999 / test=1,001
  • Max lengths: input 1024, target 128
  • Optimizer: AdamW, lr=2e-5
  • Batching: per-device train/eval batch size 4, gradient_accumulation_steps=4
  • Epochs: 5
  • Precision: fp16 (when CUDA available)
  • Hardware: single NVIDIA RTX 3090
  • Seed: 42
  • Libraries: πŸ€— Transformers + Datasets, evaluate for ROUGE, NLTK for sentence splitting

βœ… Intended use

  • TL;DR abstractive summaries for chemistry and adjacent domains (materials science, chemical engineering, environmental science).
  • Semantic indexing, IR reranking, and knowledge graph ingestion where concise method/result statements are helpful.

Limitations & risks

  • May hallucinate details not present in the abstract (typical for abstractive models).
  • Not a substitute for expert judgment; avoid using summaries as sole evidence for scientific claims.
  • Trained on CC-BY English abstracts; performance may degrade on other domains/languages.

πŸ“¦ Files

This repo should include:

  • config.json, pytorch_model.bin or model.safetensors
  • tokenizer.json, tokenizer_config.json, special_tokens_map.json, merges/vocab as applicable
  • generation_config.json (decoding defaults)

πŸ” Reproducibility

  • Dataset: Bocklitz-Lab/lit2vec-tldr-bart-dataset
  • Recommended preprocessing: truncate inputs at 1024 tokens; targets at 128.
  • ROUGE evaluation: evaluate.load("rouge"), NLTK sentence tokenization, use_stemmer=True.

πŸ“š Citation

If you use this model or dataset, please cite:

@software{lit2vec_tldr_bart_2025,
  title   = {lit2vec-tldr-bart: DistilBART fine-tuned for chemistry TL;DR summarization},
  author  = {Bocklitz Lab},
  year    = {2025},
  url     = {https://huggingface.co/Bocklitz-Lab/lit2vec-tldr-bart},
  note    = {Model trained on CC-BY chemistry abstracts; dataset at Bocklitz-Lab/lit2vec-tldr-bart-dataset}
}

Dataset:

@dataset{lit2vec_tldr_dataset_2025,
  title   = {Lit2Vec TL;DR Chemistry Dataset},
  author  = {Bocklitz Lab},
  year    = {2025},
  url     = {https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-tldr-bart-dataset}
}

πŸ“ License

  • Model weights & code: Apache-2.0
  • Dataset: CC BY 4.0 (attribution in per-record metadata)

πŸ™Œ Acknowledgements

  • Base model: DistilBART (sshleifer/distilbart-cnn-12-6)
  • Licensing and OA links curated from publisher/aggregator sources; dataset restricted to CC-BY content.
Downloads last month
39
Safetensors
Model size
306M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Bocklitz-Lab/lit2vec-tldr-bart-model

Space using Bocklitz-Lab/lit2vec-tldr-bart-model 1

Evaluation results