SantiagoSanchezF's picture
Update README.md
0b8d91d verified
metadata
license: apache-2.0
language:
  - en
base_model:
  - microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
pipeline_tag: fill-mask
datasets:
  - SantiagoSanchezF/mgnify_study_descriptions

Model Card for Model ID

We fine-tuned BiomedBERT using study descriptions from metagenomic projects sourced from MGnify. We applied MLM to unlabelled text data, specifically focusing on the project study descriptions. By fine-tuning the model on domain-specific text, the model now better understands the language and nuances found in metagenomics study description, which helps improve the performance of biome classification tasks.

This modelcard aims to be a base template for new models. It has been generated using this raw template.

Model Details

Model Description

  • Developed by: SantiagoSanchezF
  • Model type: MLM
  • Language(s) (NLP): English
  • License: [More Information Needed]
  • Finetuned from model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Downstream Use [optional]

This model isthe base of SantiagoSanchezF/trapiche-biome-classifier

Training Details

Training Data

[More Information Needed]

Training Procedure

The model was domain adapted by applying masked language modeling (MLM) to a corpus of study descriptions derived from metagenomic projects in MGnify. The input text was tokenized with a maximum sequence length of 256 tokens. A data collator was configured to randomly mask 15% of the input tokens for the MLM task. Training was performed with a batch size of 8, over 3 epochs, and with a learning rate of 5e-5.

Citation [optional]

TBD