BiomedBERT_mgnify_studies / README.md

SantiagoSanchezF

Update README.md

0b8d91d verified 7 months ago

preview code

raw

history blame contribute delete

2.03 kB

metadata

license: apache-2.0
language:
  - en
base_model:
  - microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
pipeline_tag: fill-mask
datasets:
  - SantiagoSanchezF/mgnify_study_descriptions

Model Card for Model ID

We fine-tuned BiomedBERT using study descriptions from metagenomic projects sourced from MGnify. We applied MLM to unlabelled text data, specifically focusing on the project study descriptions. By fine-tuning the model on domain-specific text, the model now better understands the language and nuances found in metagenomics study description, which helps improve the performance of biome classification tasks.

This modelcard aims to be a base template for new models. It has been generated using this raw template.

Model Details

Model Description

Developed by: SantiagoSanchezF
Model type: MLM
Language(s) (NLP): English
License: [More Information Needed]
Finetuned from model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Downstream Use [optional]

This model isthe base of SantiagoSanchezF/trapiche-biome-classifier

Training Details

Training Data

[More Information Needed]

Training Procedure

The model was domain adapted by applying masked language modeling (MLM) to a corpus of study descriptions derived from metagenomic projects in MGnify. The input text was tokenized with a maximum sequence length of 256 tokens. A data collator was configured to randomly mask 15% of the input tokens for the MLM task. Training was performed with a batch size of 8, over 3 epochs, and with a learning rate of 5e-5.

Citation [optional]

TBD