Model documentation & parameters

Algorithm version: The model version to use. Note that any HF model can be wrapped to a KeyBERT model.

Text: The main text prompt to "understand", i.e., generate keywords.

Minimum keyphrase ngram: Lower bound for phrase size. Each keyword will have at least this many words.

Maximum keyphrase ngram: Upper bound for phrase size. Each keyword will have at least this many words.

Stop words: Stopwords to remove from the document. If not provided, no stop words removal.

Use MaxSum: To diversify the results, we take the 2 x MaxSum candidates most similar words/phrases to the document. Then, we take all top_n combinations from the 2 x MaxSum candidates and extract the combination that are the least similar to each other by cosine similarity. Control usage of max sum similarity for keywords generated.

MaxSum candidates: Candidates considered when enabling Use MaxSum.

Use Max. marginal relevance: To diversify the results, we can use Maximal Margin Relevance (MMR) to create keywords / keyphrases which is also based on cosine similarity.

Diversity: Diversity for the results when enabling max. marginal relevance.

Number of keywords: How many keywords should be generated (maximal 50).

Model card -- KeywordBERT

Model Details: KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

Developers: Maarten Grootendorst.

Distributors: Original developer's code from https://github.com/MaartenGr/KeyBERT.

Model date: 2020.

Model type: Different BERT and SciBERT models, trained on CIRCA data.

Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: N.A.

Paper or other resource for more information: The KeyBERT GitHub repo.

License: MIT

Where to send questions or comments about the model: Open an issue on GT4SD repository.

Intended Use. Use cases that were envisioned during development: N.A.

Primary intended uses/users: N.A.

Out-of-scope use cases: Production-level inference.

Metrics: N.A.

Datasets: N.A.

Ethical Considerations: Unclear, please consult with original authors in case of questions.

Caveats and Recommendations: Unclear, please consult with original authors in case of questions.

Model card prototype inspired by Mitchell et al. (2019)

Citation

@misc{grootendorst2020keybert,
  author       = {Maarten Grootendorst},
  title        = {KeyBERT: Minimal keyword extraction with BERT.},
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.3.0},
  doi          = {10.5281/zenodo.4461265},
  url          = {https://doi.org/10.5281/zenodo.4461265}
}