File size: 3,173 Bytes
d320ce9
 
6938961
d320ce9
6938961
d320ce9
6938961
d320ce9
6938961
d320ce9
6938961
d320ce9
6938961
d320ce9
6938961
d320ce9
6938961
d320ce9
6938961
d320ce9
6938961
d320ce9
 
6938961
d320ce9
6938961
d320ce9
6938961
d320ce9
6938961
d320ce9
6938961
d320ce9
6938961
d320ce9
 
 
 
 
6938961
d320ce9
 
 
 
 
 
 
 
 
6938961
d320ce9
 
 
 
 
 
 
 
 
 
 
 
 
6938961
 
 
 
 
 
 
 
d320ce9
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# Model documentation & parameters

**Algorithm version**: The model version to use. Note that *any* HF model can be wrapped to a `KeyBERT` model.

**Text**: The main text prompt to "understand", i.e., generate keywords.

**Minimum keyphrase ngram**: Lower bound for phrase size. Each keyword will have at least this many words.

**Maximum keyphrase ngram**: Upper bound for phrase size. Each keyword will have at least this many words.

**Stop words**: Stopwords to remove from the document. If not provided, no stop words removal.

**Use MaxSum**: To diversify the results, we take the `2 x MaxSum candidates` most similar words/phrases to the document. Then, we take all top_n combinations from the `2 x MaxSum candidates` and extract the combination that are the least similar to each other by cosine similarity. Control usage of max sum similarity for keywords generated.

**MaxSum candidates**: Candidates considered when enabling `Use MaxSum`.

**Use Max. marginal relevance**: To diversify the results, we can use Maximal Margin Relevance (MMR) to create keywords / keyphrases which is also based on cosine similarity.

**Diversity**: Diversity for the results when enabling `max. marginal relevance`.

**Number of keywords**: How many keywords should be generated (maximal 50).


# Model card -- KeywordBERT

**Model Details**: KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

**Developers**: Maarten Grootendorst.

**Distributors**: Original developer's code from [https://github.com/MaartenGr/KeyBERT](https://github.com/MaartenGr/KeyBERT).

**Model date**: 2020.

**Model type**: Different BERT and SciBERT models, trained on [CIRCA data](https://circa.res.ibm.com/index.html).

**Information about training algorithms, parameters, fairness constraints or other applied approaches, and features**: 
N.A.

**Paper or other resource for more information**: 
The [KeyBERT GitHub repo](https://github.com/MaartenGr/KeyBERT).

**License**: MIT

**Where to send questions or comments about the model**: Open an issue on [GT4SD repository](https://github.com/GT4SD/gt4sd-core).

**Intended Use. Use cases that were envisioned during development**: N.A.

**Primary intended uses/users**: N.A.

**Out-of-scope use cases**: Production-level inference.

**Metrics**: N.A.

**Datasets**: N.A.

**Ethical Considerations**: Unclear, please consult with original authors in case of questions.

**Caveats and Recommendations**: Unclear, please consult with original authors in case of questions.

Model card prototype inspired by [Mitchell et al. (2019)](https://dl.acm.org/doi/abs/10.1145/3287560.3287596?casa_token=XD4eHiE2cRUAAAAA:NL11gMa1hGPOUKTAbtXnbVQBDBbjxwcjGECF_i-WC_3g1aBgU1Hbz_f2b4kI_m1in-w__1ztGeHnwHs)

## Citation
```bib
@misc{grootendorst2020keybert,
  author       = {Maarten Grootendorst},
  title        = {KeyBERT: Minimal keyword extraction with BERT.},
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.3.0},
  doi          = {10.5281/zenodo.4461265},
  url          = {https://doi.org/10.5281/zenodo.4461265}
}
```