journal_identification_german

This model is a fine-tuned version of deepset/gbert-base that was trained to identify and extract references to scientific journals in German news coverage. It was trained on a dataset of 8082 annotated paragraphs from German print news articles that was created specifically for this task.

Model description

Similarly to a Named Entity Recognition model, this model has been trained to detect a specific type of entity in texts: scientific journals. Individual tokens in a text are classified as either irrelevant (no journal name), the first part of a journal name or a later part of a journal name. The model was developed as part of a research project at Karlsruhe Institute of Technology, which investigated journalistic coverage of individual research results. In the same project, a similar model was trained to identify journal names in English news articles (journal_identification_english) as well as two models that were fine-tuned to detect German and English news articles that contain a reference to a research result (study_news_detection_german and study_news_detection_english).

  • Model type: token classification
  • Language: German
  • Finetuned from: deepset/gbert-base
  • Supported by: The author acknowledges support by the state of Baden-Württemberg through bwHPC.

Intended uses & limitations

The intended use of this model is to enable large-scale analyses of the journalistic selection of scientific journals as sources for their coverage. It was used to extract journal names from more than 78k German news articles to study the dominance of individual sources in science news coverage.

How to use

You can use this model with a Transformers pipeline for token classification:

>>> from transformers import pipeline
>>> journal_identifier = pipeline('token-classification', model = 'nikoprom/journal_identification_german')
>>> sentences = ['Die Pflanze sei im Laufe der Zeit unscheinbarer geworden und damit für Menschen schwerer zu finden, berichten die Forscher im Fachmagazin Current Biology.']
>>> journal_identifier(sentences)

[[{'entity': 'J-Start',
   'score': np.float32(0.9984914),
   'index': 27,
   'word': 'Cur',
   'start': 138,
   'end': 141},
  {'entity': 'J-Start',
   'score': np.float32(0.9978611),
   'index': 28,
   'word': '##rent',
   'start': 141,
   'end': 145},
  {'entity': 'J-Inner',
   'score': np.float32(0.99738055),
   'index': 29,
   'word': 'Bio',
   'start': 146,
   'end': 149},
  {'entity': 'J-Inner',
   'score': np.float32(0.9970715),
   'index': 30,
   'word': '##log',
   'start': 149,
   'end': 152},
  {'entity': 'J-Inner',
   'score': np.float32(0.99715745),
   'index': 31,
   'word': '##y',
   'start': 152,
   'end': 153}]]

Text passed to the model should consist of whole paragraphs or at least sentences as this was the setting in which the model was fine-tuned.

Limitations

The model was developed for a very narrow use case in a research project and fine-tuned on a rather small dataset with texts from a very specific context (see below). As a consequence, its performance could be much worse when applied to texts from other domains (e.g. types of texts other than news articles, texts from other periods of time).

In addition, model output should be checked and post-processed before further use for at least three reasons: Sometimes, only some subwords of a journal name are tagged as journal names. In related cases, tokens inside a journal name are occasionally not identified as a part of the name, leading to the detection of two separate names. And finally, the model has a slight tendency to also extract names of non-scientific magazines or media outlets when they are presented in a similar context.

Training data

The training data was created as part of a larger manual content analysis in which the coverage of research results in print media from three countries (Germany, UK, US) was investigated. The dataset used for this model contained 601 articles mentioning a specific research result. These articles were published in 31 different German print outlets over a period of six years (2001, 2010, 2017-2020). All names of scientific journals (e.g. Nature, Cell Metabolism, PNAS) or preprint servers (e.g. medRxiv, SSRN) in the texts were marked by four human coders. Based on these annotations, each token was classified into one of three classes:

Label Class
O No journal name
J-Start First word of a journal name
J-Inner Second (or later) word of a journal name

Training procedure

All texts were cleaned to remove some frequent formatting errors present in the original articles (e.g. ä instead of ä). Each text was split into paragraphs based on line breaks, paragraphs containing more than 300 words were additionally split into sentences (to ensure that their number of tokens would not exceed the maximum length accepted by the model). 64 % of the paragraphs (5173) were used for training, 16 % (1293) for validation and 20 % (1616) for testing. Further preprocessing and fine-tuning largely followed the steps outlined in the notebook "Fine-tuning a model on a token classification task" provided by HuggingFace. The paragraphs were tokenized using a WordPiece tokenizer corresponding to the model (with a vocabulary size of 31,102 and without lower casing or accent removal). As words that are not in the vocabulary are split into subwords with this tokenizer, the journal labels had to be aligned with the new tokens. The model was then fine-tuned using TensorFlow on a single NVIDIA Tesla V100-SXM2-32GB on the bwUniCluster 2.0.

Training hyperparameters

The following hyperparameters were used during training:

  • Batch size: 16
  • Number of epochs: 15
  • Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • Learning rate: 2e-5
  • Weight decay rate: 0.01

Framework versions

  • Transformers 4.32.0
  • TensorFlow 2.14.0
  • Datasets 2.12.0
  • Tokenizers 0.13.3

Training results

During fine-tuning, average precision, average recall and average F1 score in the validation set were monitored using seqeval:

Epoch Training loss Precision Recall F1
1 0.0238 0.0 0.0 0.0
5 7.3061e-04 0.516 0.485 0.500
10 1.9297e-04 0.611 0.667 0.638
15 1.2848e-04 0.833 0.909 0.870

Evaluation

The model was evaluated with a test set of 1616 paragraphs again using precision, recall and F1 score:

Class Precision Recall F1
J-Start 0.793 0.767 0.780
J-Inner 0.737 0.737 0.737
Overall 0.771 0.755 0.763
Downloads last month
3
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nikoprom/journal_identification_german

Base model

deepset/gbert-base
Finetuned
(68)
this model