Unlocking Public Catalogues: Instruction-Tuning LLMs for ICD Coding of German Tumor Diagnoses
Abstract
Instruction-based fine-tuning of open-weight LLMs using public datasets improves coding accuracy for German tumor diagnoses, particularly for ICD-10-GM and ICD-O-3, while reducing malformed code outputs.
Accurate coding of tumor diagnoses with ICD-10-GM and ICD-O-3 is essential for structured cancer documentation in Germany. Smaller open-weight LLMs are appealing for privacy-preserving automation but often struggle with coding accuracy in German-language contexts. This study investigates whether instruction-based fine-tuning on public datasets improves the coding accuracy of open-weight LLMs for German tumor diagnosis texts. The evaluation uses coded diagnoses from the local tumor documentation system as test data. In a systematic data quality assessment, the upper limit for ICD-10 coding performance was estimated at 60-79% for exact and 81-94% for partial (three-character codes only) derivation. As training data, over 500,000 question-answer pairs were created based on the ICD-10-GM, ICD-O-3, and OPS catalogues. Eight open-weight models from the Qwen, Llama, and Mistral families (7-70 B parameters) were fine-tuned. ICD-10-GM accuracy rose from 1.4-24% to 41-58%, and partial accuracy from 31-74% to 73-83%. The accuracy of ICD-O-3 topography coding also improved but started and remained considerably lower with an exact accuracy of 22-40% and a partial accuracy of 56-67% after fine-tuning. Malformed code outputs dropped to 0% for all models. Tumor-diagnosis recognition reached 99%. Accuracy correlated positively with model size, but gaps between small and large models narrowed after fine-tuning. The reasoning mode in Qwen3 generally yielded a lower performance than fine-tuning and was over 100 times slower. Our findings highlight the potential of leveraging public catalogues to build instruction datasets that improve LLMs in medical documentation tasks. The complete training dataset and the best-performing checkpoints of the fine-tuned models are available from https://huggingface.co/datasets/stefan-m-lenz/ICDOPS-QA-2024.
Community
Hi, apart from the interesting point of the work - finetuning vs. reasoning LLM - I wish to suggest to take care of the terminology used: ICD, ICD-O, OPS are not "catalogues". They are "classifications"; and there is a lot of literature on automated ICD coding that I do not see cited...
Thank you very much for the feedback. Regarding the title: Maybe the wording is too directly translated from German. In German "ICD-10-Katalog" is fine but we will check if it can be improved. Regarding the literature: Indeed lots of work has been done on this topic. The new thing that we are doing here is training LLMs with question-answer pairs created from these classification systems. Of course, we couldn't do a comprehensive literature review here and had to focus on the research question. But if you think that we left out important and relevant works, I would appreciate it very much if you provide some links to papers that we may include in a revised version.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper