Our paper Clinically Correct Report Generation from Chest X-Rays Using Templates has been accepted in the MLMI workshop at MICCAI 2021, by Pablo Pino and Denis Parra from the Computer Science Department, and Cecilia Besa and Claudio Lagos from the School of Medicine, PUC Chile.

:tada: The paper was chosen as one of 7 best papers candidates, among 70 papers accepted in the workshop :tada:

Visit the paper in: https://link.springer.com/chapter/10.1007/978-3-030-87589-3_67

In this article, we introduce a method named CNN-TRG to automatically write a radiological report from an input chest X-ray. The method uses a Deep Convolutional Neural Network (CNN) to visually encode the input image, and it generates the text report based on templates manually curated with radiologists, rather than a pure automatic Natural Language Generation approach such as an RNN or a LSTM.

Our reasoning to follow this template-based approach is that typical NLG approaches are usually evaluated with NLP/NLG metrics such as BLEU, ROUGE or CIDEr, but this does not guarantee clinical or factual correctness in the generated report (i.e. the right diagnostic). Measuring this is difficult without actual physicians, but some authors have proposed automatic metrics to detect specific abnormalities mentioned in a written report, namely the CheXpert Labeler and MIRQI. These evaluation methods have not been validated with expert clinicians, but aim at detecting clinical facts. Our templates, based on a set of abnormalities, are jointly developed with radiologists to make sure they are useful to their medical practice.

Our evaluation is conducted on IU-Xray and MIMIC-CXR datasets. We compare our method against 3 naive baselines (constant, random and 1-NN) and also against many SOTA models (CLARA, KERP, Liu et al., Lovelace et al., CoAtt, and more)1. We used both NLP/NLG metrics (BLEU, ROUGE-L, CIDEr-D) as well as clinical correctness metrics CheXpert Labeler and MIRQI (with F-1, Precision and Recall measures).

Our results show that despite our method reports worst NLP metrics, the reports generated are still fluent and outperform all other methods in terms of clinical correctness (F-1, Precision, Recall on CheXpert and MIRQI).

In future work we aim at expanding to different abnormalities, pathologies and types of medical images. We also want to deal with multimodal inputs and produce multimodal representations. Finally, we will introduce some XAI features, which is potentially easy since our approach is based on classifiers and we can integrate CAM, GradCAM or other similar approaches.

Useful links:

  1. We only re-implement CoAtt, for the others we report the results from the original articles.