Automated Medical Image Analysis using AI: The Why, The How, and The What

By: Jakub Nalepa Ph.D. D.Sc.
We have been witnessing an unprecedented level of success achieved by the artificial intelligence (AI)-powered techniques in virtually all fields of science and industry, with medical image analysis not being an exception. Thanks to the availability of high-performance hardware and a variety of software tools, benefiting from the recent advances in classical machine learning and deep learning has become easier than ever before and opened new doors for the deployment of such approaches in clinical settings. This may rise our expectations and make us believe that AI would solve all problems that clinicians face daily. Although we are still far from this point, there are indeed tasks that may be effectively automated using data-driven algorithms – they range from improving the clinical workflow through automating scheduling and operations, enhancing the quality of medical images, their registration, segmentation, and classification, straight to radiomics, being the process of transforming image data into mineable and quantifiable features that allow us to build diagnostic or predictive AI models [1].
Automated medical image analysis: why do we need automation?
Medical image analysis is undoubtedly one of the most important tasks performed by radiologists and medical physicists. Accurate delineation of abnormal tissue can be key in defining the treatment pathway, as it could be one of the factors that enables us to quantify and monitor the treatment’s efficiency and patient’s response. Assume that we want to understand what the volumetric characteristics of a brain tumor in a longitudinal study were. In this study, a patient was repeatedly scanned, and several magnetic resonance imaging (MRI) scans were manually delineated by a human reader. In the perfect world, an observed decrease of the tumor’s volume captured within the subsequent time points would mean that things go well, but this decrease can easily be resulting from poor reproducibility and intra-rater variability (or inter-rater variability, if such MRIs are to be analyzed by several clinicians), inherently related to manual image analysis. Things may get even more challenging if we look at the three-dimensional analysis of full scans [4].

Fig. 1: AI algorithms have been blooming
in numerous medical image analysis tasks in virtually all modalities and organs.
Wouldn’t it be wonderful to have a robust software tool that can automatically deliver – in a fully reproducible way – regions of interest that are ready to be interpreted? It definitely would, and here is where AI may come to play – in many different scenarios (Fig. 1).
How can we automate brain tumor segmentation from MRI?
There exist quite a number of medical image modalities – let’s focus on MRI that plays a key role in modern cancer care because it allows us to non-invasively diagnose a patient, determine the cancer stage, monitor the treatment, assess, and quantify its results, and understand its potential side effects in practically all organs. MRI may be exploited to better understand both structural and functional characteristics of the tissue [2] – such detailed and clinically-relevant analysis of an imaged tumor can help design more personalized treatment pathways, and ultimately lead to a better patient care. Additionally, MRI does not use the damaging ionizing radiation, and may be utilized to acquire images in different planes and orientations. Thus, MRI is the investigative tool of choice for neurological cancers for brain tumors [3].
The approaches for brain tumor segmentation can be split into four main categories, including the atlas-, image analysis- and machine learning-based techniques, and those which hybridize various algorithms belonging to other groups (Fig. 2). This taxonomy clearly shows that the topic is very active in the literature, and new algorithms for this task continuously emerge.

Fig. 2: Brain tumors can be automatically segmented using a range of different algorithms.
This figure is inspired by our previous work [6].
There are, obviously, important challenges that we need to face if we want to benefit from classical machine learning or deep learning for brain tumor segmentation. In the former approaches, we have to design appropriate extractors that enable us to capture discriminative image features. These, in turn, should allow for distinguishing tumorous and healthy tissue. Commonly, such extractors elaborate a variety of features to reflect the tissue characteristics as best as possible – we may obviously quantify the distributions of the voxels’ intensity, but we can also try to capture characteristics that are not necessarily visible to the naked eye, such as texture. There is no doubt that designing effective extractors is not a piece of cake at all and it may be human-dependent. Deep learning helps us tackle this issue – such models (of a multitude of various architectures) exploit automated representation learning, meaning that feature extractors are automatically elaborated during the training process.
Sounds like the way to go? Indeed, deep learning has established the state of the art in brain tumor segmentation (see the Brain Tumor Segmentation Challenge organized yearly at MICCAI: https://www.med.upenn.edu/cbica/brats2021/), but the high-capacity learners require huge amounts of data to learn from, if we employ supervised learning. We want to be able to accurately segment MRIs of varying quality and captured using different scanners, don’t we? That means we want to be able to generalize well over unseen image data. In supervised learning, we need to provide training samples showing the algorithm which images are tumorous and which are not. Imagine you are a teacher, and the algorithm is a student – you want to present nice and representative examples that would help your student understand what image characteristics are important (and which are redundant). Unfortunately, the world is not fully labeled – we need to somehow generate the ground-truth labels. Remember intra- and inter-rater variability? This is just one issue that makes the manual annotation process challenging. On top of that, it is costly, time-consuming, cumbersome, and not super exciting… Fortunately, we can try to synthesize artificial examples based on existing training samples in the data augmentation process. Even very simple operations may lead to increasing the training size considerably (Fig. 3).

Fig. 3: Data augmentation helps us significantly increase the size of our training samples.
This figure comes from our paper [8].
Right, now we have a dataset prepared to be fed into the training process of our carefully selected and designed AI model. Are we ready to go and hit the market with a brand new ground-breaking medical image analysis software that will change the way we see and analyze human brains in MRI?
What can we automate in medical image analysis? The brain tumor segmentation case study: a CE-marked software tool
Fortunately, not yet. To be able to deploy the AI algorithm in the wild, we must prove that is has been designed, verified, and validated with care. This piece of software could ultimately affect the decisions that are taken by clinicians. Just imagine what could happen if it delivered results that are fundamentally wrong and ill-designed.
We have been driving down this very route with Sens.AI – our automated tool for segmenting low- and high-grade gliomas from FLAIR MRI. Such software products must be CE-marked to be deployable in clinical settings (or FDA-cleared in the US), meaning that they must be risk-managed, well-documented, and thoroughly validated. This validation also concerns the data – we need to be sure that our training samples are of high-quality and can be actually utilized to teach the algorithm how to operate. The heart of Sens.AI is the U-Net-based deep neural network – this architecture is used for both brain extraction (the process of removing skull from input FLAIR MRI images), and for delineating tumors. Thankfully, the organizers of the BraTS challenge made their labeled data (with hundreds of patients scanned at a number of institutions) publicly accessible, hence the community can exploit it while developing the automated segmentation techniques. Once the model is trained and the processing pipeline is in place, we can infer over unseen data, e.g., captured at your local hospital, hoping that Sens.AI is gonna work charmingly well (Fig. 4). Can you quantify our hopes though?

Fig. 4: A high-level flowchart of Sens.AI –
from training, straight to segmenting unseen test data [6].
As we already know, delivering high-quality ground-truth segmentation is an expensive process which may lead to biased tumor delineations. Also, such gold- (or silver-)standard segmentations are rarely available for clinical data. Isn’t it tricky to quantify the performance of our segmentation technique then? What if we want to use a standard overlap metric, e.g., the DICE coefficient, to compare the predicted and manual (ground-truth) segmentation, and the ground truth is incorrect? Can the model deliver a higher-quality delineation than the ground truth?
We tackled those issues through combining quantitative, qualitative, and statistical analysis in our validation process. On the one hand, we calculated and analyzed classical quality metrics, but – on the other hand – we performed the mean opinion score (MOS) experiment. We segmented the test MRIs (never used for training) using Sens.AI and asked the experienced readers to assess the segmentations using an ordinal scale with just four options, ranging from “Very low quality, I would not use it to support diagnosis”, up to “Very high-quality segmentation, I would definitely use it”. The MOS was performed in a blinded setting, meaning that the readers obtained segmentations without knowing the source dataset. Each reader was allowed to assign a single score for each segmentation – it will not be surprising if we say that the readers disagree in many cases, will it? Also, the visual assessment is not an easy task – are jagged contours worse than the smooth ones? Shall we include more hyperintense tissue in the tumor region? Quite a number of open questions here (Fig. 5). This experiment allowed us, however, to define the acceptance criteria (being the percentage of all experts that would have used the automated segmentations), and seamlessly combine quantitative and qualitative metrics in a comprehensive analysis process. Interestingly, we exploited a similar MOS-based approach to verify the quality of our training sets as well.

Fig. 5: Are jagged contours worse than the smooth ones?
Shall we include larger hyperintense areas in the tumor regions?
Deciding on the quality of segmentation is not a trivial task (this figure comes from our paper [6]).
The final remarks
AI algorithms help us not only build automated pipelines for the most tedious and time-consuming tasks, such as “simple” medical image analysis and segmentation, but can make us see beyond the visible through extracting features that may be impossible to capture using our eyes. Such tools must be, however, thoroughly verified to make them applicable in clinics, as their results affect all other steps of the patient management chain, even though the results are ultimately interpreted by humans. Therefore, having a decent deep model is a great initial step – the next ones include designing and implementing theoretical and experimental validation of the pivotal elements of the processing pipeline in a fully reproducible, quantifiable, and thorough way that is evidence-based.
This is the right way to go.
References
[1] Sarah J. Lewis et al: Artificial Intelligence in medical imaging practice: looking to the future, Journal of Medical Radiation Sciences, 2019, DOI: 10.1002/jmrs.369.
[2] D. A. Dickie et al: Whole brain magnetic resonance image atlases: A systematic review of existing atlases and caveats for use in population imaging, Frontiers in Neuroinformatics, vol. 11, pp. 1, 2017, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5244468/.
[3] J. E. Villanueva-Meyer, M. C. Mabray, and S. Cha: Current Clinical Brain Tumor Imaging, Neurosurgery, vol. 81, no. 3, pp. 397-415, 05 2017, https://academic.oup.com/neurosurgery/article/81/3/397/3806788.
[4] H. K. Bø et al: Intra-rater variability in low-grade glioma segmentation, J Neurooncol, 2017 Jan;131(2):393-402. doi: 10.1007/s11060-016-2312-9. Epub 2016 Nov 11, https://link.springer.com/article/10.1007/s11060-016-2312-9.
[5] A. Wadhwa, A. Bhardwaj, V.S. Verma: A review on brain tumor segmentation of MRI images. Magn Reson Imaging 2019;61:247–59, https://www.sciencedirect.com/science/article/pii/S0730725X19300347.
[6] J. Nalepa et al: Fully-automated deep learning-powered system for DCE-MRI analysis of brain tumors. Artif. Intell. Medicine 102: 101769 (2020), https://www.sciencedirect.com/science/article/pii/S0933365718306638.
[7] K. Skogen et al: Diagnostic performance of texture analysis on MRI in grading cerebral gliomas, Eur J Radiol. 2016 Apr;85(4):824-9. doi: 10.1016/j.ejrad.2016.01.013. Epub 2016 Jan 21.
[8] J. Nalepa, M. Marcinkiewicz, M. Kawulok: Data Augmentation for Brain-Tumor Segmentation: A Review. Frontiers Comput. Neurosci. 13: 83 (2019), https://www.frontiersin.org/articles/10.3389/fncom.2019.00083/full.