Deep Learning and the Future of Biomedical Image Analysis

Revolutionary technological advances in the areas of autonomous vehicles, speech recognition, cybersecurity, and earthquake prediction all depend on the family of artificial intelligence techniques known as deep learning.


But no field stands to benefit more from this approach than biomedical image analysis, a painstaking task that currently falls to highly trained radiologists and pathologists. Imagine, for example, a computer program that can detect a suspicious mass in a mammogram or identify a handful of abnormal cells in a biopsy slide.


The application of deep learning algorithms to biomedical image analysis is still in its infancy. Yet researchers around the world are already achieving uncanny results, and it is only a matter of time before their as-yet-experimental models enter the clinic. Once that happens, proponents say, deep learning will enable earlier and more accurate disease detection, allow more precisely tailored treatment plans, and ultimately improve patient outcomes.


“In the next two to five years, I see for the first time the possibility that the engineering tools that we’re developing will actually affect the clinic and start to advance medicine,” says Hayit Greenspan, PhD, head of the Medical Image Processing and Analysis Lab at Tel Aviv University and co-editor of the book Deep Learning for Medical Image Analysis.


Yet before deep learning can realize its potential to transfigure biomedicine, it has some hurdles to leap. For one thing, the technology is still fairly limited and is currently better suited to performing rote tasks rather than more advanced diagnostic ones. For another, there is some question regarding just how much artificial intelligence patients and doctors will accept in the clinic. And finally, there’s an urgent need for more training data to teach the deep learning models how to do their jobs—a problem that researchers are currently addressing in multiple ways.


Ultimately, how these issues are resolved will determine precisely how deep learning will enter the realm of clinical medicine. Nonetheless, if recent experimental studies are any indication, it’s ultimate impact is virtually assured—a question of how and when, rather than if.


Computer, Teach Thyself

Deep learning relies on sophisticated statistical models known as neural networks. Inspired by the human brain, these consist of virtual neurons—what Thijs Kooi, a PhD candidate in the Diagnostic Image Analysis Group at Radboud University Medical Center in the Netherlands, describes as “rudimentary elements of computation.” In a deep neural network, these are organized in multiple data-processing layers. Each layer transforms a piece of input and passes it to the next layer in such a way that the model can eventually learn to master complex tasks that it was never programmed to handle.


Traditional machine learning algorithms must be told which image features to consider when classifying a tumor as malignant or benign. But if you feed enough images of malignant and benign tumors to a neural network—or more precisely, to a deep convolutional neural network (CNN), a species of deep learning model that is particularly well suited to image analysis—it will eventually learn to distinguish between them on its own.


Robot Doctors? Not So Fast: 
Short-Circuiting the Hype Cycle

In a paper published in the journal Nature in February of 2017, Andre Esteva and Brett Kuprel, PhD candidates in the department of electrical engineering at Stanford University, reported that a CNN they trained on more than 1.4 million images was ultimately able to detect and classify various forms of skin cancer as accurately as 21 board-certified dermatologists.


Recent work to diagnose and classify skin cancers using deep learning has proven remarkably successful. Image courtesy of the website of the National Cancer Institute (https: //

Similarly, researchers at Google reported in 2016 in the Journal of the American Medical Association that they trained a CNN to diagnose diabetic retinopathy—an eye disease that afflicts almost one third of all diabetes patients and constitutes a leading cause of blindness—as accurately as seven board-certified ophthalmologists. More recently, several of the same Google researchers trained a CNN to match and even exceed the performance of a pathologist when identifying slide images of breast cancers that had metastasized to a patient’s lymph nodes.


Other groups, meanwhile, have used deep learning to identify signs of Alzheimer’s in MRI scans of the brain, to detect lung cancer, and to spot musculoskeletal abnormalities in bones and joints.


With successes like these, it might seem as if deep learning is on the cusp of rendering obsolete the highly educated human beings who are currently responsible for analyzing medical images. And advocates do anticipate that their autodidactic algorithms will soon undertake at least some of the tasks currently performed by flesh-and-blood doctors.


“I think it should be possible to replace routine image reading tasks in the next 5 to 10 years or so,” says Kooi.
But today’s physicians need not worry about their job security just yet. For one thing, claims of deep learning models matching—or besting—the performance of human beings can be misleading, Greenspan says.


Often, such horse races are conducted by inviting radiologists or other expert human readers into the lab and showing them the same simplified 2-D images that are fed to a model. But this does not resemble the real-life workflow of a radiologist; and under those circumstances, a human reader may prove far less reliable than he or she might in the clinic, making the algorithmic competition look far better by comparison.


It’s possible that deep learning is approaching the peak of a “hype cycle,” says Daniel Rubin, MD, MS, associate professor of biomedical data science, radiology, and medicine, who runs the Stanford Quantitative Imaging Laboratory. He collaborates on a variety of projects that employ a broad array of machine-learning methods and worries that deep learning is drawing attention away from other technologies that are still useful.


Deep learning algorithms have proven remarkably adept at standard radiological and pathological tasks such as segmentation, detection, and categorization. Yet as Greenspan explains, all of these are essentially problems of classification. For example, a deep neural network can be assigned the task of analyzing an x-ray, CT scan, or MRI at scales ranging from single pixel to region of interest or entire image, estimating the probability that it belongs to a particular class and labeling it accordingly: organ or surrounding tissue, normal or abnormal, cancer type A or cancer type B.


But as Rubin points out, much of what doctors do goes well beyond pattern recognition and image classification. It involves a complex combination of knowledge, reasoning, and inference. And while deep learning might eventually capture all of that, right now, it falls far short.


It is therefore likely that deep learning models, which do not grow bored or fatigued when forced to examine scads of mammograms or slide images, will initially be applied to relatively mundane tasks. Greenspan points out that this will allow physicians to become accustomed to the technology while enhancing their productivity and accuracy, freeing them to deal with more subtle problems, even as deep learning researchers gradually hone more advanced applications.



Shadi Albarqouni, PhD, a postdoctoral research associate at the Technical University of Munich, and colleagues recently trained a CNN to decompose chest x-rays in such a way that bony structures like the ribs and spine, which can obscure the soft tissue of the lungs, are eliminated from the picture. This would allow radiologists to more easily focus on areas of interest and perceive soft tissue abnormalities, thereby improving their chances of making a correct diagnosis.


And in a paper published in 2016, Albarqouni and others trained another CNN to analyze fluoroscopic x-ray images and identify the catheter electrodes that surgeons insert into patients during electrophysiology procedures, tagging them with colored labels and estimating their depth to enable precise placement.


Beyond such supporting roles, however, it is unclear exactly to what extent physicians—and their patients—will accept AI in the realm of healthcare.


Rubin himself has used CNNs to identify and grade the brain tumors known as gliomas in digital images of histopathology slides, and to identify and localize masses in mammograms.


Yet as he points out, “patients want a human being in the loop in their care.” So does the law, which requires that human beings, not algorithms, assume liability for medical decisions. “Can it be legally acceptable to have a computer practice medicine and replace the decision-making of a person?” he asks.


The Black Box Problem

Physicians themselves may also be uncomfortable with the results of deep learning because CNNs are black boxes. While they can determine which features are most useful for discriminating between different classes of images (e.g., tumors versus benign masses), the models do not reveal which of those features they rely upon, or how, precisely, they arrive at their decisions—for example, if a tumor is malignant or not.


To diagnose diabetic retinopathy, doctors examine photographs of the back of the eye, or fundus, for hemorrhages. Deep learning models can be trained to recognize such signs as well. The leftmost column shows unannotated fundus images from a large dataset hosted by, an online platform for predictive modeling and analysis competitions, that researchers at Radboud University in the Netherlands used to train and test a CNN. The middle column shows the same images with hemorrhages marked by expert human annotators. The rightmost column shows the output of the CNN. © 2016 IEEE. Reprinted, with permission, from van Grinsven MJJP, van Ginneken B, Hoyng CB, Theelen T, S´anchez CI., 2016. Fast convolutional neural network training using selective data sampling: Application to hemorrhage detection in color fundus images. IEEE Trans Med Imaging 35 (5), 1273–1284 (2016).

“The nature of these models is such that we give them a raw framework for how we think a problem works, and they fill in all the details,” says Kooi, who develops deep learning models capable of detecting breast cancer in mammograms. “How it fills in all the details is something we don’t really have a lot of control over.”


Researchers are working on ways of peeking inside the models to understand how they select discriminant features. For example, a group of Stanford graduate students led by Avanti Shrikumar, a PhD candidate in computer science, recently developed an algorithm called DeepLIFT that attempts to determine which features are important by analyzing the activity of a model’s neurons when they are exposed to data. A team of engineers at the Israel-Technion Institute of Technology have devised a method of visualizing the neural activity of a network that resembles what one sees in fMRI of the human brain. And Rubin recently published a paper in which he and his colleagues trained a CNN to distinguish between benign and malignant breast tumors, and then used a visualization algorithm, called Directed Dream, to heighten and exaggerate specific details in order to maximize the images’ scores as either benign or malignant. The resulting “CNN-based hallucinations” effectively show how the CNN learns clinically relevant features, lending credibility to the results.


But much work remains to be done on this front. And Rubin suspects that while doctors might be willing to accept deep-learning input for straightforward screening tasks—so long as the network’s track record is strong—they would be far less likely to accept something more complex and consequential like a diagnosis from a model whose inner workings remained a mystery.


“Physicians will not accept the output of a decision support system that does not also provide explanations for its answers,” he says. For all these reasons, even if deep learning were to reach the point where it rivaled human intelligence, it wouldn’t necessarily represent the end of human doctors.


The Data Labeling Dilemma

Yet another obstacle must be surmounted if deep learning is to achieve its full potential: the quantity of the data that this particular flavor of AI requires in order to work its magic.


Deep neural networks can do more automatically than any prior class of machine-learning model. Feed them enough properly labeled data, and they will learn to perform a given task without human intervention.


But the phrase “enough properly labeled data” is a crucial one. A computer’s astonishing capacity to learn without human intervention comes at a price; namely, the need for a massive amount of annotated training data. A CNN, for example, may be capable of learning to distinguish between benign and malignant tumors all by itself. But to do so, it must first be fed thousands or perhaps millions of images that have already been correctly labeled benign or malignant, a process known as supervised learning.



“This is the ‘no free lunch principle,’” says Alex Ratner, a PhD candidate in computer science at Stanford. Deep learning, he explains, may do much more on its own than other machine learning methods; but “it needs more training data to make up for that extra complexity in the models.”


Unfortunately, large-scale annotateddatabases of biomedical images can be hard to come by, and having expert human annotators create new ones from scratch is laborious, costly, and time-consuming. Access to sufficient labeled training data is, therefore, a significant obstacle.


“We work closely with hospitals and radiologists to give us annotated data,” says Greenspan, who has used CNNs to detect metastatic liver cancer in 3-D CT scans, segment multiple sclerosis lesions in MRI images, and label pathologies in x-ray images, among other things. “Collecting the necessary data for these tasks is a slow and demanding process.”


In addition, says Kooi, “These models are still relatively stupid.” In particular, they don’t do nuance very well: Having learned to spot obvious examples of common cancers by sifting through a particular training dataset, for instance, they may stumble when confronted with rare or unusual ones, or with anomalies they have never seen before.


Dealing with Nuance: Data Augmentation

To some extent, researchers can compensate for a lack of labeled training data by using a technique known as data augmentation. This involves transforming some of the training data upon which the deep learning model hones its skills (e.g., rotating images, altering their color, simulating jitter, etc.), thereby preparing it for the kinds of variations and artifacts it might encounter when it is asked to process previously unseen data.

Data augmentation is used to enhance small collections of training data. Here, Thijs Kooi and his colleagues at Radboud University in the Netherlands employ it to generate variations on the images used to train a CNN to distinguish between breast cancer tumors and benign cysts. The three images in the top row are of normal breast tissue. The leftmost patch in the bottom row, on the other hand, contains a mass or cyst. Superimposing the three normal images from above over the abnormal one produces the remaining images in the bottom row, simulating the different amounts of tissue that might surround a lesion in a mammogram. Reprinted with permission from Kooi T, van Ginneken B, Karssemeijer N, den Heeten A, Discriminating solitary cysts from soft tissue lesions in mammography using a pretrained deep convolutional neural network, Medical Physics 44:3 (1017-1027) 2017.


In a paper that appeared in Medical Physics this past March, Kooi and his colleagues successfully used a CNN to distinguish benign cysts from malignant masses in standard digital mammograms, achieving results comparable to those attained with a cutting-edge imaging technique known as spectral mammography. They did so in part by processing some of their training data to mimic the natural variations in the amount of tissue that can surround a breast tumor, thereby altering its appearance in a mammogram.


Kooi’s CNN was required to analyze only small patches extracted from much larger images, however—patches that another model, known as a candidate detector (which was not itself a deep neural network), had already singled out as containing regions of interest.


Radiologists, on the other hand, typically examine complete mammograms, viewing suspicious areas in the context of the entire image. They also track changes in a patient’s scans over time, and note potentially significant symmetries and asymmetries between the left and right breasts. And they consider a whole wealth of information—a patient’s lab results, her medical history, her age and demographics—that is not contained within the images themselves.


Putting all of this into the hopper leads to better informed and more accurate diagnoses. Kooi and others are trying to incorporate such diverse sources of information into deep neural networks, in part by integrating them with other computational methods; but they are not there yet.


Relying on 
General Knowledge: Transfer Learning

Data augmentation represents one way of addressing the problem of limited training data. So-called transfer learning represents another: a CNN can initially be trained on a large dataset of standard images (dogs, cats, planes, umbrellas), and then retrained on a smaller dataset of biomedical ones (brain scans, chest x-rays, pathology slides). The idea is that the network will learn general, broadly applicable image features from the larger dataset, and transfer or apply that acquired knowledge to the smaller dataset, which will fine-tune it for a particular task such as segmenting organs or detecting lesions.


“The intuition behind transfer learning is that a radiologist doesn’t develop a whole new visual cortex every time he learns a new task, but relies on stuff that he already knows—and we can do the same,” says Kooi. “We can take a network that was trained on discriminating cats, for instance, then adapt the network to medical tasks.”


Esteva and Kuprel, for example, pre-trained their CNN on a set of approximately 1.28 million images, comprising 1,000 object categories, culled from ImageNet, a massive online visual database. They then retrained the model on 129,450 dermatologist-labeled clinical images drawn from clinician-curated online repositories, and from the Stanford University Medical Center. The end result: a model that performed as well as a clutch of human doctors.


In his 2017 Medical Physics paper, Kooi took the concept of transfer learning a step further: Instead of starting with generic images, he pre-trained his deep neural network on a large dataset of screening mammograms, essentially teaching the model to distinguish tumors from non-tumors—a medical task that was related to, but not quite the same, as the one he was really interested in. He then retrained the network on a much smaller dataset of diagnostic mammograms, and had it learn how to discriminate between tumors and cysts.


“Often, people train their models on ImageNet, and then fine-tune them with a medical data set,” Kooi says. “My argument is, ‘It’s always better to train the model using a task that is more related to the problem that were trying to solve.’”


In the end, his model nearly matched the performance of a system that used a different kind of statistical model—one that relied on features selected by human beings, rather than on deep learning—along with a more advanced form of mammography.


Outsourcing Annotation

There are many other ways of dealing with the paucity of biomedical training data, some of which take a creative approach to annotation itself. In a paper published last year, for example, Albarqouni and colleagues combined the ground-truth of expert annotations with the crowd-truth of nonexpert ones.


The goal was to improve the performance of a CNN that was trained to detect instances of mitosis, or cell division, in breast cancer biopsy slides. These visible signs of mitosis, known as mitotic figures, appear as small black dots under the microscope, and represent an important criterion for determining the aggressiveness of a tumor—and hence for establishing a patient’s prognosis and course of treatment.


The model was trained on expertly annotated images from only eight patients. But when it came time for the network to label previously unseen images, Albarqouni and his team introduced a new twist: whenever the model predicted that an image was more than 90 percent likely to contain mitotic figures, it cropped the region of interest and sent the resulting patch for annotation by at least 10 non-experts via a crowdsourcing platform. These nonexperts, who lacked any medical experience, were given a brief training session and a quiz designed to determine their accuracy. They were then asked to label the patches they received: Were the little black dots singled out by the CNN mitotic figures, or not?


Different users often arrive at different judgments, resulting in conflicting or noisy labels. To resolve such differences, Albarqouni and his colleagues built an “aggregation layer” that collected everyone’s annotations and arrived at a consensus label for each patch through majority voting, with each user’s vote weighted according to their accuracy.


The results of that vote were returned to the deep learning model, which took the crowdsourced labels into account in its next round of predictions—a crowd-driven fine-tuning process that effectively boosted the network’s performance, as measured by the ratio of true positives to false positives, by 3 percent.


In a subsequent project, Albarqouni gamified the crowdsourcing component. He and his colleagues transformed the patches into 3-D stars whose shape, size, and color corresponded to the likelihood that they contained mitotic figures. They then asked users to play a game in which they used a virtual plane to collect the best candidates. This “playsourcing” platform performed 10 percent better than its non-gamified counterpart, an improvement that Albarqouni attributes to the motivating influence of gameplay.


“The player is trying to get a better score,” he says; and that translates into a performance boost for the model. It’s a win-win.


Automating Everything Noisily

One group is taking the crowdsourcing idea one step further: They’re generating (and de-noising) cheaper and messier training data with rules that people write and machines apply. It’s a method called data programming that was developed by Alex Ratner and others working in the lab of Stanford computer scientist Christopher Ré, PhD. Rather than requiring domain experts or nonexperts to hand-label large datasets for training purposes, data programming allows them (or their friendly neighborhood coders) to write small snippets of code that encapsulate the heuristics and rules of thumb that they would use to annotate the data themselves. Those bits of code—called labeling functions—are then used to develop a generative model that can automatically label large quantities of data for training purposes.


Paroma Varma and her colleagues in Christopher Re’s lab have developed a platform called Coral to help users label large training datasets of images and video for deep learning purposes. Such datasets can be used to teach a deep learning model how to recognize images of people on bicycles—or how to segment a potential tumor on a bone x-ray. Courtesy of Paroma Varma.

Because these labeling functions may overlap and conflict, the labels they produce are inaccurate, or noisy. (The process of training a model using such noisy labels is known as weakly supervised learning, or weak supervision.) But Ratner and his colleagues, who have developed an open-source data programming platform called Snorkel, use a variety of computational methods to compensate. The large volumes of noisily labeled data pumped out by the generative models created with Snorkel, which are not deep neural networks, can therefore be used to train high-performing discriminative models, which are.


Most of their early efforts involved building text-based datasets. But members of the Ré lab have begun applying data programming and other weak supervision techniques to images as well. And some of their most interesting work involves both.


Ratner, for example, has been working with Rubin and Stanford radiologist Lawrence Hoffman, MD, on various projects involving radiological images and their accompanying text reports. The images have not been labeled, but the information required to do so—a physician’s clinical judgment that a bone tumor is benign or malignant, for example, or that an arterial blockage has been cleared—is buried in the reports.


Ratner is therefore working on ways of writing labeling functions in Snorkel that can “read” radiology reports and extract labels that can be used to train an image model such as a CNN, enabling it to classify radiological images without the benefit of any text whatsoever. This “cross-modal” approach has succeeded with test data, and will soon be deployed on real clinical data.


“If this works, then the model you’ve trained could look at an image before the radiologist has actually dictated the report, and come up with a classification,” says Ratner.


Paroma Varma, another doctoral candidate in the Ré lab, and colleagues have developed a different software platform called Coral to apply the idea of weak supervision directly to images and video. In a recent paper, she and one of Rubin’s PhD students, Darvin Yi, had Coral label tumors in mammograms as either malignant or benign, and used that data to train a CNN to distinguish between unlabeled examples of the two. Remarkably, this Coral-trained CNN proved almost as accurate as a CNN that had been trained on a small hand-labeled dataset.


Varma and another member of the Ré lab, PhD candidate Braden Hancock, are now attempting to crowdsource the act of creating labeling functions. In one proof-of-concept experiment, they posted images to the crowdsourcing platform Mechanical Turk and asked users not only to label them, but also to explain their reasons for doing so. They then used a language tool known as a semantic parser to convert those natural language explanations into code. The result: instant labeling functions, automatically generated from standard English. The images Varma and Hancock used were not medical ones, but if Albarqouni’s crowdsourcing work is any indication, the approach certainly holds promise.


Other automated solutions to the training data dilemma are also under development. Greenspan, for example, has been using deep learning models known as Generative Adversarial Networks (GANs) to generate synthetic training data using small training sets. Her work is still in the experimental phase, but if successful, it could enable one deep learning model to produce the data required to train another deep learning model.


Given the pace at which the field is moving, the application of deep learning to biomedical images is likely to keep plenty of doctors and computer scientists busy for the foreseeable future. Radiologists will be presented with more and more information to factor into their clinical decisions as artificial intelligence gradually enters the scene. And computer scientists, for their part, will continue to grapple with the challenges presented by data-hungry deep learning models and the noisy medical data required to feed them.