 Good morning everyone. I hope all of you had a very good breakfast and are charged up for the first full talk at Antel today. So my name is Kiran. I work on medical imaging at a company called Predebel Health. And my talk is here going to be on unsuppoised and semi-suppoised deep learning techniques for medical imaging. So before I dive deep into medical imaging, I'd like to give you a brief history and introduction to cancer. So how many of you here know at least one person who has suffered from cancer in the past? It's pretty much everyone I guess. So cancer is the most deadliest disease known to human history at this point. So I have this interesting statistic. So this graph here shows the number of women who have died of cancer in the US from 1930 to 2005. As you can see, there's hardly been any progress made in the number of people who have died. Like, in respect of how many technological advances we have made in the past 100 years, cancer research is still stagnant. So there are a couple of trends here. If you look at this gray line, which is the lung and bronchus cancer, it has actually has been on the rise after 1970, ever since 1970. That is because of the introduction of cigarette smoking. And there's one more interesting trend. If you look at the blue line, which indicates stomach and colon cancer also, it has shown a considerable decline. Can anyone guess what was this major medical innovation which happened in the past 100 years, which resulted in stomach cancer from being completely eradicated from human history? It was the fridge. Yeah. So this picture here is a perfect depiction of how poorly humans have understood cancer. So let's dive a little deeper into cancer. So what is cancer and how does cancer occur? So we all are human beings and we are made of trillions of cells. Each cell has a genetic code, which is like a source code, which instructs how the cell should function. For example, how should the cell divide and how should it should convert food into energy and how it should prevent us from contracting diseases from bacteria, etc. So what happens in cancerous cell is that the genetic code, the instruction that instructs how the cell should divide is corrupt, which results in uncontrollable cell division, which is what you see as a cancerous mass inside the bodies of cancer patients. So how you would treat cancer is that either you would remove this cancerous mass using surgery or you would deliver toxic drugs of chemotherapy, which essentially try to kill off the cancer cells. But the adverse effect of chemotherapy is that it will kill the normal cells also, which is why you see cancer patients losing hair and becoming sick. And radiation therapy is where you deliver targeted doses of high concentrated X-ray, which try to kill the cancer cells. But if you have a lump of mass and if you want to know if it is cancer or not, the ideal scenario is to take a biopsy of the tissue and analyze it in a pathological slide and then determine whether it is cancer or not. But for brain cancers especially, it is very difficult because if you were to take a biopsy of the brain cancer, you have to create a hole in your skull and take a tissue out of the lump of mass inside your brain. It has adverse side effects because it can probably impair other brain functions. And this is where medical imaging comes into the picture. Medical imaging allows us to see inside the body. So the picture here shows an MRI scanner. And what happens in an MRI scanner is as the patient goes inside the scanner, this cylindrical structure takes a 3D image of the interiors of the body. And when you visualize this 3D image in your computer screen, you see 2D slice sections from multiple directions. How you would view from the front, from the top and from the side. So how do we use this for spotting brain cancer? So the picture on the left shows the brain scan of a normal brain. The picture of the right shows the MRI scan of a cancerous patient. So this particular cancer is called glioblastoma. It is the most malignant and most aggressive brain cancer known to humans. And only 2% of the people survive and patients who contract brain cancer either 50% of them die before 14 months and 50% die after 14 months. So it's a very worrying statistic. Why is that? So if you were to take a special MRI sequence of this cancer patient, this is what you would get. You can actually see that the cancer like an earthquake has an epicenter and has grown and occupied the entire left hemisphere. So different sequences of MRI show different phenotypes of the cancer. Also during treatment, you need to know how heterogeneous the cancer is. Your cancer is not just a single cell. It's extremely heterogeneous and it doesn't respond to treatment well. So in order to analyze your treatment response, you need to know and map the different regions of the cancer within the brain. So this picture here shows the entire extent of the cancer which is called the edema. And this picture here shows the necrotic core which is the epicenter of the cancer. And this sequence shows two different regions from within the cancer. One is the active tumor which is actively growing. And the central dark region you see is the necrotic cells which are dead cancer cells. And if you were to map all these information together and create a map out of it, this is what you would get. So you have four labels. Each pixel in your MRI has to be labeled as brain or not, as tumor or not. If it is a tumor, what kind of tumor it belongs to? Is it edema or the necrotic core or the active tumor or the non-enhancing tumor? And when a patient is diagnosed with brain cancer, the doctor prescribes the course of treatment. After a few months of treatment, the patient comes for a re-scan and an MRI is taken. And the doctor analyzes how these different regions of the cancer is responding to treatment, and that is extremely crucial for the survival of the patient. So if I were to pose this problem as a machine learning problem, we need to generate labels for every pixel. And this pixel-wise labeling is carried out by doctors, and it is extremely tedious, and it takes about two to four hours of the doctor's time. Once the pixel-wise labeling is done, the doctor visualizes this cancer in a 3D virtual environment and then understands the heterogeneity of the cancer and then plans the treatment accordingly. So this is where we hypothesize that labeling can be done by deep neural networks. But there are multiple problems with glioblastoma segmentation from MRI. It is extremely non-trivial because of the following reasons. We have a shortage of samples. The dataset that was available to us was just 300 scans. And only 2% of the pixels contain tumor. Apart from this, the tumors are extremely heterogeneous. So you have to understand the entire mapping of cancer from just 300 patients. And there's the complexity of the data as well. You are no longer dealing with 2D images, but we are dealing with 3D images and four sequences of these 3D images. So it's essentially a 4D tensor. You have to map the pixels of this 4D tensor to a 3D tensor of labels. And the dense annotations are extremely expensive. So a couple of weeks ago, I had a guy from a neurosurgeon from Silicon Valley who mailed me saying that he's willing to sell his dense labeled annotations for around $2,500 per scan. So that roughly equates to one and a half lakh rupees here. And multiply that with 300 scans, which is definitely not scalable for machine learning problems. So this got us thinking, can we leverage unsupervised learning? You don't have to depend on label data. Can we use unsupervised learning somehow to learn the underlying structure of these brain scans and come up with better diagnosis and segmentations for these problems? So this brings me to the first part of the talk. I'd like to start with deep unsupervised feature extraction. So how do you leverage deep learning? Or how do you train deep networks in an unsupervised manner? So autoencoders are extremely useful for feature extraction. A lot of autoencoders have been used for feature extraction in many deep learning problems. And an autoencoder is the most simplest MLP that you can ever think of. It is just a three-layer neural network, and it has two components. It has an encoder and a decoder. Both the encoder and decoder are simple matrix multiplications followed by a linear activation or sigmodal activation. And the purpose of the autoencoder is nothing but just to reconstruct its input. You feed the input, you ask the autoencoder to reconstruct the input. That's as simple as it gets. Now if you take the MNIST problem, it has a 28 cross 28 patch, and you would rasterize this patch into a one-dimensional array, you would feed to the autoencoder, and you would ask the autoencoder to reconstruct the input. And the loss function is just the mean square of the autoencoder's output and the original input that was provided to the network. But there's a problem here. The autoencoder is susceptible to identity mapping. It learns one-to-one mapping rather than learning the underlying structure. So how do we prevent this? We have a variant called a denoising autoencoder where instead of feeding in the raw input, we feed a noise version of the input and ask the autoencoder to reconstruct the original output, original input. So this ensures that the network is trying to somehow manage and learn the underlying structure rather than doing one-to-one mapping. So how does it exactly learn the underlying structure? Take this example of understanding the curve, learning the curve, and if you were to train an autoencoder for learning this curve, you would take sample patches, sample points from this curve, you would corrupt it, and you would ask the autoencoder to reconstruct the original input. And as you feed more and more samples, the autoencoder is forced to learn the underlying structure and essentially learns the reconstruction function of this curve. And it learns these weights, learns this underlying structure and encodes it into weights. So for the MNIST problem, if you were to visualize the weights and learnt by the autoencoder, this is what it would look like. It has learnt a lot of strokes over here, which are characteristic of the dataset, which is MNIST. So as you can clearly see, I did not use any label data at all. Just with unlabeled data, just by corrupting the input and asking the autoencoder to reconstruct the input, the network has learnt the underlying structure of the data. Okay, so this works for MNIST, but how do we convert this into brain MRI? So we have the dataset over here, which is four sequences of 3D MRI. What we do is, in order to alleviate the data imbalance, we extract ROAs around the tumour and sample patches, small patches from within this ROA. We extract four cross 21 cross 21 patches, four corresponds to the four sequences and we extract 2D 21 cross 21 patches and we stack them on top of each other and you have the image on the right, which is a set of patches which are extracted from this ROA of the brain MRI. So to summarize, you have these four sequences. Now you extract the patch, you noise this patch, you drop around 20% of the pixels and you ask the simple autoencoder to reconstruct the original input. And the mantra for training this is as simple as you extract the patch, you noise it and you reconstruct. You extract noise and reconstruct. You do this until convergence is achieved. But as a typical deep learning researcher, you are not happy with just a single layer autoencoder. We always want to go deeper. Jokes apart, deep layers are extremely useful for learning hierarchy of features. As lower layers learn local features and as you go deeper and deeper, it learns a hierarchy of global features, which are essential for understanding the image. But there's a problem with training deep autoencoders. Because of this reconstruction function, because of the mean square error loss where you would try to reconstruct the original input, it suffers from vanishing and exploding gradients. So this is where you pre-train the autoencoder layer by layer. What I mean by layer by layer is that I take the first autoencoder, I extract patches, noise it and reconstruct it. Once the first autoencoder is trained, I take the encoding middle layer and use it to initialize a new second autoencoder. Now this second autoencoder is training. So when the second autoencoder is trained, I take the middle layer of this second autoencoder and use it to initialize the third autoencoder. And this process goes on until the number of layers I want to achieve. So once these autoencoders are trained, I decouple these encoders and decoders from these autoencoder and I put them together to stack called as a stack denoising autoencoder. So the picture on the right shows a stack deep stack denoising autoencoder. So how do we use this for feature extraction? You take the encoder out alone from this deep stack denoising autoencoder and you feed the patch. During inference, in order to extract features, we don't noise the pixels. We feed it as is and we extract the features out and the learnt representation is a 1D vector which corresponds to the number of neurons present in the encoding layer in the final autoencoder. So this brings me to the second part. Now pre-training is done. We have trained our autoencoders layer by layer and we have prepared a stack of denoising autoencoders and we use it as a feature extractor. How do we go from here to classification? So we have the encoders over here and we have the input patch and we add a logistic layer on top of this encoder and we fine tune it for classification. It's just as simple as that. It's just a multi-layer perceptron of the layers corresponding to the number of autoencoders used plus one layer for the logistic classification layer. And this setup is called as a stack denoising autoencoder. Now, we have to classify our brain MRI pixels. How do we go about it? We have four sequences of MRI. We extract the patch 4 cross 21 cross 21 patch. We rasterize it and we feed it to this MLP and ask the MLP to classify one among the five classes belonging to brain or what type of tumor it belongs to. Now, I have classified this patch, but how do I start accumulating predictions? So when I extracted this patch I noted the central pixel of this patch and I classified and assigned the class that the MLP says to this central pixel. So the mantra for fine tuning is you extract a patch and you classify you stride by one pixel classify the central pixel, stride by one pixel extract the patch classify and so on. So it's a patch-wise classification network. And the network that we used is a five layer autoencoder with three hidden layers 3,500 neurons in the first layer 2,000 neurons in the second and 1,000 in the third and five neurons corresponding to the number of classes in the problem. So here are how the segmentation results look like for us. The picture on the left is the ground truth. The picture on the right is the classification done by this MLP. This is the second case and you have the third case over here and you have the fourth case. You can see that a very simple MLP a very simple five layer MLP pre-trained with denoising autoencoders has achieved very good excellent segmentation results when compared to the ground truth. So let's analyze the matrix of this semi-supervised learning architecture where you pre-train first on unlabeled data and fine-tune unlabeled data. So the matrix for analysis is called as a DICE metric which is the intersection followed by A plus B and your DICE score is 1 if your segmentation matches the ground truth. So this is our model which is a five layer stack denoising autoencoder a very simple MLP with five layers and this is the state of the art which is a 11 layer deep 3D convolutional fully convolutional and fully supervised network and multi pathway network and the performance of this state of the art the deep medic is that it was trained on 220 scans in a fully supervised manner and had DICE scores of the following for whole tumor, tumor core and the active tumor whereas our network which is a simple file layer MLP which was pre-trained on 135 scans and fine-tuned on 135 scans achieved better performance on tumor core and active tumor than the fully supervised network. It's just a file layer MLP when compared to 11 layer deep and 3D convolutional fully convolutional fully supervised network which is where the current trend in deep learning is going, I presume. So in order to strongly evaluate the semi-super-based setting, we went aggressive on the second experiment. So instead of fine-tuning on 135 scans, we pre-trained our network on 135 scans and fine-tuned it on just 20 scans just 20 scans annotated data and our active tumor was still better than the state of the arts performance. So from here I'd like to propose the next hypothesis which we tried. So far we had this two-step training process where we pre-trained the network in an unsupervised manner and fine-tuned the network in a supervised manner. So can we just do classification in an unsupervised manner with just no label data at all? So to test this hypothesis, we took the denoising auto-encoder we trained it with patches extracted only from normal brain scans we showed this denoising auto-encoder just normal brain scans and during testing we plot the error map of reconstruction. When we plot the error map of reconstruction we see an interesting characteristic here. You can see that the network has started to spike on the regions where the cancer has occurred in the brain scan. So if you were to think of this auto-encoder is like a kid or like a human so we it is detecting abnormalities in a scan just because it was trained by showing just normal scans. Just by virtue of having seen only normal scans the auto-encoder is showing spikes in cancerous regions and we can use the reconstruction function to very beautifully create a heat map of this cancer. And the architecture we chose was just a simple three layer auto-encoder nothing fancy about it and we had a whole tumor dice of 0.8 with just unlabeled data no labeling at all. We had an accuracy of 0.8 when we binarized this error map to create a segmentation map out of it. So how good is this novelty detection? How good is this denoising auto-encoder as a novelty detection? So in order to test this we trained the novelty detector on brain tumor segmentation data set by not showing it any brain tumors at all. Now during testing we switched to a different data set for stroke relations also patients undergo MRI scans and we wanted to detect stroke we wanted to see if the network can detect stroke from the stroke data set. As you can see the first two pictures on the top and the bottom are the raw images the raw MRIs the third picture the picture in the middle is the heat map produced by the reconstruction error of this novelty detector and the green picture shows the binarized segmentation from this reconstruction error and the red is the ground truth. So to test this performance quantitatively we took the dice course again we evaluated it against the state of the art so the state of the art is again the deep medic which is a fully convolutional 3D convolutional network which was trained in a fully supervised manner. It had a dice score of 0.66 whereas our novelty detector which was trained in a completely unsupervised manner with no label data at all achieved a dice score of 0.64 which is as good as the state of the art considering that we did not train it with any label data at all and on a completely different data set. So we thought this novelty detector is really good so can we fuse this with the classification MLP that we trained in the first two steps. So the picture on the left shows the prediction done by the MLP as you can see it has some false positives over there and these false positives can be removed by using the novelty detector's mask and we multiply the mask and retain the connected component which is particular to this cancer and the post-processed image is almost as good as the ground truth and this work has been published and if anyone of you is interested you can follow the link on this slide and read up the paper if you want to know more details about the training methodologies used. Right now until now we had a decoupled training process right so we had a pre-training unsupervised pre-training step followed by a supervised classification step now we wanted to combine this together in one single step process and here we can use hybrid architectures which can train both which we can train both unsupervised and supervised networks in the same fashion. So we take the stack denoising auto encoder again it is the shape has changed but the essential functionality is the same you can see the green boxes correspond to the encoders and the yellow boxes correspond to the decoders and when you have unlabeled data you ask the network to reconstruct the output but when you have labeled data you not only ask the network to reconstruct the output but you also ask it to classify by adding an MLP at the bottom of the encoder and you modify your loss function accordingly you have a reconstruction loss plus alpha times classification loss where alpha is our tuning parameter depending on the amount of labeled data you have to the amount of unlabeled data additionally you would use skip connections to fuse features from earlier layers and aid in reconstruction or classification and this setup is called as a ladder network as you can see it has a ladder structure to it because of the skip connections between the encoder and decoder now if you want to go more into efficient training we can introduce we can replace our fully connected networks with fully convolutional networks we can replace the fully connected layers in our networks with convolutions and max pullings one cross one convolutions can be used instead of fully connected layers as one cross one convolutions result in highly efficient non-overlapping forward inference and backward inference so the encoders are replaced with a series of convolutions and max pulling and your decoders are replaced with deconvolution layers and you have skip connections which fuse features from earlier layers and you have a reconstruction map out additionally you can also use this network to train a segmentation map and you can add a MLP on the bottom of the encoders and use it to classify your rigorous and you would modify your cost function accordingly with reconstruction loss plus alpha times classification plus beta time segmentation again alpha and beta are tuning parameters which you can use depending on the ratios of the data available to you and where this might be useful is that the current brain tumor segmentation 2017 challenge wherein you are not only asked to segment the brain tumor but you are also asked to predict the prognosis of the patient how far does the patient have to survive how far the patient can survive you can use the regression over there to fuse both information from the segmentation networks and use that to aid in this regression so how I envision the future of data efficient learning is that we have all these medical imaging centers all over India we can deploy these unsupervised learning engines auto encoders also GANs because GANs also are trained in an unsupervised learning fashion and we can continually learn the underlying structure of the data as and when data comes into the scans and we don't have to wait for supervisors to come and annotate the data for you to train in a supervised fashion but when supervisors do become available you can clone models out of this unsupervised learning engine and label the data and fine-tune your network for the corresponding task that you are trying to solve be it brain cancer or breast cancer or liver cancer or any kind of medical imaging cancer problem you are trying to solve so I'd like to conclude by saying that I'm really hopeful that with the current deep learning revolution that we are going through and the genomic revolution that we are going through in the biotechnology and biosciences field that we will have technological advances which will bridge the gap between humans and enable us by providing us with weapons which can help us in our war against cancer and I would also like to acknowledge my colleagues at ITMATRA who helped me with the project I'd like to acknowledge Dr. Ganapathy Krishnamoorthy for guiding me in this project throughout the year and Dr. Keshav Das for providing the domain knowledge regarding brain tumor segmentation and Vurgis and Subramaniam are my partners in this project thank you we have time for questions yeah hi I'm Pranitha so I have a question regarding the loss function that you mentioned that you use reconstruction loss plus some parameter alpha times the classification loss so can you please talk about this like if your reconstruction loss is like sum of squared errors or something like that and what is the classification error that you are using and how do you your classification error would be a negative log likelihood like your mean square error would be a different scale and likelihood would be a different scale so yeah you can tune the alpha according to that yeah it seems one of the assumptions in your architecture is that you are taking a 4x21x21 slice so you are just considering only that slice and nothing surrounding it so if I took a normal brain scan and then shuffled all the slices then it would detect no anomaly at all so it doesn't depend on the arrangement of the slices at all so is there a way you can improve it perhaps to understand that this is the correct arrangement also something like that yes instead of extracting 21x21 patch you can probably extract 21x21x21 3d patch cross 4 yeah but what would be the cross 4 in that yeah 4 will be the number of sequences mra is a special sequence you don't have only one image but you have 3 3d images so the 4 is the 4 sequences the flare t1 t2 and t1 so basically your answer is that you will just have to increase the slice size okay thanks yeah it's just curious about your labelled samples like you told you have got few labelled samples so are they all positive or are they mixed positive and negative cases as well so we had around 300 brain scans so each brain scan has about 240 slices and each slice is about 240 again it's a 240x240x240 tensor and you will have around 2% of the pixels 2 pixels of this 240x240x tensor which is labelled as tumor and all of the 300 scans were annotated with had brain scans since for each patch you are predicting only the center pixel it would introduce noise so how are you looking into that it will not introduce noise so you are predicting each pixel based on the order encoders so it might happen that neighborhood which has tumors but one of the pixels might not be predicted correctly so there would be noise in your prediction that would be a peripheral noise but if you can explain further what that noise would be then I might be able to answer let's say I have a band of 5 patches which are any class let's say positive and my prediction says 4 of them are positive 1 of them is not let's say the center pixel of it but when I sort of give this to the doctors I need to give them the whole band 5 of it not 2 and 2 so how do you remove that one mistake there okay so those kind of mistakes can be removed by morphological operations you have dilation and you have dilation kernels dilation erosion kernels those are simple image processing algorithms which you can use to smoothen your predictions but that would further increase your error but the problem that you are trying to solve is not fundamental to the architecture it's not fundamental to the pixel level classification process that has been described here but that noise that you are describing is possible for any kind of segmentation algorithm that is available let's say if I am predicting a whole patch of it then the local neighborhood would come into picture yes so I will predict let's say a patch to be completely class 1 or class 0 but your convolutional neural networks apply convolution across kernels so they have a particular context at which they look into the image and classify whether that is in religion or not so that information is already taken into account by using convolutional neural networks and also as I said in the last few slides your fully connected layers can be implemented as convolutional layers by using one cross one convolution that is same any more questions? alright thank you