 Alright, thanks everybody for coming to the AI Village. Our next next talk today is by Yisrael on automated injection and removal of medical evidence in CT and MRI scans. Could we have a big round of applause? Thank you everybody. Can you hear me? Okay, great So this is my first DEF CON and I'm very happy to be here. This is quite an experience. My name is Yisrael Amiriski. Thanks and I'm a cybersecurity researcher at Bangorin University and I'm going to talk about how a deep learning can be used not just in the domain of maybe you've heard of deep fakes before where you know people have used deep learning to kind of do face swaps or create clones of individuals and make them do things. So in this case, we're actually going to take a look at deep fakes but in the medical domain and what we're going to take a look at specifically is how an attacker could possibly inject or remove medical evidence in CT scans and MRI scans. Okay, so a little bit of background. So for those of you who don't know MRI and CT scanners are basically medical equipment that take 3D volumetric scans of your body and they're used to diagnose various different medical conditions. So for MRI scans, they can be used to diagnose problems in your bones, your joints, ligaments, cartilage, or any of your discs and your brain as well. And CT scans are mostly used for diagnosing cancer, heart disease, appendicitis, trauma and so on. And in the US in 2016 alone, there was about 38 million MRI scans, almost 80 million CT scans, and those numbers keep on going up. They're a very powerful and important tool in the medical community to be able to diagnose certain medical issues. So what's the vulnerability? First of all, there's so many radiology networks that are exposed to the internet, either intentionally or unintentionally. If you do a quick search on census.io, you find something like 10,000 of these different networks. And also the healthcare industry in general has a port security track record, as we see in the past. And just one good example of that is several months ago, McCaffrey researchers were able to get into one of these radiology networks, download a CT scan of a person's pelvis and then make a 3D print of that. So the main reason for this, there's various different reasons, but a lot of medical infrastructure, such as hospitals, use a lot of legacy systems and they don't pair so well with new technologies. So they tend to use very downgraded protocols. We'll take a look at that in a few more moments. So what's the threat? So we're actually considering not just the case of somebody stealing your data and taking a look at your personal data, but actually what happens is the attacker goes ahead and changes that data. It's not just medical records like what your blood test results were, but actual scans, manipulating those scans to then affect your diagnosis. So why would an attacker want to do that? So there's a few different aspects. The first aspect is psychological that he wants to cause some sort of traumatization or life change. So think of like a political leader, the attacker wants him to step down and rethink his life or perform some sort of global terrorism. The monetary aspect, so maybe the attacker wants to sabotage or falsifies through evidence, form ransomware, or the more likely case, insurance fraud. So the attacker gets his hands on his own scan and then injects some sort of small fracture in his spine or small aneurysm, which is evidence that's very hard to refute through other tests. So the attacker claims some sort of life quality of life insurance and get millions of dollars in that. And then the physical aspect that the attacker wants to cause physical harm, right? I think those cases are pretty clear. So what's the general approach? Let's take a look at how a CT scan is processed inside a hospital. So when you go into a CT scanner and it scans you, it takes basically these vertical slices, 2D images of your body, and those are stored in a format called DICOM format, a standard, and they're sent over the Ethernet network to this PAC server for later viewing. And then the radiologists at some point in time will pull those scans, analyze them, write a report, and then forward that to the requesting doctor, whether it be an oncologist, neurologist, depending what they're taking a look at. And then that oncologist or neurologist, that doctor would then give you some sort of diagnosis. So the point of entry here that is most effective for the attacker is somewhere between where the scan was made and where the, before the radiologist gets to make his or her report. So this is a very basic topology of what the network inside a hospital looks like. And so you see on the left side you have a bunch of different scanning modalities. You have X-ray machines, CT scanners, MRI scanners, and so on and so forth. These are all connected to PCs called modality workstations, which more often than not are Windows XP machines. And they capture the raw data from the scanners, convert them to this DICOM format and send it over the network to the PAC server. Inside this network, which is supposedly segregated, but not quite, you have other kinds of devices. You have radiologists workstations reviewing those scans, various different administrative systems. And of course it's many times bridged over to the entire hospital's network so that referring doctors can pull scans, take a look at them and make their own prognosis. So those hospital networks, of course, also have their own Wi-Fi connections and many times also the network itself is also connected to the internet for radiologists in other countries to perform off hours of viewing up those scans. And as you can see, there are many different kinds of attack vectors that result from this. Now, we have published a paper in ARCA if you want to take a look afterwards. It kind of goes into detail of all the different kinds of attack vectors. But I'll just say in very high level there are three general attack vectors. One is from the internet. The other is through Wi-Fi access points. And then the other is via physical intrusion. But once the attacker is in, he can then plant some sort of malware, whether it be on the modality workstation themselves or as a man in the middle of the device or on the PAC server or wherever he places it, he has full control over all the scans going through the network and it's very easy to pick out a specific individual because the DICOM format lists the patient's name and all his information so he can automatically kind of find which scan he wants to tamper with. And then there's other locations where he has a chance of picking up the right scans such as deploying the malware inside the viewer itself. Okay, so let's get to the interesting part, the deep fakery of it all. And let's talk about how exactly can an attacker automate the process of injecting or removing medical evidence. And we're going to focus on lung cancer and how an attacker can inject or remove lung cancer from a CT scan. So let's talk about training this model. So the first step in training a model, we need data. So what we're going to do is actually in our research, what we did is we went to these free databases of CT scans you can get online and we found one with lung cancer and we downloaded that database and then we went through it and we had already all the annotations of where all the lung cancers are. Annotations mean where XYZ coordinates. And what we did is we extracted these cubes surrounding those areas because what you want to do is you want to just have the neural network focus just on the area that's relevant because if you give it the entire scan, you're talking about almost a billion pixels you're not going to be able to train a network that way and it's going to train on so much information that's not relevant. So instead we're going to just cut out these small cubes, the areas of interest and build the data set of these cubes. So we take out these cubes, we perform some pre-processing steps, histogram equalization which helps all the features kind of pop out it's kind of like contrasting and standard normalizations, zero one normalization and then we get this nice data set of these cubes of the cancer samples and we perform some data augmentation because often we don't have that many samples anywhere between 600 to 1,000 samples is very, very small so we'll augment that by maybe changing some rotation, shifting in different directions, adding a little bit of noise to augment that to a few thousand. And these cubes are about 32 by 32 pixels. Okay so now that we have our data set we can now train our neural network and the neural network that we use is this kind of auto encoder here and it's a 3D auto encoder so the inputs are three dimensional and what we do is actually before we put a sample through the network we kind of mask the center, we kind of erase it with zeros and then we pass it through the neural network, we ask the neural network basically to kind of predict what do you think is going to go inside this masked area. This process called in-painting is used to basically get the neural network to kind of guess what should be there based off the context of that blue area you see on the side. So that blue area acts as kind of like this kind of tissue context where the neural network can kind of figure out okay given the structure around how can I complete this in a realistic way and what happens if you train this neural network on many many cancer samples what's going to happen is no matter what cube you put in there whether it's actually a cancer or not once I erase the center and put it through again it's going to try and complete that with cancer and vice versa if I trained it on benign samples. So this process works decently but what happens is that often they generated images are very blurry so to improve that we actually add another neural network as some of you have already recognized this structure and this extra neural network is called the discriminator and he's in charge of basically policing the outputs to make sure that they look realistic that there's no blurriness so to speak and so he's trained on samples that are actually real and samples which are fake generated from this neural network and he has to decide which is real which is fake and during the training process the generator gets the signal to try and learn what mistakes he had where did he foul up and make this some sort of artifact or blurriness so that he can fix it so this general architecture is referred to as a generative adversarial neural network and that particular network we're looking at is a pix2pix network used for in-painting with these 3D inputs and 3D filters okay so now that we have one more point so after we've trained it I actually threw out the discriminator because all we're really interested in is the bottom half that generator okay so in that way we trained our generator how do we actually deploy that in a malware how would an attacker actually go about doing that so there's a kind of like attack pipeline if you will so once the malware kind of sees that a scan is going over the network with the right DICOM tags for the particular patient what it'll do is it will use a simple algorithm to locate where it wants to inject or remove that evidence and it'll scale up that cube to the right proportions because many scans are stored in different aspect ratios go through the whole preprocessing process zero out that center use whichever neural network you want to either inject or remove the evidence and then reverse the whole preprocessing step add some noise, kind of cover up the interpolation blur and then paste it right back in again so the algorithm could also repeat that multiple times you want to inject multiple different nodules in this case and then eventually write it back into the DICOM file so here's some sample results so this is cancer injection and we're looking just at 2D slices here so obviously they're 3D so the left side is before, the right side is after and this is for removal and you can see it does a pretty good job and actually I have here a tool which you can play with it lets you kind of inject and remove cancer as you go so let's give it a second here to start up so once you train the neural network you don't need GPU to execute it it actually can run on the CPU just fine so here we go so I've loaded up a full scan of somebody's lungs and I'm just going to pause it and if I click somewhere the first time it takes a few seconds because it's got to load the model but you can see just by clicking in different areas you can just easily kind of inject cancer and also the same algorithm we looked at before can be used to remove cancer so if I click over here and remove this horrifically ugly GUI and find a cancer here's one it says a big one so you have to click twice but let's see if we can find another one so as you can see it works pretty good and that's nice but that's not good enough just to say that okay so we can play around with neural networks we can generate some content but is it realistic enough right this is a 3D view let me skip ahead so this is also by the way this is just showing it doesn't just work with lungs it also works with all sorts of other conditions so this is showing reconstruction of the brain tumors as well okay so can we fool the experts right so we've shown how we can generate medical evidence or remove medical evidence but is it realistic enough so this is how we evaluated our system we took 100 CT scans and we split it up as follows so 35 of them which were actually maligned they had cancer in them we used the tool to remove all the cancer automatically and then we took 35 of them which were actually clean scans with completely healthy and we injected one or more nodules to each of these scans and then we also added in 30 untempered altogether just as a baseline and then with this data set of 100 tempered scans we hired three expert radiologists and also got ourselves our own state of the art lung cancer screening AI which was downloaded freely online which won the CalQL 2017 competition and the first evaluation we did the upline trial so we didn't tell the radiologists anything fishy about these scans we just said okay go diagnose them tell us what you think and then the second trial we took the last 20 scans and it was open we told them hey listen we tampered these scans can you tell the difference between a tampered scan and an original scan so for the blind test the results were quite staggering we had a 99% attack success rate for removing cancer 94% attack success rate sorry for inject 99% for injecting 94% for removing we also asked the radiologists what do you think about these cancers the tampered ones we didn't tell them they were tampered and they gave us a pretty high scores in terms of how malign they are how dangerous they are for most cases if not all they required some sort of biopsy and then for the open test where we revealed the truth the radiologists accuracy was still very low anywhere between 30 to 60% accuracy which means that they weren't really able to differentiate between what is a real cancer and fake cancer okay so we know that they're realistic and we know how an attacker can go about this attack what the attack vector might be but how can we validate that this sounds kind of sci-fi it's actually possible that an attacker can get into a hospital and manipulate those scans so we decided we have to take this one step further let's go do a covert penetration test on the actual hospital and say we can get in there and deploy a man in the middle device and tamper those scans so this is how it went down and we got a USB to Ethernet adapter we also printed out a small little Philips label so it looks legit and we put it all together and it looks something like this and you get this kind of like man in the middle device that you can plug in between the CT scanner and the rest of the PAX network and it would intercept everything leaving no forensic evidence make sure it left no traces and it would also act as a very nice back door into the network because we have from the waiting room we can connect to its access point so this is a short little video clip and it's going to show you basically this penetration test so what I did basically is I went to this hospital and I of course with permission but I didn't tell them I was coming and I waited for the cleaning staff to open the door and I just walked right in and if you walk in like you belong there no questions are asked and it took a few minutes but I was able to find the radiologist workstations so I can go ahead and plant my device there but that wasn't the prize and not so far away was the one of their CT scanning rooms and there's the CT scanner and once I found it it took no more than 30 seconds to unplug the ethernet cable there plug in my own little man in the middle device and very conveniently put it under this floor panel here so nobody will find it and that was it I was able to intercept all the data and as you'll see also I was able to get the nice strong Wi-Fi signal from the waiting room another interesting thing is I learned a lot from this penetration test it wasn't just about intercepting the scan but I also found out that the internal security of many hospitals are actually very poor they assume because it's segregated somewhat segregated that they don't have to rely on so many network calls and they can be quite relaxed so after I showed them what I've done I got into the network they helped me perform a CT scan I was able to intercept that scan and I found that not only was the scan sent over in clear text it was supposedly encrypted but it was all in plain text so I was able to manipulate the entire content and I found out also that after about something like three minutes I was got 27 credentials and that hospital just freely broadcasted over the network because again there was supposedly no user except for internal users could use this network but anybody who walks down this hospital can see ports on the wall and plug right in and get access so after this we went to the Washington Post and made an article through some excellent journalism there they found that maybe the general case in many different hospitals in US and around Europe that hospitals in the internal network has very poor security and the encryption is typically not enabled between the scanners in the network and the PAX server so very briefly some counter measures so there's preventative counter measures so we can try and secure the data in motion at least the very least naval encryption network would solve a lot of problems but as I mentioned before a lot of hospitals even though they could technically enable encryption that would cause a lot of different components to fail because they're using a lot of different legacy components so this old x-ray machine that they're still using doesn't support the latest protocols so they'd rather just leave it all downgraded another thing they could do is staff awareness they don't belong there they should ask questions not just ignore them and there's also a way for detecting so actually in the DICOM standard you can actually enable digital signatures so that the CT scanner will actually put a digital signature so you can verify that it hasn't been tampered but unfortunately as far as I know nobody uses that and there's more advanced techniques such as watermarking and image tamper detection but none of these methods are deployed on site as far as I know and there's a lot of ongoing research on this subject as well so in summary with deep learning it's possible to inject medical evidence into CT MRI scans automatically and realistically and the model can fool a state-of-the-art AI 100% of the time and expert radiologist between 96 and 99% of the time and we as we've seen the attack is viable and it can also easily be easily mitigated if the healthcare industry takes a few steps forward just to push through and clean up their security hygiene so I think maybe we have a few minutes for questions if anybody has a question and yeah I mean I'm running what you saw here I just ran on my laptop so any decent CPU can do it the question is really how quickly you want to execute it and it takes I think the model itself after trained was about 500 megabytes the model just the model yeah so like I said before all these scans they're saved in different aspect ratios I think for compression reasons so the neural network has to see everything in the right unit so you can't give it one scan that's you know half the measurement of the previous scan because things just won't match up so we have to scale everything we just use simple interpolation like you know Python and that causes some blurring which is why as you saw we also add a noise to kind of hide up those sins but that's all we did really yeah the data sets I think it was 800 scans total we didn't use all of them because we only want the ones with cancer and the real data says actually just how many nodules we extracted which I think in total was something around 600 before augmentation and after augmentation I think we got to probably 16,000 and even though they're 32 by 32 by 32 cubes that's still a lot of pixels so it took about 24 hours to 30 hours to train the network yep okay last question it can happen on the network itself on the go so when the file is being sent from the scanner to the Mac server it's in clear text so that the Raspberry Pi that I had just captured it changed the pixels and sent it along you don't have to do it as every single packet goes by because it just opens a TCP channel so you just kind of pretend as a classic man of middle tech just gobble up all the data process it in your own time take a minute or two and then pass it along okay so if anybody wants to come at the AI unwind then check it out and speak with me alright thank you