 Hi, welcome everybody. Thanks a lot for coming to my talk. What you are about to see here today is pretty cool stuff. I had lots of fun working on that project, so I hope you'll find it cool too. So in the nutshell, I'm about to show you how to set up a cache timing cover channel so that two VMs collocated on the same physical box, same socket, can talk together without being detected. Without being detected is important here because this talk is all about practical implementation, not just some proof of concept or theory stuff. And so just a second. So before we begin, usual disclaimer. So this is a research project that was done on my own time, my own network. This talk reflects my opinion and not the one from my employer. Information and code provided here should be used for educational purpose only. All right. Let me introduce myself and also give you guys some context around how I ended up working on that project in the first place. So my name is Etienne Martinot. I'm from Ottawa, Canada. I work currently for Cisco system where I do Linux kernel and KVM stuff. As a kid, I was fascinated by electronics and radio, especially the concept of modulation. It was just much later during my engineering study at Laval University that I finally realized what was going on. That was cool. But then I got a job, ended up spending several years hacking in the Linux kernel. And as you may imagine, I kind of forgot all about modulation. Up until recently, where part of my work doing some low level performance analysis on KVM, I noticed something strange. Basically, I observed some sort of cross talk going on between two virtual machine. Obviously, the cross talk was very subtle. But the tool I was using was smart enough to detect it. So doing more investigation, I realized that the two VMs were being assigned to the same physical core, but different threads of execution. So this threads of execution concept is typically known as SMT or hyper threading on Intel. There's more research and I found that nice picture from Intel. So here we can clearly see that the two execution threads are actually sharing some common functional unit. And some operation have to be serialized one after the other. So that stuff explained the result I got. But then I had an idea. What if on one of the hyper threaded sibling, I modulate a contention pattern over the execution pipeline? Let's say a long instruction is a one. Let's say a short instruction is a zero. Then what if on the other hyper threaded siblings, I try to detect that amount of contention over the execution pipeline by executing an instruction and measuring the time it takes? If it's slow, then it's a one. If it's fast, then it's a zero. Then I realized that with this technique, I could do pretty cool stuff such as sending information from one VM to another, assuming that the two VMs are being assigned to these hyper threaded siblings. So I ended up spending quite a bit of time on that project. And so one of my goal was to see with my eyes the quality of the signal being the quality of the communication channel. So naturally I decided to try to transmit an image so that I could see the result thing output on the other side. Obviously the very first image I used back then was not this one. It was a picture of my kids which I am obviously not going to show around here. So I used the Defcon logo this time which has been reformatted at 640, 480, SVG quality one bit per pixel. And this is what I got on the other side. So that was pretty cool because I was able to see all the noise present in the channel and so on but still I was able to get some information out of it. So then this is at that time that I essentially realized the problem related to security and I said to myself there is a big issue with this type of stuff and this is basically why I am here today. And by the way before we move on to the core of this talk just for the fun of us here at Defcon I have a recording that shows what happened in real time when we take that image and we send it over and over again at 15 frames per second. One thing I did in that recording is that I have my noise generator that I start and stop in the background which is essentially a compilation of the Linux kernel. And then you will see the effect on the communication channel. One more thing is that I am running the mp4 encoding software on the same machine where I am running this experiment. So on its own this thing is generating quite a bit of noise. All right let's take a look at the video here. We see that when the Linux kernel compilation is running the channel is completely saturated by noise and that's expected because the pipeline is just running so many instruction at the same time so many context which are coming that nothing can go to the other side. All right let's take a step back here. My goal was to come with a practical implementation not just some theory stuff. Why? Well because I wanted to prove that this is a real issue and that we need to fix it. So now in this talk we are going to go over the design and what it takes to basically come with such a cache timing channel. So basically we are going to go over the shared resources on x86 multi core. I am going to show you guys how to encode and decode data using cache line. And doing that we will see the effect of the hardware prefetcher and I am going to show a trick we can do to get around it. Then we will also see that the encoded data that we put in the cache line doesn't stay there for very long time especially with VMs because with VMs there is lots of noise. Also we will, I will show you how to find cache lines that are shared across VMs. And at the end of it I will show you my phase lock loop implementation which essentially enable two processes running from different VMs to synchronize together very, very precisely. Finally we will do some detection and mitigation aspect and also towards the end of this talk I will basically measure, do some sort of bandwidth measurement on this channel and we will also go over a reversal example. Alright. So when you have IPertrading enabled there is lots of possibility for inter-VM modulation assuming that the two VMs are being assigned to the different, to the two IPertraded siblings on the same core. You can do pipeline contention. This is the first example I show you guys. But you can also do modulation in L1 cache or you can do modulation in L2 cache. Now if you have IPertraded disabled. Alright. Maybe this one. Alright. So as you guys are all very familiar with by now we have a fantastic tradition called Shot the Noob. Has this guy been doing good? That's exactly what I like to hear. Alright. We're not going to hold them up too much longer. To new speakers, new attendees and do DEF CON 23. Alright. As I was saying now if we have IPertrading disabled which is typically the case because this type of issue has been reported way back then in 2005. Link at the bottom of this page we can still do modulation but this time in L3 cache. This is what this talk is all about. Obviously if the VMs are being assigned to different socket this cache timing modulation won't work because the cache are not shared across the sockets. But now you see there is that buses that connects all the cache together. The co-eventsy module that can potentially be used. So this is also interesting but this is outside the scope of my talk today. Alright. Now it's time to understand how we encode data in the cache. So a cache line typically holds 64 bytes. So when you read a byte that is not in the cache it's the whole cache line that is brought in from memory. So now the basic of this trick relies on the fact that we can measure very accurately the time it takes to read a byte from memory. When we get it from L1 it's very fast. L2 a bit slower. L3 even slower. And in memory in main memory it's very slow. So now the way we do for encoding a pattern in the cache line is we load or flush a particular cache line. Let's say when it's loaded it's a one, a flush is a zero. And for the decoding part what we do is we measure the time it takes to read a byte that corresponds to this cache line. If it's fast it's a zero because that cache line was loaded. If it's slow then it, if it's slow then it's a zero, sorry. If it's fast it's a one, if it's slow it's a zero. Sound simple? All right let's take a look at the practical example. So when I started that stuff I wrote a simple client and test program. There is no VM in the picture at this time. This thing is running just on Linux on the host directly. And the cache lines are coming directly from shared memory. So here the client is encoding a pattern. This is the graphic that you see at the bottom left here. And once the pattern is encoded in the cache, basically the client wakes up a signal on the mutex and the server wakes up. And do the decoding. So but now here there is something weird. This is what I got on the other side. So but there is clearly a pattern when you look into this. This is just not pure noise. All right then I decided to take a step back. I'm just going to write a simple test that flushes all the cache line from zero to 100. And then after that I'm going to measure the time it takes to load them back. And I'm expecting long latency for all of them. But there is obviously something else going on here. Some of the cache lines were exhibiting long latency but lots of them were very fast. What's going on? This is at that time that I learned about prefetching. So prefetching means bringing data or instruction from memory into the cache before they are needed. So on the processor that I'm using which is a Xeon 5500 there is more than one prefetcher. There is prefetch for L1, prefetch for L2 and there is also different algorithm out there. But at the end of the day this thing is trying to predict what address will be needed in the future. All right. Before I move on sorry this hardware prefetcher stuff is one of those things that you can control at the BIOS level directly enable or disable that. This is obviously not what I'm doing here because we don't have that type of access over the machine. So we have to work around it. So I came with that idea that I'm going to just randomize the cache line access. So I'm going to basically randomize the cache line access within a page. Fair enough. But it turns out that we also need to randomize the cache line access at the page level. So in other words you cannot just go with an incremental pattern at the page level because the hardware prefetcher will kick in and will detect that and will try basically to load your cache line in advance. Doing all that apparently to manage the hardware prefetcher at least on the machine that I was running on. Then I face another problem. So basically what happened if you end up waiting longer before doing the decoding? So right now I have a client it basically encode, signal the server, server wakes up and the decoding. It's very fast. But let's say what happened if you wait? Wait more? And wait even more? Well, we clearly see that the time from when you encode the data to the time from when you decode the data has to be very small. Otherwise the other stuff that is running on the system will kick in and start to pollute the cache and essentially erase your data. So in other words the encoded data and the cache evaporates pretty quickly. And this is even more true for us when we are running in the VM because with VMs there is lots of noise. So talking of noise. So I've done a couple of experiments to try to characterize the noise and so on. So I have a test program that basically it's basically using a calibrated software loop that takes exactly two CPU cycles to execute. And I'm running that loop 100,000 times. So I'm expecting that the execution time in cycle is going to be 200,000 cycles. And I'm repeating that test 1000 times. So when you are running on bare metal kernel with all interruption disabled there is no noise, right? The loop is always taking 200,000 cycles over and over again. But now if you are running in user space, right? There is processes running and so on. Well there is some noise. And actually the noise is coming from the host operating system that is doing interruption handler and all that stuff, right? By the way, the small spike that we see there are the timer interruption on a per CPU basis. So this is a six core machine. And the bigger spike is actually a network interruption that is running on CPU zero. Now if you are running in the kernel inside the virtual machine with all interruption disabled there is still quite a bit of noise down there because you know the host kernel is running, all the interruption is running and you have some more noise because of the hypervisor layer. And finally if you are running in VM user space, which is the case when we will do that communication thing, there is quite a bit of noise. And that is because the kernel that you are running on has its own timer interruption, all the stuff is running there, the hypervisor layer and then on the host. By the way, it looks really bad here but if you make the mat, the gradation comes down to about two percent which is about what we expect for a compute load when running on a VM. All right, now we understand the noise, we have a way to trick the hardware prefetcher. Now it's time to put the client in VM one and the server in VM two. Remember the first test we did, all that stuff was running on the host directly. So I put my server in VM one and VM two. But then I realize there is another problem. The cache line that I was using initially are not available anymore. Basically the L2 and the L3 cache, those things are tagged by the physical address. But in a VM, the physical address that you see has nothing to do with the real physical address that exists on the host that the cache is using. Why? Well, because there is that other translation layer, that other translation layer, sorry. And basically the VMs in the virtual machine, they don't have access to that information. It's a tricky problem to solve, but I don't think it's impossible. But fortunately for us, we don't have to worry at all with this issue. Thanks to KSM. Well, at least as long as KSM is enabled on those systems. So what is KSM? KSM is a kernel thread that runs on the host kernel that basically scan the running processes and compare their memory. If it finds identical pages, KSM merge them into just one single pages. So, but obviously if one of those program wants to modify, let's say those pages, KSM kicks in and basically do the unmerging. KSM is pretty useful because it saves a whole lots of memory with VMs. Especially when the guest, especially because the guest operating system image from one VM can be shared with the other guest operating system on the other VM. All right, coming back to that slide, the idea I had was to create a unique pattern on a per page unique pattern in memory that is the same across the client and the server. So that the idea is that on hostOS, KSM will look at those things, do the scan and it will eventually do the page deduplication for us. Note here that the per page is important. And because if different pages are identical in content, KSM will detect that and will merge them one on top of the other. And so you will end up overlapping your cache line, which is obviously not what you wanted to do. Side comment, with KSM you can do pretty cool stuff. Such as in identifying the operating system or the application that are running beside you. All you need to do for that is to load in your own memory the image of what you think is running beside you. Then you wait a bit because KSM deduplication process takes time and then you write to some of those pages and then you measure the time it takes. So if the time it takes to do the right is much longer than the normal right inside your virtual machine, it means that you got KSM involved all the way down from the host. It did the page deduplication for you and it means that you have a match. You basically identify what is running beside you. All right, coming back to that picture again, I realized there was another problem with my design. Basically there is no synchronization primitive across processes running in different VM. Remember when I was running directly on the host, server was signaling a mutex, client was signaling a mutex server, was waking up and doing the decoding. But here there is no such thing. Well in reality there is things to do that. For example on Linux there is an KVM, there is IVSHM where you can basically from one VM to another signal is a MAP4 and so on but that stuff is not enabled in production. We need something to replace the mutex. Why? Well because we want the server to run right after the client so that he's going to pick up the signal, right? Remember what happened? If we wait too long all the data is gone so we have to be fast. So then I ran a couple of options I did not really know how to attack this thing. So one of the options I had was basically I'm just going to forget all about that synchronization aspect and kind of hope for the best. With some with error correction ECC we can achieve some data transmission. You know obviously there is there is space between the client and the server and it's come totally random so this is going to give us very low bit rates obviously but the CPU consumption is low and now that's kind of cool because we need to be we don't we don't want to burn CPU because everybody's going to detect us right? So another option it's basically I set it up such that there is a busy loop on each side and the client is set up to run a bit faster than the server so at some point in time there will be an overlap and at that point the server will pick up the signal and the transmission will happen. So this is giving a no K bit rate but the problem with that one is that the CPU consumption is very high and this is no good for us because we want to remain undetected at hopefully well and we would like to be less than one hundred one percent CPU usage sorry. Option number three so this is basically my phase lock loop implementation that I was talking at the beginning. Let's define a common period on the server and the client and let's let's have the client and the server lock into phase. How I did that? Well at the beginning of each period I have the server that is sending out a sync pattern. This sync pattern is very similar to what you found in those analog TV for the vertical sync that's the kind of the same concept and then you have the client that is running a swept scan over that period and tries to detect that sync pattern. Once it detects it, it basically lock on it. So what obviously once the sync is detected the client is just shifting back the phase and now we are ready for transmission but there is a there is a tricky problem for that to work. We need a monotonic pulse. We can in theory tolerate some jitter but not too much because in VM there is lots of noise and the data evaporates out of the cache very quickly. So in practice all that stuff looks fine but in in theory all that stuff looks fine but in practice it's a bit more tricky. How can we achieve a monotonic pulse? First thing comes into mind it's to use timer. Well timers are good because anyway we need to sleep in order to avoid detection. But the time with timers there is a big problem so this is a graph that represents the latency in microsecond the this the latency distribution in microseconds of a timer that is running in a VM. This is a log log scale. So we see that from that graph there is there is lots of jitter. I mean it can range from all the way down to 20 microseconds all the way up to almost 200 microsecond. And now if you factor in the original design this timer this jitter from the timer is going to come from both VM at the same time because they have the same type of distribution. So there is just too much jitter for that to work. The data will not persist and transmission will not happen for sure. Okay so the idea I had was to basically compensate this timer in software to some to some value above the maximum jitter so that in theory this should give us a nice monotonic signal. Well here we need to be a bit careful because this compensation thing is going to be subject to noise. In other word what I'm trying to say here is more time you're trying to compensate something more noise you end up accumulating from the underlying stuff that is running beside beyond you. Right? And the other thing that needs to be that we need to be careful with is this compensation is burning CPU. But on that aspect it's not too bad because all we have to do is to stretch the timer period to some higher number and we will still stay under 1% CPU usage. All right it's a tricky problem but I believe at the end I got it right. The compensation thing I'm using is basically a calibrated software loop that is kept in check with the TSC at every single point in time. And this is the result I got. So my machine is a 2.4 gig machine and when I'm running Hydal this graph represents the jitter that I have on my compensated timer and it's in cycle. So I have roughly I have 50 cycle of jitter on my timer. It corresponds to 20 nanosecond. Even on a loaded system I've got roughly 300 cycle. It's 120 nanosecond. It's pretty fast. Pretty accurate sorry. If you compare that with the original timer I mean there is obviously no no comparison here. Even if we put the latency back into perspective with the original graph it doesn't even it doesn't the jitter doesn't even show up on that scale because the original timer was in microsecond scale and now I'm working in nanosecond scale. And as you may already understood this synchronization aspect is the key of this design because it basically enables the communication to happen with very low noise because it's very very precise across the two process running and at the same time it consume low CPU. Okay let's recap what we have so far. So we have an encoding and a decoding scheme that is based on memory access time. Slow it's a one, fast it's a zero. We manage to get rid of the hardware prefetcher without disabling it by the BIOS because we just randomize the cache line access at the page level and so on. We also found some we also found cache lines that are shared across VMs and that is thanks to KSM by the way. And we managed to design a phase lock loop that basically gives a very high precision across two process running a different VM. Time for a demo now. Okay so I basically with that technique I basically repeated the original experiment which consists of sending that DEF CON logo from one VM to another. We can see that this techniques is offering pretty high quality and kind of low noise communication channel at least if you compare that with the original pipeline contention example I show you at the very beginning right? So there is no error correction that is running on the transmission channel there is no retransmission nothing but as you can see if you look carefully you'll see that in the picture there is a couple of bits that are flipped here and there and that's expected right that the channel is kind of noisy a bit. All right then in that experiment again it's another recording I repeated that exact same streaming experiment with my noise generator running on and off in the background. The first thing I want to mention and that's kind of cool is that when the transmitter is not running the receiver is picking up the noise from whatever is running on on the on the operating system. So to me this could potentially be used to fingerprint the operating system that is running underneath. Also in this recording here same stuff than the previous one I have the mp4 encoding software that is running on the same machine and so on. So on its own this thing is generating quite a bit of noise. But still you will see the effect when let's say I move a window around for example and of course you will also see the effect of when I compile the Linux kernel you will see the noise going on and off and so on. One last thing I may have mentioned it before there is no compression no retransmission there is no protocol what you see is essentially the raw capacity of that channel. Let's take a look at the video so that's the noise I was talking about. You see the compilation of the Linux kernel totally saturate the channel so on the left it's the source and on the right obviously it's the destination of the payload. Alright so now we'll make the mat real fast here so that video was transmitted at 60 frames per second interlacing interlaced four time 15 full frame per second one frame was a VGA quality 640 by 481 bit per pixel it's if you make the math it's roughly 4.5 megabit a second and the CPU both sides were 50% CPU utilization. Of course if you if you utilize more CPU you can crank up the bandwidth on this one. Alright now let's focus on something a bit more useful than trying to stream a picture from one VM to another, a reverse shell. First thing I'm doing here and again this is the same two VMs that I have originally is that I'm running the client in I'm running the server in loopback mode which basically essentially display whatever was sent by the client and the client in that mode is right is sending a bunch of exact a small A and in the background I'm running on and off again my noise generator and you guys will see the effect you will see that the synchronization will happen at some point. Alright now transmission is going on. I'm going to hit the pause button for a second here so in my program the reverse shell stuff I have a way to basically turn on ECC on the communication channel because as you can observe there is some bits here and there that are flipped right so now I'm just going to unpause that and I'm going to run the system with error correction turn on this time. Alright now we see that the output is clean right? Obviously sometime it will happen that there is some more there is more bits that are flipped than what ECC can support and so in my program is actually displaying a couple of star right there the servers is displaying that obviously we can we can we can crank up the error correction I'm just using I believe a 16 bit ECC over 240 bit payload so that can be increased easily so now I'm going to basically remove the loop back mode because I want to send command on the other VM that's the reverse shell mode right so that was just a test mode here so I'm just going to unpause that thing so again the lock was pretty fast right now the lock is when the phase lock loop is synchronizing so now I'm sending command to that other VM commands are coming back this is the process that are running in the other VM just going to hit the pause button again well to be imaginative here I named my stuff timing channel so the timing channel is the program that is running in the other VM as you can see that the time that this program has consumed so far is very minimal and yeah that's the that's the reverse shell here I'm looking I believe at the source code of this program yeah all right so you see that you know the reverse shell is not super responsive and that is because the server has been configured in a mode where I don't want the server to burn too much CPU on the other side in theory we can we can dynamically crank it up or crank it down right now I believe it's using half a half a percent of CPU in this demo so you see the response of a responsiveness level all right so that's it for the demo so what can we do to prevent that stuff so the first thing is obviously to disable that page de-duplication thing or to set it up on a per VM policy not on so that it doesn't basically cross the VM boundary that's one thing you can do it's gonna it's gonna take care of this those enter VM shared with only pages it's gonna it's gonna remove them all so this flush and reload technique won't work essentially and and then it will take care of that side of the M and O application fingerprinting thing I was mentioning earlier obviously this is at the cost of higher memory one of the other thing that can be done is on the x86 today this C flush instruction is not it's not privileged so maybe we can make it maybe Intel can make it privileged or something don't know if it's a micro called update obviously you have to revisit your call location policy let's say what what what are you putting on the on the core what you put on the socket what you put on the box basis personally I'm more of a fan of of trying to detect this type of communication and for that to happen there is many there is many let's say way we can do one of them is there is a bunch of hardware counter that are available for performance reason on the on those chips so one could do some sort of pattern and noise analysis to try to to detect you know some some spike and some very you know precise noise the other thing that can be done is you know it's it would be too to try to detect those enter VM process scheduling pattern meaning that let's say let's say you have two process running into different VM and they are always scheduled at the same time somehow they always overlap so that could be detected obviously and one of the other thing is abnormal TSC usage for for that for that stuff that I'm that I've been doing there is lots of there is lots of call to our DTSC because this is what the compensation is doing and so on so one could would could monitor and put some your stick around on some usage pattern of our DTSC instruction and try to detect that so this is really what I'm working on yeah and so basically that's pretty much all I have the source code for this in this cottage code is going to be on my Github I'm just going to send an update on on this slide deck shortly thank you very much everybody