 Great, thank you very much. So I'm actually a software engineer here in New York at a company called Bluecore. However, the content of this talk was actually work that I did while I was at Twitter. I also want to say this was absolutely not just my work. There was a huge number of people who were involved in this, so thank you to all of those people. And also, actually, thanks to Twitter for letting us actually talk about it. So this is a debugging story, which I personally really like, so hopefully you will too, bear with me. Like any good debugging story, it starts with a bug report. One of my coworkers sends me a message saying, hi, your service seems to be returning corrupt data. Here's some exception. Uh-oh, so let's go investigate. Let's see what's happening. So we go and take a look and we have here our nice data center. We have server A on the one side and server B on the other. And as a good software developer, I like to be able to ignore the fact there's this magic network in the between. We see log entries indicating that server A sends a request saying, please get me the full name for user at EPC Jones. We see log entries on the server saying that it received that request. We see entries on the server saying it sent the correct answer, which is my full name Evan Jones. And then we see log entries on the client saying, I got this crying cat emoji on the other side. I'm gonna throw an exception here. So we are very confused. How is this possibly happening? This doesn't seem to make sense. The network isn't supposed to do this. So we check in with Twitter's operations team on the chat channel. Turns out there's a minor fire drill going on. We aren't the only people who have noticed this. It's happening all over the place. So what do you do when you are running a data center and all of a sudden stuff starts getting corrupted across the network. So here's your nice high level diagram of a data center. We have these gray boxes at the bottom are servers. They're connected in racks. There's a switch in each rack that connects all of those servers and the switches themselves are connected to get more switches. So if stuff is getting corrupted, what do you do? Well, the emergency answer is you just start unplugging the broken things. And thankfully Twitter's operation team has some way that they could just go cut off this whole rack and disconnect it from the network. And so that's what they did. With the help of the teams that were on call, they figured out, well, these racks seem to be generating more errors. Let's try unplugging that one and see what happens. And eventually you unplug enough racks that the errors stop. So I would have assumed that this is the end of the emergency. Now clearly you still want to go figure out what happened and fix the bug, but like the bleeding has stopped, everybody can go home and sleep, right? Turns out corruption is actually really insidious. So here's our two servers. If you have this corrupt message coming back to server A and you're getting this cat emoji, the best thing that could happen is server A says, I don't like that, here's your exception. The worst thing that it says is it says, names can have emojis, people like emojis, we'll put emojis everywhere. And the client, the server then continues and it says great, well I'm gonna compute something with it and I'm gonna store that in Memcache. And then maybe I'm gonna go write and update some database entry or a log file and I'm gonna put your cat emoji all over the place. So now any other server that comes around and reads this data for all time is getting this emoji and is probably unhappy. So maybe server A wasn't unhappy with this emoji, but maybe this other server suddenly is. Turns out cleaning this mess up takes like a week or maybe even two and tons and tons of people. It was during that process that one of my coworkers said, excuse me, I think I'm gonna go home and print all my bank statements. Wait, how is that relevant? He's like, do you think banks never have bugs? Your account balance is basically one of these database entries that's corrupted over here. If you're in finance you'll have to find me and tell me if that's true or not. Okay, so we've cleaned up the mess somehow. We got there. So now let's actually talk about how the heck does this actually happen because it's not supposed to happen. Let's first talk about what is supposed to happen. So we have our application here at the top. It wants to send some data like my name, Evan Jones. It sticks it in a packet. It calls the right system call. It hands it to your operating system. Your operating system uses TCP, most cases. To send that message, TCP has this checksum on the front which is computed from your data because the designers of TCP back in the 70s realized that networks frequently corrupt your data. So it would be nice to be able to have something to verify if it was received correctly. Your kernel then hands that to your network interface. Your network interface actually does the same thing. As it's converting your TCP packet to electrical impulses on the wire, it's computing a cyclic redundancy check or CRC and it sticks it on the end as it goes out off across the wire for exactly the same purpose. So on the receiving side, after it comes back on the switch, you're receiving this packet. Your network interface computes the CRC and it verifies that it matches the value that's in the packet. If it matches, it hands it off to the kernel. The kernel does the exact same thing. It goes and computes the checks, recomputes the checksum based on the data. If it has the same value, it says great, everything looks good. I'm going to hand it to your application and your application happily receives correct data. If your data arrives with a CAD emoji in it, your Ethernet card will compute a different checksum and it will just reject the packet. Similarly, even if your kernel receives it, it's going to, again, compute a checksum that's not gonna match. So our applications were, in fact, receiving this corrupt data so something here was going wrong. So it took a while, a long while, to figure out what the things are that were going wrong. The first one was slightly more obvious, which was that there seemed to be something that was hardware related and that was related to the switches. So here's our high level diagram of a network switch. We have a bunch of network interfaces that are connected to some sort of wires at the bottom and then there's something that's called the switch fabric at the top, which I don't really know how it works, but we'll just bear with me. So network packets arrive on one interface. Your interface basically goes and takes the packet and hands it to the switch fabric. The switch fabric figures out which destination port it's supposed to go out on. Maybe it's the one on the far right. It sends it to that network interface and then your network interface sends it back out on this other wire. So now your whole network can talk to each other. So there's two interesting, or related to the bug, questions that are here. The first one is what happens when another packet arrives, say, on this other interface that needs to go out the same destination interface? You can't send two packets out on the same wire at the same time. So now we have our old friend queues and so there's a queue in front of each of these outbound network interfaces that can hold packets that need to be transmitted out that same port. If you get too many, you eventually just throw them out. So now there's a second interesting question that matters for this bug, which is what happens to the CRC that's attached to this packet? I assumed while the packet going out the other side is exactly the same, so all these bits are just gonna travel through the whole fabric and out to the other interface. Turns out that that's not what happens. And the reason that that's not what happens is because lots of switches have all sorts of extra features in them. You don't just want to route packets blindly. Sometimes you might want to reroute packets or you might want to encapsulate a packet in another packet. So there's all sorts of things that can happen inside this fabric that are actually going to modify the packet. And if you modify the packet, you need to recompute the CRC. In hardware, it's really nice to not have any exceptions. So what does the fabric actually do, or what does the switch actually do? When the packet arrives, it verifies the CRC. If it's bad, it throws it out. Somebody else's problem. But if it's fine, it just forwards just the data blob part of the packet to the fabric. And then the ethernet interface on the other side is responsible for recomputing the new CRC. This is totally safe, right? Because it's valid data coming in. We're all just inside this nice switch box inside. And then it goes valid, it gets computed out the other side. No exceptions, like no conditional branches. It's all just plain hardware. It's all fast and free. So the bad part is, what happens when your cat emoji is in the memory and the switch that might be in these queues and you're getting corrupted data in that packet? The answer is, the network interface on the other side just happily recomputes your correct CRC for your corrupt data. So this explains part of it. This is bug number one. The switch is corrupting packets and it recomputes a valid CRC for your corrupt data. So the lesson here is ethernet does have a CRC, except it's really only there to detect errors on the physical wire. It has nothing to protect you from errors that are inside your servers or your switches. Okay, we still have this. Let's go back to our corruption diagram here. So we've received this packet with the cat emoji. It has a correct CRC, so our ethernet adapter hands it to the kernel. The kernel should still be computing a checksum and throwing this packet out. So why is that not happening? Well, I simplified things a little bit. It's not actually in most modern systems your kernel that's actually computing this. Again, you're computing this checksum on every single packet. It's a fairly simple calculation. The network interface is already computing this ethernet CRC. That feels like, again, the kind of thing that hardware would be really good at. Just compute this on every single packet and figure out the answer. And so that's actually what happens on modern systems is that your ethernet adapter is actually doing this computation. And so what the kernel receives is actually this bit of information from the hardware. Here's your packet. By the way, I checked the TCP checksum and it's bad. So bug number two is actually fairly mundane is that in certain conditions which involved using containers inside this, Linux containers inside this system, it was just not checking that bit. It was just being like, I've got your data packet. Here you go. I'm gonna hand it to the application and ignore this completely. So bug number two is actually the kernel is actually ignoring the hardware TCP checksum bit. This was a one line kernel patches as one of those classic errors. It was like, if this thing and it should have been an equals instead of not equals or the if statement wasn't supposed to be there. I don't really remember. It was literally a one line patch. It was submitted in February, 2016. So if you're running something reasonable it actually has this fix in it. So you don't have to worry about it. So that was about the conclusion of about three months worth of work. So there's about two lessons here. The one lesson here which is network corruption does happen. It happens extremely rarely but when it does it's extremely bad. It's really costly. I've actually been contacted by a bunch of people after this who have been like, oh, I've had this happen too. And I'm like, oh really tell me some details. Like what happens? Like I can't really talk about it. My employer is unhappy about it if we talk about things like this. So just putting it out there. Don't blame the network but sometimes it does happen. The other thing is you can solve this by adding end to end CRCs or encryption. So if your application was actually adding its own CRC or its own encryption at that level you're using TLS or SSL this problem would actually go away. So that actually might be worth the overhead in many applications. Thank you very much. I have more details about this on my website evangones.ca if you want to know the gory details. Thanks.