 So my name is Angelo. Today I'm going to talk about one of the open source projects that I've been working on at Facebook. So a small interaction about myself. This is which are me on one of our data centers. I've been in some affair with my Dev server. So I am a production engineer. If you are familiar with positions like SRE in Google, like Cytor, the big engineer, or DevOps, basically we are a hybrid between software developers and people like operation engineer basically. So our focus is reliability, performances. We engage with the software developers to make sure that their software will run at our scale. We basically are two types of production engineering team. One type is embedded production engineers. There are people that are embedded in product teams like the news feed or the photos team. And we basically engage with them. We make sure that the software can operate at our scale. Then there is another type of production engineer, which is 100% production engineers. And this is my team is 100% production engineers. That means we write the code, we support the code, we deploy the code, et cetera. I joined around 2011. So like five years ago, more or less, in Dublin. My first team was called Cytor reliability operation. We are basically the first people responding to incidents at Facebook. So we needed to watch everything around the infrastructure from the website to the API for the mobile phones, storage, photos, anything. It was pretty scary. Those two years were pretty scary. I was basically watching the site alone over the weekends and millions of people would rely on me watching the site. But eventually we managed to automate ourselves out of the job using Python. We wrote a tool called Fbar, Facebook auto remediation, which is a tool that was scanning our electing system, finding all the alerts and trying to remediate them automatically. Eventually the team was closed. And this is what you want. You want to automate your job. So most of us joined different teams. And I joined the cluster infrastructure team. So my team basically owns data center core services. For that I mean services like DHCP, TFTP, DNS, LDAP, stuff like that. Anything basic that you need to run hardware, computers at scale. We do that with automation. We use heavily Python, some C++ as well. And we basically are responsible for... One of the things we are responsible about is making sure that you can install operating systems on tens of thousands of servers. And you can do that basically with no human intervention. We also own automation to bring up capacity and decommission capacity. That is basically Python automation to make sure that when you have a new cluster made of tens of thousands of servers, you can press a button and everything will happen. All the machines will be configured. All the backends will know about this new cluster, et cetera. Who knows about this? We've seen this before. Yeah, so there is no cloud. There's just other people's computers, right? So raise your hand if you work in a data center environment or you own some physical infrastructure. All right, just a few people. Anyway, this talk would be interesting also if you deal with, you don't deal with hardware because it can give you an overview of what you need to do to make sure that you can deploy those servers. And yeah, somebody needs to provision them. So I wanted to say that we also run container infrastructure at Facebook, but my team is responsible for making sure that hardware that runs the containers can be automatically deployed. So we built something like an AWS in-house. This is an open compute server. It's an old one from 2012, I think. Facebook runs this nonprofit organization. Actually, it doesn't run. It is part of a board of open compute. It's our open source initiative for data center and server design. We open source designs on how we do data center electrical systems, how we design our motherboard and stuff like that. So you can check it out on the internet. So my team is responsible for installing this automatically. And we have many, many of them. We have many data holes like this, each made of many, many racks. And we have buildings like this. This is a picture from one of our data centers in Sweden up in the polar circle. This is a picture of when it was being built. We have a second building now, actually like this one, similar to this one. And we have many, like some of them around the world. We have a few in the US. We have one in New York, which is our first physical data center outside of the US. And we are just in the process of building one in Dublin. In addition to that, we also have many pops around the globe. We need pops or point of presence to make the user experience faster. We want to be as close as possible to the end users. The location of these pops is totally random, by the way. We don't have any in Africa, right? Maybe we have a few, but anyway. So, basically what the mantra is this, hands-free provisioning, right? And you can't have people physically go around with USB sticks or CD-ROMs around the data center to install like tens of thousands of servers. And provisioning is complex. There's many, many, many different variables at play. There is like a version of the OS, different versions of the Inetra, the different version of the kernel. If you're stuck with V6 only or before, do all stuck with V6. If you're dealing with a BIOS machine or a UFI machine, which bootloader you are using, iVixie, Grub 2, which server type you're deploying, all of these components. But today I'm gonna talk about TFTP. So, TFTP, in 2016, what the hell? Can it not use HTTP, come on? Well, it is still used. It's not used on the open internet, of course, because it's very unsafe. But it's very, very used in data center environment and ISP environments. The reason is, it's a simple protocol. The specification of the protocol is very simple. You can write something yourself in maybe like one week, two weeks, if you want to support different RFCs. So it's easy to implement. And it's UDP-based. That means that you don't need a full TCP IP stack, which means the code you produce, it's very small. And that means that you can burn this into small chipsets. So think about network card chipsets and stuff like that. And because of that, usually it's used in embedded devices and network equipment. And TFTP is used traditionally with the DHCP as a way to net boot up a machine from the network. Some people say, don't you burn IPXC directly on the chips? But we could do that, but it doesn't work at our scale because we want to try different versions of network boot programs or IPXC and grab. And to do that, we want to be able to chain load over the network, a new version of the boot loader. So we don't actually burn IPXC on any boot loader on chipsets. All right, so this is an overview of how the provisioning phase works. It's divided in three parts. You know, power on the machine. There is a net boot phase, which you're gonna go into later. After you get your network configuration, you download your init.rd and your kernel, you bootstrap from there. We use CentOS, so RPM-based operating system. And in this system, usually you have Anaconda. Anaconda will download the kickster files and do all the things that it needs to do. So formatting drives, partitioning, installing RPMs, configuring the network, whatever. After that, it will reboot and then we run Chef. Chef will basically converge the machine into the state we want. After that, if Chef runs, you know, exit with status code of zero, we just mark the machine as provisioned, which means it can potentially take a production traffic. So if you look into the net boot phase, it's divided in three parts. There is a DHCP phase at the beginning where the machine doesn't have any network configuration, needs to fetch the configuration. Typically stuff like IP, default gateway, net mask, DNS, and we also pass the location of the network boot loader. We use both DHCP before and V6, depending on the cluster we are in. We are basically now every new cluster that we turn up, it's single stack V6 only. There is not before at all. This had its own problem. We worked on that for two years, but now we basically have no problem at all with it. We can bring up clusters in V6 only environments. All the software is V6 only. Anyway, you get your, the location of your network boot program. Then the NIC card will download the network boot program and start it. And the network boot program itself will use TFTP. So the download happens actually over TFTP. I didn't mention that. Then the network boot program will download its configuration. Typically via TFTP, it can do HTTP if it's a new, recently, like a recent boot loader. And you can download init.rd and kernel. This again can happen over HTTP or TFTP. In our case, we just decided to go for TFTP because we want to use the same flow. Then the network boot program will just basically load the int.rd memory, load the kernel memory and just start it. So TFTP is pretty old. It's like 30 years plus protocol from 1981. But hey, I am from 1981 as well. So that's a picture of me around two years old. So it's as old as me. And this is a diagram showing the protocol in high level. So the client will connect over UDP to port 69. We'll ask for a file. The server will basically allocate a random source port for the packet being sent back to the client. In this case, the port is like Y in the picture. It's basically, as soon as you send an RQ request, you get back the first chunk of data. By default, it's a very small block size of 512 bytes, which is ridiculous. And we're gonna see about the problems connected to that. But basically every time you receive a chunk of data, you send back an acknowledgement with an integer signifying the next block you want. And conversation happens like this. It's a lockstep algorithm. Basically you can't have the next chunk until you acknowledge the previous one. And basically the last packet. So the end of the session is indicated by the size of the last block. If the last block is less than the block size, then the session is over. Yeah, if a packet is lost on the network, the client will re-send previous acknowledge to get the data packet back. And there are a number of retries after which the device will give up. And when I say the default block size is ridiculously small, and this can cause problems. For example, if you have a pop somewhere, like on the other side of the Pacific Ocean. Oh, okay, I put it in North Korea, but we don't have a pop in North Korea, okay. But anyway, so if you have a pop somewhere in Asia, and back in 2014, we didn't have any TFTPs ever in pops, so we needed to connect to the closest origin or data center, in this case, Oregon, in Oregon. If you have a big, relatively big file, like 80 megabytes, for example, you need RD, which is 80 megabytes, and you need to download that with the default block size. With 150 millisecond latency, yeah, it will take you 12 hours to do that, right. Even if you double the block size to 1400, it will take you five hours or 4.5. This is not a problem in the data center itself because you have sub-millisecond latency, so in most cases, you just wait around one minute. But this is a great problem in the pops. So this is one problem. So in 2014, we had this deployment structure. You can take the diagram on the right, and you can just copy-paste this diagram for every cluster we had. So every cluster was contained, like TFTP installation. We used to have physical hardware load balancers. They were exposing a VIP, like a TFTP VIP for every server in the cluster. And we just were using the standard open source, like TFTP, DEMON, you find in any Linux distribution. We had this active standby configuration. And then we needed to sync around seven gigabytes of stuff because we had many in-it RD, many kernels, many Cisco images and stuff like that. So every time you were deploying one of these, you need to sync seven gigabytes and keep it in sync. Then we have our Python automation to provision bare metal, and this automation needed to be aware of which of the two instances was active, so it was a bit of a pain. And any of those things in the picture can fail. So the problems with these are physical advances, waste of resource, because we had one machine in a cluster just doing nothing, and for every cluster we had this pair of hostesses that were basically doing nothing. We could maybe go with a few tens of them per data center or something, so it was a waste of resource. And yeah, the automation needs to be aware of which one was active. We didn't have any stats, but then we tried tailing the TFTP DEMON log files, but we found out that after like three weeks, it was just stopping logging, I didn't want to go into the C code to fix it. And as I said before, TFTP is very bad protocol in high latency environments. Yeah, and I said it before, too many moving parts, and each one of them can fail. So how did we solve these problems? So in general, we like to use open source as much as we can, but sometimes it doesn't work, so we try to come up with our own solution sometimes we can't open source the result, sometimes we do, and in this case, I decided to do it and it's in GitHub. So we brought a framework to write, so basically a framework to build dynamic TFTP servers. It supports only RRQ, so the fetch request is there is no right support. If you're using TFTP for writing, just stop it, don't do it, and it supports the main TFTP specification. So the main one and then also other extensions like the negotiation of the block sides, negotiation of the timeout intervals, and other stuff. And it is extensible, that means you can import the framework, override a few classes, and you can define your own logic. And you can also define some callbacks that will be called by the server to push your own statistics to your own infrastructure, like monitoring infrastructure, alerting infrastructure, whatever it is. So this is an overview of how the framework works. There's a client, it sends a RRQ request, and there's a base class called base server. This class basically implements the UDP acceptor for any new connection to do some parsing of the connection, and after that it will fork into a new process, and we basically just call get handler, which is a method that returns a base handler, which is a base object that deals with the session, basically with the client. The base handler itself will call a method called getResponseData, which returns an object. This object is a file like object, so it implements an open read and size, just like a normal file object would do. And the base handler will use this object to fetch block of data from this object and basically communicate with the client, and basically implement the protocol. When the session is over, your callback that you provided in the constructor will be executed, which will give you standard counters, but you can add your own counters and make your own logic to push your stats to your infrastructure, and the server does the same. The base server does the same in intervals that you can configure as well. So in the GitHub, read me, actually, in the GitHub project there is an example directory you can look at with the simple server, simple as a server you can write, like a server that servers file from a root directory in the file system. This is basically just the example you find in the GitHub repo. So first thing you need to do, so as I said before, you have this response data file object. What you do is you inherit, can you read the code? Sorry, colors. You can, you need to inherit from this object and implement read, size, and close. In this case, it's a very simple object, just a wrapper around the file object, but it can be anything, like think about configuration files that you have to generate dynamically, right? You can just write whatever you want here, but the interface has to be a file-like object. So you do that, then you implement the study render, which here is just basically just inheriting the base object, calling the constructor, and then exposing the root directory and a path, which you need to pass to the getResponseData method, which returns the object we declared in the pu slide. So very easy. For the base server, you do something similar. Actually, I forgot that in the constructor at the top, you have to declare an argument for the statistics callback. You basically do the same for the static server. You inherit the server, call the constructor, expose, initialize the root, initialize the handleStats callback, and then you just override getHandler to return the object we declared in the pu slide. And this is the main. You have the top two functions are your callbacks. You get stats, which is a dictionary of statistics, and you can do whatever you want with it and use your internal libraries to push counters to your monitor infrastructure. And it just initializes the server, and then you just call the run method, and that's it. So how do we use it in Facebook? So in Facebook, a while ago, like two or three years ago, we decided to get rid of hardware load balances, and we have our own software load balancer architecture, which is based on Linux IPvS, which is a kernel stuff, but the control panel, it's Python. So we have Python to control IPvS by a netlink API. So everything, like we basically use Python with some IPvS stuff in Linux. So we decided to get rid of the hardware load balancers, so we use the software one. So what we did is we have a population of FBDFTP servers in every region. Any server can respond to requests coming from any client. What we do is we implement our own logic on top of the framework to stream this stuff from the closest HTTP repository. So we don't need to synchronize seven gigabytes of files anymore. What we do is, if something is not on local cache, what we do is we fetch the static file from HTTP, but we don't wait for the fetch to finish, we just start streaming bytes. We don't wait for 80 megabytes or whatever it is to land on the disk and then we serve it. So every time there is a new request, we do an HTTP head request to the HTTP server to get the timestamp of the file. If it's new, we just start this streaming process again, otherwise we serve it from the disk. And we have, so when a request comes in, we have regular expressions. We have regular expressions. If it matches a static file, we do what I just said, so we stream it from the closest HTTP repo. If it's a dynamic request, like a configuration file, we make some backend calls to our backend systems to get the information we need to build the file response object, which is dynamically generated and we serve the response back to the clients. So what are the improvements here? As I said, no more physical load balances, no waste of resource, any machine can serve traffic from everywhere. We have stats, finally, fancy dashboard and everything. The TFTP servers are dynamic, they don't need any, they are stateless servers, basically. No, so the configs files for grab IPX are automatically generated. And the static files are streamed. And yeah, of course, you don't need to synchronize seven gigabytes of data every time you deploy a new server. And because of this, it's container friendly, we can just take the binary, put it in a container and it will just run. But I said that we removed load balances completely, so how do we route TFTP traffic? So DHCP is the guy that tells the machines which IP they need to connect to find the TFTP server. So what we do is basically we have a project called NetNorad, which is actually, it was recently, you can find like an article on our blog about it. NetNorad is kind of, it's not really 100% Python, it's Python and C++, but the goal of this project is to ping every rack switch in the fleet from different locations on the fleet in order to find latencies and packet loss. But it produces, ultimately, it's like a huge JSON map file which tells you the latencies between cluster X and cluster Y in the infrastructure. So we have this huge JSON that gets published every couple of minutes and application can subscribe to this JSON file. So the DHCP server, which actually I talked about last year, necessary on, it's our own implementation of the DHCP server. It will fetch this map, combine it with the location from which the client is arriving and also combining that with the health checks that we have internally to know which TFTP servers are enabled, so alive. So we combine this information and we just decide, we just pick the closest TFTP server to a given client. So we use this map of our network to do that. And this helps a lot with POPs, right? So now what we do is we deploy TFTP, the FPTFTP implementation in every single POP. So we reduce the TFTP latency because it stays inside the same location and we stream the stuff from over the ocean. So we no longer have these TFTP issues we were having back in 2014. Okay, so that's the end of my talk. You can find the project in GitHub. You can install with PIP and we have a session tomorrow at 2.45 p.m. about how we use Python in production engineering at Facebook, so if you want to talk to us just feel free to come over and now I'll take some questions. Hello, one question. Hi, how much faster is it comparing to the previous TFTP? Can you say that again, sorry? How fast is it comparing to the previous? It's not faster because that's C implementation. This is Python with multi-process. I did some benchmark, unfortunately legal didn't want me to put the benchmarks on the slides but it is not as fast as the C implementation but it works quite well with our infrastructure. So we can do, we do every quarter I think. Yeah, every quarter we do mass provisioning tests where we take a full cluster, we drain it and we mass provision this cluster with tens of thousands of machines and it just can cope with tens of thousands of machines. Yeah, so to respond to your question it's not as fast as the C implementation but it's fast for us, it's good for us. Another? Can it work on some network hardware instead of? Yeah, yeah, so potentially if you have a Python interpreter it can absolutely work. Like I don't know if you know but we have an open computer rack switch, sorry. Open computer rack switches which basically are Linux boxes with the ASIC hardware in them and they run like a normal Linux system. So potentially you can have FP specifically deployed in every rack switch and just be able to save that rack switch. So yeah, you can do that. If you can run an interpreter in your network device then you can, it's Python 3, there's no, it's pure Python 3, there's no dependencies on anything. Yeah, so you don't have that problem. You can just run it with Python and it just uses sockets and multi-processing and that's it. And maybe if you, yeah, if you need to do what we do like the streaming stuff you need the HTTP library but you can use the embedded, you know, the library in the interpreter in the default Python distribution. Okay, what? Okay, thank you, engineer, for your interview. Sorry for the technical problems. For you, thank you for your interesting conference and thank you very much to you for coming. Have a nice day.