 What I want to talk to you about today is the Facebook cash infrastructure and how we leverage Python to make things scale. The first question that should be on your mind at this point is, what do I mean by cash? Because I'm not talking about CDN. I'm not talking about the level one, level two instruction cash on your CPU. I'm not talking about the block cash on your hard drives, the query cash on your databases. I want to talk to you about a globally distributed, non-persistent data storage that is going to allow us to retrieve queries within one millisecond latency. Why one, why not five, 10, 100? It's because of the high level of data dependency that we deal with. The very rich user experience that you interact with on the Facebook page means that we need to retrieve data before we know which data to retrieve. This is the graph of the subsequent data fetches that we need to perform in order to render one news feed story. We need to know who's the author, what's his name, his latest profile picture, the content of the story, the people who commented who liked the story, what's your friendship with them, what's the privacy settings that they have for you seeing that story. If we are going to do that many data fetches to render one news feed story, it means that if we want to render a full news feed page, we need to perform most of these in at most one millisecond. This is where we get to our first requirement for the caching system at Facebook. The first part of the caching system that we have is memcache. We use memcache at Facebook pretty much like everybody else out there. As a key value store, a look aside cache. When a web server wants to render a page, it's looking for a piece of data, we look in memcache if the data is there. If the data is not there, we go to the data store. It can be a database. It can be an expensive computation. We retrieve the value, store it in memcache, subsequent data fetches are going to be fast. We want to have a large memory space because we want to store a lot of data in the cache. We're going to cover a bit of scale later on. We spread the load on all of our memcache machine inside one given pool using consistent hashing. I'm not going to cover that tonight, but if you're not familiar with the technique, look it up. It's industry standard. It's very powerful and it allows, it gives us some key property to be able to scale that to many servers. We have hot spares, fully configured machine. They're ready to go. So when we lose a machine, we have one cache server ready to go to take its place. There would be other ways to deal around that, shrink the array, reconfigure all the clients to know about the failure. We don't do that. We leave the dead server there until we get the new server in its place. And failover is typically good enough to take care of the missing machine. We have one other problem to deal with is that the speed of light is actually not that fast. And crossing the Atlantic is, on average, two orders of magnitude slower than the response time that we want to have in our caching system. So that means that strong consistency is impossible. Our second requirement is that the caching system at Facebook has eventual consistency. How eventual? Well, that's what we deal with in Vancouver. We maintain consistency by leveraging the MySQL replication. You guys are probably familiar with the fact that we do use MySQL. You might not be familiar with the fact that we use pretty much the out-of-the-box MySQL replication to take data from one data center to the other. In the slave region, we have a daemon running on every MySQL databases that is watching the bin log and transforming every rights into a cache invalidation operation in our cache clusters and front-end clusters. It's interesting to note at this point that we don't send a new value to the cache machine. We invalidate what was in there if it was in there. One reason for that is that it's fairly possible that what is popular hot content in North America might not be what's popular and hot in Europe. The main reason why we do that is that deleting an item from cache is idempotent. We can spool the deletes. We can replay them if we're recovering from an outage. So if we're dealing with a slow mem cache machine, that's okay. We can just keep the deletes in aqueduct daemon that the daemon we have here is called aqueduct and then replay them later on. In terms of scale, that's an infrastructure that is composed of thousands of servers. Over a billion cache operations per second, over a trillion key value pairs, and we segment the memory space into multiple pools. And wildcard, the default pool, the largest of our server pools, has a hit rate of 98.1%. That means that even though we're using MySQL to store and persist data, the data that we use to run the site is fetched mainly the vast majority of it straight out of RAM. So how do we use Python to leverage the scale? Memcache itself is C daemon, fairly low-level C code, very, very tight optimization of how they use the memory, the very performant access to the network. The release schedule is fairly slow, and that's why we want to have some very agile tools to be able to manage how we deploy these memcache servers because we want to be able to respond to changing demands, changing how many users we have, how much heavy the application developers are hitting the cache. We use Python pretty much all over the place to manage the caching system, but I want to cover two use cases where we leverage it very effectively. The first one is to manage the pools themselves. So if we want to grow some memcache pools, shrink them, if we want to replace the server internally at Facebook, we have to contact about a dozen internal services. We need to create tickets to our sitops to repair the server after we detect the failure. We need to inform all the clients of the cache system that the IP address of a server serving a particular chart has changed. That's hundreds of thousands of machines that need to be informed about that new configuration change. So we use MCCONF, that is the tool to configure these arrays, and MCCONF is a great glue between all these different network protocols to be able to manage these changes. The other thing that we do with Python is software upgrades. Software upgrades with a very stateful service like memcache is a very hard problem. If I reboot a memcache machine to install a new kernel, well, I lose my cache. If I kill my memcache demon because a version upgrade changes the cache layout, I lose my cache. And basically, I need that cache to be able to run my site fast enough. There are many things to take into consideration. First is these different pools have dependencies. How fast I can reboot my servers is going to depend on the load of the site, how fast I can warm up the working set size to reach my desired hit rate. And there's no clear solution to that. It's basically experimentation and using your gut feeling. So we love to use Python there because we're able to test and see the result of different recipes or strategies to warm up the cache. So this is memcache. And now, let's take a look at the problem from a different angle. We represent the social graph as a set of nodes and edges. Users, comments, pictures are objects. And they are linked together with typed edges, like friendship, like having liked a story. So in this case, I have a post. And my friend Marco commented on the post. And my friend Sarah and Nikhil liked that post. If we were to teach the caching system about that data format, then we can do queries that go much beyond simple key value lookup. Think of retrieving a series of edges, all the comments attached to a post. If you've seen that post an hour ago and you come back to see it, we typically want to show you only the newest comments. If the caching system performed that query five minutes ago, well, we can ask the database for only five minutes worth of comments instead of one hour worth of comments. This is huge. In the system at Facebook, we call tau. Tau is a read-through, write-through cache that is using two layers. So if a web server is looking for a piece of data, it's going to ask a tau follower. Tau followers are co-located inside front-end clusters with the web servers. And if the data is not in the tau follower, the tau follower is going to ask a tau leader who is, in turn, responsible for asking the data store. Same thing for writes. And the main advantage of having this two-layer caching system is that we get a better hit rate. But mostly, we solve many concurrency problems that require a lot of logic inside the memcache client. I am going to come back to that if we have time a bit later on. In terms of scale, we have a fairly large deployment of memcache machines. And we have a similarly impressive deployment of tau servers. In terms of exact numbers, I would say that the tau deployment as of today is probably about 20% larger than the memcache one. This number, 97.5% hit rate, is really the surprising one. Where memcache has multiple pools where the memory space is segmented into different arrays, tau offers only one unified view. And the vast majority of the social graph, everything that is how users and their content relates to each other, is served essentially straight out of RAM. We use Python for tau in ways very similar to what we do for memcache. But there are a few interesting differences. And I want to cover one of them with you. So on November 6, 2012, a bit over a year ago, I was drinking a beer at home. It was a Sierra Nevada Pale Ale. If you guys are familiar with the Yakima Benelux, Ben, the master brewer there told me that Sierra Nevada Pale Ale is actually the inspiration behind the Yakima. It's a pale ale, fairly nice and crisp, not too multi, very, very happy. So this is going to become important at some point. And one of my friends was asking me, oh, are you excited about the elections? I said, ah, elections. I'm Canadian. I cannot vote. I'd rather celebrate the greatness of the US, the American culture. We were that great Sierra Nevada. And then somebody posted this picture. What's surprising, or should I say remarkable about this picture, is that 4 and 1 half million likes, about half of which happened within the first hour of that picture coming online. Did I mention that I was on call? Let's go back for a moment. Consistent hashing is going to take one key, hash it. Using the size of the array, select one lucky machine who is going to perform all the reads. The reads are typically a very cheap operation. Also, all the writes and channel all the way to the database. A like here is creating an association, a typed edge. That's inserting a row in the database. That is slow. And one lucky machine had a really, really hard time. So how we dealt with that? Well, there was a lot of, let's say, duct tape deployed on that night. But what we managed to serve that picture, the long-term solution is a lot more interesting. And that's where Python really helped us. We have an extension on the consistent hashing that we called shard splitting. It should probably called shard replication. So we can elect some hard shards, some hard partitions on the dataset, and say these should be replicated. So instead of serving the index of one machine, we're going to serve the index of multiple machines who are going to be responsible for serving these shards. That may sound simple. But it gets a bit tricky when you want to take into account the routing of the invalidations that need to happen asynchronously on that. And the other extremely interesting thing that Python allowed us to do is shard placement. So we can take individual shards and manually map them to not cold, shall I say, mostly idle servers. So we take really busy shards, and we send them to servers because of statistically spreading the load who are not doing a lot of work. This is using G-Event to monitor the status of all the cache machines in real time. And then several time, a minute, we remap shard. And in turn, this was a summer intern project. Pretty amazing the things that you can do with Python in only one summer. OK. There was also this time when we invalidated a good chunk of the cache. I don't know if you guys are familiar enough with PHP, but PHP arrays are ordered containers. They're also associative array. So you can map a key to a value and retrieve it in pretty much a constant time. Python dictionaries also allow you to index a value and retrieve it in constant time. Python dictionaries are fundamentally unordered. We have to deal with a bit of PHP at Facebook. We have some legacy systems that do store serialized PHP arrays. But we like Python. So at some point, we migrated a legacy system to Python. And we figured that the best mapping to a PHP array would be a dictionary. Well, it turns out that in this case, this particular PHP array was an ordered list and not an associative container. And that resulted in, yes, basically invalidating a big chunk of cache. Fortunately, most configuration at Facebook is actual Python serialized objects or Python code. And we have a pretty sophisticated system called Configurator to be able to do schema validation for the nesting of the objects. Think of your XML DOM schemas. And we can also use logic to be able to go introspect these objects to see if the configuration is signed before deploying it to these numerous servers. We use Python for other things, for remediation. We use it for deployment. Interestingly, we use BitTorrent heavily internally to be able to deploy new binaries, new versions to all our servers. And I understand that we covered a lot of ground. And I probably left many questions unanswered, like how we actually do failover, how we deal with thundering herd, how we deal, how we provide read after write semantic. We published a few papers on Memcache and Tau recently. So I invite you to go out and read these papers if you want to learn more about these concepts or if you are interested in getting more numbers. That gives us a few minutes for questions. But first of all, I would like to thank you for being here.