 All right. Hello everyone. Today I'm going to talk about this paper about how Facebook uses memcache in order to handle enormous load. The reason we're reading this paper is that it's an experienced paper. So there's not really any new concepts or ideas or techniques here, but it's kind of what a real live company ran into when they were trying to build very high capacity infrastructure. There's a couple of ways you could read it. One is a sort of cautionary tale about what goes wrong if you don't take consistency seriously from the start. Another way to read it is that it's an impressive story about how to get extremely high capacity from using mostly off the shelf software. Another way to read it is that it's a kind of illustration or of the fundamental struggle that a lot of setups face between trying to get very high performance, which you do by things like replication and how to get consistency for which techniques like replication are really the enemy. And so, you know, we can argue about whether we like their design or we think it's elegant or a good solution. But we can't really argue with how successful they've been. So we do need to take them seriously. And for me actually this paper, which I first read quite a few years ago. It's been, I thought about it a lot and it's been sort of a source of ideas and understanding about problems at many points. All right, so before talking about Facebook proper, you know, they're an example of a pattern that you see fairly often that many people have experienced in which they're trying to build a website to do something and, you know, to build websites are not interested in building high performance, you know, high performance storage infrastructure. They're interested in building features that will make their users happy or selling more advertisements or something. You know, so they, they're not going to start by spending a man year of effort or a whole lot of time building cool infrastructure they're going to start by building features and that they'll sort of make infrastructure better, only to the extent that they really have to because you know that's the best use of their time. All right, so a typical starting scenario. And one of the ways when a some website is very small is, you know, there's no point in starting with anything more than just a single machine, right, you know, maybe you start you only have a couple users sitting in front of their browsers. And, you know, they talk over the internet to your, to your single machine, your single machine is going to maybe run the Apache web server. You can write the scripts that produce web pages using PHP or Python or some other convenient. Easy to program sort of scripting style language and Facebook uses PHP. You need to store your data somewhere. Boy you can just download sort of standard database. And Facebook happened to use MySQL MySQL is a good choice because it implements the SQL query language is very powerful and acid transactions provides durable storage. So this is like a very, very nice setup. And we'll take you a long way actually. But supposing supposing you get successful, you get more and more users, you know, you're going to get more and more load more and more people are going to be viewing your website and running whatever PHP stuff your website provides. And so at some point, almost certainly the first thing that's going to go wrong is that the PHP scripts are going to take up too much CPU time. That's usually the first bottleneck people encounter if they start with a single server. So what you need is some way to get more horsepower for your PHP scripts. And so that takes us to kind of architecture number two for websites. In which, you know, you have lots and lots of users, right? Or more users and for you need more CPU power for your PHP scripts. So you run a bunch of front end servers, whose only job is to run the web servers that users browsers talk to. And so these usually called front end servers. So these are going to run Apache, the web server and the PHP scripts. Now, you know, your users are going to talk to different servers at different times, maybe your users cooperate with each other, they message each other, they need to see each other's posts or something. So all these front end servers are going to need to see the same back end data. And in order to do that, you probably can just stick, at least for a while, you can just stick with one database servers, you can have a single machine running my SQL that handles all of the database, all queries and updates reads and writes from the front end servers. And if you possibly can it's wise to use a single server here because as soon as you go with two servers and somehow split your data over multiple database servers. Life gets much more complicated. And you need to worry about things like do you need distributed transactions or how does the PHP scripts decide which database server to talk to. And so again, you can get a long way with the second architecture, you have as much CPU power as you like by adding more front end servers, and up to a point, a single database server will actually be able to absorb the reason rights of many front ends. But you know, maybe you're very successful. You get even more users. And so the question is what's going to go wrong next. And typically what goes wrong next is that the database server since you can always add more CPU more web servers. You know what inevitably goes wrong is that after a while the database server runs out of steam. Okay, so what's the next architecture. This is web architecture three and the kind of standard evolution of big websites. Here we have the same we have you know now thousands and thousands of users. Lots and lots of front ends. And now we basically we know we're going to have to have multiple database servers so now behind the front ends we have a whole rack of database servers. Each one of them running my sequel again. But we're going to shard the data. We're driven now to sharding the data over the database server so you know maybe the first one holds keys you know a through G and G through second one holds keys G through Q, you know whatever the sharding happens to be. On the front end, you know you have to teach your PHP scripts here to look at the data they need and try to figure out which database server they're going to talk to and you know at different times for different data they're going to talk to different servers. So this is sharding. And of course the reason why this gives you a boost is that now the all the work of reading and writing is split up hopefully, hopefully evenly split up between these servers. And since they hold different data they're now replicas right we're sharding the data, and they can execute in parallel and have big parallel capacity to read and write data. It's a little bit painful the PHP code has to know about the sharding. If you change the setup of the database servers that you add a new database server, or you realize you need to split up the keys differently. You know now you need you're going to have to modify the software running in the front ends or something in order for them to understand about how to cut over to the new sharding. So there's some, there's some pain here there's also, if you need transactions and you know many people use them. If you need transactions but the data involved in a single transaction is on more than one database server. You're probably going to need two phase commit or some other distributed transaction scheme. And it's also a pain and slow. All right, well, you can get fairly far with this arrangement. However, it's quite expensive, my sequel, or sort of, you know, fully featured database servers like people like to use is not particularly fast. You can probably only perform a couple hundred thousand reads per second and far fewer rights. And, you know, web sites tend to be read heavy. So it's likely that you're going to run out of steam for reads before rights that the traffic will be the load on the web servers will be dominated by reads and so after a while. You know, you can slice the data more and more thinly over more and more servers. But two things go wrong with that one is that the some sometimes you're, you have specific keys that are hot that are used a lot and no amount of slicing really helps there because each key is only on the single server. So that key is very popular that server is going to be overloaded no matter how much you partition or shard the data. And the other problem with adding was sharding adding lots and lots of my sequel database servers for sharding is that it's really an expensive way to go as it turns out. And after a point you're going to you're going to start to think that well instead of spending a lot of money to add another database server running my sequel. So take the same server run something much faster on it like as it happens and cash D and get a lot more reads per second out of the same hardware using caching than using databases. So the next architecture. And this is not starting to resemble what Facebook is using. The next architecture still need users. We still have front end servers running web servers and PHP. By now maybe a vast number of front end servers. We still have our database servers because you know we need a system that will store data safely on disk for us and will provide things like transactions for us. We don't have a database for that but in between. We're going to have a caching layer as this is where memcash D comes in. And of course there's other things you could use that memcash but memcash D happens to be extremely popular caching scheme. The idea now is you have a whole bunch of these memcash servers. And when a front end needs to read some data. The first thing it does is ask one of the memcash servers look do you have the data I need. So it'll send a get request with some key to one of the memcash servers. And the memcash server will check it's got just a table in memory. In fact, memcash is extremely simple it's far far simpler than your lab three. It just has just as a big hash table in memory it checks with that keys in the hash table that is it sends back the data saying oh yeah here's the value I've cashed for that. And if we if the front end hits in this memcash server. Great. Again then produce the web page with that data in it. If it misses in the web server though the front end has to then re request the data from the relevant database server. And the database server will say oh you know here's the, here's the data you need, and at that point, in order to cash it for the next front end that needs it. The front end will send a put with the data at fashion database into that memcash server. And because memcash runs, at least 10 and maybe maybe more than 10 times faster for reads than the database for a given amount of hardware. And it's off to use a fair amount, some of that hardware for memcash, as well as for the database servers to people. People uses arrangement a lot and it just saves the money, because memcash is so much faster for reads than a database server, still need to send rights to the database because you want rights you want updates to be stored durably on the databases disk and still be there if there's a crash or something. But you can send the reads to the cache very much more quickly. Okay, so I have a question. The question is why wouldn't the memcash server execute the put on behalf of the front end and cash the response before responding to the front end. So that's a great question. You could imagine a caching layer that you would send a get to it. And it would, if it missed the memcash layer would forward the request to the database database was from the memcash memcash would add the data to its tables and then respond. And the reason for this is that memcash is like a completely separate piece of software that doesn't know anything about databases and it's actually not even necessarily used and in conjunction with the database, although it often is. So we can't bake in knowledge of the database into memcash. And a sort of deeper reason is that the front ends are often not really storing one for one database records in memcash almost always are very frequently what's going on is that the front end will issue some requests the database and then process the results somewhat you know maybe take a few steps to turning into HTML or sort of collect together, you know results from multiple queries on multiple rows in the database and cached, partially process information in memcash, just to save the next reader from having to do the same processing. And for that reason, memcash, it doesn't really does not understand the relationship between what the front ends would like to see cached and how to derive that data from the database that knowledge is really only in the PHP code on the front end. So therefore, even though it could be architecturally good idea, we can't have this integration here, sort of direct contact between memcash and the database, although it might make the cash consistency story much more straightforward. And yes, this is, it's this answer the next question that is the difference between a look aside cash and a look through cash. And the fact the look aside business is that the front end sort of looks aside to the cash to see if the data is there and if it's not, it makes its own arrangements for getting the data on the miss, you know, a look through cash might forward the request the database and directly and handle the response. And, you know, part of the reason for the popularity and memcash is that it is it is a look aside cash that is completely neutral about whether there's a database or what's in the database or the relationship between stuff and memcash and what's in the items in the database. All right. So this is very popular arrangement very widely used. It's cost effective because memcash is so much faster than the database. It's a bit complex. Every website that makes serious use of this faces the problem that if you're don't do something that data that stored in the caches will get out of sync with the data in the database. And so everybody has to have a story for how to make sure that when you modify something in the database, you do something to memcash to, you know, take care of the fact that memcash may then be storing stale data that doesn't reflect the updates and a lot of this paper is about what Facebook story is for that although other people had other plans. This arrangements also potentially a bit fragile. It allows you to scale up to far more users than you could have gone with databases alone because memcash is so fast. But what that means is that you're going to end up with a system that's sustaining a load that's far, far higher, you know, orders of magnitude higher than what the databases could handle. And thus, if anything goes wrong, for example, if one of your memcash servers were to fail. And meaning that the front ends would now have to contact the database because they didn't hit. They couldn't use this to store data. You're going to be increasing the load on the databases, dramatically, right, because memcash the you know, supposing it has a, you know, hit rate of 99% or whatever it happens to be. You know, memcash is going to be absorbing almost all the reads, the database back ends only going to be seeing a few percent of the total reads. So any failure here is going to increase that few percent of the reads to maybe you know I don't know 50% of the reads or whatever, which is a huge, huge order of magnitude increase. So as Facebook does, once you've got to rely on this caching layer, you need to be set up pretty serious measures to make sure that you never expose the database layer to the full, anything like the full load that the caching layer is seeing. And, you know, you see in Facebook, they have quite a bit of thought put into making sure the databases don't ever see anything like the full load. Okay. So before this is generic. Now I'm going to sort of switch to a big picture of what Facebook describes in the paper for their overall architecture. We have lots of users, every user has a friend list and status and posts and likes and photos. But Facebook's very sort of oriented towards showing data to users. And a super important aspect of that is that fresh data is not absolutely necessary in that circumstance. You know, suppose the reads are, you know, due to caching, supposed to read yield data that's a few seconds out of date. So you're showing your users data, not the very latest data, but the data from a few seconds ago. You know what, the users are extremely unlikely to notice, except in special cases. Right. If I'm looking at news feed of today's, you know, today's news, you know, if I see the news from a few seconds ago versus the news from now, a big deal. Nobody's going to notice, nobody's going to complain. You know, that's not always true for all data, but for a lot a lot of the data they have to deal with sort of super up to date consistency in the sense of like linearizability is not actually important. What is important is that you don't cash stale data indefinitely. And what they can't do is, by mistake, have some data that they're showing users that's from yesterday, or last week, or even an hour ago. So the users really will start to notice that. So they don't care about consistency like second by second, but they care a lot about not not being in not chewing stale data from more than, more than a little while ago. So the situation in which they need to provide consistency is if a user updates their own data, or if a user updates almost any data and then reads that same data that the human knows that they just updated. It's extremely confusing for the user to see stale data if they know they just changed it. And so in that specific case. The design is also careful to make sure that if a user changes data that that user will see the change data. Okay, so Facebook has multiple data centers, which they call regions. And I think at the time this paper was written, they had two regions their sort of primary region was on the West Coast, California and their sort of secondary region was in the East Coast. And the two data centers look pretty similar. Each one of them had a set of database servers running my SQL. They started data over these my SQL database servers. They had a bunch of memcached servers, which we'll see are actually arranged in independent clusters and then they had a bunch of front ends. Again, sort of a separate arrangement in each data center. And there's a couple reasons for this. One is that their customers were scattered all over the country. And it's nice just for performance that people on the East Coast can talk to a nearby data center, and people on the West Coast can also talk to a nearby data center which just makes internet delays less. Now the, the data centers were not symmetric, each of them held a complete copy of all the data. They didn't sort of shard the data across the data centers. So the West Coast, I think was a primary. A real copy of the data and the East Coast was a secondary. And what that really means is that all rights had to be sent to the relevant database in the primary database center. So, you know, any right gets sent here. And they use a feature of my SQL, a sort of asynchronous log replication scheme to have each database in the primary send every update to the corresponding database in secondary region so that with a lag of maybe even a few seconds. These database servers would have identical content, the secondary database servers would have identical content to the primaries reads though we're local to these front ends when they need to find some data. So it talked to memcache memcache in that data center. And if they missed in memcache they talked to the, they'd read from the database in that same data data center. Again though the databases are complete replicas all the data some both of these data centers in both of these regions. So overall picture. The next thing I want to talk about is a few details about how they, how they use, you know what this look aside caching actually looks like. So this really there's reads and writes. And this is just what's shown in figure two for read, which is executing on a front end. The first thing if you read any data that might be cash the first thing that code in the front end does is makes this get library call with the key of the data they want and get just generates an RPC to the relevant memcache server. So they hash this library routine hashes on the client hashes the key to pick the memcache server and sends an RPC to see to that memcache server memcache well they reply yes here's your data or, or maybe reply nil saying, I don't have that data. It's not cashed. So if these nil then the front end will issue whatever SQL queries required to fetch the data from the database, and then make another RPC call to the relevant memcache server to install the fetch data in the memcache server. So this is just the routine I talked through before. That's kind of what look aside caching does. And for right. You know V is the writing, we have a key in a value on a right and this is a library routine on each front end. We send the the new data to the database. And, you know, as I mentioned before the key in the value may be a little bit different. You know what's stored in the database is often in a somewhat different form from what's stored in memcache, but we'll imagine for now the same. And once the database has the new data, then the right library routine sends an RPC to the database telling it look you got to delete this key. So I want to write the writer is invalidating the key in memcache D and you know what that means is that the next front end that tries to read that key from memcache D is going to get nil back because it's no longer cached and will fetch the updated value from the database and install it into memcache D. All right, so this isn't an invalidate scheme. In particular it's not. You can imagine a scheme that would send the new data to memcache D at this point but it doesn't actually do that instead of to lease it and actually in the context of Facebook scheme. The real reason why this delete is needed. Is also that front ends will see their own rights. Because in fact, in their scheme the memcache the my SQL server the database servers also send elites whenever you any front end write something in the database, the database with the McSqueal mechanism, the paper mentions will send the relevant data to the memcache servers that that might hold this key. So the data, the database servers will actually invalidate stuff in memcache by and by may take them a while. But because that might take a while. The front ends also delete the key so that front end won't see a stale value for data that it just updated. So that's all sort of the background of this is pretty much how everybody uses memcache D there's nothing yet really very special here. Now, eventually you know the papers all about on the surface all about solving consistency problems. And indeed those are important. But the reason why they got where they ran into those consistency problems is in large part. Because they, you know, modified the design or set up a design that had extremely high performance because they had extremely high load. So they were desperate to get performance and kind of struggled along behind the performance improvements in order to retain a reasonable level of consistency. And because the performance kind of came first to them. I'm actually going to talk about their performance architecture. Before talking about how they fix the consistency. Okay, sorry, there's been a bunch of questions here that I haven't seen. Let me. Okay, so one question this means that the replicated updates from the primary my SQL database to the secondary must also issue deletes to. This is I think a reference to the previous architecture slide. The observation is that yes indeed when a front end sends a right to the database server database of updates its data on disk. And it will send an invalidate a delete to whatever memcache server there is in the local region the local data center that might have had the key that was just updated. And it also sends a sort of representation of the update to the corresponding database server in the other region which processes it applies the right to its disk data on disk. It also using the McSqueal sort of log reading apparatus figures out which memcache server might hold the key that was just updated and sends a delete also to that memcache server. So that the if the key is cash just invalidated in in both data centers. Okay, so another question what would happen if we delete first in the right and then send to the database. And with reference to this thing here what did what if we did the lead first, you know, if you do the lead first, then you're increasing the chances that some other client so suppose you delete, and then send to database. And here, if another client reads that same key, they're going to miss at this point, they're going to fetch the old data from the database, and they're going to then insert it into memcache, and then you're going to update it leaving memcache for a while at least with stale data, and then if this, the writing client reads it again it may see the stale data even though it just updated it. So doing that elite second, you know, these over the possibility that somebody will read during this period of time and see stale data, but they're not worried about stale data in general. They're really most worried in this context about clients reading their own rights. So on balance, even though there's a consistency problem either way. So doing the delete second ensures that clients will read their own rights. In either case, eventually the database server, as I mentioned, will send a delete for the written keys. Another question I'm confused on how writing the new value shows stale data, but deleting doesn't. Let me see. I'm not really sure what the question is asking the if it's with reference to this code. Once the rights done. Okay, maybe the question is something we didn't do delete at all. So that when a client front web front end did want to update some data would just tell the database, but not explicitly delete the data from memcache. The problem with this is that if the client sent this right to the database and then immediately read the same data that read would come out of the memcache. And because memcache still has the old data, you know memcache hasn't seen this right yet. So a client that updated some data and then read it updates the database but it reads the data stale data from memcache, then a client might update some data but still see the old data. And if you delete it from memcache then if a client. If you do do this to leave then a client that writes some data and deletes it from memcache and then reads it again. It'll miss a memcache because of the delete, and it'll have to go to the database and read the data and the database will give it fresh data. Okay, so the question is how come. Why do we delete here. Gosh, why don't we just instead of this delete. Have the client just directly since it knows the new, new data. Just send a set RPC to memcache and this is a, this is a good question. And here we're doing I have an invalidate scheme. This would often be called an update scheme. And let me try to cook up an example that shows that, well, this could probably be made to work this update scheme. It doesn't work. It doesn't work out of the box and you wouldn't you need to do some careful design in order to make it work so it wasn't client one wasn't up now we have two clients. Reading and writing the same key interleaved so let's say client one tells the database. You know sends X plus plus to the database. And just incrementing X. And then of course, or let me say, you know, it's going to increment X from zero to one, so set X to one, and then after that client one is going to call set of our key, which is X. And the value one and write that at the memcache D. Supposing, meanwhile, client two also wants to increment X so it's going to read this latest value in the database and almost certainly these are, in fact, transactions so what if we were doing increment what client one would be sending would be some sort of increment transaction the database for correctness because the database does support transactions. You can imagine the client to increments the value of X to to send that increment to the database and client to also is going to do this set. So it's going to set X to be two. But now what we're left with is the value one in memcache D even though the correct values in the databases to say if we do this update was set, even though it does save us some time right because now we're saving somebody a miss in the future because we directly said instead of delete. We also run the risk if the if it's popular data of leaving stale data in the database. It's not that you couldn't get this to work somehow. But it does require some careful thought to fix this problem. All right, so that was why they use invalidate instead of update. Okay, so I was going to talk about performance. They, the sort of root of how they get performance is through parallel parallelization parallel execution. And for a storage system, just at a high level there's really two ways that you can get good performance. One is by partition. Which is sharding. That is you take your data and you split it up over, you know, to 10 pieces over 10 servers and those 10 servers can run independently hopefully. The other way you can use extra hardware to get higher performance is by replication. I just have one copy of the data and you kind of for a given amount of hardware you can kind of choose whether to partition your data or replicate it in order to use that hardware. And there's, you know, for memcache, what we're talking about here is is splitting the data over the memc available memcache servers by hashing the key. Every key sort of lives on one memcache server. And for memcache what we'll be talking about here is having each front end just talk to a single memcache server and send all its requests there so that each memcache server serves only a subset of the front ends and sort of serves all their needs. And Facebook actually uses a combination of both partition and replication for partition the things that are in its favor one is that it's memory efficient. Because you only store a single copy of each item of data or some replication, you're going to store every piece of data maybe on every server. On the sort of downside of partition is that it's as long as your keys are sort of equally roughly equally popular works pretty well, but if there's some hot a few hot keys partition doesn't really help you much. Once you get those partition enough that those hot keys are on different servers. You know once the, if there's a single hot key for example, no amount of partitioning helps you because no matter how much you partition that hot key is still sitting on just one server. The other problem partition is that it doesn't mean that the front if front ends need to use lots of data, lots of different keys. It means in the end each front ends probably going to talk to lots of partitions. And at least if you use protocols like TCP that keeps state. There's significant overhead to as you add more and more sort of n squared communication for replication. Fantastic if your problem is that a few keys are popular, because now you know you're making replicas of those, those hot keys and you can serve each replica of the same key in parallel. It's good because there's fewer. This is not n squared communication each front end maybe only talks to one memcache server. But the bad thing is because it's, there's a copy of data in every server, you can cash far fewer distinct data items with replication than with partition. So there's less total data can be stored. Generic for pros and cons of these two main ways of using extra hardware to get higher performance. All right, so I want to talk a bit about their, what one sort of context in which they use partition and replication is at the level of different regions. So, I just want to talk through why it is that they decided to have separate regions and kind of separate complete data center with all the data in each of the regions. I'm sorry, before I do that there's a question. Why can't we cash the same amount of data with replication. Okay, so supposing you have 10 machines, each with a gigabyte of Ram, and you can use these 10 machines each with a gigabyte of Ram for either partition or in a partitioning scheme. If you use partitioning scheme where each server stores different data from the other servers that you can store a total of 10 gigabytes of distinct data objects on your 10 servers, each with a gigabyte of Ram. If you use partition, you know, each byte of Ram is used for different data so you can look at the total amount of Ram you have. That's how much distinct data, you know, different data items you can store with replication. You know, assuming your users are more or less looking at the same stuff. Each replica, each cash replica will end up storing roughly the same stuff as all the other caches. So you have 10 gigabytes of Ram still on your 10 machines, but each of those machines stores roughly the same data. So would you end up with this 10 copies of the same gigabyte of items. So in this particular example, if you use replication you end up storing a 10th as many distinct data items. And, you know, that may actually be a good idea, depending on, you know, sort of what your data is like. But it does mean that replication gives you less total data that's cached. And, you know, you can see there's points in the paper where they mentioned this tension. They don't come down on one side or the other because they use both replication and sharding. Okay. Okay, so the highest level at which they're playing this game is between regions. And so at this high level, each region has a complete replica of all the data. Right. They have a each region has a complete set of database servers each database database or corresponding database servers for the same data, and assuming users are looking at more or less the same stuff. That means the memcache servers in the different regions are also storing more or less basically replicating what we have yours replicating in both the database servers and the memcache servers. And the point again, one point is to you want a complete copy of the site that's close to West Coast users in the internet, literally in the internet. And another copy of the complete website that's close to users on the East Coast, close on the internet again. And the internet's pretty fast but coast to coast is, you know, 50 milliseconds or something which if you do if users have to wait too many 50 millisecond intervals they'll start to notice that amount of time. Another reason is that the you want to a reason to replicate the data between the two regions is that these front ends to even create a single web page for user request, often dozens or hundreds of distinct data items from the cash or the databases. And so the speed, the latency, the delay at which a front end can fetch these hundreds of items from the memcache is quite important. And so it's extremely important to have the front end only talk to only read local memcache servers and local databases so that it can do the hundreds of queries it needs to do for a web page very rapidly. And if we partitioned the data between the two regions, then the front end, you know, if I'm looking at my friends and some of my friends around the East Coast and some of the West Coast, that means if we partitioned that would might require the front ends to actually make many requests that, you know, 50 milliseconds each to the other data center. And users would, users would see this kind of latency and be very upset. So the reason to another reason to replicate is to keep the front ends always close to the data to all the data they need. And this makes rights more expensive because now the front end and secondary region needs to write and that's just send the data all the, all the way across the internet, but reads are far far more frequent than rights so it's a good trade off. Although the paper doesn't mention it, it's possible that another reason for complete replication between the two sites is so that if the primary site goes down, perhaps they could switch the whole operation to the secondary site, but I don't know what that is in mind. Okay, so this is the story between regions is basically a story of replication between the two data centers. All right, now within a data center, within a region. So in each region, there's a single set of database servers. So at the database level, the data is sharded and not replicated inside each region. However, at the memcache level they actually use replication as well as sharding so they had this notion of clusters. So given regions actually supports multiple clusters of front ends and database servers are hearing and two clusters in this region. This cluster has a bunch of front ends, and a bunch of memcache servers. And these are completely independent, almost completely independent so that a front end and cluster one sends all its reads to the local memcache servers and misses it needs to go to the one set of database servers and similarly. Each front end in this cluster talks only to memcache servers in the same cluster. How do they have this multiple clusters. Why not just have, you know, essentially a single cluster a single set of front end servers and a single set of memcache servers shared by all those front ends. One is that if you did that and and that would mean that you know if you need to scale up capacity or sort of be adding more and more memcache servers in front ends to the same cluster. You don't get any win there for in performance for popular keys. You know, so they're the data sort of this memcache servers is sort of a mix you know most of it is maybe only used by small number of users but there's some stuff there that lots and lots of users need to look at. They're using replication as well as sharding. They get, you know, multiple copies of the very popular keys, and therefore they get sort of parallel serving of those keys between the different clusters. The reason to not want to increase the size of the cluster individual cluster too much is that all the data within a cluster is spread over partitioned over all the memcache servers and anyone front end is typically actually going to need data from probably every single memcache server eventually. So this means you have a sort of n squared communication pattern between the front ends and the memcache servers. And to the extent that they're using TCP for the communication that involves a lot of overhead, a lot of sort of connection state for all the different TCP so they wanted to limit, you know this is all the different TCP's, they want to limit the growth of this and the way to do that is to make sure that no one cluster gets to be too big so this n squared doesn't get too large. And, well, related to that is this in cast congestion business they're talking about the, if a front end needs data from lots of memcache servers. They send out the requests, more or less all at the same time. And that means this front end is going to get the responses from all the memcache servers that created more or less the same time. And that may mean dozens or hundreds of packets arriving here, all at the same time, which if you're not careful will cause packet losses. That's in cast congestion. And in order to limit how bad that was they had a bunch of techniques they talked about but one of them was not making the clusters too large so that the number of memcaches is given front end to talk to, and they might be contributing to this in cast. Never got to be too large. The reason the paper mentions is that it's behind this is is a big network in the data center. And it's hard to build networks that are both fast like many bits per second and can talk to lots and lots of different computers. They're bringing the data center up into these clusters, and having most of the communication go on just within each cluster. That means they need a smaller they need, you know, a modest size fast network for this cluster, and a modest size, you know, reasonably fast network for this cluster but they don't have to build a single network that can sort of handle all of the traffic between among all the computers of the giant cluster. So it limits how expensive underlying network is. On the other hand, of course, they're replicating the data in the two clusters. And for items that aren't very popular and aren't really going to benefit from the performance win of having multiple copies. This it's wasteful to sit on all this Ram and you know we're talking about hundreds or thousands of servers here so the amount of money they spent on Ram for the memcash servers is, is no joke. So, in addition to the pool of memcash servers inside each cluster. There's also this regional pool of memcash servers that's shared by all the clusters in a region, and into this regional pool. And modify the software on the front end so that the software in the front end knows how this key. The data for this key is actually not use that often. Instead of storing it on a memcash server in my own cluster. I'm going to store this not very popular key in the appropriate memcash server of the regional pool. This is the regional pool. And this is just sort of admission that some data is not popular enough to want to have lots of replicas of it. They can save money by only cash in a single copy. Right. So that's how they get that's this kind of parallel replication versus partitioning strategy they use inside each inside each region. And the quality they had that they discuss is that when they want to create a new cluster in a data center, they actually have a sort of temporary performance problem as they're getting that cluster going. So, you know, supposing they decide to install, you know, a couple hundred machines to be a new cluster with a front end new front ends new memcash servers and then they, you know, fire it up and, you know, maybe cause half the users to start using the new cluster. Well, in the beginning, there's nothing in these memcash servers, and all the front end servers are going to miss on the memcash servers and have to go to the databases. And at least at the beginning, until these memcash servers gets populated with all the sort of data that's used a lot. This is going to increase the load on the database servers, absolutely enormously. Because before we added the new clusters, maybe the database servers only saw 1% of the reads, because maybe these memcash servers have a hit rate of say 99% for reads, only 1% of all the reads go to the database servers before we added the new cluster. If we add a new cluster with nothing in the memcash servers and send half the traffic to it, it's going to get 100% miss rate initially. Right. And so that'll mean, you know, we gone from, and so the overall miss rate will now be 50%. So we've gone from these database servers serving 1% of the reads to them serving 50% of the reads. So at least in this imaginary example, we've increased by firing up this new cluster, we may increase the load on the databases by a factor of 50. And chances are the database servers were running in a reasonably close to capacity and certainly not a factor of 50 under capacity. And so this would be the absolute end of the world. If they just fired up a new cluster like that. And so instead, they have this cold start idea in which a new cluster is sort of marked by some flag somewhere as being in this cold start state. And in that situation, when a front end in the new cluster misses that actually first, first it has its own local memcash. If that says no, I don't have the data, then the front end will ask the corresponding memcash in another cluster and some warm cluster that already has the data for the data. If it's popular data chances are to be cached. A friend and we'll get its data and then it will install it in the local memcash. Only if both the local memcash and the warm memcash don't have the data that this front end in the new cluster will reach from the database servers. And so this and so they run in this kind of cold mode for a little while. The paper I think mentions a couple hours until the memcash servers, and the new cluster start to have all the popular data, and then they can turn off this cold feature and just use the local cluster memcash alone. All right. So another, another load problem that the paper talks about that they ran into and this is a load problem. Again, deriving from this kind of look aside, caching strategies is called a thundering herd. The scenario is that supposing we have some piece of data, there's lots of memcash servers, but there's some pieces of data stored on this memcash server. There's a whole bunch of front ends that are ordinarily reading that one piece of very popular data so they're all sending constantly sending get requests for that data. And of course there's some database servers sitting back here that has the real copy of that data but we're not bothering it because the data is cached. Well, suppose some front end comes along and modifies this very popular data so it's going to send a right to the database with the new data, and then it's going to send a delete. To the memcash server because that's the way rights work. So now we've just deleted this extremely popular data. We have all these front ends constantly sending gets for that data. They're all going to miss all at the same time. They're all going to now having missed, send a read request to the front end database, all at the same time. And so now this front end database is faced with maybe dozens or hundreds of simultaneous requests for this data. So the loads here is going to be pretty high, and it's particularly disappointing because we know that all these requests are the same key. The database is going to do the same work over and over again to respond with the latest written copy of that key until finally the front ends get around to installing the new key and memcash and then people start hitting again. And so this is the thundering hurt. What we'd really like is a single, you know, if a miss if there's a right and delete and a miss happens and memcash we'd like, we'd like is the first for the first front end that misses to fetch the data and install it and for the other front ends to just like take a deep breath and wait until the new data is cached. And that's just what the design does if you look at the if the thing called a lease, which is different from the leases we're used to but they call a lease and we start from scratch in the scenario again. Let's see. So now suppose we have a popular piece of data. The first front end that asks for data that's missing. The memcash will send back an error saying oh no I don't have the data in my cache, but it'll install a lease, which is a big unique number. It'll pick a lease number, install it in a table and send this lease token back to the front end. And then other front ends that come in and ask for the same key, they'll simply get a lease to wait a quarter of a second or whatever some reasonable amount of time by the memcash D because the memcash D will see how I've already issued the lease for that key. So it's a lease, potentially a lease per key. The server will notice it's already issued a lease for the key and tell these ones to wait. So only one of the servers gets a lease. This server then asks for the data from the database. And we get the response back. Then it sends the put for the new data with the key and the value it got and the lease to prove that it was the one who was allowed to write the data memcash D will look up the lease and say ah ha yeah you are the the person to lease was granted and it'll actually do the install by and by these other front ends who are told the wait will reissue their reads. Now that it will be there. And so we will evolve as well to get just one request to the database instead of dozens or hundreds. And I think that the sense in which is a lease is that the front end fails at an awkward moment and doesn't actually request the data from the database or doesn't get around installing a memcash D eventually memcash D will delete the lease because it times out and the next front end to ask will get a new lease and will hope that it will talk to the database and install new data. So yes, the answer to the question the the lease does have a timeout in case the first front end fails. Yes, yes. Okay, so these leases are the solution to the awkward problem. Another problem I had is that if one of these memcash servers fails. The most natural you know what's if they don't do anything special if a memcash server fails, the front ends will send a request they'll get back a timeout and network will say geez that, you know I couldn't contact that host never got a response. So what the read library software does is that then sends a request to the database. So if a memcash server fails and we don't do anything special. The database is now going to be exposed directly to the reads all these reads the memcash server was serving, as is the memcash server may well have been serving you know a million reads per second. That may mean that the database server would be then exposed to those million reads per second then it's nowhere near fast enough to deal with all those reads. Now, Facebook they don't really talk about in the paper but they do have automated machinery to replace a failed memcash server but that takes a while to sort of set up a new server, a new memcash server and redirect all the front ends to the new server instead of the old server so in the meantime they need a sort of temporary solution. And that's this gutter idea. So, let's see the scoop is that we have our front ends. We have the sort of ordinary set of memcash servers. The database, the one of the memcash servers has failed, we're kind of waiting until the automatic memcash server replacement system replaces this memcash server. In the meantime, front ends are sending requests to it, they get a sort of server to not respond error from the network, and then there's presumably small set of gutter servers, whose only purpose in life is to not respond to the idle except when a real memcash server fails and when the front end gets an error back saying that get couldn't contact the memcash server. It'll send the same request to one of the gutter servers and though the paper doesn't say, I imagine that the front end will again hash the key in order to choose which gutter server to talk to. And if the gutter server has the value that's great. Otherwise, the front end server will contact the database server to read the value, and then install it in the memcash server in case somebody else answers has to the same data. While this main servers down. The gutter servers will handle basically handle its request and so they'll be a miss, you know handled by lease, at least there's no thundering herd. There'll be a lease on a miss on each of the items that was in the failed memcash server. So there will be some load in the database server but then hopefully quickly this memcash server will get all the data that's in use and the service and then by and by this will be replaced and then the front ends will know to talk to a different replacement server. And because they don't and this is today's question. I think that they don't send the leads to these gutter servers because since a gutter server could have taken over for anyone, and maybe more than one of the ordinary memcash servers, it could actually have cash be caching any key. And so that would mean that and there may be you know for an ends talking to it. That would mean that whenever a front end needs to delete a key from memcash or when the mix wheel on the database sends a delete for any key to the relevant memcash server. Yeah, you know, the natural design would be that it would also send a copy of that delete to every one of the gutter servers. And the same for front ends that are deleting data they would delete from the memcashers but they would also have to lead potentially from any memcat gutter server that would double the amount of the leads that had to be sent around even though most of the time these gutter servers aren't doing anything and don't cash anything and it doesn't matter. And so in order to avoid all these extra deletes. They actually fix the gutter servers so that they delete keys very rapidly. They're hanging on to them until they're explicitly deleted. That was the answer to the question. All right, so I want to talk a bit about consistency. All this at a super high level. You know the consistency problem is that there's lots of copies of the data for any given piece of data. There's a primary database. There's a copy in the corresponding database server of each of the secondary regions. There's a copy of that key in each local cluster in one of the memcash D's in each local cluster. There may be copies of that key in the gutter servers. And there may be copies of the key in the memcash servers and the gutter memcash servers of each other region so it's lots and lots of copies of every piece of data running around. You know the stuff has to happen on all those copies. And furthermore, the rights may come from multiple sources the same key may be written at the same time by multiple front ends and this region maybe right front ends and other regions too. And so it's this concurrency and multiple copies and sort of multiple sources of rights since there's multiple front ends. And that creates a lot of opportunity for not just for there to be stale data, but for data stale data to be left in the system for long periods of time. And so I want to I want to illustrate one of those problems actually in a sense we've already talked a bit about this when somebody asked why the front ends don't update why do they delete instead of updating. This is an instance of the kind of while there's multiple sources of data. And so we have trouble enforcing correct order. But here's another example of a race, an update race that if they hadn't done something about it would have left data indefinitely stale data indefinitely in memcash. And it's going to be a similar flavor to the previous example so so then we have client one. He wants to read a key. But memcash says, it doesn't have the data it's a miss. So see one's going to read the data from from the database. And let's say it gets back. Meanwhile, client two wants to update this data. So it sends, you know, the rights. Key equals v2 and sends that to the database. And then you know the rule for for rights the code for rights that we saw is that the next thing we do is delete it from the database from memcash. See, see two is going to delete the key from the database and that's safe right you know it was actually see two doesn't really know what's in memcash D but deleting was ever there is always safe because certainly not going to cause a stale data to be deleting won't cause her to be stale data. And this is the sense that they paper claims that delete is item potent that delete is always safe to delete. So if you call the pseudocode for what a read does. If you miss and you read the data from the database you're supposed to insert that data into memcash D. So client one you know may have been slow and finally gets around to sending a set RPC to memcash D but it read version one it read up you know what is now an old updated version of the data from the database but it's going to set that into memcash and you know one other thing that happened is that we know the databases is whenever you write something in databases sends deletes to memcash D so of course maybe at this point the database will also have sent a delete for K to memcash D and so now we get two deletes but doesn't really matter right these lease may already have happened by the time client one gets around to updating this key. And so at this point, indefinitely, a memcash D will be caching a stale version of this data and there's just no mechanism anymore. Or the system if the system worked in just this way. There's no mechanism for the memcash G to ever see to ever get the actual correct value it's going to store and serve up stale data for key K forever. And because they ran into this, and while they're okay with data being somewhat out of date, they're not okay with data being out of date forever, because users will eventually notice that they're seeing ancient data. And so they had to solve this they had to make sure that this scenario didn't happen. And this this problem also with the lease mechanism, the same lease mechanism that we described for the thundering horde, although there's an extension to the least mechanism that makes this work. So, what happens is that when memcash G sends back a misindication saying the data wasn't in the cash, it's going to grant the lease. So we get the misindication plus this lease which is basically just a big unique number, and the memcash is a member that the association between this lease and this key knows that somebody out there with a lease to update this key. The new rule is that when the when the memcash C server gets a delete from either another client or from the database server. The memcash server is going to, as well as deleting the item is going to invalidate this lease. So as soon as either of these deletes come in, assuming the deletes arrive first, the memcash server is going to delete this lease from its table about leases. This set. This is the lease back from the front end. Now when the set arrives, the memcash server will look at the lease and say wait a minute. You don't have a lease for this key, or I invalidated this key. I'm going to ignore this set. So because the lease has been because one of these, if one of these deletes came in before the set. This is to be invalidated invalidated. And the memcash server would ignore this set. And that would mean that the key would not just stay missing from memcash and the next client that tried to read that key would get a miss would read the fresh data now from the database. And would install it in memcash and presumably the second time around the, the second readers lease would be valid. Okay, and indeed you should ask what happened if the order is different. So supposing these deletes, instead of happening before the set. These deletes were instead to have to happen after the set want to make sure this scheme still works then. And so how things will play out then is that since if these deletes were late happened after the set, the memcash server wouldn't delete the lease from its table leases. So the lease would still be there when the set came. And yes, indeed, we would still then it would accept the setting we would be setting key to a stale value, but our assumption was this time that the deletes have been late and that means the deletes are yet to arrive and when they when these deletes arrive then the stale data will be knocked out of the cash. And so the stale data will be in the cash a little bit longer but we won't have this situation where stale data is sitting in the cash indefinitely and never deleted. Any questions about this lease machinery. Okay. To wrap up, you it's certainly fair to view this system a lot of the complexity of this system as stemming from the fact that it was sort of put together out of pieces that didn't know about each other like it would be nice. For example, memcash D knew about the database and understood that memcash D and the database kind of cooperated and consistency scheme. Perhaps if Facebook could have at the very beginning, you know, predicted the how things would play out on what the problems would be and if they had had enough engineers to work on it they could have from the beginning built a system that could provide both all the things they needed high performance multi data center replication partition and everything. And there have been companies that have done that so the example I know of that sort of most directly comparable to the system in this paper is that if you care about this stuff you might want to look at it is yahoo's peanuts storage system. Which in a sort of design from scratch and, you know, different different many details but it does provide multi site replication with consistency and good performance. So it's possible to do better but you know all the issues are represent that just had a more integrated, perhaps elegant set of solutions. The takeaway so for us from this paper. One is that for them at least and for many big operations caching vital absolutely vital for to survive high load and the caching is not so much about reducing latency. There's much more about hiding the enormous load from relatively slow storage servers. That's what caching is really doing for Facebook is hiding sort of concealing almost all the load from the database service. The takeaway is that you always in big systems you always need to be thinking about caching versus control versus sorry partition versus replication. And you need ways of either formally or informally sort of deciding how much your resources are going to be devoted to partitioning and how much to replication. Ideally, you'd be able to do a better job in this paper about from the beginning integrating the different storage layers in order to achieve good consistency. Okay, that is all I have to say. Please ask me questions if you have.