 a little bit about some of the work I did in Facebook. This is about a couple of years back, so specifically around the architecture of the Facebook messaging system. So just to put some context on the talk, what you experience used as the Facebook messaging system actually used to be the chat system, and then there was a separate messaging system. And at some point, they were unified. So what I'm going to be covering this talk is not the chat system, but specifically focus on the re-architecture of the messaging backend, which used to be based on MySQL and was written very differently to a more sort of modern implementation. So one thing I want to make clear, I'm not a Facebook employee anymore. I left Facebook last year, so I'm just going to be talking about a brief period of six months where I was one of the key participants in the design team that came up with the design. There is a lot of material on the net that you would find. And I've tried to not repeat that stuff. Like there's a publication. I think in June in ICD, there's a publication by the HBS team in Facebook. There have been Facebook engineering blog posts and stuff, so there's a lot of stuff, background material that you can find on the web for free. I'll try to lend a little bit of insight into how we came to our decision point, rather than try to focus a lot on the implementation, because I was actually not part of the implementation. I moved on to other projects. So having said that, this is a pretty hard problem. We've been hearing about a lot of hard problems, and this is definitely up there amongst the hardest ones. The stats are very simple. Facebook obviously wants to be a serve everybody on the earth. So we were at least projecting, hey, we'll have billion users in short order. And that is obviously almost true, I guess. And even if you take very moderate usages of any emailing kind of system, so I put some example numbers, 25 messages per day for every user, assume that every message is four kilobytes, and even just forget about the attachments, the photos, and Excel, Word, whatever people are sending around. So just to the math, hey, it's about 10 terabytes a day, just the message bodies. So now, but that's not it. I mean, if all your messages were just stored in one gigantic file, you still wouldn't be able to access them and actually give you the email interface that you're all familiar with. So we have to add some indexes. You have to maintain threads like Gmail. In Facebook, what I think they do today is sort of, there's an implicit thread for every friend. So you see all messages from the friend sort of in a chain, and there would be threads for groups of people communicating. So there is this notion of threads, and you have to maintain indexes. What are the messages in that thread? You have to provide search, so that means you have to maintain a keyword index. You have to maintain all these counts. So how many we log in, and we expect to see an unread message count overall. We want to see, on every thread, how many messages are there, how many are unread, and so on and so forth. So there's a variety of sort of indexes and summaries that you have to compute on the base data. And then once you start adding in those, you realize that the actual amount of data that you'd have to store is even much more than just enter by today. So beyond the scale, the amount of data, the volume of the data, these are standard aspects for any company is that we want the data to be replicated to some remote data center, so that it's available if the primary one goes down. So ideally, we want to be able to concurrently update data from different data centers, so that no matter whether you're logging in from India or you're logging in from Europe or America, you might be hitting the data center that's closest to you, and actually we can sort of make updates there. And so I think in the summary, we probably mentioned that this talk might cover consistent, the cap stuff. And this is where it starts getting really hairy. So this is a hard problem. Doing concurrent updates is very hard. But anyway, the minimum we want is that there should be a copy of data somewhere else. In the case of disaster, we should be able to recover. The other thing I wanted to mention was we talk about availability all the time, failures, 5-9s, 4-9s. One of the things that people really liked internally at Facebook was that it's not that things didn't go down. So I think we had at the time of my leaving maybe like 5,000 databases or so. And what I used to hear was that every day, we would lose three or four of those databases. They would just burn and crash. And then somebody would actually bring up a replica manually. It was kind of not automated. But none of you guys know about it, and none of you guys care about it. From your point of view, service seems to be mostly up. So the reality of the market is that if you bring down 1% of your service, nobody really cares. But if you bring down your entire service for five minutes, you'll get headlines in New York Times. Facebook goes down, Amazon goes down, and so on and so forth. So we wanted to design the system that this has to be a design requirement is that the system should never crash in its entirety. It should always have independent points of failure. And it's OK if some things are offline. That's kind of like OK. But losing all of it is not OK. And the overriding concern was can't lose data. I mean, so if there's one way sort of you're going to undermine people's confidence in your messaging system, it is to end up losing data. So that was sort of priority number one. Things that we did not have to solve at the point about photos that I just took that out of the equation. And the reason was that there's a system in Facebook already that's called Haystack. Again, you can find a lot of documentation online. It's globally distributed, distributed in a data center, just scales out linearly. And it's something that they've perfected. It just works at this point, because Facebook is by far the largest photo store in the world. So the idea was that, hey, any attachment data, we'll just throw it there, and let's not worry about that problem. The other key thing I want to highlight here, and I cannot stress this hard enough, is that please take anything I say here with a grain of salt. Because not all companies have the choice of talent and engineering skills and bandwidth that Facebook had. So our charter was that, hey, find the best implement design. Something that is rock solid that's going to scale to billion to billion users. Don't worry about whether it works today. I mean, you would worry to a certain extent, but you can't start off with something that's completely broken in concept, but if it's not perfect, if it needs features, then hey, we can hire anybody we want, and we can get that built out. And that's very different from an average startup who's got to choose something that just works. And getting things done fast. So this is another sort of always a tension between, hey, getting things perfect versus moving fast is like the religion at Facebook. You've got to do things fast, including the fact that the product changes. So Zuck would just sort of dream up something, and tomorrow he'll say, OK, guys, I want the ship like in two weeks or something, and you've got to do it at that point. So that's sort of the context here. So let me move on to some of the more detailed sort of point by point. So if you go and look up online documentation, you will find that we ultimately settled on edge base, and I'll describe some of the other components. But how did we get to that point? So the first point that I think everybody in the, so there was a small team of six people. There was a lot of conflict. Everybody had different ideas, stuff. But I think everybody agreed that whatever we chose had to have stupendous right throughput, because we're going to have email system messages pouring in like crazy. We did not even cover spam. So for example, if you go to Yahoo Mail, you would actually see a ton of spam that they don't throw away. They'll hold it on for a month or so. So maybe that 25 messages per day was actually something realistic for genuine messages. But actually, if you include scam, you will be moved. So tremendous right throughput. There were only two choices. If you were going to go with a disk-based solution, you had to have a log-structured data structure to store data fast. So there have been talks here that have talked about this that you can do buffering and async and all the stuff. But at the end of the day, if you want to store data really fast on disk, you need a log-structured container. Again, tons of papers around. I'll talk a little bit about what we chose. One of the things, and so there's another way, which is like, OK, let's just put everything in flash. And one of the designs that emerged internally at Facebook was this very clever design where it was argued that, hey, if you look at your mailbox, let's say it has 10,000 messages or even 20,000 messages. If you just store the metadata, what time it came, who was it from, just a basic metadata. It's not that much data. And so maybe, hey, we could just store every mailbox, just a metadata of every mailbox in flash and potentially combine with memory, and it will be done. And that might actually be an interesting solution. And I think it will get more and more tractable as flash prices go down. But the trouble is that it puts you in a bind. You can only store so much. And I'll take a very simple example. Let's say when you open your mailbox or any thread, you'll see a bunch of messages just pre-opened for you. It makes no sense to force you to click on every one of those. If you were implementing the back end of such a system, you would think that, hey, I showed in one IO or in one request, just pull down a bunch of stuff that the user wants to see in a single page view. So with storing things on disk, you get this flexibility. You can start storing more stuff inline, small messages instead of just the metadata. You can pull it all out and show it to the user really fast. And in some ways, in the flash solution, you were sort of punting the problem of, hey, OK. So you've got all these 4k really, really small messages and who the hell is going to store that? So the indexing is the other part. I mean, you have to have the index. So again, you can do bitmap indexes very efficiently, but it's very tricky to get it right with a general purpose keyword index. So moving on, more sort of deep down stuff. So I wanted to show you this diagram. There's this thing called LSM trees. I'm sure many people in the audience are sort of aware who have looked at these systems. But if you're not, the way it differs from a traditional database like rdbms is that instead of storing one tree of all the data and having pointers from the root down to the leaf nodes, it stores things as a set of trees. And the advantage is that as data comes in, you don't have to update older data. You just sort of like, data just comes in and sits here. And then asynchronously in the background, you go and merge it. And all the hard stuff you sort of punt. I'll do it later on. There's a lot of follow on work from this paper, but this is pretty much the standard design pattern for log structure databases. And very high write throughput, because you're never accessing disk randomly. So you're just sort of piling on at the front of your sort of free space or whatever. One of the coolest things about this is that it matches the messaging system requirements very well. The reason is that if you think about it, when you open your mailbox, you tend to access messages that you got recently or even old threads, threads that got a new message. Those are the ones that you would sort of click. And guess where those things live? Everything that got updated recently or that was just created recently lives in the head of this pile, in a small separate data structure. That means in theory, if you were to get this organization right, then you could cluster everything that you need in a small part of your disk system and read that off very efficiently. So the interesting part was that I think when we looked at the mailbox problem and we thought what the ideal disk organization should be, it was almost exactly LSM trees. And so we said, okay, let's find a system that is built around this kind of concept and go with that. So and as again, many of you know, Bigtable was the one, the most famous system that sort of used LSM trees and then Edgebase and Cassandra, both sort of were built around as LSM trees. There are also, by the way, for MySQL users, there's Tokutek, DBE, and there's a bunch of other sort of storage engines that are now sort of like based off this pattern. This data structure is inherently snapshotted. So the last tree that lives on, I guess, your rightmost side is holds the entire data and every day you can just take the oldest tree and shovel it somewhere, it's like your NetApp snapshot or something like that. So it's very easy to back up. So moving on quickly, I don't have much time. The problem with the system is that reads are cheap, reads are expensive because the reason is that every time you come in and ask a question, hey, do you have this key? Now instead of a traditional database that just sort of goes down from the root of the tree and goes down to the leaf and finds it or misses it, now you gotta answer this question for every tree that you have in your system. And people have invented clever ways like bloom filters and stuff, but no matter what you do, you cannot get away from the fact that this is not a read-optimized system. So the solution then was that, okay, so this is great for high-ride throughput, for read throughput, let's build an application server on top of this data store and let's keep a very well-tuned cache of everything that's required to serve user page views and mailbox views in that app server. Now, once you start doing caching, you have cache coherency issues. So again, just to keep things simple, like just come with a very simple policy, let's just have users bind to app servers, right? So all updates to the mailbox, all reads to the mailbox are going through this one app server that also holds a very nice and memory cache and no cache coherency issues. The only problem we have to solve is, given a user, we have to find out which app server the user is hosted on, right? So the solution to this, I mean, we knew how to solve this, but the actual solution was done after I left. So I think right now, if you read the papers, you will see that there is a user discovery service that's written in ZooKeeper. So that is responsible for maintaining this mapping. The other, again, going back to the lesson tree, one of my favorite topics is that, like, think about user logging back in after a long time and you don't have the cache is completely cold, right? So how do you construct this cache? How do you populate this cache? And if you were dumb about it, you could, like, that could take a lot of it, right? But again, you know, if you think about it from that, the tree organization I showed you, you would realize that most of the data you need is actually going to be living in one of those trees and so you could, if you were smart, sort of like just scan the first tree or the first few trees and pretty much load everything that you would need to render a user's mailbox. No single point of failure, you know, don't want to take everything down, but HDFS, as I don't know if people have mentioned, it's a single name node, it doesn't have high availability. So, okay, that's not too hard, right? I mean, so we just, like, you know, instead of having, like, one gigantic HDFS and HBase instance with, like, beta bytes of data, let's have, like, small ones, right? So I think the current deployment has something like 100 node HBase sort of clusters. And I believe, you know, over time we've also added name node HA, but given the inherently sort of, like, the federated and the partitioned nature of the deployment, they actually don't care about it that much. I mean, this is kind of almost ironical. So, you know, okay, we decided the data structure, we know that it has to be distributed, so we have a choice, right? So we can do Cassandra or HBase, and there was also at the time, and even now I think there's HyperTable, there's another sort of good choice from a log structure, it's sort of a distributed store point of view. I didn't write it up because HyperTable's ecosystem is not that popular or whatever, so these are the two sort of main systems. So in a nutshell, I don't know if I'll be able to go in and sort of go deep into this. We tested things out, you know, that was the only, we were in a mad scramble, you know, trying to sort of get things out, we had to get this design right, but we actually had to ship things as well. So tried out HBase, tried out Cassandra, big caveat, we tried out, this is 2010, everything has changed since then. We tried out Facebook's internal version of Cassandra, not the open source tree, and you know, maybe that was a mistake. But in our testing, we had a really poor time with Cassandra, and this is just within a data center, it's not sort of across data centers and stuff. Then what happened was that while we were doing all this, we said, okay, we really want to understand what's going on inside Cassandra. We started looking, reading the papers, we started diving deep down in the code, and I'll try and go a little bit more deep into this, but for our use case, again, you know, big if, like I mean, if, but like for us, HBase was the better choice, Cassandra wasn't. Another thing that, you know, I talked about like no data loss. So one of the big worries we had was, man, like I mean, if like we screwed up, like if we chose a system that like ended up losing data, like basically we'll be all fired, right, it's pretty simple. So, but we really trusted HDFS. So, you know, it has its issues, it doesn't have HA, it's kind of slow, blah, blah, blah, but you know, we had stored like Peterbytes and Peterbytes on it for like almost a couple of years when we were doing this evaluation, and we had never suffered any loss. And so in the spectrum of things we could think of, and I think Pramod was just talking that, you know, of all the things that they looked at, MySQL was the most rock solid thing, and we had that kind of same philosophy that, okay, we really, really trust MySQL, but if you're gonna trust anything after MySQL, probably it was HDFS. And that definitely tilted the playing field quite a bit. And, you know, again, sort of to reiterate, like we could hire engineers, you know, we could hire committers, whatever, right? So the main thing was we could build out all the missing features of which there were a lot. To talk about disaster recovery, one of the things that I really love about Edgebase, and this is common to like many database technologies, is that it's really easy to do backups, right? So you got an inherently snapshot system. What you do is every night you take your, sort of like the full compacted snapshot tree of your entire system, you ship it off to a remote data center. And you're also shipping your right ahead logs, the transaction log that's storing the changes in the database. You're also shipping that to a remote data center. And every night, once you've shipped your, sort of, you know, the full compacted snapshot, you just truncate your log because you don't need the stuff that is already reflected in your snapshot. And this is so easy to do, you know, you can just probably write a Python script around it. You don't even have to get it right. Now, you don't have to synchronize number two and number three. As long as number three is before number two, I mean, you don't drop off things that have not been reflected in the snapshot. You're okay because the replays are very important. So that's actually how things are working. And again, if you look up the papers, they know this is exactly how they're doing the backups today to a remote data center. Right, so we tested, right? So I thought it would be good to like, so this is actually not part of this presentation. I fortunately still had like my report from back those days sitting in a PowerPoint somewhere on my laptop and I just like did a screenshot of that. So like no point going into it, but what you can see here is that, we were actually running a real edge-based cluster. We had an application server. We simulated like 12 million users with some standard workload. And we wanted to test this whole concept out. Hey, is the app server gonna hold up given a reasonable access pattern? Is it gonna be actually able to protect edge-based from most of the reads? And so on and so forth. And this in itself was a very interesting experience for me. I realized that writing a cache in Java is probably not the best idea. So I'll have to stop very quickly. Flash, I wanted to not, let's talk a little bit about this. So Flash is big in Facebook. Every time we design storage systems, the question comes up, how are you gonna use Flash? Because a few years from now, that's gonna be the thing that's gonna take over everything. So a few different ideas. We quickly understand that, okay, again, LSN3 is like recent data is clustered. Well, we can put that on Flash. There's a cache, we can put that on Flash. When the app server throws out stuff from its cache, you can put that on Flash. So a lot of simple ways of exploring Flash. Finishing it off of things that I sort of, even after leaving the project, I kept feeling that, hey, we didn't solve everything. So if the HDFS is too big, when you build systems in-house, we are better off building things out of small components. But here we have this pretty big code base that does something. And every time there's a bug or there's some performance or whatever, you have to touch large parts of the code. So we have learned to be nice if we actually had small pieces. So like data node, block manager, name node. And one of the things that, if you looked at this architecture closely, you'll realize that HBase actually doesn't need a name system. It has its own name space. All it needs is a block manager. So yeah, things are not perfect. I wish these systems were built as smaller components and we could choose best of breed, but we had to live with what we had. And if you've seen the talk so far, we couldn't quite cover the cross data center concurrency case. So we gave up on the concurrent updates and we kind of punted on it. We said, okay, well in future, if we really want to set up a data center in Europe or somewhere where we want the European users to have the data in Europe, what we're gonna do is we're gonna just federate users. So these guys go to US, these guys go to Europe, and we're gonna maintain a global registry that maps users to continents or data centers. So then we kind of punted the problem. So who maintains this global registry? How do you deal with concurrent updates to that and partitions on that? But I think the gut feel we had was that, hey, that's a much simpler problem. Talking about like much, much smaller amount of data. The updates are very infrequent. The updates we are an optimization. So if a European users keep setting up American data center, it's not a big deal. So we could do globally consistent rights at leisure and nobody would care. So I think I'm out of time. Let me stop here because I don't think I can cover the controversial part. Does anybody know, should I stop? Or... All right, okay, I'll show that. All right, this is a really controversial part, right? So Cassandra versus HBS. I mean, I'll try to be really quick about it. So these systems evolved from different philosophies. Cassandra believes in a flat earth. If you look at Cassandra, it's a symmetric system. All nodes are in a ring. There's no notion of, hey, this is the stuff that lives in data center A, data center B, or rack A, or rack B, right? World is not flat, world is hierarchical. You've got the PCI bus, you've got the rack, you've got the data center, you've got the region, the continent. You've got different partition properties in each one of these boundaries. You have different latencies. So one of the best things that you could, papers you can find on the web, Jeff Dean has this sort of things that every engineer should know, right? Latencies, and you've got memory latency, L1 cache latency, L2 cache latency, same data center, cross-continent, so on and so forth. So I just wanted to put, this is not a criticism by itself. This is just a starting philosophy. Another starting philosophy, Cassandra, does not believe in centralization. Everything is independent. There are no special rules for any node. There is no central anything. There is no central commit log in the system. You want to say, okay, what are the transactions that have flowed through my system in the last 24 hours? You'd have to actually read like two or three copies of commit logs and try and reconstruct that. So I was trying to find, I've been working a lot in Ruby and said, okay, this is like the do repeat your reads paradigm instead of the DRY paradigm. And, you know, philosophies have consequences. So that's the unfortunate part. I mean, the only reasonable configuration in Cassandra that you can come up with is this, the Quorum configuration would be, okay, let's maintain three copies. Let's do two reads and two writes to sort of have a successful read or a successful write so that, you know, every read and write can have sort of at least one sort of replica in common. So you could sort of get the most recent write. Now, the thing you would immediately notice is that I've been talking a lot about, hey, you know, we chose the write optimized system and it was not read optimized and we had to actually work on a cache to envelope it so to give it good read properties. But here we have a backend system, if we chose it, where we would have to read twice to get a consistent read, right? And it was just like, you know, we couldn't do it, right? I mean, we were always bound by a spindle of per second. We were not using flash when we started out. So we had to choose a system that was read, like had good read properties in addition to being a write throughput. Again, you know, if you go out there and read on the internet, you'd find people still debating whether strong consistency is actually possible in Cassandra or not. And there's no point in me trying to go into this. Like my general point is if five scientists twittering, you know, throughout the day, I cannot agree that it has strong consistency. Imagine if the science has bugs, what kind of code would you write out of it, right? I mean, we can barely get the code right and we understand the science completely. In this case, like it's almost unimaginable. The other key thing you might want to take away from here is that distributed storage is not distributed database. So, simple example. So, you've got a bad disk or a bad block, right? You go to a system like HDFS and say, okay, Mr. HDFS, I lost a block. Can you please recover this block? And Mr. HDFS will say, yeah, you know, I know like these three, four other replicas floating around. I'll just like give you back one of the replicas. And by the way, I'll also redeplicate that replica onto the failed drive or machine or whatever, right? If you go to a distributed database, they'll say, oh man, I had like, you know, multiple sort of like databases or like trees or whatever that were using this drive. I really don't know, you know, who this block affected, right? So, you got this missing block and like there's lots of people who got affected. So, yeah, you can recover. And how do you recover? You have to now merge these databases together. And there are very, very fancy ways of doing it. It can be done. Those are nice academic papers. But the thing is that you just took a problem that was very, very simple and converted into a problem that was very, very hard, right? And again, you know, like you can come up with the right science and stuff, but you know, ultimately you have to think, okay, can I really implement that correctly, right? And how many bugs would I get, you know, if I didn't? So, basic stuff, gotta get that right. Another thing that, you know, vast majority of programmers here in this room as well. The way we write programs, you say, okay, read a value from a database. Do some transformation, write it back, right? So, if you were going to do an increment operation, you would say, okay, read a counter, plus, plus, write it back, right? And now, you know, if I were to come and tell you, you can't do that anymore, right? Because if you are working with the eventually consistent system, sorry, I'm in my hurry, I've switched back from like consistency, individual consistency. The reason I switched to this was that you could sort of get high rate performance if you were in a eventually consistent mode, right? But what I'm trying to point out here is that standard cookie cutter programming becomes very difficult in an eventually consistent database storage system. Now, some things are possible, right? So, you could take that increment operation and push it down into the database and tell it, yeah, can you please perform an increment correctly, right? And yes, that can be done. But you cannot take your arbitrary business logic in slip number two and push it down into the database. So, one of the things I took away from this sort of like analysis and exercise was that it's very, you should not try and do conflict resolution at the row level in the guts of the database. I mean, that is just the absolute wrong approach. The reason is very simple. I mean, even the system that I described to you is actually a very simple system. It's just a messaging store, right? Now, imagine that you were doing something more complicated where you said, okay, I actually need to make updates to multiple rows as part of something that happened in my system. And this thing that happened in my system has to be atomic. For example, if I got a document in my collection, I got to put the document in some database and I got to update three, four different indexes and maybe those are all sharded differently and are in different data stores, right? So, I got this concept of a global transaction that I have to perform atomically. Now, if every one of those transactions is eventually consistent, right? It doesn't help you, right? Because what you want is you want those five actions to all turn out the same way. Either all of them turn out positive or all of them turn out negative. It doesn't make any sense that some of them turn out positive and some of them turn out negative, right? So, my general feel, you know, as sort of a more of an architect hat is that you should try and solve conflicts at the very highest possible layer in your system where you can describe your outcomes. And I think this is almost like a transaction monitor. I mean, this is the kind of the layer where you would also like do transactions. And this is actually how Facebook is sort of like has gravitated in this direction where we have a team that now works on sort of just transporting global events across like pipes and that would become the layer at which you would say, okay, we had two conflicting events because they wanted to modify the same key, right? And we optimistically allowed them to go ahead in their data centers. But now when we detect a conflict, we got to like do some conflict resolution. And this has to be at the layer of this transaction that spanning multiple rows, multiple databases, God knows what, right? It cannot be at the level of the database. All right, done? Time. Okay, yeah, I should stop. So there's a lot of stuff on the web. You can find all this. Unfortunately, there's no question for time, but no time for questions. All right. But thank you for that second wonderful talk in today's topic. Thank you so much.