 Hi, my name is Nicholas Spiegelberg, I'm with Facebook Messages, and I'm going to let you for the first couple of minutes before we get started here, see my big, beautifully bald manager talk about Facebook messages for a couple of minutes. Facebook message product. And we realized pretty quickly that the problems that we face are actually bigger than just Facebook. A lot of sharing should be simpler than it is. But I want to reach out to my cousin Danny. I have to use text message because he just graduated high school and that's all he really uses. My grandmother, she only uses email, I can't send her a Facebook message and expect her to get back to me. I'm keeping this look up table in my head of how to reach out to each person. Why isn't this easier? Why don't all these technologies work together? All you just need to send someone a message is the person and the message, and that's it. People should share however they want to share. And if they want to connect via email, they should be able to do that. So we're giving every user of Facebook the option of getting an atfacebook.com email address. This product is an email, but it allows people who do use email to connect with the rest of us. We've modeled this entire system after chat. There's no subject lines. There's no CC. There's no BCC. When you press the enter key, it sends your message right away in real time. We want this to feel like a conversation among friends, and when you come back to your computer or your phone, you should be able to pick up right where you left off. It's always seemed like a problem to me that when I look at my email, occasionally I see a message from my mother, the same between a bank statement and a bill. We should be able to do better than that. So we created the social inbox. When you log into Facebook and look at your messages, all you're going to see by default are messages from your friends and their friends, and that's it. Once you give out your email address or phone number, it's just a matter of time before it winds up in the hands of the wrong person. And your only options at that point are to change it or live with a compromise. We believe people should have control over what gets delivered to their inbox, no matter what the medium. And so with the new Facebook messages, if you change your privacy settings, such that only friends can send you messages, then we'll actually bounce emails that come from anyone who's not your friend. By default, the inbox only shows messages from your friends and their friends. But if your grandmother who only uses email sends you a message, and it finds its way into the other folder, you can always promote it into your inbox, and from then on your conversations with her will be front and center. And of course, you can block anyone or any email from sending you messages. Between this and our privacy settings, it's like your own personal do-not-email list. The most important part of messages is context. You know, email is organized by subject lines. But when we look at the subject lines for Facebook messages, the top three were no subject, hi, and yo. It's not a great way to organize conversations. We decided to organize conversations around people, not subject. It also made sense to bring together all the different conversations you've ever had with one person into a single thread. You know, whether those happen in chat or SMS or email, those all now live in one place. If you want to reach out to your friend via text and they want to respond via email, that's possible. If you want to use chat while they're using Facebook messages, that'll work too. And all those different communications will all live in one conversation. Imagine you had the entire history of your conversations with your boyfriend or your girlfriend. I mean everything from, hey, you want to get coffee later, all the way to, you've got to pick up the kiss tonight at soccer practice. My grandmother had that. It was a box of letters written by my grandfather from when they were dating. That kind of thing is increasingly rare. And I'm left to ask, where is my box of letters? It's locked up on a phone. It's locked up in email. It's not in one place, until now. Americans, we love to act sentimental, don't we? Hi, so my name is Nicholas Spiegelberg. I am an engineer, software engineer with Facebook, in particular, I joined Facebook a year and a half ago, right when this messages product was just getting off the ground. And I joined in working on the new storage layer that we're using to power this messages and the high volume of data that goes through it. So I want to talk a little bit about the technology behind messages. And in particular, the open source technology with messages. We use a lot of open source solutions. We make sure that we publicize those open source solutions as quickly as possible. I mean, for us, it's really beneficial that we don't have all the secret sauce here, that we're just contributing back with the community and working together to build great scalable solutions for technology. So, messages product. I don't know if you've seen the new one, more product side. Again, I'm more infrastructure side. But the big thing is being able to carry on a chat message with somebody else who's going over SMS and not having to really understand what the medium is that you're giving these messages by to just have it all sort of be seamless and to have it all being focused on the people, right? The subject isn't, hey, yo, the subject is my friend Will Bailey, right? That's the difference. So the major pieces that we had for the new Facebook messaging product, was we needed a front side client product, right? We needed a mail pipeline to handle the new email. We already had sort of an XMPP layer with chat. We needed to add spam filters to make sure that you only got emails you cared about. Then we needed an application server to sort of move a lot of the load that with smaller services, you could just keep it on PHP. We needed to move that to something like Java or C++ where we could handle the load better. And then the main area that I'm working on and have expertise on is the last two's areas which is the storage and migration. So what sort of things do we need to store? So the four main areas where we can communicate here within the Facebook messages product are our traditional messages pipeline, which of course already had tons of data in it. Our chat, which we went from having sort of this temporary thing that goes away every week or so to being something persistent that's interleaved with your messages, that you can chat with somebody who has an email address. You can chat with somebody who has an SMS. All that stuff is logged and it's one nice big personal history together. So what sort of volume is our infrastructure having to handle for this new messages product? We haven't taken measurements recently, but the most recent measurements we sat down and did was the monthly data prior to launch, which is November. And at the time, we were doing 15 billion email messages with our legacy email product. We were doing 120 billion chat messages. And obviously our regular mail messages were a little bit larger. They're about 1k when we looked at it. And it was about 100 bytes for chat messages. But still, between those two, that means that we're getting in every single month around 25 terabytes worth of data that we need to store. The other interesting thing about this data is that this is constantly growing. I mean, the number of people that are adding messages versus deleting messages, the deletions are such a smaller portion. So we need to have a data store solution that can not only handle this monthly volume, but can handle this monthly volume coming in for years. And as the data store needs to grow. So what's really interesting is for our data back end, one thing that we did not choose was MySQL. We actually decided to use HBase for our small and medium data. So your normal messages, your chats, if we're going to do like a cache of your front page, a snapshot of that, and also your inverted search index. For attachments and large messages, we already had a really good existing infrastructure with Haystack, which again is a sort of a separate product that they're working on trying to get that more public as well. But we have them handle attachments for us so we can really focus on sort of the small to medium data load that's constantly growing. So the basic gist of our architecture is we have this client front end. And if you're used to sort of the MySQL world, what you realize is when you go past the scale of a single server, you need to shard. And in Facebook terms, those shards, you're talking of thousands and thousands of servers that are sharding for a single table. That's not really acceptable. We can handle smaller levels of shards, but we really would like to have less granular sharding. We would like to have ability to grow our cluster as we need to without needing to reshard our entire database. So we have these cells for HBase, for our smart scale storage. And we have a user directory service which uses ZooKeeper to map every single Facebook ID to what sort of cluster. So your traditional sort of sharding message. So the client goes, what's the cell for this user? Sponsor back cell one. After then, the client just can talk directly to the cell, which has the application server, which has all the storage information. The application server can then contact Haystacker HBase as needed. And again, like I said, we have multiple shards of these cells, but versus sort of MySQL, a single cell contains a lot more data. We have less shards, and I'll talk about that more in a minute. So what sort of open source solutions are we using here? I mean, there's a couple on here that are very popular already. We have memcached, right? So you can throw a memcached box in front. And all of a sudden, a lot of your read worries go away, right? So you need to have decent read performance on the back end. You need to have good write performance. And you really need to now worry about sort of the maintenance headaches with having to shard, with having hot regions, stuff like that. For ZooKeeper, ZooKeeper is used actually for our user directory service, which is doing this sort of mapping. It's an eventually consistent data store, but it's really meant for small amounts of data. It's not meant for multiple terabytes worth of data. HBase is really of the five. It's the most controversial. It's the one that we had to talk about the most. It's the one that we had to do sort of the most of development on to get up to handling this sort of data volume. And it's basically a database storage engine for message data. The buzzword is no SQL, right? But basically, we had to evaluate a lot of different data store suites because we realized that this sort of sharding problem with my SQL wasn't going to go away and it was going to require a lot of effort if we wanted to stay on that. So HBase uses, even underneath it, HDFS, which is a distributed file system. So you can basically think of, say, 100 nodes as a single file system server. It automatically handles. If a server dies, it handles doing multiple replication, making sure you have data locality between your application and your file system. And then finally, we have Hadoop for a lot of map reduced jobs. We had a question a couple months ago, like, how much do we use MapReduce? And my co-worker immediately responded, like, this is our bread and butter. Like, we're going to go down if we don't have something like Hadoop to be able to handle all these asynchronous tasks that we need for application, the application and migration, doing caching, doing snapshots. So the other ones are fairly well known. The big mystery here is HBase. So what exactly is HBase? HBase is a strictly consistent NoSQL architecture where what you have is you have a ZooKeeper cluster. So that's doing eventual consistency, but you're being able to have, say, five ZooKeeper nodes so you don't have a single point of failure there. The ZooKeeper cluster says, who's the current master? The master's job is to handle if a server dies to do recovery, to do load balancing. And the basic idea is you have these region servers, so they're your individual data stores. But your actual data is sharded logically versus physically. So each region server has a number of regions. And so say you have 10 regions per region server. If you have a single server die, then you can distribute the load, right? Each new server only takes one-tenth of that load instead of your traditional thing where maybe you have a master's slave and you're doing double reads, well now all of a sudden you've got a 200% increase if one of them dies. The main sort of ideas to keep in mind, I'll get a little more technical at the end, is you have this region server that's writing new database, it's data, it's writing it to a log. So it's all doing sequential writes on disks. One of the problems that we have with MySQL is that you can get into cases where you're doing a lot of random writes because of the tree updates. Whereas you have this log that's all doing sequential writes on disk. And then as an asynchronous task, when the log gets too large, it flushes that to a basically like two-level B-tree file called a store file. So basically when you do a get, you're basically doing an n-way merge of multiple store files. And when that n gets too large, of course your get performance is becoming problematic, right? You're having to do a lot of disk reads. That's where this compaction comes in to unify those n files into one. And then the last thing is if you have, like I said, the region server has multiple regions. If a single region gets too large on a single server, it can split. So it can split that into two regions, which again, you get sort of this scenario where maybe a region server starts off with only 10 regions, but it doesn't have a lot of data. Therefore, data recovery is really quick. There's not a lot of stuff to replay. Well, all of a sudden your data starts growing, but the splitting comes in and now instead of having 10 regions, you have 40 regions. So now you have four times the amount of data, but you can distribute this load now into 40 parallel tasks, right? So you can have fast recovery of a single server going down. So choosing HBase, you know, it's a really hot topic of what sort of no SQL data store should you use, what's the trade-offs? I mean, I think people tend to get very zealous about this and they have their favorites and they say why everything else is bad. We don't really look at it that way. We were really looking for a very pragmatic solution, which was very slow time to market. Something that's interesting is one of the big competing no SQL data stores is Cassandra, which actually the three original developers of Cassandra are on the team that's using HBase. And Cassandra is a great database system. It's just that when we sat there and evaluated for this exact product, what do we need and what gives us the lowest time to market, we saw that HBase had a number of advantages. So these advantages were it has a strong consistency model. A lot of people think that eventual consistency is not that big a deal. It's not that hard to program in. Good luck if you think that, you might not be doing it quite the correct way. It's really rough, even with ZooKeeper and having, say, a thousand lines of code in HBase to deal with this eventual consistency. We had a lot of problems. We had a lot of really subtle bugs that show up. The biggest problem with eventual consistency is here you do a write on X, you immediately do a read. You really have no guarantee that you're up to date on the data when you do the read, right? Because you could, your write doesn't go to all the servers at once, so you could be getting stale data immediately after you wrote to it. So that was a great thing to us about HBase. It also had this automatic failover. Like I said, you have these region servers that if they die, it can automatically parallelize recovery. You know, here you're talking, trying to get it down to seconds for when a server dies and when all the users are up. Again, and this is really great when failure is the norm and when you need like UDP class machines that need a lot of servicing with like MySQL, you get into these issues where MySQL, you do this master sort of slave thing, you shard it. You have say a thousand servers, right? So 500 masters, 500 slaves. One server dies. Well, you got to get ops on it right away, right? Because if that's, say it's a master dies, if the slave dies, you just had data loss. Whereas here, it's automatically redistributing the load for you. We had situations where in our clusters, we had 10% of the machines down while we were doing a dark launch. And you just didn't notice it. It automatically failed over for you. It automatically handled the load balancing for you. And having to get those 10% of machines online was sort of an asynchronous background task instead of something that you really had to worry about right now, which is a huge benefit for us. So that talks about sort of the automatic fail-able over the multiple shards, preventing the cascading failures that you can balance this stuff. Being able to do compression, we do LZO compression. We get like a five to one ratio on the disk, which is really great. Let's see, read, modify, write, operation support, like counter increments. We have a lot of applications that overwrite data or that increment data. A lot of HBase was optimized for sort of the counter stats use case. And MapReduce is supported out of the box. Interesting thing, you look at the old SVN revisions. Doug Cutting, he's an old Hadoop guy. You can see him and SVN blame a couple times here. There's this whole ecosystem where the same people that worked on Hadoop and HDFS were also working on HBase. And we have a large data storage team. So it's great that we can take these data storage guys that have experience with these subsystems. We can easily bring them on and have them help out with the HBase use case at times. And you're not having to necessarily separate like your data analytics team and your database team, which is good. HBase uses HDFS, like I said. It has a lot of really attractive features. When we were looking at, say, Cassandra, we were going to have to do very strict consistency model. Whereas HDFS just had this out of the box. They had checked sums. I know Cassandra has been working on getting some of those things in. But the nice thing is, is here we had these developers that had three years' experience that knew exactly what was wrong with HDFS, exactly where the little plan points were and how to administer it. So we really only had to focus on the database side of things. We didn't have to focus on the data loss side of things due to file system replication. Like I said, it's battle tested as well. It's running petabyte scale clusters. A lot of our Hive clusters, our Hadoop clusters, HDFS is sort of the core file system technology underlying all of that. So that doesn't show up that well. So interesting thing, a lot of people ask about sort of the cluster layout. Like I said, we have multiple clusters that we're using to handle this messaging product. But we basically, when we're talking about a server cluster, we're talking about 100 servers, we're sharding them so that you have a switch per rack. And then you have a master switch connecting them. So you have 20 servers per rack. So you can get a lot of, when you're having to write to three replicas, one of the big problems that you get into is you have network IO problems now, right? So if you can get a lot of data locality and if you can get a lot of rack locality, so maybe you're only having to send run replica actually over to the master switch versus if you're doing sort of like a Cassandra model where you're writing all three of them simultaneously, you might be flooding the master switch. So we basically separate it into 20 servers per rack, five racks. We try to isolate all of our components so that we can really trace if a master goes down. As you saw, I mean, there's HBase, there's HDFS, and there's Hadoop. So there's a lot of big subsystems and processes that are all interacting. So being able to isolate that is really good for debugging capability. For the region servers, we have basically, we try to go for more hard drives because it's better to have four one terabyte hard drives than to have one four terabyte hard drive because you start getting bottlenecked on the disk writing there, right, where you can parallelize that four ways if you have four one terabyte drives. So data migration. It was a big problem. Like I said, you're getting 25 terabytes of data a month coming in. We have this existed messages product. People expect their old messages to work on this new format. We're dealing with huge scale. This was all stored in MySQL in the background, and we have to go to HBase for the win. OK, so one of the biggest problems is that MySQL is normalized. So we had three main tables that we basically had to denormalize, migrated user over, and it's one insanely huge join of terabytes worth of data spread across many different machines, slightly problematic. Yeah, so multiple terabytes of data spread across machines. Failure is the norm, and you get really odd problems. So one thing that was kind of anecdotal that we discovered is, as you saw in there, you basically have an inbox for your friends and then an inbox for other people who aren't your friends. I guess we have these teeny boppers in America that they don't want to be tagged in photos. So what they'll do is they'll actually destroy their account when they leave and then recreate their account when they log back on so people can't tag them. But yeah, that's great when you're having to move terabytes and terabytes of data over, and depending on what time you ask, they either have no friends and their account doesn't exist, or they have 500 friends. So we had a lot of interesting problems. And a lot of times, what's the failure the reason for? In that case, the failure wasn't even anything operationally that was wrong. It was that users were destroying and creating accounts. And I guess we still let them do it, so more power. So basically, we get a snapshot of the MySQL data, and we store it into HBase. We store it in sort of a flat structure that's not really organized. And then we join the data with MapReduce to actually put it in the table in the organized format. Load it into the final cluster using bulk loaders. So something that's interesting is a lot of people will do HBase performance stats, and they'll talk about put operational costs. One of the benefits that HBase has is if you're importing data while you're serving live cluster data, if you just do a bunch of puts, well, then you're worried about your synchronization locks. You're intermixing your live data with this sort of background data that's not as performance oriented. It's great if just off in the background, you can create these Btree tables or store file tables yourself, and then just shove them in after you created your entire shard of the database. So that's what we do with these bulk loaders, to sort of take the load off our live data as we're importing millions of users in at a time. Let's see. Yeah, let's get this. I'm not a migration guy. I just work with them on TV. OK, working with HBase and HDFS. So our main goal, one of the big problems with a lot of these no SQL databases is data loss. Yes, a lot of the algorithms sound wonderful on paper, and then you look into the gritty details of what happens if server A dies, followed by server B, followed by server C. Oh, you get data loss, but that's not a big deal. Well, with Facebook messages, people tend to have this tiny little strong opinion about Facebook for whatever reason, so we got to make sure we're not losing your data and protecting it. So the goal of zero data loss, that was one of the main things we worked on. That was the main thing we were worried about was making sure that no users lost data, that in the face of normal failures, you could have crazy scenarios happen, and we would still be able to recover for it. So one of the main problems was HBase didn't have sync support, so if your server died, you lost, say, the last second's worth of data, because HDFS, the master and the slaves weren't communicating properly. So we added that in. We added ACID-compliant rights at a row-level granularity so that we could do ACID sort of transactions here. We did early log rolling, so we found a lot of cases where we did a lot of tight integration between the file system and the database, so the second that we saw everything by default is stored in three replicas. You could get into these weird scenarios where one slave dies, so now you have two replicas, but you go on your merry little way. Now another slave dies, you have one replica. You go on your merry little way, right? But now you have a J-Bot disk with only one replica, and you could potentially have data loss. So we did a lot of tight integration to make sure the second you went down to two replicas, we started triggering re-replication, we started rolling off to a new log so we got a brand new three-way replication scheme. And then we also did a lot of stuff with the HBase master redesign. One of the fundamental problems that you hear a lot of times with HBase is people will say, well, it's a single point of failure, right? Which is normally a huge problem. We've basically kind of moved HBase from being a single point of failure to where you can have a master, you can have multiple backup masters that are willing to come in, and then you really store your critical information that needs to be preserved across server loss in ZooKeeper. So like I said, it's this eventual quorum that we needed the fast writes that was really sort of tricky because it was eventually consistent. But the great thing about it is now we can have five servers. So you no longer have a single point of failure, you have to have five things die in a row before you're a host, which is great. Okay, I mean a lot of it was sort of getting down to the nitty gritty. I think one of the problems that you hear a lot of times is people will do this amazing new technology and they'll want to add all these cool features. I have added a new feature to hear called Bloom Filters, which is doing probabilistic caching algorithms and sounds great, and I got to read a lot of PhD articles about it. And then I get a two X performance gain. And then my boss comes around here and he starts looking at the decompression code and goes, oh, we should buffer those RPC calls, 20 X performance gain. A lot of it is really sort of understanding your system, understanding where the bottlenecks are, getting down to the nitty gritty, making sure that you're not sitting there going, well, I'm not exactly sure how this works. We can't afford to say I'm not exactly sure how this works. We get a random unit test failure that happens one out of every hundred times. It means we don't understand what's going on. We got to go figure that out. So we did a lot of work on really understanding the system, really adding small availability and operational improvements to it, adding rolling restart so we can upgrade versions without needing any downtime for the users, being able to interrupt any long running operations. So a lot of times you create a server and then it just stalls for a minute because something weird's going on and you're like, I don't know what it is. We had to find all those little areas so that we could shut it down in a second, restart it when we needed to. HBaseFSCK to do background file system checking for us, because we're not just storing your data, we're storing your data, we're backing it up, and then we have separate utilities that are going there and scraping and making sure that both those systems are saying the correct things. Like I said, performance, bloomfield filters, column seeking, a lot of this stuff was making sure that we minimized our disk seeks, making sure that we parallelized in the right locations, really looking at our exact use case, and instead of just adding a bunch of features that look cool, adding features that would help us launch Facebook messages. Yeah, another interesting thing is, so I think another important thing that we did was feature isolation. I mean, obviously with a bunch of new products that we had or integrating them together with HBase, HDFS, Hadoop, we really wanted to sort of isolate problematic areas or just isolate, you know, if we don't know how a feature works, we shouldn't just keep it unable on production. We should disable it until we can learn how it works before we enable it. One of the big things was doing those splits was sort of a temperamental sort of thing where you could have occasional data loss. We had a lot of people working on that, but we thought, well, that's not our focus. Our focus is consistent data. We can just manually do the splits. We can control them. We can automatically shard them. And that's really great for, you know, when you have a 100,000 log line log file and you have to analyze your regions and then you have to analyze it over the past week. Well, your region just got renamed from this to this to this to this, and it's a nightmare to debug. It's great if you can just have a consistent hash key that you can look up the region over a period of weeks and sort of do performance measuring on that. So let's see, other operational challenges. One of the big things we tried to do here is since we had such a huge amount of intro structure, we had a dark launch that was basically mirroring our existing inbox traffic. So when we were, say at like 1% for release, in dark launch we were at 10%. So we were always making sure that we were scaling our dark launch above and beyond and trying to find failures well before the users would find them. And, you know, the best data to work with is real data. You can sit there and run benchmarks all day long, but in the end, if you can replicate the data, if you can have it running over months at a time and make sure that you can find all the little weird bugs that happen once a month, that's really the best way to do it. Let's see, we did a lot of deployment monitoring sort of stuff, tons of scripts, dashboards. One of the things that's great is you go and look here on our HBase and we just have tons and tons of graphs and we sit there and analyze the graphs. Since we're doing manual splitting, we can sit there and see how a region is behaving over a month at a time and do certain performance tweaks, understand how the system works as data grows and as certain ratios between our files change. So the last area that I really wanna talk about is working within the Apache community. Like I said, personal opinion is Facebook is a very open source friendly company. I didn't have to go sign a whole bunch of waivers just to get started on this project. In fact, I didn't realize that companies made you sign waivers until a couple months after I was working on this and they said, oh, that's not us. And they really want you working closely with the community. We started, HBase wasn't like this thing that Facebook built and Facebook made popular, right? This is a product that's already been going for four years. It's not like they had this sick community. They already had this wonderful community of multiple companies with smart people that were working on this product. And we didn't wanna come and be the flash in the pan. We wanted to come and work because we thought we could get our product done faster and help them get their products done faster if we work together in an open source environment. We had in-house expertise in both HDFS and HBase, which again, like I said, it needed to be tightly integrated, so we tried to bring in people from even outside of messages within Facebook to help work with this open source community. We tried to, when we were evaluating different sweets, like Cassandra, like HyperTable, we weren't just looking at what features they had as well. We were looking at like, what's the community want? What are the community's goals? Is what we're doing helping the community and is what the community doing, so hopefully helping us as well. And we try to increase community involvement. You know, there's active encouragement to submit diffs and patches whenever possible. I mean, I personally try to stay within a week of the actual open source branch. So I mean, what you guys are seeing when you go on to hbase.apache.org, you know, Facebook doesn't have this huge amount of secret sauce in the background. You're roughly seeing exactly what Facebook's using. You're seeing patches roughly as we create them. We really wanna work with the community. We think it's in our best interest and in the community's best interests if we're being open and sharing. Additionally, that means that like, I'm a committer now with the HBase project. We have another person who's in the primary committee and we're actively encouraging, you know, when we brought in co-ops, tried to get them to work with it, tried to get people outside of the project that had an interest in running HBase in their application to try to commit to the community, not just have it be the little HBase group that's doing all the work. And I mean, there were massive feature improvements that we did with the community and we couldn't have done without it. So we had this HDFS 20 append branch. So we not only had to work with multiple organizations within HBase, we also had to work with multiple organizations within HDFS to try to integrate all this stuff together so that we could have zero data loss. We did an HBase master rewrite which basically took two of the people from HBase, two of the people from the open source project. I mean, they sat there for hours and were hashing out the design of this product and building it collaboratively. And continually interacting, you know, you can talk about major features but at the end of the day, when you talk about working with the community, it's a lot of little day-to-day stuff. So an interesting problem that we had was about a week before our launch, we were getting extremely large responses, people texting up a storm and they go over two gigabytes worth of data that we're now RPCing. We had to learn how to shard that. And a perfect example is we saw this and we went, ah, we have too large an RPC. Immediately what we do is we go on IRC, we start chatting with the other PMC developers saying, have you seen this problem before? What's your thoughts? We have a committer with Stumble upon who goes, oh yeah, well that's actually a good feature to add. I've been wanting to add that for a couple weeks. It just wasn't, it was next on my docket. So you guys need it within a week, right? So he ends up writing the permanent feature. Meanwhile, we have another guy within our team that's doing the sort of a lightweight feature, doing performance testing on it that makes sure that this is a theoretically sound implementation before we delve into the full thing. He delves into it, emails us to patch. We go, we do peer reviews on the patch, send it back to him. He goes, there's even more peer reviews on the patch. That's how you get a nice stable product out fast is you have people working together which it wasn't just Facebook wanting this feature. There were multiple companies that wanted this feature. We all wanted to work together, talk together. We didn't want to have who developed this and who gets the glory. We just wanted something working. And yeah, interesting statistic is when we joined, we were at ODOT 20. The next rev they bumped it up to is ODOT 90 to reflect that they had append capability. We had over 1,000 patches that were between these two revs. So if you've tested with HBase 20 before, it's a little bit old. I don't know yet. And that's it. So do you guys have any questions? What's the one point regarding problematic about operations? You have to move a large set of data around. How do you handle that? So the question was, you have a lot of volatile operations. I think I was talking about the people creating and destroying accounts and you have to move, migrate, terabytes worth of data. How do you do that? I mean, carefully, right? So to be honest, migration was that area of this where I was the least expertised on. A lot of it was doing large MapReduce scripts, handling a lot of failures, again, taking a lot of statistical measurements. Trying to measure any little odd error case that happened that people normally take for granted and they say return false on. Yeah, it's not on the migration. How do you actually move the data within your network and in a proper way? How do you structure it? A large pipe? Yeah, okay. Yeah, I mean, how do you structure that? Yeah, I mean, so beyond that, beyond like how do you programmatically do it? Yeah. Yeah, different questions. Next? Yeah, so you just mentioned functionality and the goals, the way why you used HBase, you didn't say anything about performance and I've read some reports of the standard of performance of HBase. Yeah, so the question was, what about HBase performance? Sort of what's interesting, performance was definitely something that we looked at. It wasn't necessarily an immediate top goal. The big goal was, is the architecture sound theoretically? Is the development stable? Are the features stable? Do we not have data loss? You know, we sit here and in two weeks, we increase the right performance of one of our use cases by 300%, right? So I mean, why do we care exactly what the performance is if we can triple our speed? What we care about is the zero data loss. So. Do you have an argument for eventual consistency that there is a searing which says that if you don't sacrifice consistency, you have to sacrifice either high or low humidity? Yeah, so. What's your trade on that? Yeah, so the question was basically, if you guys have heard of the CAP theorem, that's consistency, availability, and partitioning. And talking about, you know, what Cassandra does more of like an eventually consistent model. So what you're sacrificing, as I recall, is sort of the partitioning where you cannot have sort of a split brain sort of scenario in here. I mean, that's roughly what you're dealing with is, if for some reason like your network switch cuts off and you have two brains, you can't serve out of both clusters. You have to serve out of a single one. In reality, it's not a big deal because like I said, you got three replicas. You're doing the replicas across a rack. We've literally, you know, you can have a rack die and go offline and it still works great. It's still auto rebalances for you. So. I mean, you don't have data, I can't see it right now, that's awesome. I don't know if it's anything different. So the other question is, what about different data centers? So replication is, support is already working with HBase. So in addition to having the three replicas, you can have three replicas among different data centers. Right now it's doing sort of a master slave. You know, the realistic is the, we're working on that, that's a high priority for us, but the realistic is those three replicas are great. You know, you rarely get below two replicas is the reality. So definitely, I mean, the cross regional, like having say a data center here in Europe, it's really for the read latencies, right? Any other questions? You're monitoring, where do you store your metrics? HBase as well? Your metrics for monitoring, you were saying you have a lot of metrics coming out of your ads and stuff. Right. So basically we have an internal application that's very similar to, yes. We have an internal application that's very similar to, oh, sorry, repeat the question. What do you do about metrics? Where do you store the metrics? We have an application that's very similar to Ganglia. It's an open source project that we use for storing our metrics in. HBase by default will export your metrics on a variety of areas. We chose to export it with JMX. So we collect JMX data and show up graphs that way, but it's very easy, again, to do on Ganglia. I believe Ganglia is what a couple of the major companies working with HBase use. I think when Ganglia is great, but it has some problems. So I was wondering where you were actually storing data and how you were getting out. Because if you used something like Ganglia, you end up having data in IRB, then you have no idea how you're going to extract data. You have to use IRB, factory, whatever. Right, so the question was sort of some technical details about Ganglia. To be honest, like I said, it's something that's very similar to Ganglia. So I can't really, it's not my area of expertise. I just, I give them the data and they worry about it. So I know that there's, you could probably, I mean, the HBase community, if you go on Pound HBase on IRC, they're very friendly, they're very willing to answer any questions. There's a lot of open source solutions and they can tell you the trade-offs for those. So, other questions?