 I'm Chief Architect at Basho Technologies. Have any of you heard of Basho? Cool. We make, if you haven't heard of us, we make the React distributed database. And we've been working on that in one form or another since late 2007. But it was only recently that I sort of started reading some of the distributed systems literature, some of the old distributive systems literature. Of course, I had read some relevant stuff, but I had never sort of tried to get a chronological history of how the field progressed and sort of got to some of the fundamental problems of it. And I'd never really looked at it over time, relatively early on in my career, about 15 years. So this is sort of a result of me taking a step back and sort of looking at what we're dealing with in terms of distributed systems. I think we get caught up in buzzwords like NoSQL, Big Data, Cloud, scratch away all the hype and what those things all really mean is sort of a move towards distributed systems that I don't think maybe people have realized it, but I don't think it's gotten enough attention yet. And I think calling attention to it opens us up to a lot of resources and a lot of ideas for future work that can make all of our lives easier, both as implementers of these systems, operators, consumers of these systems and end users of the other systems that these kind of distributed systems power. So back in the day, do you remember these things? The web wasn't used for much. It was all mostly static content. There was no way, except for a few very innovative sites where you could sort of post any data back to the web. It was all read-only sort of one direction. And because of that, with utility, it was neat. Everybody was fascinated at the time. I know I was. There was a lot of information on the web, but there wasn't much you could do with it. This is before e-commerce. This is before social networking. It was an innocent time. So people didn't really care when the little digging man came up because, you know, wait a little while and read whatever. The thing they were trying to read was probably stupid anyway and not of much value. It certainly wasn't a life or death situation. And, you know, you didn't know if a website was down or if your mother had just picked up the phone and started talking into the modem and caused it to hang up. I might be dating myself a little here with the talk of modems. Commodore 64 was the computer that I came up on. This is how Yahoo was. It was a static directory curated, you know. There was no such thing as a search engine. There wasn't a lot of distributed systems for a lack of a better word, but, you know, science going on for the purpose of making the web better itself. Another one. That's Microsoft's first home page. I just put this one up because I thought it was hilarious. Yeah. Microsoft's World Wide Web Server. So this wasn't that long ago. This was, you know, 15, maybe 20 years ago, not even 20 years ago. And it's just amazing how different things are nowadays. And that has consequences. They've gotten different for a number of reasons and that has consequences for how we build systems, how we operate systems, how we test systems, how we design systems. So, yeah, 15 years is a long time. It's come a long, long way since the little dicky guy. So the modern web. You know, we have, in the past 15 years, a whole bunch of stuff, technological advances have happened, which gives us, you know, new opportunities to make new and interesting applications. And, you know, as those applications make the web more useful and make the web a larger part of our everyday lives, users develop new and usually heightened expectations of what those apps are going to provide for you. And because of all this, we have to go about and engineer our systems in new ways, design our systems in new ways, and sort of think about what exactly it is we're doing, you know, the craft or the science or the art of software, however, software development, however you happen to see yourself. It has big implications that I know I didn't, you know, fully realize and I'm still realizing a lot of them. So, what's happened? Most of these graphs just go up and to the right and it represents something good happening, and you've probably seen a million of these graphs in various presentations, but this is upstream bandwidth growth. Downstream bandwidth growth has obviously grown tremendously as well, but upstream is interesting for some reasons. Hopefully I can get into later on. So, about doubles every year, upstream bandwidth. And if it continues, maybe 241 megabits by 2030. And I think this is going to be something that takes people by surprise, the capabilities that are emerged when we have this kind of upstream bandwidth. We were able to get YouTube because our cable modems got faster, but when you can do stuff like this, maybe some of the more esoteric peer-to-peer technology will become viable. So, another one you've probably all seen, the cost of data storage in, I think it was 2,000, a dollar got you 10 megabytes nowadays or whenever this graph ended, a dollar gets you 10 gigabytes. So, we all know, storage is very cheap. And smartphone growth. This isn't Mary Meeker's state of the internet. I stopped with the graphs in a second. But the orange is growth of voice and the blue is growth of data. And so, from barely nothing when this graph starts in 2007 to nearly an exabyte a month, if I'm not wrong, is another driver. What does that say? It's like 600, 800 petabytes almost. So, a lot of things are changing. So that gives us new sort of things we can do. Of course, there's user-generated content. I don't remember when I first started to see user-generated content on the web. It's brought us great things like, you know, YouTube comments and Reddit and 4chan. But this is what the web is really about now. I mean, this is the core of social networking. But, you know, sites like Yelp, I remember coming out. Craigslist has been around for a while. That's when the web really started to become useful, at least in everyday people's lives. So, you know, as much as much crap as people put on the internet, what you started to see, I would say, maybe around 2005, is the direction of traffic on the web shifting from just read-only, you know, sort of static websites to much more data coming from the user, where before it was just sort of protocol data. Give me this page. So the dynamics of the web are changing. And if you compare those dynamics, you know, a relatively siloed, not siloed, but read-only database changed by a few people, consumed in a mostly read-only fashion by memory. That's by users. That sounds a lot like sort of the old ways that relational databases were accessed traditionally. You know, batch updates, and then mostly read-only. The, you know, the flow of user-generated content and the growth and how interactive the web was started causing data to come the other way, and obviously we had to do something with that data. And that's when you started to see, at least the papers written, that went on to inspire the various NoSQL companies to make their databases. It's dealing with this new type of traffic, having to store it, having to query it, having to display it, that relational databases weren't designed for it. If you go back, I think it was the Christmas of 2004. This is before Amazon had any NoSQL, crazy homegrown databases yet. They were on Oracle, and they had a series of high-profile outages that I believe were traced back to, you know, in the long run, Oracle not being able to keep up with the demands of, you know, an audience of users trying to shop. And that's where we started to see the change in thinking that led to NoSQL today, which is, you know, these traditional values and invariance that databases like Oracle hold, namely, in Amazon's case, strict consistency over everything else, including scalability, might not be the best choices for business. It wasn't a technological change. It's really all about money, the shift towards more distributed systems. And so that's when you started to see eventually consistent databases come out. This was more things that we can do now. So now we can store all this data. We have new solutions to store it. You know, sort of pioneered by the companies who first had to deal with this sort of user-generated content coming back. We can do crazy things with machine learning. I know the guy who worked on that People You May Know thing on LinkedIn. You ever tried that and seen how creepy it is? How could they know that? So that's, you know, I'm often critical of the hype around Hadoop and you hear about companies just, you know, we need a big data solution. So they buy 50 machines, put Hadoop on them, and just let them sit there so they could check it off. I mean, I think there's a lot of that, but that's a creepily interesting application of what you can do when you can store petabytes and petabytes of data and then, you know, sick thousands of commodity servers at the problem cheaply. And there's a bunch of other things, too. You have location-based services now, near-field communication, and like I said, it's, you know, these upstream speeds increase. Some very interesting peer-to-peer networks. So we're, you know, dependent on this stuff. I just got back from London and I lost my iPhone the first night that I was there. And it was like being naked in a strange city. I depend on it for so much. It's got my, that's how I get a cab. It's a halo cab and get a cab from London, yeah. Oh, but I thought I was in New York City when I first got there. After I replaced the phone, it worked. So in the States, I use Uber to get a cab. Google Maps, I'd be lost in strange cities without Google Maps, and I expect it there on my phone and I expect it fast, or else I get in a really bad mood and, you know, I wouldn't probably go out and explore beautiful places. Yeah, and, you know, it says something that maybe we're a little too dependent on technology, but this is how it is. Oh, this is well-documented, but I think it's a point that isn't driven home enough and it's really the, one of the reasons why we made RE-OCK is that the higher latency, you know, higher latency rather has an impact on revenue, a direct impact on revenue, and people have quantified it. Marissa Mayer, a couple years back, they did an AB test where, what was it? Added 500 milliseconds to a page and that was a 30% drop in what would have been ad revenue generating traffic, so a bunch of money. Amazon calculated that if they added 100 milliseconds to the load time of their front page, they'd get a 1% drop in sales and if you've looked at Amazon's results, that's a ton of money. And there's a paper about high frequency trading and the consequences of, like, your locality to the exchange fight with each other to get even feet closer to the computer they're trying to talk to, so for high frequency algorithms, they get a small time advantage. And one fund calculated that after your 5 milliseconds behind, every millisecond costs 4 million in revenue per millisecond. So, you know, high latency, that was right, low latency and more importantly, predictably low latency is really a priority. Going back to Amazon, if you've read about the architecture of their site, it's made up of something like 170 different services. And to render that site, I'm sure some of those services can be, you know, fetched in parallel but others are probably dependent on the result of previous lookups. So, variance in your latency in the data stores that back those services that render sites like that is incredibly harmful to performance in the long run. So, not only is low latency important, but predictably low latency. Predictably low latency is important. And nowadays, again, if you saw Facebook put up the construction guy, GIF, you know, it would be all over Twitter or whatever. So yes, users are increasingly impatient. I find myself and I have to catch myself some time and remind myself of how lucky I am when I'm sitting in San Francisco and Uber cabs not loading quick enough to take me to a place I should probably walk anyway. But we are, we are impatient. And that's, again, because we've moved stuff that we used to do in an analog sense to our phones and to our computers. So, if you are in the business of providing services you'd like people to use obsessively, then you'd better be focused on delivering a low latency predictable experience. And that means predictable across hardware upgrades and downtime and whatever the world throws at you. So, latency and availability are actually very sort of intertwined. It doesn't matter you know, latency is latency until the user gets up at which point the thing might as well be down. And that sort of nuance is fundamental to distributed computing and it's the reason why distributed computing is frankly such a pain in the ass is because you can't tell the difference in a bounded time between a node that's behaving that's just slow and a node that's down. Which makes things like reaching decisions in bounded time and again providing low latency answers difficult. And most of the distributed science research is really about exploring the upper and lower bounds in what you can guarantee in spite of the fact that that fundamental impossibility proof exists. So what is a distributed system? Everybody probably has their own slightly unique definition. The one I sort of pulled from Wikipedia says a distributed system is a system of several autonomous computers each with its own memory that can be made by message passing. So that's uncontroversial and true enough I think the definition is probably broader but I'll take that one. A more famous one is Leslie Lamport's definition which is a distributed system is one in which the failure of a computer you didn't even know existed renders your own computer unusable. So this is the task of people developing distributed systems is to minimize dependence minimize single points of failure among systems like this. So whatever happened to Leslie might not have had to. So that's sort of the snarky definition. And one from my co-worker and CTO Justin Sheehy which I sort of qualitative one that I enjoy distributed system is one which is in a constant state of partial failure. So things are always failing you know gone are the days where we buy huge big iron from IBM and get a bigger box and try real hard and spend a lot of money to keep that one machine up nowadays we buy whatever costs ten thousand dollars sorry two thousand dollars from Dell buy a bunch of them and when they die sometimes it's even questionable whether you should try to fix them because you know who knows if the guy fixing is going to mess it up and it's depreciating anyway so a lot of batch of customers do what they call rotten the rack where they just shut it off and keep adding new servers and eventually go back in bulk and call the old ones that's the new model of you know horizontal scale and it's radically different than the old one and it has radical implications on how we have to write these systems. So we have some new challenges and all this change given all the opportunity what are we you know what are we facing here so you might remember a paper back from 2005 called the free lunches over Herb Sutter big C++ guy wrote that on what was essentially the dawn of the multi-core revolution and what he was warning about was the sort of exponential step up in complexity that ordinary programmers are going to have to conquer in order to make their programs continue to run fast you know up until 2005 Moore's law held on single chips so single chips you know would continually get faster. It was a dynamic they called Andy Giveth for Andy Grove the CEO of Intel Andy Giveth and Bill take it away meaning this chips got faster and Microsoft Word got more bloated and they had an easy cheap time doing it because they didn't have to worry about threads and locking and mutexes and race conditions and anything or anything like that right but for the most part single-core stopped getting faster in 2005 and chips got wider you know with more cores on the chip instead so in order to leverage cores like that not only the different deconstruct your problems in a way that's parallelizable they have to be parallelizable to begin with but now you introduce the element of coordination and control among you know things that used to just run along and they're only just fine without having to ask anybody permission or protect a critical section or whatever variety of things to this the scope of the problems the thing the decrease in storage price that made storing lots of stuff more efficient the need for latency the need for geographic distribution and having servers everywhere and the need for fault tolerance we sort of got rid of not only you know the like machines running our systems running on single machines but even the idea or the illusion of trying to maintain a single system image of what is actually a distributed system and from that and you know cloud computing was a big part of it too where it becomes easy and you know to spin up a relatively large amount of relatively low power machines as an operational expense instead of a capital expense so now we got all this great stuff YouTube videos and all that but for example with Halo they deliver calves and they're in multiple countries and at some point you need to provide a good experience globally and that means having machines in multiple data centers that is another huge problem so you know you start off with just sequential single threaded code very easy to reason about add in threads and now you have sort of exponential possibilities of interleaving and you have to control which ones occur so only the valid ones happen then we added more computers in it so you have multiple computers doing this thing over a link that's potentially lossy and has varied characteristics over time and you can't make the same guarantees about it that you could about your PCI bus or your L2 cache and then you know to add insult to injury as if this wasn't hard enough now we need to do this all over the world on even lossier links you know over longer distances with higher latency so you know again back in the day you just drive down the road stop signs go as fast as you want nobody else to bother you if you're watching this it's very easy to reason about the state there's a single automobile with a single human inside it going a single speed taking a single path and then you get to the highway where you have people still driving straight I don't see many I don't see any stop lights here but the reason about this now you have states explosion here especially when you consider that objects can sort of share state if you're looking at the driver if you're flicking off the driver next to you your state is both flicking him off and his state is receiving the burden so you have you know it gets very complicated but then when you add other systems and you add public transportation and you add car accidents and the stuff that happens in the real world reasoning about this is impossible and this is an example you can't just let things go like you wouldn't a single threaded world this is you know obviously a failure of some sort of mutex or semaphore but this is why we have stop lights you know it's a mutual exclusion primitive we have consensus and I think a lot of these distributed algorithms are very accessible when you think about them in human terms yielding to another driver we have a protocol and the states you know you let the guy in the right go or the gal in the right and it works it's a way we achieve consensus knowing you know via an algorithm but it's complex and it slows things down and sometimes the needs for it can be subtle you know somebody pulls over on the side of 101 and it slows down traffic for miles there's add-on effects, there's emergent effects that happen so enough of the traffic metaphor that when you have to think about reasoning about all that stuff really it's just more than one person can keep in their head I just need an excuse to put this on their element but it's a lot to reason about has anyone ever debug the multi-threaded program was it fun? one person always says it's fun you know and you can imagine and probably have experience that a distributed system is like that but worse so you can't just let things run haphazard even though that single car on the road can cover a lot of ground unencumbered by speed limits or stop lights to other people or the need to talk to other people so the tools we tend to have and they sort of build on each other and these are really sort of the fundamental problems and tools in distributed systems coordination consensus and consistency we use coordination to sort of fancy term for machines talking together with the intent on doing something to achieve consensus which is machines agreeing on one thing and that could be fire the missile make sure you only fire one missile it comes down to a couple problems a lot of times like the leader election problem you want to make sure something happens at least once sometimes you want it to happen definitely firing a few missiles is fine but at most once think of something else disastrous so providing those guarantees actually takes you to this circuitous route if you really want to understand the dynamics of it through these distributed systems concepts so a higher level of what we use coordination for is to achieve consensus this is a database which is my specialty the means by which in a distributed database we achieve consistency which is also a valuable property but the problem is consistency comes the consistency comes at the expense of both latency and availability you know if you go back to a human metaphor the more people if you're trying to make dinner plans the more people you have to you know invite the longer it takes and you know if you can't get a quorum of people to go sometimes you cancel it so it doesn't always work out so it's not as available or not as easy to do or not as likely to succeed as you just going to dinner by yourself and saying screw them so these are really the tools that we have to use so coordination is the first one you know this is basically just overhead different distributed algorithms consensus algorithms in particular have different message complexity PAXOS for example takes two rounds you need to think about the channels you use for coordination you need to think about their latency and you need to sort of do some back of the envelope math about you know what the latency impact of each sort of other party you have to talk to is going to be we have some systems that do this zookeeper is one of them for example distributed coordination service you know and zookeeper sits up there with its five machines and you know decrees various things that everyone gets to believe to be true which is a good solution and better than we've ever had before but as I'll get to we don't have a lot of reusable primitives for doing a lot of this fundamental distributed systems work so you know using coordination you build consensus and there's various types PAXOS is sort of the algorithm that's most talked about it also has one of the sort of worst histories in the literature of any other algorithm about 20 papers called PAXOS made easy PAXOS made easier and it's really not easy I've yet to find one person who can including computer science PhDs who can sit down and explain it to me but the idea is simple and the way it's usually implemented is you have a machine say five of them and what you need to do is propose something and if three of those machines are up and can commit it to stable storage then you've achieved consensus in the context of databases again this is I propose that the value of this key is this and then you observe you know strong consistency out of a database Byzantine consensus is a little harder and something you have to implement sometimes which is that assumes that all the other processes are friendly and don't try to send you weird messages security in distributed systems is often overlooked or left as a you know something to think about later but if you have malicious actors that can access your control infrastructure you just got yourself another round trip and you got yourself a higher number of machines you need to provide fault tolerance and then finally consistency you've heard eventual consistency a million times Dynamo was the eventually consistent database Dynamo was Amazon's choice business choice that they weren't going to insist on consistency for adding stuff to people shopping carts they said it was more valuable to just always allow them to add something and then at the much more infrequent occurrence that they check out then you can use a more strongly consistent mechanism so right there you see the need for varying levels of consistency and databases that can provide you know a spectrum of these levels are just starting to become mature and I'll talk a little bit about our plans with REAC as well you want as little consistency as possible really though computer science is very interested in what's the weakest possible failure detector to solve this problem you want to choose the bare minimum in terms of consistency needs which translates to coordination traffic which translates to latency which translates to the likelihood of downtime and unavailability so the snowball effect so REAC at least as a version 2.0 will have its normal eventually consistent mode meaning that as long as one machine is up it'll accept a right and as long as one of the machines in your REAC form is up it'll serve a READ you can tune it now to the other side where it uses a Paxos-like algorithm to provide strong consistency so within the same app and within the same database you can make these choices and again they're really business choices one of my fears is that strong consistency certainly is an easier model to program against with eventual consistency you have to deal with the likelihood of conflicts coming back receiving stale data deal with resolving conflicts basically doing your concurrency control on the client side it's a pain that it's the price you pay for high availability so just some pro tips here stuff I've learned over the years take them or leave them but my advice understand your consistency needs early and less is more push to find the weakest possible consistency model that you need to serve your data if you want to minimize latency maximize availability and ultimately minimize cost as well too often this happens a lot in the literature and sometimes it's left as an afterthought in open source projects modern systems like this that need to grow in shrink dynamic membership is something that is often glossed over but it's often very hard it's one of the harder problems that there is and there's really not a lot of literature on it mainly because they just say you know that's left as an exercise or we didn't consider it's not the scope but for real systems applied systems you need to be able to have machines come and go when they die the systems also have to pay very close attention to versioning of both data and protocols in order to support things like zero time upgrades you need systems to be able to speak a couple protocols ahead and behind in order to have mixed clusters and not really go down when you have to upgrade the software and all of this all of the decisions that you make of the design phase will have an eventual impact on operational expense and testability and this is one thing I've learned especially at Basho trying to test some of the more complex code is that you really have to design for testability you have to it's sort of a first class thing you need to think about and I think the general gist of service oriented architecture supports that quite a bit but at all phases you have to think about testability when integrating some of our newer fancier testing stuff we had to rewrite some parts of the code to make it testable in the way that we needed to implementation phase choose languages and practices and enable safe and concurrent programming Erlang is my choice it's somewhat esoteric sometimes I get JVM-NV it's a little obscure but it's been around since before Java and people often talk about you know it's lightweight processes and it's actor model and it's sort of bullet proof history running Erics and switches a little more subtle sort of implementation technique is it's just philosophy of let it crash when an exception happens in your system instead of trying to piece the whole world back together inside of an exception handler and continue the process just disappears and then some process that's hierarchically above it restarts it that's one of the most powerful things of Erlang and I was thinking why is that and I really think it's because that means that you put your failure recovery code and your initialization code in the same place as you're developing a system your initialization code is a code that's run the most often the code inside the exception handler this should never happen log it and throw it or you know whatever your attempt was to sort of recover from the exception is the code that's the least run or it's up to you to test it so that it runs and we have questions and you know it's code that also changes subtly as other parts of your system change I use a functional language I like I don't know if it's the functional that I value with Erlang specifically the sort of immutable nature of it you know no shared memory access processes only communicate through message passing so you never have sort of a dangling reference to the data each process is or garbage collection happens on a per process basis so you don't get vm wide pauses which are obviously I've been talking about latency they're harmful for latency and a whole bunch of sort of software engineering features to do its sort of history in writing telco switches that drew me to it so you know don't be afraid to experiment with stuff there's a lot of good stuff happening on the JVM too I think that's going to be a rich thing in the future they just isn't quite the language yet that I like the next two kind of go together strive to build reusable components right now you really have nowhere to go you couldn't go to github and check out the best paxos implementation and use it in your software and there's a couple reasons one of them is that the things on which you're establishing consensus are sometimes deeply embedded in sort of the domain model of your app but oftentimes not and if you don't want to go install zookeeper or tell your customers to install zookeeper it would really help out to have sort of reusable consensus implementation vms the old operating system actually used to have one in their vms cluster product and then now we don't have one there is some promising work out of stanford called raft which was a consensus protocol where the number one goal of the paper at least was understandability because of paxos really unfortunate history in terms of how it's been explained raft does a pretty good job and they actually quantitatively tested it with students and ab the whole thing and there was a statistically significant better sort of comprehension and ability to explain the protocol afterwards in raft so I think that's cool because there's a new consensus protocol on the block but also the fact that academics are recognizing the importance of things like that we wouldn't need an understandable consensus protocol if people didn't need to understand it in order to write it and in that vein you know prototype often and the more reusable components we have the more we're able to prototype when views came up for linux you got a bunch of silly file systems but you also got a lot of cool ones and it really lowered the bar to playing with hey what can we expose the file system in linux if we had like some of this for some of the distributed systems problems we could prototype more easily and we wouldn't really have to reinvent the wheel as much I've been doing distributed system stuff for 15 years and until Basha where we actually finally built some reusable stuff at least for us so a whole lot of wheel reinvention every time we needed to do something and again I'll keep hammering on operational expense that's what it really comes down to that's how we win sales at Basha you have you have choices in implementation and some of them are complex some of them are trivial or some of them are not choices at all, they're forced on you but focus again on testability in implementation should be a first class concern and you know if you're designing a command line interface to your software you know debug the thing so the guy that has to operate it can operate effectively and the tests really really add up and for testing unit tests really just are not enough for this kind of distributed system stuff I think that developers have an ingrained and uncontrollable hesitance to breaking their own software that makes unit tests not enough not only they're not enough you just can't generate enough random input to that test to simulate what happens in the real world you need an adversary it's like a penetration test it's like a security fuzzer I won't get into what Quick Check is there's probably a clone for your language but it's a tool that generates a lot of test cases based on a specification or assertions about invariance and what your program's supposed to do and you just let it run all night and it tries as hard as it can to break the program investment in a testing tool like that although it costs money and takes time is probably the one best thing we did at BASHA or one best improvement we sort of did varied workloads if you write a database if your test only tests keys inserted in less of a graphical order or only tests read write ratios with a single statistical distribution those things have crazy interplay with the way virtual memory systems work on various machines the way file systems work depending on how you've implemented your database invest time in writing your own or if you have to write your own write your own but invest time in a testing plan that tests as many workloads as possible preferably from traffic samples that you could correlate to accesses to your database and again that's a mistake on the slide so finally I sort of started my career at Akamai Technologies so operations is sort of the my favorite part here and it's what I've been talking about with operational expense the whole time myself and Justin our CTO started off in the group that was responsible for writing the software to deploy Akamai's network so at the time it was like 20,000 machines and we managed to grow that by a factor of two or three without increasing the size of the team through software so I guess it's dev ops you know 15 years ago so React is definitely written in that spirit written to be automated but it's also in you know in that spirit in its design React is a simple sort of homogenous architecture and it's not just about React a lot of systems exhibit its architecture it's the same architecture the rest things the rest roles that you have in a system if something's the master node and the shard node and the whatever node and you have five different types of things each one of those adds costs to maintenance because you have to monitor them differently when someone accidentally kicks one the recovery procedure for that one is different than the other one you gotta figure out which one it is and you're likely to mess up so simple homogenous architectures where all the machines are the same and losing one box is no more disastrous than losing another are the way to go for for cheap operations beware about emergent properties and this cycles back to the design implementation and testing phase if you look at an Amazon outage report it's very rarely ever just a single bug in a piece of software it's usually some unforeseen interaction between a couple different systems so testing is hard enough but you also have to learn about who your neighbors are systemically in your systems and at least try to foresee bad things happening between systems that are independent or separated an example of that is TCP and cast where this happens in a lot of many to one systems where if a lot of machines send responses to a replica at once at roughly the same time and it overflows the switch buffer then all sorts of flow control kicks in and you can see like a 10 gigabyte pipe only get one gigabyte pipe get only one gigabyte utilization and it's really a pain to find a debug but the environment that you deploy things in is very important to consider in all phases of the development cycle and finally monitor intelligently it's easy to throw up graphs of every single variable but that I'm talking about the best monitoring system I actually ever interacted with was again at Akamai where it was just a sequel interface and you could do things like select star from DNS servers where select all processes from DNS servers where the resident size is over 75% of the system memory and so you don't have to think of things like that at install time or configuration time and you don't have to express them in some sort of configuration file it's sequel everybody can learn it managers would write these alerts so we just wrote sequel statements where if they returned rows they page someone I've yet to see an open source thing like that I think it's doable and if anyone wants to do it as a startup let me know and that's it if you want to get in touch with me I'd love to talk about this stuff Basho is always hiring, I'm Argu Zero on Twitter and there's a copy of this presentation all that futzing in the beginning was me deciding I needed to do a HTML5 based presentation so newdistributiveworld.herokuapp.com you can get these slides and I think I'm out of time but I don't see anybody waiting so if there's any questions I'd love to answer them come on did you explain Paxos? no the most the most I can explain it is you have machines and you have quorums that overlap and you have stable storage and it's magic and it's hard no I really can, I could do a better job of explaining RAF but maybe not right now anyone else? alrighty, well enjoy the rest of your day again check out our stuff we make REAC, the database and the REAC cloud storage product it's basically an S3 clone implemented on top of REAC and all that stuff is on either our website basho.com or github.com