 My name is Shannon Kemp, and I am the Executive Editor of DataVercity. We would like to thank you for joining today's DataVercity webinar, Consistency in Distributed Systems, sponsored by Cloudant, an IBM company. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A section in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DataVercity. As always, we will send a follow-up email within two business days containing links to the slides, the recording of this session, and any additional information requested throughout the webinar. Joining us today is Mike Miller. Mike is one of the original co-founders of Cloudant, which was recently acquired by IBM. Congratulations, Mike. As Chief Scientist, Mike is responsible for developing and evangelizing Cloudant's technical vision, managing long-term product research and development, and directing special projects. Mike is a postdoctoral fellow. He co-founded Cloudant after cutting his teeth on petabyte per second problems at the Large Hadron Collider. Mike holds a B.S. in Physics and a B.A. in Philosophy from Michigan State, a Ph.D. in Physics from Yale, and is an affiliate professor of particle physics at the University of Washington. And with that, I will give the floor to Mike to start the presentation. Hello and welcome. I think we're going to go with a long introduction. That's great. I should give you another one. So it's a pleasure to be here. I'm going to talk about consistency and distributed systems. In particular, I'm going to focus on the portion that I think affects most of our lives these days, which is really boiling down to application space and that means databases. I hope to have saved enough time at the end for a decent amount of Q&A. I'll also note that I realized that today I had to cut a tenement amount of this to be able to package it into a time slot with a diverse audience without too fast. So there's a question of follow-up and future work that I'll talk about at the end, and I'll be working on it in more detail. So that's the end of the presentation. Before we move on to slide two, I'll just note in the beginning, you're going to hear me talking actually about one author in particular, Peter Thialis. I always left a last name, and I apologize if he hears that. But a lot of the fundamental research that's moving the field forward is actually coming out, a specific lab group at Berkeley. And we were lucky enough to grab Peter for a conference we have coming up next week in San Francisco, around on the 17th, and interested in learning more from one of the experts in the field. Feel free to reach out to me for a discount pass to Cloud and Con, including speaking in the morning of the 17th. So let's jump into it. Shannon gave you a pretty thorough introduction. But my name is in big systems. This is actually a picture of the last experiment I worked on in Geneva, Switzerland. There's a small human for scale at the very bottom, and a four-story device here. This is the greatest scientific achievement, I think. Now, as a general watch, I'm John Collider. A tremendous amount of engineering effort went into trying to understand the fundamental building blocks of nature and how they interact. In doing so, the field was forced to solve a lot of problems, practically because there were no solutions you could buy for various portions. So there's a tremendous amount of invention that happens and something like this. And that's where I found myself and my co-founders, and many people in the audience started to move into the realities of how do you deal with a problem, actually. It's actually over one X of a second or more for one of these devices. And clearly, that's the value that you couldn't dream of digitizing and writing to distributed and distributing globally to a large consumer base. So there are a lot of strategies in custom hardware and custom software to get there. But some of the things we learned very early on, actually, in kind of the first science lab in elementary school maybe, I think hold true all the way through that, and we'll come back to those at the end when we start talking about data modeling. You know, that you take everything down, you cut it off with your pencil, and you talk about updates as they happen without actually throwing the original data or the original notes or personal information away. And that's kind of the general theme of a strategy that allows you to see what's going on. So one slide here, Modation, and then we'll jump into some more detail. I saw this presented very nicely by the founder of Joints, I believe, saying that kind of these two things, big data and mobile, have I think his words were broken, our model of computing. And at the very least, when it comes to applications, data and data, it's a greatly stressed model for consistency and transactional reasoning. And then, as a tech or developer or somebody's building a new product, these are combined to make a large opportunity space for your product or your ideas, but they also make it very challenging. To go into more detail, whose problem, you know, who really feels the pain, I think you feel the pain, the minute you're getting a benefit on a single computer or a single server. When you begin replicating your data for, say, redundancy purposes or to try to scale, let's say, split it back to one server and have it reflaid at the reload from other servers, you can begin spreading data between data centers or actually across more than one device. We're all very used now to having a single or experience, whether we take up our phone or a tablet or a laptop or, you know, a workstation, you name it, that's very challenging, especially when some of those devices are online or offline. When you have mixed read-write workloads and a high number of concurrent users, it's normal anytime you even take this application and go across more than one process. So, slide six, my personal statement is that this is pretty much now everyone's problem if you're building a new computer or something like that. Slide seven is that there's a lot of market response to this problem over the last half to a decade. And in particular, you're going to focus on new SQL, new SQL, and cloud. Probably a part of that. And I think that the great news is that there are new solutions hitting the market. Some of you on the phone are probably very familiar with that. We used steps and, you know, strategies to adopt. But I would say that this is still a fairly new field for the majority of folks. And I think it was a peel-back from a lot of people. And look at this from the view of the fundamental users of, you know, these types of technologies, distributed systems, and in particular databases. I think people are all very familiar with distributed systems from the first time you used a telephone if you had a landline or a mobile device if you had your hands on it. After all, we're all online right at some type of distributed system, not really for basis. But if you go a little bit down and look at, you know, on the application developer, I want to build an application that actually will have some states. Like 80% of the data on the Internet now is user-created data. So that's what I'm usually talking about here when I'm talking about states, right? So maybe something like that. So that it's there for somebody else to need you. Or maybe you're at a later point in time. So maybe creation and updates and synchronization of all of that. One of your data is spread across multiple machines or processes. So I'm going to take a flight down here and give you a four-page clouded pitch to give you an idea of how to build your perspective. So bear with me real quick. And then I'll bring this back to what we're going to learn. So again, the country and space is new challenges. That means it's more of an opportunity. We've been very fortunate to grow into that. And so clouded is a distributed database as a service. And the key differentiator is that it shifts with its mobile strategy that there's kind of no mobile write master. So a big part of the marketing around this is trying to make it very easy for developers to involve and start using a service. So write with clouded. The hosted service will come in. So you come in, you sign up. Your name is Janice. We could be janice.clouded.com. And that's the link for which you begin sending your JSON based on the straightforward. But wait, there's more. So what does the thing that you expect to adopt as a database to do? JSON documents, you can do it in B-trees like a regular database, a variety of secondary databases you can do search into space. And then in 2013, the big difference here is that, okay, I said it has a mobile strategy. So you can write local, whether that is an S-lite on a phone or a browser, or whether it's in the cloud. So I kind of have this single API that allows you as a developer to write all these different things at very distant scales. And then, at least, your data gets synchronized across the globe. So in Samsung, you have the billion-end user base. This allows your users to have their data synchronized wherever they are in the globe and deal with those problems of scale and geodes dispersion. These patterns, multi-tenor, are dedicated in the places where your app is located. No, here's a pretty different model than databases were originally built for. They were originally built to maybe be embedded in the application itself. I think the same thing happens. Eventually, it starts to be running as a separate application but on the same computer, right, as the application itself. And then you need different computers that are very nearby. We're talking about a highly, a small world now, a great scale, and semi-connected devices like phone come in and out. And then, the challenges you have are new. So kind of wrapping up that pitch, looking back to the developer perspective and picture my mentor here, who tends to look very surprised and very choked a lot of the time. Once we explain that type of a story to a developer, the first question we get is like, so how do you write code for that world? You can compare it to something that I know or something new that I know more about, like MongoDB, and then a long list of things, right? And part of that is like, how do I start to reason transactionally? So, you know, I'm trying to bring forth, bring the existing design and things that developers are comfortable with into this new world. And that's what we're here to talk about today and I think my overall message that I want everybody here to take away, and I know that the tenant marker, which is good, is that you do need to understand the basics of your data store. There's really little replacement for that if you want to get the most out of the system and drive the best way. I'm going to talk a lot about new SQL and no SQL databases. And then, they're very flexible. They're very new. It doesn't mean that they're not production-ready, but it does mean that a lot of the best practices and strategies and tooling and ecosystem around them, even the consultancies, is relatively fresh. And so, this is part of a series that events the end of the day. Diversity is running to try to help folks like yourselves understand how to really master them and get the most out of them. So, with that, I want to back up just a little bit and talk briefly about some classification of this history here. A lot of ideas you can trace back to some central minds. It shows two of them here that I've written some of my previous on. Like, Gene, Jeff, Dean, and Sanjay, again, want from... There's a nice link here to an article by Kegnet and Wired, which is a pleasure to contribute to. But I think that they, at least for me, in a large fraction of the field and the cohort of other companies, like, you know, Tengen, obviously, Shashara, et cetera, I think there's two large extent leaders that really inspired that field. So, maybe four pieces of literature or art leadership that really find that. Slide 17 shows you the ones that would choose. Google Photography talks about how you take the notion of a file system inspired across multiple devices. This shows you how to ease the way in which folks could interact and query and analyze that data in a batch system. Google, which inspired a lot of the database side of things, started to take those same primitives for distributed data and the standard gather of distributed systems and try to make that more accessible to actually store data in application state. And then I'll sneak one in here from Amazon, a Dynamo paper, which was published by Bernard Vogel, GEO, and others, which gave a really nice view into one of the systems that Amazon had built and used internally in the back of the shopping cart and other portions of the commerce business. And in particular, that one was one of the folks like myself because it gave you operational insight into not just the architecture of the system, but what you have to run it in production and what the perspective is in terms of, you know, how do they reason about consistency or availability in particular. Papers are really seminal, and we choose an infographic that I think was created by Clodin or for Clodin. The kind of thing is if you take those four papers and put them at the center, this is kind of some of the early portions of the new SQL, the NoSQL, and the kind of Fintupe ecosystem where at the top you have the system and MapReduce, and they have quite a large number of things that are really online batch analytics moving into real time, but very much old style systems. And really now I think we just refer to it as the Houdy ecosystem and everything that is being built on top of that as well. Incredibly disruptive to the analyze analytics market. And I think that to do this is a big part of our lives for a long time going forward. It's a big table and dynamo papers, you know, basically inspired a large fraction of the NoSQL databases and a large fraction of the algorithms that are incorporated into some of the new SQL databases like ODB or BoltDB or things like that. And I think we condense this down very quickly and efficiently, but I'm glad to take some more questions on this later on. I'll just talk about kind of two different types of databases here in particular. Those that are heavily dynamo inspired and these are based forandra, cloud instance, reoc, et cetera, gives you a highly available highly fault tolerant systems. I'm going to try to make references longer to be which adapts some of that philosophy, but not all, and it's kind of stuck in here as its own entity, but it's probably much more familiar to folks who are used to running MySQL. So, I'm going to jump into the biggest challenges to develop a reasoning. I wonder if you're going to see SQL people. Okay, this is great. I know that I'm sounding a little tinny. Let me know if there's anything I can do to speak up, Shannon. As you gain, it's also important to talk about some of the things that you give up in adopting your technologies and some of the challenges. So let's start with four high level things here where especially dynamo inspired systems when you're looking to replace single points of failure to give you reliable systems better feeling, which is one thing. And the fact that you no longer win something like a sample or a react or a cloud instance or some kind of ones, but you have to worry about how you manually partition your data sets to fit it onto the machine is very powerful. I pulled one quote here from the Google Center paper recently with the link that they talked about the experience they had trying to recharge their manually partitions model database which stores their F1 data, which was, I believe, the data that's really core to their advertising. And that's actually a fairly modest data set in order to buy it. And the next manual resharding that they performed or sorry, the most recent one before they decided to abandon that system and build something else to create some ideas for their coordination. And that's two years of incredible business risk. And one thing that I want everybody to remember is that the fact that we don't really talk very much about how users are responsible for partitioning their data on these distributed systems is a big step forward for the field because it's an incredibly much interesting thing. Folks who are used to building systems and dealing with high levels of concurrency, especially or high levels of workloads. The systems in asset compliance are consistent on disk by taking out lots and moving to a dynamo inspired system was incredible because those lots kind of go away and you end up with systems that are very tunable to be available at all times and even in failure modes or high levels of concurrency they will take your right. And that's an interesting strategy having worked on proprietary systems at large hand flyer where we are pumping some metadata into like Oracle and it's just going to keep up. We eventually realized that we would rather know that the system will eventually reach consistency and people will always take my right. You know don't want it to reject data and I want to make sure that if the right hits it is it's resilient and will be there for me to get it later today. And probably the most differentiating part of NoSQL in general is the fact that you don't have to start with a scheme of, right. It has optional and I think that if you think a system has done a good job of saying okay and you can throw everything into HDS and apply a scheme of unreed to bring structure after the fact where application developer again it's so important to allow you to get started easily if you can deal with very dynamic datasets. I know that when it comes to enforcing strong consistency or you know in some of these new datasets there's a high level of schema or tables and that does take away some power from the end point to efficiently do those 23 here just closes up this new list of tables, right. The fact that you know the web runs on JSON that's a study for new web and mobile applications of the enterprise sort of runs on as well similar things in rows and columns I think that at any time you've worked with that in large quantity that's impedance match for mobile or web engine 24 and Shannon do you mind just hopping in letting me know if the sound is okay I'll go a little just join it sounds like you're on a speaker I don't know if you have a handset or no I'm talking on my web before is it better if I talk quite about the sound check and now but it just kind of went a little wonky okay if I talk better so something must have changed since the sound check okay so we're on slide 24 now that was a high level review of just why we're interested in consistency why distributed systems are hard how we got to the new systems that we have the things that they've the power that they think is an offer architect or product start a little bit more and just kind of I think the two maybe sounding models that folks probably familiar with things that kind of classified things just familiar scaling systems in the real of scaling replays and then you have multiple secondary that's still a viable option whether I'm my sequel MongoDB or running your own solar you know in that system that you can split your workload in half by sending rates to the primary and then runs the things will eventually touch the data and in the case of MongoDB the the but now sounds like maybe a little too close to the mic a little bit too muddy sorry how that works it's weird how it changes it does let's see what the audience says so thank you for watching the challenge thank you everybody that's jumping in I'm going to classify master slay I'm going to talk about form based systems and I'll note that you're running a system that on a single server is surely a massive compliance like my sequel you introduce other servers for replica reasons then you're in a situation where you're in a distributed system and those are either to lock until that data is replicated from the primary to the secondary and most systems will give you in in latency and and or both are right which are the dynamo inspired systems and or react etc. where the idea is that there's no master and you're going to redundantly store data on different machines and it's right that gets hashed automatically to the and then that partition lives on three nodes the right is exceeded when two of those documents flush to two of the three nodes in the default configuration and then I get a redundantly on three machines and you succeed right by default it's two right even if the third right haven't been synchronously or slowly on the lead you have a voting mechanism where nodes can say here's the back to the situation where even if one of them is slow it knows that it's slow and can repair the system and that's with consistency in a distributed system even in the state of some challenges that will come back to as well I'll just note that you don't rigid anything for free in any of these systems that have its own disadvantages and make sure that you are using the optimal data model that's it for the purpose of time I'm going to focus more on the dynamological science that goes forward in the case that there's a tremendous amount of literature there I think there are a little bit newer and more interesting in their scalability you see right here suppose you have a class that comes in and is written copy is there a tremendous number of corner cases that you're discussing it away from when you're building a application or talking to users adopting a service like that trying to help them understand how to reason in this so Cloud and Cassandra and Dynamo then Dynamo then use fine-grained locking and some of them don't unless you have a lock update release lock and you have fairly locked systems that I can go into more detail on in the Q&A session and moving into the consistency side of things is a nice difference here from the real quick that developers these days are kind of left to those two choice items on the right are fast but consistent results and algorithms that deliver those consistent results say by locking and updating in place and releasing are all slow and worse under failure. A lot of the large real-world systems giving developers significant impact. And then the reason that you kind of have those two classes of choice is not it's not fundamental it goes down to the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the system, things don't have to be consistent on the disk for them to be consistent from the calling perspective. I'm interested in coming in, you know, if you're interested in speaking part about consistency models and eventual consistency in particular, I urge you to read this one kind of high-level summary paper here by similar authors from Berkeley. It's a great, great overview, and it goes into details of, you know, how do you actually reason and how is it that we have so much of the web that's been built on these systems we buy them today, we trade real money, we have shopping carts, you get all the material good, that and, you know, $10 that are being transacted based on systems like this. A lot of those systems don't have fundamental guarantee of safety, right? Inventual guarantee of safety. How is it that it's built? And what's the start? Another paper that really goes into something that I really like to do, which is, okay, there's a good number on it. When we talk about eventual consistency, what's eventual are we talking? Is it an hour? It's no assessment. The answer to that is going to, you know, define for me whether or not I'm comfortable with that. And it's also going to help me understand whether or not that's a good choice or something that I'm built. For instance, the ability to distribute a counter that's being updated a billion times a second may not be the most natural fit for a dynamo database, but something that's being updated at the rate of once per second is a great fit. How you follow the data kind of to balance will depend strongly on the system you're on and where it falls in that curve. So there's a lot of times here in slide 28. I want to make sure we save time for questions and actually show a few. But some of that paper, they do a very nice analytic derivation and then look at real-world data from Yammer, I believe, before they were bought by Microsoft. And then there's also data for read. Some of the bottom, they show data for write. On the left have the kind of single-reader, single-reader requirements. So that would be one, three, writes for read to have to succeed right before you send the data to the client in a dynamo system. That means you have available partition talent because you're not there in the TDB system like this. That's where you say, okay, if I'm starting three copies of my data then I only succeed to write when all three copies are flushed together. And then there's different curves here. Different colors represent whether you're on SSD, which I think is the green triangle and all the way to the right, would be at least one of the copies is the cross-liber network, say, on a different data center. On the left here are seconds. So 10 to the 0 would be one second. And then look at, as you say, I need to start data recently and consistency requirements to write or read, you know, marches from left to right. What may answer the question, how long are you going to take for the system to be fully consistent, right? So I need 50% consistency. You would read 0.5 mark on the y-axis, run that across until you hit one first and then look down on the x-axis, read off those seconds or seconds that it takes to do that. The general trends are what you would expect if you're running locally to SSD. That's much better than if you're writing across, you know, two different data centers connected by y-axis. And so this would change from literally a single millisecond to, I guess, hundreds of seconds. But if you're marching forward a slide, you know, you can also then apply that to real-world systems and start to just put some ballpark numbers on it. So from that original paper I quoted, you know, they say you can go off and measure this. And some simple DD, kind of the first no-SQL database was a service from very early in their portfolio, was consistent, always consistent by about 500 milliseconds in Amazon S3, which has a hard job right, but something like nine of redundancy for the project takes about 10 seconds. From this work that I just talked about, it's something like this. Re-appliance is generally consistent on this, depending on, you know, spending media, depending on the order of 10s or hundreds of milliseconds. And so that's the ballpark that we're talking about here. So when you're designing one of these systems and thinking about the data model, what I think is okay, and what I'd like to maximize is the benefit, right, from scalability because minus, you know, what is the cost of me getting any of these stale data? You know, the last version of this document at the state-of-the-art show times the rate of those stale, you know, return on things stale for time. And so that's the thing that you need to have in your mind, and a lot of the time, this is qualitative. I'll note that if you do have a code now, that code can be analyzed. I'm not sure this is a tool or something, academic research can be analyzed to tell you what the consistency requirements of that code are and help you match that two systems. So that's what I think. I've been talking about what systems are and are not eventually consistent. I think that's always important. My favorite example is something that is eventually consistent and actually may converge to see lost audio. Can you still hear Shannon? Yeah, I can hear you fine. That's terrible. I don't know if it's mobile through to the whole thing. It's perfect, but it's been... it's workable. Okay, thanks. It sounds like it's worse for you. So one example of eventual consistency is the pictures. And there was a parade event about 14 months ago where the social press, Twitter account was hacked, so a three-minute drop in the Dow Jones of 100 points, which must have been billions of dollars. What is the most inventive paper of all time or has been saved? But the interesting thing here is there is a very real business cost to that, and the time for that to converge was about three minutes. There are more examples that feel very familiar with, except in town, it's kind of the most classic original version of eventual consistency, where you write a check that could be turned into money before to either realize whether or not you have a cash. And that, you know, if you look at ATMs, they may not always be fully connected. I don't know. They never worked for a bank, but I think you can talk about this from folks in the industry for some of the reasons that you have a limit to your ATM daily so that you don't gain the system of other ways. And then both of these things you need to think about that figure of merit, right, what's the penalty and what's the strategy and what's the cost of being wrong. And the number one strategy from a developer's perspective here to lay down a few slides of actual just on them is to adopt the idea of commutability. The idea of maybe not updating a place, but you're treating your application as a state machine, and going into more detail. But thinking about that bank connection, you may not always have a single, you know, float that stores the value for your account, but you do exactly what you do with your own checkbook, which is you write the deltas, right? You write the changes I'm taking next dollars out for this check. I'm taking a Y dollar for that check. And then the example of how you can model the M&Js on documents, or you can say, you know, $137 U.S. as a state account, 1-2-3-4, and depositing an entire 6-7-8. Going into a little bit more detail, that's the general gist of it. And if you're looking for maybe the most recent post-person for this deal, and that can really bring you up to date, there's a great talk by Richickey on the values from one or two years ago, and I think he gave him some conferences. He is through this closure, the language, and the author of the comment, and the mutable database runs on top of other data storage itself. And he really talks about, okay, some of the reasons that got into the situations are historical. The fact that you didn't have large amounts of data, lots of storage, et cetera, and he gives you a nice kind of walk-through to rethink this from the beginning, how would you do it in a while? Although it's not as simple as I'm necessarily talking about here, but it's very powerful. And so, this is one of the key things here. Thank you for this interview. You know, there's a great post, there's a hyperlink in the slides that we'll send around from 2007. It talks about accountants, don't use it. So it's a great concept that's been at the core of functional thinker programming languages for a long time. For example, and once you assign the values to a variable, you can never update that. Right? And so this is the first thing the language can do to keep you from having, you know, race conditions and loss. And in general, that's a very important thing as you start thinking about writing programs that fill, literally, with the number of cores on systems. This concept is deep that the core of five systems, especially modern systems like DFS, you'll hear the phrase, and it's also bundled into the storage engines and databases themselves. So at the core of, you know, the storage, many different database storage engines and databases like Zatomic and HTV and Cloudants, take those concepts and make sure that you're never updating data in place. You're putting this at different levels, sometimes at the language level, but you can also then do this at the data model level itself as an architect or developer. And that's the part that I want to kind of close with here, and then we can go into it now. So I'm going to give you three simple strategies and then kind of talk briefly about future work. The biggest one here is that you don't want to update in place. You want to write only. And so, you know, an example would be if I need to change the value of, you know, some space within them, I can either, you know, do a read and just write the new value, or I can even just write the change to that value. Different systems differently. But then versions, and maybe put into the newest version, you can make sure that time is always going forward. You will converge to the space and that you are building something that is highly scalable from some of those concurrent writer-reader-access patterns that I think are about when pointing to the demo-inspired problem. So an example to make this very explicit, I think, you know, even the deletes are done by writing only. And so a delete where you write a special two-stone type of record in a row or a new document are some of the mental parts of, you know, many existing SQL RMS systems, as well as the new SQL system. That's two standard things. But then down to more brass tacks, and the same thing about building a system, say, within the cloud and through the app or Cassandra now, breaking out many to many or one to many relationships, using foreign keys and links, right? Notifying the data in a standard way is the best way to build a system that can scale and minimize concurrent, you know, minimize contention when people are trying to read or write. This is kind of in contrast in broad search through the MongoDB strategy where you tend to bundle things up into rather fat documents and trade those on the wire very quickly. It's something like, you know, Cloud Ant or Dynamo in general. It also works to independently indexing options you have to take something like a relationship between a user, a friend, which is a lot of people that are trying to deal with a lot of thousands of friends, that's actually at large and tough to connect as well as to trade on the wire and multiple people trying to update that in place are going to stress the system and oftentimes that can bring it to its knees. So the same thing is to move to maybe a more normalized start schema where you start to put arbitration, right, at the data model and level types of data, right? So what I'm going to say, some object relational map will do this for you on top of no SQL and SQL systems. Sometimes you can just do it yourself. But here, you know, the transition from having the one-to-many relationship in a single document versus breaking into multiple documents is going left to right in the screen. And so I would break up, you know, my Miller user document to just have information about myself and a primary T, which is the underscore T. Every time I add a new friend, right, I can just write a new document to the same system. I've done that as a link or make an edge in the graph here, right? Each one has vertices that says, okay, user and Miller, which is a specific document ID that says that this target, you know, user A-Hoff, is a friend of this. I have a thousand friends or a thousand updates to my friend list, right, a thousand of these small documents, right? So they're very small. And we're going to talk about the wording which they arrive in the next slide as we kind of wrap up here. But the important thing is that many of these systems, and this is where it really breaks down a little bit into the details of an individual system, the options you have for secondary indexing are critical, right? So if you have an individualized U, that would say you need to say who are the friends of Mike Miller or that the entire user state. I can do that with a single query to the system. And the standard makes that easy with the notion of columns, columns, columns, families, et cetera. So there are ways where you can efficiently set all the data in a single query without having to walk the relationships explicitly, right, in multiple search rounds. But the big thing here is I would argue that that's often not the case at all and it's the exact opposite. When you're looking for a scale, you want to bring it up in the smallest pieces possible in these secondary indexes. I'm not going to star around that to make this the problem easier. And just to point that out, I'm going to talk about the regional banking example. Thinking commutatively is very important. And so in the checkbook, you're just going to delta, not new versions of the state. So, for example, a friend is nice, but in the ending relationship from the last slide, we didn't actually have to do a cumulative operation to say, like, how many friends do I have total? Right? That type of thing will be very important for your checking account if you want to know the total value there. So, really, I could store my checking account on the left as a single document with a list of transactions and a single floating float for the number of dollars in my account. Or I have a pretty empty account document on the right that labels the account number is that type. And then through the transaction, I can identify the amounts, the source accounts, and the target accounts. I'm going to come back to the order question. Absolutely. The key point here is that you're able to then get the value of your accounts via materialized view. That's usually things to note are what happens to the order of operations, and that's what we're talking about continuously. And what happens, what is the actual workflow, was here to fully accept the transaction. I would say, that that may change underneath me, just like with an ATM. And in that process, I'm going to limit the user for making a key larger transaction and write that back. I can come back to in the details of the Q&A, but to break it out, you understand the penalty being wrong. You make an estimate that it will take for the system to be fully consistent. And the type of operation that's needed is whether or not you're going to let somebody make a transaction that will take the account for both the year and the year, what the penalty is. That's very analogous to what happens in the real world. This is actually a whole level of literature. And thinking community-wise, there are a new style of work around something called quantitative data types, taking things like addition or share counters and looking at how you build those on top of modern distributed systems and making those simpler and racking those up in client libraries for developers so that they don't have to do that type of underlying logic themselves. Not going into any details of those and just note that they're very popular right now in the general Cassandra for common secret. So I'm going to turn forward to things that I'd love to talk about in the future. Sometimes I guess when I think that the audio is working better. There will be some more details about like, okay, let's talk about facts about the timing of document rights and reads in that 19th example. That takes you into some of the reasoning. I will glance over the facts that secondary index are very powerful in this immutable data modeling paradigm. I understand the interplay between the primary index and the secondary indexes and that's a very different system to system. There's new transactional systems as well which kind of examine and review everything that exists in previous generations of databases or what relapses can we or what sometimes can we relax and then a really nice post-recording that'll get you reading there if you want to. I will just note that when I go to the Berkeley AMP plan, which is algorithms, machines, and people, that's really the best writing on this. What they don't write about there becomes tomorrow's enterprise architecture and so consistency here's the link and how you can read bottom to top and you will be a world expert. I'll close and say thanks and hopefully we can do some Q&A. Thank you for the sound issues. You know, technology is so great when it works. We do have some great questions that have come in and this has been a great presentation despite the sound quality issues that we had being at the top here and it was one of the most popular questions that we get as people want to know if they'll have access to the link to the slides and the recording of the session. We will be sending that out by end of day Monday so I will get it out to everybody including the additional information that Mike has passed on somebody was asking about the papers that you had mentioned so I will add that and we can get that out to everyone. The first question is Mike, on slide 29, what is the underlying technology streaming plus in memory? What are typical structures of metamodel? I'm not very fond of relational but I've seen many relational. Would you shed light on how to leverage semantics into metamodel? I am not quite sure exactly what we mean by metamodel but I just want to make sure on the mental consistency here are you the numbers quoted from Amazonas 3 versus simpleC versus Cassandra? I can have a little moment to respond to that and you can send me another question there. If that's the case I'll just jump into that real quick and just say that each of these systems has a different approach but most of them make use of a different hierarchy where you have some type of stream coming in different places say clouds different the fact that it doesn't have a right head log everything is committed to this so it's always consistent something like Cassandra will have a right head log and you can replay from that but all of these systems make use of a tiered hierarchy where you have things that are in the cache that if you have an append only storage engine and so level db is at the core of many of these systems that is very cache friendly. One of the great things about the unique ability that Vichy talks about is how cache friendly they are. If you're always writing something new then that tends to be in the cache if you have a right and multiple reasons that come after it. I'm not sure how to answer the question about metamodels in general but I'll note that these approaches are there whether you're talking about streaming in memory or on disk. You understood the question appropriately in your initial question to what it was about so that was good that's perfect. And Mike would you summarize the perspectives or checklist for consistency considerations? I would generally begin by asking users tell me about the relationships in your data. Do you know if you have one to many and many to many relationships? That's number one. Most data is highly one form or another and so then I say okay are there you know one to many example where you have a user with friends? How much does that user have? Is it a hundred? How does that impact on whether or not I think about bringing that document up into multiple pieces or we're not going to think about keeping it in a single piece but perhaps most importantly in a dynamo in a product system like Cloud Inc. or Cassandra or React asks the question of how often do you want to change the state of one of those documents? And then I compare that to the curves on slide I forget where it was right after this from what is the coherence time of a system like that and it depends very strongly on the deployment of this multi-data center to make sure that you're not in one of those regions where you're saying that 90% or above regions for consistency. If not, you'll be in a world of pain where you're reading still data introducing conflicts into the system and then the system will drag what needs trying to repair those conflicts with the various of the entropy underneath. So those are the two big ones probably or three big ones. Tell me about the relationships to one to many, many, many. Tell me about the size of those and how often those are going to be changed. Correct. And next is how do you really abstract data modeling for these NOSQL databases? All I've seen is kind of a hard coding of actual values and abstractions as used in the relational world. That's a big question. I guess I'll note that from this from the developer's perspective, I was very my bias is very much towards objects relational mapping or I think objects or data structures within a language or work and then I'm looking for the most efficient way to write, read and query those myself or to let the system do it for me. One of the things is that a lot of the web frameworks Ruby on Rails, Python's Django you know what you're working with Node.js or something like that. A lot of these frameworks will just give you a default mapping of your objects in memory onto documents or rows and columns on disks whether that's SQL or NOSQL. NOSQL is if I want to define things for instance to reach for a Python library and then create classes and define the relationships between those classes and ORMs will turn those into document types and impose form key management for me under the hood. The databases themselves NOSQL databases especially are just getting to the point where the thing to do is to provide tooling for the form keys. That's one of the things that you always get for Freeride within R2DMS systems. This is all I mentioned it's entity and I think that there's really no replacement here for trying to find experts in the fields to help you walk through the whole data modeling stages there. The problem that you're thinking about has been solved in a pretty language centric way on Stack Overflow or something like that. But as per specific tools that allow you to lay down some theory and turn that into a schema or document model those are very young. The next question actually is more of a request. Can you show the strategy number one slide one more time? I want to pull that up. I will ask the next question here. The next question is what are the strategic differences between write only and logs? What are the best use cases for write only rather than logs? I'm going to interpret logs as actual logs data say web server logs or something like that. There does seem to be a convergence between what's happening in a lot of the database technology and file systems and actually what happens on the media itself versus what's happening in the log world. But there are some key differences. In any of the systems that I talked about there is the ability to go back and change the state of something that you've written. In a log file system and as you know as you're storing time series data where you're taking measurements for sensors and writing a disk for your data center or something like that those values will never be changed ever. Whereas when we're talking about application states the state will transition between different states that go in in a database and invalidate the tree in the past. So you have the ability to write a tombstone and delete that. In general carries significantly more overhead to give you the flexibility of retrieving something in order, log and time. They're fully indexed. Primary and secondary indexes allow you to comprehend the queries and are not really based on scans. A large fraction of the data especially as you think about Internet of Things tends to be right once time series data where that will never be represented by something else. And for that a lot of people are reaching for databases when that is not necessarily the right choice. So what you'll see is being able to keep the same API from the developer perspective with swap out storage engine to something that's really meant for pure time series data in that log based approach. There's one more question here for you. Would you share design patterns of indexing? We may not have time for it. Sounds like both columnar and more relational are choices in the same system. The second part got a little bit of an awkward answer but yes. The primary and secondary indexing options are different across different databases even particularly in the Dynamo family. Cloud ints is different than say Cassandra. Differences themselves on the flexibility to always build a new type of index points in the future. In columnar databases we're talking like big columnar not necessarily like Vertica columnar disk or in memory that allows it to be very fast to retrieve at a later point in time and to do index in new waves you may have to actually update the schema in order to do that in something like a big table family. Those are the big differences right there I think between kind of the ability to always build a secondary index at any point in time in the future. That's what document databases let you do versus columnar style data databases. But then in the NoSQL family with Cloud int we have the ability to build scene-based indexes and things like true columnar dial-up probably in the road map for this entire family of databases. But the key point for you as a developer data modeler is to understand what is this system exposed to me and how can I use that to best like how would I best leverage that. So the explicit example is like as you use to do aggregations of these types of CRDTs are very important in that immutable model that I was talking about before. I don't have a sharp answer to that but there's no replacement for just like maybe yourself a big cheat sheet. Look at the secondary indexing options like projection, compound, pre-commit, post-commit, those are really critical things for you to think about as you're wrapping one of these solutions off the shelf and trying to cause it into a problem. This has been a great webinar of course even with the sound issues you're always a great speaker and always appreciate the education you can provide our subscribers. So thank you everyone for your patience and working through it and thanks for your attending as always. Thanks for the great part, participating great questions throughout the webinar. And I will be sending out slides and links to the slides and links through recording by end of day Monday and they will be posted to dataversed.net and additional information requested throughout the conversation throughout the presentation. And Mike you just had a request for would you please write a book on this topic? So that's great. We're happy to help. So thanks everyone. I hope you have a great day and again Mike thank you so much for this great presentation. Thank you all. Thank you for recording out to everyone. Thanks everyone. Thank you.