 So, today, I'm going to talk about how do you combine a high-performance data platform which is great for transactions and also performance analytics on it. For those of you who have, you know, let me ask a question. How many of you have heard of Aerospy before today? Okay, about 10 percent, 20 percent. So, for those of you who are not familiar with it, five years ago, Brian and I co-founder, Brian and I founded this company in order to deal with the problem of high-performance lead-right transactions that we thought was going to become really important in the area of big data. And that essentially, this whole presentation is partly technology, but also our story as to how the whole market developed and so on. So feel free to ask any questions as we go on. You have this thing, right? Okay, so the whole motivation for products like ours, you know, systems like Aerospyke, is the essential growth of the internet. The key things about the internet enterprise is how much these enterprises have to delight their customers. And the customers are generally impatient. They do not want to wait. And it's kind of a well-known fact that if customers, you know, if a particular website gets about 20 percent slower, then very soon they don't just lose 20 percent of their customers, they lose all of them to some other site. And this is something which Google and Facebook and so on have kind of used to train people, you know. In fact, to the extent that recently I've been talking with folks from financial services and those CIOs are essentially, what they're telling me is Google has spoiled all our customers. So now everybody wants, even on a financial website, to have the kinds of vitality that you see on the internet applications. So this is actually something, you know, extremely important to address if enterprises have to succeed in the new era. Now the service here can never go down, and that's kind of a key part of this, you know. Here's a list of all the enterprises I've liberally used to, you know, pick them from aerospace customers, but, you know, you can see they're all, you know, well-known. In terms of folks in India, you know, there's Comli, there's Pubmatic, there's Visuri, you know, there's also Inmobi, you know. These are all people who have used our product, but also they're all, also pioneers in their own right on the internet over various, you know, they use lots of other interesting technology also. So before I go into exactly what is this technology and how we do transactions and analytics, let's take a look at the entire database landscape. You know, I'm an old-time database person who actually worked in the area of, you know, transactions as well as a little bit in, you know, analytics in the old days, and what we have now is a counterpart on the right side, which essentially is transactions on the right top and then on the bottom part of the right side, you have the big data analytics, just Hadoop and so on. A lot of this analysis, you know, if you look at the whole evolution of this market, in the original, you know, like 30 years old database kind of, the database kind of inventions, the OLTP databases came first, followed by the OLAP ones, as they're called. Now the reverse actually happened in the new era. You know, there's been a lot more analytics in terms of Hadoop and so on, which happened before the NoSQL or the real-time big data actually took off. And the interesting thing about the real-time transaction space where, you know, for example, we at Aerospike have been pretty much part of is the importance of reads and writes and also have 24x7 availability and running transactions at high throughput and low latency. So all of these create a whole bunch of interesting problems, which I'll talk about as to how we solve those. One marketing slide, obligatory one. Gartner, for example, has been doing analysis of these spaces and if you look at this whole quadrant, you know, you can probably not see it very clearly. There's a whole bunch of the NoSQL players like, you know, data stacks, couch-based Mongo are all here. If you look at Aerospike, it is alone in the, you know, what is called the Visionaries Quadrant. You can hardly see it, but I'll post these slides so you can see it. So why is it that we are Visionaries? It's something you will probably, hopefully, you'll get a sense of as we go through this talk. Typically, what does the real-time database deployment look like? Today, if you look at any of our customers or, for the better, anybody on the internet, you know, it doesn't have to be an Aerospike customer. If you just look at them, what they do is they do a bunch of analysis and then analysis is essentially done in terms of a batch process. They took out all the data, typically look at logs, look at user behavior. They generate a bunch of data and then they generate what is called user profiles or certain behavior information and stick it into a front-end database. Okay, and then at the time, a user is actually at their website. For example, if you take real-time bidding, every time any of you go to a website, they're actually auctioning us. Each of us gets auction saying, there is this person who is, if it's Srini, let's say, I'm there, Srini's arrived on the website, or Sunil, or head of the Bangor office, he's arrived on the website, and then they say, okay, Suni's interested in fast cars, Srini's interested in fast bikes, whatever. And they actually figure those things out and then they will show you ads. And not only that, they are able to kind of decide what has Srini been looking at in the last week. You know, maybe I've been looking at, you know, flights to, you know, somewhere, or maybe I was looking, my son is in college, so maybe I'm looking at something related to, you know, a computer I want to buy or a phone. You know, they actually know all of that and they'll show you things which are relevant. Now, this is not, so if you look at search advertising and what Google is actually pioneering, you know, search advertising, the context is there in what people type it. Okay, but display advertising as it's called, you don't have the context except by what people do. So you have to observe the people and see what their behavior is. And you have to decide it at the time that they are at the website. And that's what makes this thing interesting. So this real-time access has to happen within the 50 to 100 milliseconds so that they can actually capture the attention of the user who is at the website at that time. But you cannot do a whole bunch of analysis at the time, but the analysis happens before and the appropriate segment information or whatever other kinds of information, behave information gets stored here, algorithms get written so that they can get retrieved data which is relevant to that user and maybe some nearby users and then you can, you know, very fast be able to satisfy this. And there's a huge amount of money going. Okay, the interesting thing you have to realize is all of the display advertising front-ends are really revenue critical for the companies that they run them. If these things are down, they don't make money. So this is extremely critical. Even though, you know, if you look at financial services and banks and so on, if their systems go down they don't make money, but they actually trade in much higher value if you're buying stock and so on. It's not going to be like, you know, a small amount of money. That's kind of the big difference. Okay, so these are all revenue critical too. I don't have to go there. I could use this. So what are the key challenges in building a system to actually tackle such internet applications? You know, here are a few, you know, these are the things we decided were important when we started working on this project. Essentially we have to handle high rates of read, write transactions over persistent data. This is very important to note. If we did not have to handle high rates of read, write transactions as demanded by the internet, existing old databases can do this. You know, MySQL can do quite fine. Oracle can do fine. That's exactly where the system was, you know, 10 years ago, 15 years ago. Now all kinds of patchwork was done on these databases, including some by me when I was working in those, to essentially make them slightly better, slightly better. But the growth of the internet has made this several orders of magnitude higher, and Moore's law and whatever it is that you do can help by changing some basic algorithms that you have to actually rewrite the database in order to do things like that. You've got to avoid hotspots. In order to run a system which actually, you know, continues to run over time without problems, you have to avoid hotspots. You know, I don't know if any of you have driven cross-country or, you know, on a car or something and things kind of, eventually, you know, if you drive long enough on a car like that, it's going to break down somewhere because some weak link will be there which is breaking first. So that creates a hotspot. A hotspot actually is easy to create in an online system. Once it's created, it will bring the entire system down. It will not be localized, and that's very important to avoid. Now, you've got to provide, and this is something which comes from some of the database background, you know, myself and some of us have in our company, you know, including Sunil, and what it is is we really believe that a database needs to keep data consistent. It is important to be high performance. It's important to be scalable. It's important to be reliable and be up all the time. But what is most paramount to a database is the fact that it never loses data. And that makes things very difficult. If you look at some of the early kind of work done in NoSQL, if you look at even the Dynamo paper, the whole premise is, hey, if you want to get high performance and availability, we're going to kind of not give you consistency. We'll give you eventual consistency. And that's fine for a class of applications which is running the Internet. But if you have to cross over from Internet applications into, you know, standard enterprise applications where people, you know, lives run under, you know, like banks, you know, networking, you know, telcos, for example, you really need to solve the consistency problem at high scale. The last three, the first three are application development. You want to deal with all these points so applications are easy to write. And then the next three are about services and operations. You know, the bane of the databases running for long times is that you have to do maintenance tasks. You have to actually rebalance indexes. You have to move data. You have to add nodes in clusters. All of those are long-running tasks. And they interfere with short-running read-write transactions. You know, the kind of transaction I talked about where a user is at a website and you have to tell within a few milliseconds what the user is doing so you can show them an advertisement, for example. That kind of stuff gets in the way if your system is now, you just added 10 nodes to the cluster and you're now simply moving all the data around. What happens to these transactions? How slow do they get? The important thing is not to get them to be slower than what is acceptable. We proved that you can do this over time. This is important. Scaling linearly is, of course, a given. You want to add nodes, you want to scale. The last thing you want to do is once you add a node, you just don't want to be in a situation where you create a bottleneck because this moving data around just doesn't work well. So adding capacity with no service interruptions, that's another critical thing. These services that we're talking about in the Internet, the Internet never sleeps, it never stops, it never goes down. You never know when the spike is going to come. You're just going to have to deal with it. So I'm going to first, the first part of the talk, I gave you some overview. Now I'm going to spend the next part on how do we do the read-write small transactions with 100% uptime. So that looks like an earthquake or something. So I'm from California, so I'm very sensitive. You should all dive under the table. That looks like it stopped. So what the next thing is, I'm going to talk about how do you actually have a system which performs high throughput read-write transactions. And then we're going to talk about how to extend that to do hot analytics as I call it, analytics on your read-write data. So the basic idea here is to make sure that a system never goes down by keeping copies of data all over the place. That's kind of the fundamental thing. But how do you do that? So the way we have kind of split the problem is we've built clusters which are tightly coupled, as we call them. So this is important, because the important thing is each of the cluster nodes are identical. So we can just add a new node and it will join the cluster. We make sure that data is replicated synchronously within the cluster. And that's not enough. So there's a cluster, let's say, in a data center in Bangalore. Now you want to have a data center, in Singapore, to have some level of, you know, availability, that the Bangalore one goes down, you know, there's a power outage or whatever happens, you know. And then what happens, or an earthquake, actually. So the point is, what happens is that particular cluster goes away, then the other cluster needs to be able to work. And that's basically what we did. We built a robust, single cluster situations where you can have synchronous replication. So if you have nodes going down within a cluster here, we will be continued to run safely. And then if the whole cluster goes down, you're able to go across. So that's what is shown here. This is basically one of the first deployments we did with what is called our cross data center solution. Developed out of Bangalore, actually. And essentially what we did was, this one cluster at Amsterdam, another one at Ohio, another one in Texas, another one in San Jose. And if you look at it, there are different kinds of rings here. Some data can move to all four clusters. Some data is replicated only in two of them. Others are replicated in three of them. So all of these flexibilities are important to run such a system. So this is how we kind of slice the whole pie, you know. Synchronous replication within the cluster with homogeneous nodes and asynchronous replication across data centers. And that actually works and solves the practical problems of most of the deployments. Now how do we actually distribute it so that we don't have hotspots? Now it's important to have a really robust hash function when we do, you know, to eliminate hotspots. If you think of it, you know, at the Aerospike primary index is basically a DHT, a distributed hash table. And what we use is the IPMD-160 hash function, which to this point has no known collisions. So there's a whole paper, papers written on this kind of stuff. So what that means is you can hash virtually anything and you will not see the same, you know, the digest of, you know, of the key is going to be different. And that's kind of really valuable for us. In fact, we use it by not actually storing the key at all in the database. Then we have added a feature for, you know, for showing the key if the user, you know, if the user want to be really sure, you know, the financial transaction thing, you just want to keep the key also. We can also do that. But it is typically not necessary for, you know, advertising use cases and so on. Now the hash has, you know, the way the system works is we take the key, we essentially produce the hash, and then once we have the hash, we have whatever is called a partition ID. So the key space is split into some number of partitions. You know, in our case, it's about 4K partitions. And once we have the key space, now split into 4,000 partitions, we can then map randomly a partition to a node. So what is showing here below here is a table. Essentially it is called, you know, it maps partitions to nodes. Okay? The partition map table as it's called essentially distributes the partitions in a kind of a uniform way across these nodes. Now, if you want to store more than two copies, the table shown here is essentially a two-copy system. If you want to store more than two copies, there will be one more column here, which will tell you what nodes the third copy of this particular partition is in. And the whole algorithm works seamlessly. Yes, yes. So the way, the question was, is the partition ID uniformly distributed across nodes? Now the way it works is if you look at this table here, if you go down a column, you see a node actually appears, you know, equal number of times for that particular column. Okay? So the number of partitions in a node is approximately, you know, one by n, where n is the number of nodes in the cluster. Okay? That's actually a great question. Our clusters are typically between one to 100. So, yes. No, this will not scale in terms of the number. The way you, what you do is you just have, you know, the 20,000 partitions. Yeah, yeah. If you're going to do a cluster of that size, which might actually be necessary if you want to do a petabyte database, we are basically in the, you know, dozens of terabytes size. And I'll talk a little bit about how we store data. So you will see how within, you know, a 100 node cluster you can actually get a pretty, you know, pretty much you can handle all of, you know, North America's ad traffic in like a six node cluster. So some of these things are really powerful. Real time, okay. So what do we have to do in order to make sure that we write with immediate consistency, right? I talk about immediate consistency and you have nodes in the cluster. So if you end up having a master and a replica, what will happen is the write is sent to the row master. So we've already spread it into partitions. So given a key, we can hash it to the digest, determine the partition. And once we know the partition, we look at the partition map, we know which node the master is in and then which node the replica is in, right? So the write goes to the master. And then we do locks, you know, we do record locking, just like, you know, classic databases do. So if there are multiple writes coming in for the same record, they will serialize themselves. It's that's important. And then the write is done synchronously to the memory copy of the master and also in memory on the replica. It's done synchronously. That's important. And then you basically come back, you know, you queue the operations to disk and then it go back to the client. So the fact that this is a client will not only get an act saying it's okay if both the copies are written. That's important. And in terms of, and that's kind of the classic write case. Now what happens when you add a node? Now that is actually a much harder case because when you add a node, you have to continue the transactions and the cluster will discover the other node. One of the things we've done is we have made it very efficient for node addition to happen. So the new cluster gets constituted really fast. We will make sure that there is not as much of a glitch in the transaction load as the node comes on or a node drops. To some extent the node addition is harder than a node dropping actually, if you think about it. Yes. No, we have a fixed set of partitions in the system. That was the question earlier. So if you had to do larger clusters, you would have to start with a larger partition number. So when you add a node, the number of partitions is still the same in the system. They just get mapped differently. So you don't increase the number of partitions when you add a node. The mapping will change, exactly. When the mapping changes, that is exactly what we are trying to say here is the mapping changes. So that change of the mapping happens in a very short amount of time. All the cluster nodes agree on the mapping and then what happens is that is the current state of the cluster in terms of the partitions and that is the new state of the cluster where all the data has to go. There are moments around the system. So if in this case, right, there's a three-node cluster and then there's a fourth node which is added, one fourth of the data of the number of partitions in the system have to move from these three nodes into the fourth node. That way you get 25% of it each in various places. So that is basically the work that starts to happen. And the important thing is you got to do that while the partition handles reads and writes from the partitions wherever they are and that's kind of the tricky part is you have the current map and you have the new map so the client keeps track of where everything is so you're able to efficiently move it over. The question was will the rights continue to happen during rebalancing? Nothing stops in aero spike ever. Things could slow down. You may have a little extra latency during rebalancing you will pay a little price in terms of the response times but you can keep it under control by balancing the priorities of the data migration. You can make the data migration priority slower so that the migration will take longer time and you will get better latency. You know? Question? Yes? That's a very good point. No, it is not. And that is very important because if you actually wait until you flush the disk we can actually provide the same amount of throughput but your latency will actually not be good enough and we can do that. It's not something we've been asked to do. Yes? On both copies, that's important. There is no durability if you write the memory on one copy. The D of asset doesn't work unless you write multiple copies. We don't do it today. But every time we've had a discussion on this including with financial services I've been able to convince them and it's not me, it's like Mike Stonebreaker if you guys are familiar with databases there is this whole case safety thing so you can at some point show that if you want to have more durability have three copies because the odds of all three of the nodes dying at the same time when that particular data has not gone to disk or any of them is infinitesimally so it just starts going down and it's three, I think it's pretty much non-existent. Oh, that's... Well, the interesting thing about our key is there is nothing to be done because it's a function. You give me a key, I can tell you what the digest is and the client can do it, the server can do this a backup tool can do this there is no, it is based on a function. So one way hash function. It's an IPMD 160. It's a very robust hash function. That's what we do. Yes? We move the partition data around. Is that the question or not understood? No, we do move it. That's the whole thing here is as soon as the new node comes the cluster is alive, reads and writes transactions continue but we are moving the partitions along. Okay? We published a paper on this in VLDB a couple of years, three years ago, I think. Explaining this. One of the key things about Aerospike which we discovered very early was it is important to have a smart client. Okay? It is very, very important, you know, when you have a client that there is exactly one hop between the client and the node where the data is. So in steady state when there is no flux, the partitions have all moved and the cluster is stable Aerospike scales linearly. You can just prove it. There's nothing which will be in the way. The only thing which will get in the way is the amount of network bandwidth you have and so on. Now the trick is when you add a node and remove a node the trick is to make sure that you give enough capacity to the system so things can actually stabilize and come to steady state as soon as they, you know, as fast as they can. So that's kind of the tough algorithms there. It's all about real-time prioritization we patterned a bunch of stuff there but the map, the partition map there are two protocols between the client and the server. One is the data protocol which goes really fast. Another one is the info protocol as we call it where the server actually exchanges information with the client saying here is where all the partitions are and what happens then is it's really important for the client to gather information fairly soon but not instantaneously so it doesn't probably once in two seconds or once in a second. That's good enough. In the time when the client doesn't have the right partition information it can send the request to any node in the cluster since the cluster has the right information it can proxy the request as we call it and send the request to the other node to fetch the data. It'll be more expensive because there's an extra hop. If everything were to proxy then that'll create a bottleneck so we try to kind of get out of the proxy more as soon as possible. Okay? Another key thing about Aerospike if you think about the key contributions of Aerospike you know you can divide it into three or four things. One is flash optimization the other one is the clustering I talked about you know and there are more in a second but this is an extremely important thing so if you look at flash adding flash to any database makes the database go a few times faster. People do that with Oracle, people with MySQL you can do that with Mongo, you can do that with any database. Okay? So what is it that Aerospike did which was different? What we did, what we realized when flashed are coming people started using it about five years ago because we could rewrite the data layer of a database to make it go a hundred times faster than existing traditional databases. Okay? So we invested a bunch of time in making sure that you know we added a whole bunch of parallel things this whole slide isn't showing well but all we are saying here is the current database has a file system a page cache, a block interface going to SSDs and rotational. So the current database system has a whole bunch of you know stuff in it to make it work well when the disks are really really slow. So the whole idea of the traditional database kind of architecture is to try to get it into memory and then do operations on it. Okay? So what we realized was if we would actually architect the database in a parallel way we can execute directly out of SSDs you know based on you know using our parallel algorithms. Now once you do that what you get is you get a hundred times faster system it is not like magic there are algorithms involved and so on and that is what we see in terms of our when I talk to you about performance and why we are so much better if you look at the other no-sequel everybody says we are an order of magnitude faster so much faster than traditional databases all true. But we are an order of magnitude faster than that you know take away from this. Okay. So there is an interesting side effect for being high performance on SSDs because SSDs have a cost advantage so you actually end up getting systems you know it is again hard to see here but this is a two million dollar system with mean memory and then this is only about a two hundred and sixty thousand dollar system and this was actually done by one of our customers so you can actually get ten times cheaper deployment at ten times higher scale the kind of founding principle of Aerospike was that we would build technology so that people other than Google will be able to reach Google scale without putting in the effort that Google had to do in the first place. That is what this is and this is not us. This is basically the industry produced SSDs. Billions of dollars investment has gone into this. We have written software to leverage them to the best possible way and we already have like three or four customers in our kind of who reach Google scale like App Nexus for example, Exhalate you know there are a whole bunch of people over the last four years who use this kind of technology there is no economic way such companies could reach that scale and compete with while Google and Facebook are going at the same time so that's I think an important enabler and there are a bunch of companies in India who have done that you know in Mobi, you know Snapdeal, you know they are all kind of benefiting from this kind of technology because that's what they are, they are all about Snapdeal is going to be like huge right you know going forward like Alibaba whatever Flipkart is going to do the same thing they are going to need technology like this to speed up their deployments otherwise you lose your customers you just can't afford that so what can you do with all this stuff a few slides on what you can do you can do one million transactions per second I think Raj sitting out there there is a blog coming out on Monday Raj is one of our key engineers at Bangalore he is a performance expert you might want to talk to him in one of the breaks ask him how he got the one million TPS on a single server and this is something which we continue to work on performance is something that is very important but more importantly is with native flash the first slide I showed you one million was in memory now in flash we are this is actually an older slide I think we have a revised slide right Raj coming out on Monday you can look at it but you can see we are in order of magnitude better than other databases we have like Mongo, Couch, Xander and so on but the important thing is not the fact that we are better is by definition based on what we did the important thing is this one the latency of the thing at high throughput is like hallmark of aero spike you run aero spike you will get that kind of flat latency day in and day out month after month year after year and as the node goes down there will be little glitch and it will just stick to it and that is hard that is based on real-time prioritization algorithms and so on which we had to do to make that happen arguably we have made it on a simple API which is key value we wanted to make sure we got the platform right we didn't want to go and do a whole bunch of APIs at the same time as getting the platform right now that we have the platform right I am going to talk about the analytics portions and this one I am probably going to skip because all this is saying is you can actually run a system which essentially goes and then you take a node out and the other nodes immediately take over the basic transactions and this is something we can do seamlessly let's go a little bit into analytics so far I have talked about how can you do read write transactions you can do data rebalancing we also do fast writes using a log-structured file system and do defragmentation on the back end so how can you do high throughput reads and writes on a system and never stop now can you do more than that it is something over the last year and a half that is basically what we have been doing the original things I have talked about so far are about four years old but for the last year and a half you have worked on adding features like this you can do secondary indexes you cannot actually build queries without efficient secondary indexes I mean you could do a scan but that is really not a query what happens is if you look at the database literature from the beginning indexing is always better than brute force parallelism it is kind of a given because everybody is interested in every aspect of data that is in the system each query is interested in a certain aspect of it if you could plan a little and build the index on the things that are most interesting for your applications then you always win but however many times you are not able to think through what are these important things that you want a query so you should be able to build indexes on the fly so let's say you thought of a query you know to do instead of scanning the database for a day you could build the index for a day and then you could reuse that index over and over again every day after that every second after that for all we know so that is kind of the trade off so what we do we do all the standard things what are the hard things to do all the same things I talked about for primary keys right so we keep the index in memory the data on on disk that is the whole SSD kind of innovation I talked about but in this case secondary index is also stored in DRAM and you have to make sure that the index is data balanced across the cluster and the interesting thing is our data is distributed using a hash table which means that it is already balanced really well with a robust hash function so we co-locate the secondary index with every node which means in order to run the query we would send the query to every node and each one will execute it locally which means you get the benefit of parallelism so this indexing is therefore optimized for what are called low selectivity indexes or high cardinality indexes so you do a query on the secondary index you will get a number of records coming back if it is just one to one then you can actually do a better job with the system that we have just so you know that is what it is optimized for so parallel processing works really well indexing data is co-located as I mentioned and we also guarantee as it and the interesting thing is because it is co-located on the same node we can guarantee you a consistency thing which otherwise it is hard again I talked a little bit about low selectivity index queries so you make a query on the system the query is sent to every node in the cluster and then at that point each of those nodes gets the data back and the client puts it together and the interesting thing is if you have multiple SSDs per node then we can read these things in parallel so if you think of it there are multiple cores in the machine the multiple you know disks in the machine all of them go in parallel so if you had to get let's say a million keys back and if you had like 10 node cluster with a 10 node SSD you just basically get you know a 100 way parallelism right there okay here is an interesting thing so everybody goes it's a no-SQL database what exactly does it mean so we use no-SQL too but the way we are not SQL is just a query language we can add on top of any database frankly so if anything the new branch of all the no-SQL databases all they are saying is the original traditional databases which happen to be relational have not actually redone their data layers and their platforms in such a way to provide higher performance with the newest technology in terms of cores in terms of SSDs in terms of everything else networking all of that is what all of the no-SQL group has been doing SQL happens to be just a query language from our point of view we can do SQL like queries using the system I just talked about we are going to have a workshop this afternoon where we will talk about how you write these queries and we use Lua as a programming language as a user defined function with which you can write these various map-reduced functions or other kinds of aggregation functions you can do as this particular slide shows you can kind of use a secondary index you can do a range query on time if you want you can actually do all kinds of arbitrary group buys order buys, offsets whatever you want to do you just write Lua code and use the secondary index to limit the amount of data which comes in locally in each node the data is sent to the client which is another final step of processing for group buy for example you are going to set all the groups to the client before you do the final merge and then you get the answer yes the question is do the secondary indexes have to be created using a separate tool or management so what we have is a way to go to the database and ask it as a command just like you have in a normal relational database and then we will create the index dynamically in the background and then once all of the nodes have created the index the system knows when that has happened it will make the index available and then you can query it, it will tell you ok the index is now available and we can run a query against it so it's dynamic completely that's ok the question is what happens when we start building an index and more data is coming in what happens when you start building the index first of all starts a scan in order to make sure that it kind of builds the index on all the nodes all the records in the database at the same time it also turns a bit somewhere saying this index is available for writing so every time essentially an update happens the system will actually create that index entry also at the same time so by the time the scan is completed so essentially once you start the index build all updates since then will actually be properly represented in the index so once the scan completes the entire database you may actually re-scan the same record for which you just added an index entry and then we will make sure we reconcile all that so the system does that so at the end of the scan your index is like the final join happens in the client has to happen in the client because that's an interesting thing there you could actually do it in one node in the cluster but I go back to the hotspot thing I said I could guarantee you so it's better to do it outside the cluster because our cluster cannot afford to ever ever go down because of the hotspot so we choose to do local processing on each node and we don't necessarily gather all the data from every other node into one cluster node because that becomes now special it's doing more work and it starts slowing down so we don't do that we just keep it extremely well distributed if every node is everything loaded I think you're okay if one node is everything loaded everybody else is light because you don't know when the next thing is going ahead so we avoid that so therefore we do all of the final steps is always done on the client so maybe go through a few more slides the one thing I want to say is we are not a column database that's why this slide exists so people go you do analytics we don't we are row based I think we believe we can do a lot of interesting queries using row based algorithms some of these we're going to go over this afternoon analytics query this looks like SQL but it is a wrapper as I said SQL is just a query language you can just take SQL and convert it into Java that's what we've done here we take a SQL kind of string and generate Java and then we execute query on it for example we will show this afternoon how do you run a query where you find it took a whole bunch of airline data saying when flights showed up and then find out which of the airlines had a long time arrival or late flights so you can do those kinds of queries using and it ran in about 0.5 seconds I think we probably need to improve that even more so but we will for the workshop do you need internet access Sunil do you need internet access for the workshop do they need internet access no answer is no yes yes we do that we have special threats for things which return more versus less yes we have to in fact one of the things we found Raj is smiling there I remember Raj and I staring at it once one day maybe five months ago so only point here to make is you can do these analytics while the servers while you are adding servers and rebalancing so all of the things I talked about for single record transactions also apply for analytics so we are always on system so I'm going to skip past these lessons thing maybe I'll look at a couple of them one of the things is like you got to keep the architecture simple when you do any of these things because the problem is solving a fairly complex the last thing you want to do is also keep the architecture complex that becomes another problem so this is something we learnt this kind of tells you just don't use DHT the whole partition thing somebody asked we want to make sure it's fixed for the cluster and that actually simplifies the whole bunch of things we want to make sure every node is identical it enables you to make sure the whole thing runs when you add a node there is no confusion this node is twice the capacity of the other one you know all of that goes away from an operational point of view and also an algorithm point of view so I mean the one of the things which are sizing clusters properly is one thing I want to point out it is easy to make any system kind of perform poorly if you use it in a way it's not supposed to be used so one of those is like if you have so you want to make sure that you size the system it will run forever the only other thing I want to point out here is two things I guess have planned for is that you have done everything that you have done running a service I think everything you know everything I know is going to be wrong on pretty much in a day fairly soon in the future so you might as well keep your mind open and try to plan as much as possible but also use the right data management tool for the job one of the things I want to emphasize is Aerospike is not Hadoop Aerospike is for real-time data you can do analytics on the real-time you don't want to misuse it to actually put a whole bunch of data in Aerospike we probably will be faster than something like Hadoop but it doesn't matter because what is the difference between like 12 hours and 3 hours you know it's just hours right what we want to do is things which are within a second you know when you're actually doing something really substantial you know or within milliseconds in terms of read write transactions so use the right tool for the job I'll skip past this this is just saying what does an architecture look like and here are a list of features we basically have you know we talk about you know you can basically we talk a little bit about it in the workshop this afternoon so I can probably skip this thank you for attending this on a Saturday