 All right. Welcome. This is NoSQL with NoCompromises, and this is really targeted at the, maybe not the broader sense of NoSQL that would include Hadoop. Although we do have a, we being gigaspaces, as a very good integration story of that as well. But this is more for big data, NoSQL databases, large clustered systems that try to emulate a conventional database to some degree, and fall short in some ways in order to achieve the goals of scalability and performance and reliability and all these sorts of things. Those are the compromises, but before I get to all that, let's move along. Okay, me, Dwayne Philpe, 25 years in accounting in the industry, from the development side really, various roles. The most recently independent consultant that was actually partnering with Gigaspaces, and then I joined them. And so now I'm the technical account manager, which I deal with pre and post sales engineering, proof of concepts, customer architecture, and the occasional speaking engagement like this. Probably more importantly is, I'd like to get an idea of the audience, just how many of you are kind of exploring NoSQL, not actively using it at the moment? Alright, so I don't assume too much here. And how many are planning? I mean, you're seriously in either a pre-production or you're seriously going to implement a NoSQL, you're certain you're just exploring how? Okay, great. And people are actually using NoSQL in production. And NoSQL like a Cassandra, HBase, et cetera, Mongo. Okay, great. So let's proceed now. So why are we driven to NoSQL in the first place? Well, limitation is relational databases. And the biggest reason is bigness. We've typically, from what I've seen, we've outgrown the data sets that we can comfortably store in a relational database. And we want to push the envelope on that. Sometimes not by as much as the typical use cases that are held up like Facebook or Twitter. You know, a few of us are dealing with data sets quite that large. And, you know, so we're looking to grow, we're looking for alternatives to handle more data, perhaps crunch very large data sets from our website operation and so forth. We're attracted to it because of easy scaling and not cumbersome clustering technology and conventional SQL databases. So scaling goes along with bigness, of course. But elasticity implies scaling at runtime, implies growing a cluster to handle very large scales and scaling up as well as scaling down. High availability. NoSQL solutions, again, like something like Cassandra or MongoDB or HBase or React or so forth, are very highly available. You're not reading through a single point of failure. And then reliability and self-healing goes along with that. And the large cluster is typically, if a node fails, data is replicated across the cluster. And actually, that replication is the source of some of the limitations we face with noSQL. But on the plus side is databases are highly reliable and the data gets moved around as needed. Extreme write capacity. You can be driven to noSQL because of the need for high write capacity. This varies. But again, goes along with the parallelism of having a large cluster of nodes, all of which can write independently. Of course, that extreme write capacity also produces the parallelism which ultimately leads to some of the compromises that I'll be talking about. And flexible data model is another thing. I'm not really certain if people are really being driven to a noSQL because of a flexible data model. Is that a prime driver for you? Who would say that's a prime driver? What's a flexible data model? That's what I figured. That is a great virtue. But not necessarily what actually pushes people over the edge to try to get beyond the relational model. So the compromises. The fact is that noSQL databases don't have any transactions. So there's a certain category of applications that absolutely must have transactions. And these are typically eliminated from consideration as applications that can use a noSQL database. A typical example would be a financial application that needs to credit and debit and account simultaneously. Without transactions, obviously, readers will get inconsistent views. Although that is a different topic from read consistency. So from what I pointed out earlier, I'm not a DBA. I do have my background mostly in the development side. But as far as transaction isolation, you're essentially only guaranteed atomic rights on individual keys in these databases. Which is not across multiple keys. Read consistency. This is one of the bugaboos that comes from having the high availability and the high right scaling and so forth. The databases now, our tuneability appears to be a feature that has come to most of them now. Read consistency simply means that when you write the data into this database, a reader is not guaranteed to read the value you just wrote. Even if you write it to a single key. Now there are many, many applications where that's fine. It's fine to wait for information to propagate across a cluster. And the reads consistency does not matter. But there's a large set of applications that require that. And they're actually built with that expectation in mind. There are ways around it that involve retooling your data model and architecture. Okay, no stored procedures. Typically there's no logic being executed. You could argue that that's a virtue actually, right? I know in my career, putting logic in a database was always viewed as not the greatest thing in the world to do. But the fact remains that many, many systems do have embedded logic in databases in the form of stored procedures. And you simply don't get that with no SQL database. Of course, no triggers. It goes along with no stored procedures. You can't. You're not going to get events. You're not going to get notified when something happens. And security while it's being nibbled out around the edges, it's immature at best. And you're not going to find sophisticated role-based security or Kerberos or something typically on a no SQL database. Yes. So some no SQL products do have transaction inconsistency. Does that mean? Well, the only way you're going to achieve that is, especially when I was getting to read consistency on no SQL was the... It's tunable. So you can select consistency into no SQL database in several of them. But the performance will be killed by that. You'll be having to wait for replication across a network to multiple nodes. There may also be several hops involved, even in rights. Yeah. So you can circumnavigate it, but at a performance cost. And part of the solution I'm presenting here is not just providing... I mean, if you've read the description of the talk, not just the provision of an acid transactional layer, but also extreme high performance. So the bottom line is even from the vendors themselves, there are limited use cases. They're usually limited to use cases that do not need consistency in transactions. They're limited to statistical type applications, too, for data mining and so forth. And that's fine. There's plenty of people that can live within those limitations. But for those who would like to take an existing application that is talking to a relational database and use a no SQL database instead, that's what this talk is about. So no compromise, what does a no compromises architecture look like? Acid transactions, fully consistent reads, strongly consistent reads, repeatable reads, horizontal scalability, and across the entire stack. Colocated native business logic. This is the equivalent of stored procedures written in native business logic. It can be a language such as Java or a .NET language or a dynamic language such as Groovy or JRuby. Real-time eventing. Triggers, equivalent of triggers, equivalent provided by a capability for doing continuous query. And by continuous query, I mean a query that's running all the time in memory, in our case, and that generates events in certain conditions that are reached in the data store. Complex event processing. The ability to respond to different events that occur, meaning data modifications, data updates, the arrival of data, the launch, asynchronous, multi-threaded processes, SQL queries. So the ability to do SQL queries, I realize this is not a new feature for most big data data stores, but this is a feature of the overall platform. A fully elastic and soft healing. So through the stack as opposed to a conventional stack, say with a web tier, app server tier, and so forth. Reads in the low tens of microsecond range. This is a cache read number, of course. We'll get into the details of that in a bit. Redundant writes in the low 100 microseconds range. These are redundant writes in memory. And roll-based security for both data and management. We've managed them with the cluster itself. A typical compromise architecture that I'm seeing now. Essentially what this boils down to is having two separate stacks. One to handle apps that can tolerate eventual consistency and lack of transactions. Or maybe something like this, where the app tier is actually smart enough to know where to go and maybe is actually combining the two together. And that's far as it goes. But we haven't solved scalability here. So the Franken architecture, yes. This is just pulling together a lot of pieces in order to produce what you need. And granted that some level of combining these parts and pieces together in order to construct a system is, it almost seems inevitable from what I've seen. But I think we can do a lot better than that. It's got complicated, obviously, many moving parts, all the different vendors, many contracts overlapping upgrades, different technology expertise, etc. It's not elastic and certainly not in any uniform sense. And it's inefficient. You're going across several network ops in a typical request flow. Okay, super cluster concept. So this is really the end result of this architecture. It's really a very simple note. How many of you are aware of Gigaspace's XAP product and what it is? All right, so before I get to that, well, actually before I get to this, Gigaspace's XAP here, I'll go into a little bit more detail here in a few minutes. Gigaspace's XAP will sit in front of the NoSQL cluster. Gigaspace's XAP is an in-memory, horizontally scalable, parallel processing platform. And it's fully transactional data and processing are stored in memory and scaled horizontally. The NoSQL cluster sits behind it, serves data to XAP. And in effect, XAP serves as a sort of logical transactional veneer or layer on top of the NoSQL cluster. And this is sort of a note to myself, but this layering here is logical only. We have customers that actually mix the Gigaspace's cluster intertwined with the NoSQL cluster on the same nodes. So it doesn't have to be a physical separation. Okay, so there's some synergies here, obviously. XAP, actually, because of its nature, really is a NoSQL solution. And by the definition of NoSQL used for this conference, it's a NoSQL solution itself. It is a, it could be, it can be viewed as an in-memory, object-oriented database, clustered. It's an application platform. It's not merely a data store and that's where we get the ability to execute code and produce real-time event processing in front of the NoSQL back end. It's not disk-based. NoSQL is. So it's memory-based. This is, in practical terms, is going to limit the size of it, even though it itself has a very large upper limit on size. It's not going to become petabyte. Nobody's going to put a petabyte in memory for any of my lifetime, I don't know. But it adds a, let's see, it does a transactional layer here. They're both distributed data stores. They're both highly available. They're both elastically scalable independently. They're both self-healing and they're naturally complementary. This is just another view of it. So we have a lot of similarities and overlaps here between the two. When you put the two together, you end up with a platform that I think is really unparalleled of the power that you can get from the parallelism of the data access to the parallelism of the processing. So here we go. It's an in-memory clustered federation. This is the XAP product. Storage and processing, as I mentioned earlier. Transactions with strong consistency, high availability, and self-healing. So the way that's achieved is through replication here, we see in memory. So there's at least two copies of every data element. Normally organized on separate nodes so they can tolerate failover. It is horizontally scalable so there's a load balancing content based load balancing on the front of it. Data is sharded across multiple servers and lives in memory. Memory is the primary storage medium from XAP's perspective. Memory is the database. And if all your data fits in memory, that's great. If it doesn't, like we're discussing here today, you have no SQL needs. Typically we would have a relational database behind the scenes integrated with a store like this. But the additional, the combination of XAP with no SQL is more compelling because of the scalability, what it adds to the no SQL platform. And final, yes, XAP is no SQL. So you can combine. So here's a few slides just showing how XAP takes a standard stack and collapses it. You could view this as the no SQL database. And it's small, not because it's not important, but just because this is explaining XAP only. We'll take a standard message chair, a message broker, say a clustered message broker in an architecture like this. Messaging gets partitioned and co-located in memory. That eliminated one hop or multiple hops across the network to retrieve and fetch messages in store messages if you have a messaging layer. The business logic chair then also joins the messaging chair in memory as well, eliminating another hop or two. And then even the web care, the web applications can be hosted on the XAP platform and also manage the same high availability process, agent-based management we have to provide a high availability. And then the whole stack now that's all been joined in memory and clustered can scale in and out either manually or automatically based on performance metrics. So XAP, just some more details on that. In memory, there's fewer physical tiers. The logical tiers of course still exist. Logical tiers will always exist and should exist. But the typical organization of a full application stack in the physical tiers has great disadvantages for scalability. Data partitioning, this is similar to what you would see in those SQL database. Data partitioning spreads storage and load, although it's not executed the same in XAP. It has the same result, which means horizontal scalability and load balancing. So that's partitioning in memory? Yes. I said it was content-based partitioning. So when you actually define the data that is going to be stored in the grid, you can identify fields. The one thing I didn't mention for XAP, because we don't really have time to get too detailed, XAP stores data as business objects and or documents, and you can actually annotate certain fields to be keys and indexes and also the routing indicator in the object so that you can pick values that will scale evenly. So the annotations are at the instance level? Yeah, well, these are actually annotations on Java classes. So in the case of Java, we have a .NET also. And there's also XML, if you like that, you can do it in XML. So the business logic also is distributed across the cluster in memory with the data. So every node is running all business logic. So we also support a distributed RPC capability. Queries against memories. Query against memories flow to the node SQL store on MIS. So a cache in S, I mentioned before that we're clearly not storing all memory, all data in memory because the node SQL store is far too large. So when you run queries, these are SQL queries, you can run it against the data in memory. If the data is not resident in memory, the equivalent query will be generated and run across the node SQL cluster. The writes are persisted asynchronously to node SQL if desired, meaning you can also store data in there that doesn't go to node SQL if you wanted to. But generally the writes are persisted asynchronously and they are queued in order and they're fully transactional group. Continuous query triggers event listeners with logic. We showed earlier that you could have messaging in your cluster. You can also have procedures and events that are reacting to data as it state changes as it flows through or as it's replicated in the system. Management processes in the cluster detect failures, redistribute load, restart failed processes. So the system is scalable automatically and manually as well. And since user activity flows through Zap, user is exposed to Zap security. We're all flowing through this layer that hides some of the missing pieces of the node SQL system or the compromises. This is just a quick overview of the transactional rate. Just to give an example here. Is the node SQL like a third party node SQL or is it possible? Oh, I'm sorry. Yes. We have integrations now. We integrate with node SQL vendors. We are the in memory data processing cluster. We integrate with Cassandra, Mongo. We even have an HDFS integration although it's the story for HDFS. It doesn't relate to this talk and there's others coming. So the client will call a service which is exposed as a remote service from the XAP cluster. Copies are made to backup nodes in memory. The node SQL write is queued and the typical latency is experienced by the writer into memory is under 200 microseconds. Concurrent readers experience strong consistency. So I've written here, I'm not representing a write, but client writes are experiencing strong consistency. Meaning they're going to read exactly what was written. They're going to be locked out of transactions where that's desired. And experience essentially a relational database like experience. A query with a miss here and this is a little bit different. So we're essentially running, in this example, we're running Gigaspaces XAP in an LRU caching mode. Meaning that we're saving elements in memory that are most frequently accessed. In this case, client calls a service which queries or the client queries directly. Generally I advocate having a service layer on the grid so that you're not actually directly accessing the cluster. But either way, if the data is not in memory or an insufficient quantity is in memory, the SQL will actually flow through and be translated into the native node SQL query language on this end. The data will be returned to memory. The oldest existing data in memory will be expired if there's no space for it. And data from both memory and node SQL return to the call are typical latency less than 200 microseconds per end memory. Obviously if we have to penetrate to the node SQL layer, it's going to be bounded by the node SQL database itself. So do you guarantee consistency between the two data scores then? No, and that's an interesting point because it doesn't matter. Let's put it this way. There's a certain assumption here that we're accessing the data through the in-memory data grid. So clearly if you want the consistent view of data period, you have to go through that way. We're assuming also that we're not running a strong consistency mode on the node SQL tier. But if there's a miss on the memory cache and then you go to the node SQL store and another read hits the memory cache, you're going to get two different reads because you're looking at partial data, right? If I can't get two different reads because it wouldn't be a, it's not in memory to begin with. A portion of it quickly, right? Or did I misunderstand that? You said if some, I thought you said if some of the data, not all the data was in memory, so half of it was in memory. Right. And then when I went to do a read of that half, a rest of that half, I could get different reads on that. No, actually the read will be, well, let's see, the read in progress, I see what you're saying. The data from the cache in this that comes back, it could be, yes, that is quite possible that you would get for the data, the remaining data could be, if you weren't doing a key-based selection, it was a say a range-based selection. And there were multiple possible satisfies on the node SQL and you could get, the idea here though is that when you write, when you're writing into the data store, you could view the LRU memory storage as a queue to some extent. And reads that are against that store are going to be consistent. The only thing that the data durability in the cache has to surpass is the length of time it takes the underlying store to replicate, which is typically very fast anyway and we're talking sub-second issues. So these are for the strong guarantee. This is coming up pretty well. I think this is the key point I wanted to make here at this time. XAP really hides the lack of transactions and eventual consistency from client apps and frankly you just don't have to deal with it. If you're looking at taking, say, just replacing a relational database in an architecture, it's actually a possibility. You can eliminate the need, as I said here, but you can eliminate the need for many architectures, maybe perhaps not all, but for many. And of course you get XAP decorates node SQL with real-time event processing. So you get all of the benefits of an in-memory data storage and distributed RPC environment and a complete stack. The client apps experience and memory speed and the entire architecture scales horizontally. You could really almost visualize this as, I know the comment earlier was that we used to apply the node SQL, but these two are so complementary that you can almost imagine a product that had both, simultaneously because they are such natural matches for each other. And I think this is an important thing. This isn't just some idea I had, this is actually real. We have integrations currently with Sandro, MambaDB and HTFS as well. That's the Hadoop story, which is a different story. So if you want to see more about or look at some of my recent activity around this integration effort, check out the Gigaspaces blog. There are a couple of recent posts, Cassandra on Acid. You get the elephants and eyeballs. This is actually an integration with the Gigaspaces grid with both Cassandra and Hadoop simultaneously. It's kind of an interesting use case there. We're past time, but if you have any questions, thank you.