 My name is Brian Borkowski. I am the CTO of Aerospike. Here is our... Here's our picture of an Aerospike. An Aerospike is part of a rocket. It's an aerodynamic feature. It's a small disk that goes at the top, very top of a rocket. It creates this compression shockwave around the rocket itself and allows it to get past the sound barrier. With an Aerospike disk, you have to tune it exactly for the size, shape, speed of the rocket so that you have enhanced stability. This particular rocket also has an Aerospike motor, which is a little curved device that allows the flame to stick together as well. So that's why we named the company. It's a small piece of technology that makes a big difference. And what I'm going to talk to you about here today is the role of flash and SSD technology within NoSQL and real-time databases. So just a quick summary or an overview of where we see flash playing. In a lot of our installations, we see a big data landscape that looks about like this. There's an analytics tier. That's sort of our old school big data. That's Hadoop. That's various different analytics, query engines, Green Plum, Vertica, a lot of Vertica, even other systems. And you have a set of data scientists out there creating queries and creating new insights that come out of these analytics systems. The app database, however, I mean, these are systems that take, if not minutes, hours, sometimes even days, run individual queries. So you need something on the front end of this system that can actually store those real-time patterns and get involved with the patterns that you need. So I'll go through some examples of that in a minute. Within this new ecosystem, we see a couple different new pieces of technology developing. So besides, you know, the new analytics tiers like your Hadoop and Vertica as well. Within the app server domain, we also see a new set of app servers becoming more interesting. Most importantly, Node and Ingenix are part of the event-oriented systems that you can use, app developers use, to drive more scale within these kinds of big data systems. And at the tier where big data apps are required, you start seeing the new NoSQL players. Now, I've also included Memcache there. I think that's a lot of the father of and is still involved in many of the ecosystem because of the in-memory performance, as well as a lot of the new companies that are at this conference. So what does this mean in terms of a typical deployment for big data in real-time? We've got models that are usually in these different analytics tiers. Because different databases are optimized for different query patterns, we'll see, you know, let's say a volt doing a lot of in-memory where you've got very small things since they can't use flash. You'll see a lot of Hadoop for the large systems, for column-oriented. So you'll end up seeing a lot of these different databases here. But what you have on the front side of your system is usually has to be just one database because that one database, you can make one query to look at your user, look at your interaction, and make a single decision. As these systems become more richer, you will actually see multiple database queries because now an application developer has more power on this front side. Typically, there are writes that go into all of these systems in parallel as well as interactions with a data store that's doing things like pattern matching. So let's think of an example like fraud detection where you're doing a velocity calculation and one of those cases we talk about commonly within fraud detection where if you have a credit card that is used too quickly within multiple geographies, right, you have to write all of those geographies all the time and read them back in real time as well as putting them back for your modelers. What do we typically see at Aerospike in terms of deployments of flash databases for that frontier of the application? There's really been a radical change in flash over the last year. So last year, this time, we still saw a lot of that configuration of machine. We saw a lot of eight core Xeons with about, you know, 24 to 48 gig of RAM and we would see parallel 100 gig, 200 gig machine, SATA drives and the kind of throughputs with Aerospike that you do that and larger cluster sizes. This year there's really been a change in what has happened in the data center. We saw a new generation of Xeons and we saw price of RAM dropping dramatically giving machines that are tend to be about five times denser in terms of that and also flash that is nearly 10 times denser on a machine by machine basis. The increase in throughput is also fairly remarkable with these systems. So now instead of having larger clusters, we see smaller clusters where each node is a bit bigger. Sure. So the question is whether this is... SSD, a solid state disk, often uses SATA as its transport. So you can use either SAS or SATA in an SSD. There are always flash drives and the physical interconnect will either be SATA or SAS. The protocol that the disk is speaking. Let's move on. We can talk about it later. Talk to him about it. He knows the answer. So where we are at Aerospike is we really saw that the in-memory techniques of random access could be applied in a big data context if you use and optimize for flash. What are the kinds of different problems that you will satisfy using flash technology? Advertising optimization, great case because you often want to cover all of the cookies and all of the users that you see, billions of users constantly. Fraud detection, we've found fraud detection is a great use case. As I say, fraud is everywhere. It's true in-game. You want to look in-game at different user behaviors and histories. You want to look at it in financial transactions. You want to look at it in advertising. In advertising you can present and say, hey, I'm a website and someone clicked on me when no one in fact clicked there. Everywhere where money is transferred, there's a little bit of fraud and there's people looking at transaction patterns and use patterns to see them. We also see financial positions and T-zero, which is to say one to four second financial calculation cases. In specific, what application patterns are in use? Well, first of all, we're starting to see map reduce in real time, map reduce in milliseconds as opposed to using a Hadoop system, which is more of a batch-oriented transaction system, and cases where you want to suddenly track not just every cookie, every IP address, every search term that's been offered recently. Great cases of these are in advertising, where you really want to not just see what has happened recently, but try to predict for giving users where they're going to be in the future. How do we know that Flash is already large? I didn't get a chance to update this since last week. Anyone at the Flash memory summit last week? Yeah. You saw the Facebook keynote, right? We all look down to Flash. It wasn't full of Flash. It wasn't stuff. They were in Flash. Flash everywhere? Everywhere. So Facebook? Yes. I have to leave in a couple of minutes. When you say Flash, we talk about Flash everywhere. I saw Flash in the beginning. I saw Flash in memory. Server memory, storage cache, server cache, arrays, front-ending, HDD. When you're talking about the kind of Flash that you're discussing here, what are you specifically talking about? In-memory, in-night-out type stuff? We're talking about a generalized system to optimize databases for all of those. So putting Flash in your servers, whether that's SATA, SAS, SSD, whether that's PCIe-based, like a Fusion IO card or some of the Micron cards, and the new Flash PCIe interconnects, all of that stuff. What is a generalized way to deal with Flash in databases? Not in advance, if I'm serious. Not yet. Not as heavy as you see. Do you die out? Yes, I did. So what we found at the Flash memory summit there's a very interesting keynote. I think it was Wednesday morning, it was Thursday, where this number is old. So what we know from Fusion IO is that Facebook and Apple spent about $200 million just in Flash hardware. Now they run their own software on top of that. They're using their own stuff. They announced that their architecture, they were actually making $1.2 billion worth of Flash buying this year. So why are they doing this? For all the reasons I was just saying, because of random access and application design, to build a rich application, you really want to be using Flash. Those guys figured that out. They had to build their own software in order to, and their own distribution mechanisms, in order to manage that much Flash. So, you know, some of the common preconceptions, misconceptions about SSDs is, first of all, I can just take SSDs in Flash and put it in my database, it goes faster. So how many people are using SSDs in Flash in their databases today? And it goes faster, right? Yeah, it does. Two or three times faster? Maybe a little more. But SSDs are so much faster than that. Instead of having 200, 300 IOPS per disk, you've now got 20,000, 50,000, 100,000 IOPS. Even more than that. So why is it only going two or three times faster? Well, that was one of the simply different than any device we have seen so far today in storage history. The way that they're different is what I call asymmetric use pattern. So reads are very fast, and you can do a lot of them, very small reads as well, but writes are still very slow, and you have to use log-based file system style approaches in order to be able to write efficiently. So where RAM, it's fast to read, fast to write, it doesn't matter what the size is. And with disk, reads and writes are both slow, and you always want to do streaming. Rids and writes are very different with SSDs, which is why you have to specifically optimize your database for Flash. The other point about writes with SSDs is if you keep your writes to large blocks, you get less wary amplification created by the device itself. The device will have to go through and pick out all those little pieces and bring them together and push them back and forth, which is similar in software to having garbage collection and defragmentation. So if you do this at the database level and do it knowing the application's use pattern, you can do it more efficiently than the device can. Now the good news at the device layer is the device knows its chip layout. Sometimes it's better to let the device do that, and I'll talk a little bit later about systems for doing that. Another important factor within Flash and SSDs is the parallelization that you have to get out of the application layer. Now in sort of the old-school database way, what does this mean? It means async I.O., and it means a lot of reactor models that allow you to process I.Os very, very efficiently. And as opposed to using a true-in-memory, you know, just blast your thread through no locks kind of strategy. So you have to code a little differently in order to make sure you get enough parallelism. A lot of these devices are happiest at 256 to 512 simultaneous I.Os in flight. So you have to write your code with threads and I.O. patterns that are capable of this. If you can, you'll end up seeing factors of performance that's much higher than simply the three to four times X faster. Another preconception about SSDs aren't they expensive? Again, this is an old slide. The last time I looked up at Dell's site, I found prices for RAM at about $30 a gigabyte. It's come down a little since then. And what we were seeing back in January 12 was about $10 a gigabyte for flash storage, which has come down now to about $1 per gigabyte. Well, with a micron M500, we're now down at 50 cents. And the rumors that I've heard about what Apple is buying internal to their infrastructure is with their pricing power, they've also been able to get down to about 50 cents per gigabyte. So that's really starting to encroach on where you would even start thinking about rotational drives. And I'll show you some results about very high performance flash, the highest performance flash that I currently know about at $8 a gig. So compared to in memory, it's really a lot cheaper compared to the same characteristics. Compared to rotational disks, there are also choices that are within the same price band. Aren't SSDs slower than in memory? Well, yeah, they are. I'll show you later some of the drives we've tested and where we see different drives fitting in in the system. But it is so much faster than rotational disks. So you can put... One of the nice configurations these days is SATA drives and you can get 20 of them in a chassis. Once you get 20 of these things all lined up, assuming you're doing your IOs correctly and you've got your RAID card configured correctly, you can get some pretty awesome performance. Performance to a point where the bottleneck starts shifting from flash and from storage, where it's been traditionally for nearly 30 years, the bottleneck starts shifting to the network, which is a very interesting result. So sort of taking some of those and pointing out what you can now do with SSDs is rotational disks, I just sort of took something from 3PAR's price list. 5K TPS, that's considered a pretty hot system. With SSD, you get 4K of storage and let's see, the TPS, 500K database TPS, right? Whereas RAM, it is faster, but look at the difference in price. Instead of 16K for 4 terabytes, you get 20K for half a terabyte. Another example, this is the kind of thing that I show customers when I'm trying to express to them the difference between in-memory with flash and in-memory with RAM, is a project budget. So I don't know about you guys, but I suddenly have to talk my boss into a new project and I often have trouble starting a project at $4 million in the initial hardware bot. If we had a high availability 10 terabyte system, this is what I got out of Dell when I was trying to build a reasonable 10 terabyte high availability system. Whereas with SSDs, this is with the Micron P320H, which is actually a very expensive drive, it's an $8 a gigabyte. I came out to only $200,000. And you know, I could probably talk a boss into $200,000. That's within my power persuasion. Now some customers I say, well, you know, I don't really need 10 terabytes. This is the kind of price savings. I was on a whiteboard at some guys in Asia and they said, well, we're just starting out with a 200 gig system. What is a 200 gig in Redis with RAM versus a flash deployment? Well, 10 servers, the particular kind of servers they had versus three servers and still hitting the same performance goals that they wanted to hit. So I'm going to do $20,000 right off the bat in our initial buy. So let's talk for a second about what it means for a database to optimize for SSDs. So I was talking a little bit before about the kinds of IO patterns that you need to use. The primary one is using those large write patterns and also being able to keep indexes and RAM. Indexes and RAM has two different positive benefits. One is that's usually the choke point of parallelism in the database is access to the index. That's where you've got multiple applications coming in. They're all trying to find exactly what they're looking for. And RAM works better for parallelism. There's a lot of great lock free algorithms within RAM. And so you can get the parallelism that you need at the device level if you keep your indexes and RAM. The other important point about keeping your indexes and RAM is there's a lot of small writes that occur when you're writing to your index. And like I was saying, writes to the index, you don't want to do those small writes. So you do your large writes using log-based file systems and if you keep those indexes in RAM, you're going to get the high performance that the device can actually give you. Another problem that you have to deal with is background defragmentation. So when you have a log-based file system, that means you have to defragment. Instead of relying on your file system to do this, we found that that's a real impediment in Linux especially to high performance. So what we've done is we use a native device layout similar to what does Oracle do? Oracle uses native device layouts also. This is the common way to build databases. But then you have to constantly defragment. If you keep your defragmentation up until a later period and run for a while without it, then you'll slow down and you'll have unpredictable performance. So having that consistent, constant defragmentation as part of your database is key. We've also found that RAID hardware creates a lot of unpredictability because most of them are not yet built for Flash. They're trying to make these big, long stripe sizes to make sequential access efficient. Well, guess what? That's not how SSDs work. So instead, what you have to do is route your IOs to individual devices knowing the device structure and get the RAID card out of the way. Those things are a huge bottleneck right now. And we always have to be very careful, especially with the LSI configuration tools to make sure we can really get enough throughput through those devices. The application, from their perspective, we're the application, the database, manages those devices directly. We're going to get fast enough storage. So another interesting optimization there's a lot of talk in SSD about trim support. We have found in general that trim tends to slow down and create a lot of pauses within devices. There's no good reason for that. That's simply an implementation issue that device manufacturers will eventually overcome. So we typically don't use that. We have a configuration that allows it. Maybe someday we'll find a device that does use it. There are next generation interfaces available for Flash. And at Aerospike, we were the first database to use the open NVM KV interface. What the KV interface does is it supports using a primary key index, a key value hash table style interface, directly from the Flash card. So instead of supporting block level interfaces where you have to do all that refragmentation and extra space, you can use the KV interface within the device itself. It saves you on RAM, and it saves you on this refragmentation, allowing you to use less storage and pay less to the Flash cards. So that's a very interesting thing. Fusion I.O. and Dustco developed it originally. Fusion I.O. opened source to that interface. We expect to see more device developers pull that interface into their system. We're quite excited about it. Our DMA is another interesting optimization. It allows you to deal with some network issues. So one thing we've really learned about Flash is it's simply a new technology. Optimizing for Flash means knowing which Flash devices really work at any given time. A guy comes out with a... a micron comes out with a new device. Is it really fast? Does it have predictable performance? Of course the datasheet says it does, but really it doesn't. We thought that we had to create our own certification tool for SSDs and promote that to the Flash makers. Because we would have deployment situations where we would measure out that the Flash subsystem and storage subsystem was unpredictable and slow, and then our customers would ask us to go prove it and get the Flash guys in line and do a firmware fix. So we found that the only way we could talk them into the fact that it was their problem would be to give them a bunch of source code that they could analyze and then actually see the problem themselves. So the workload that we tend to configure this for this tool is one again where there's large block rights and small reads, and you measure the latency of those reads. Because latency of reads is really what slows down a lot of the database access. We tend to see a lot of use cases at the 1.5 kilobytes per object basis. That would be very common within behavioral cases. I expect that will go up a lot as the cost of Flash drops. So let me show you the kind of results we can get out of these tools. This is a particular device. It's a PCIe device called the Micron P320H. And it's not cheap. On the other hand, it's still less expensive than Fusion IO. And the way to read this table across a 6-hour period is more than a millisecond. At the level of 150,000 read IOPS and with 200 megabytes of rights simultaneously. Compare that to some of the other players and devices that are available right now. The Intel 3S 3700 is a great drive if you can get them. They're priced a bit lower. They're $3 a gigabyte. You see a real difference in the latency that they give. And these are much lower throughput rates. Now you can stack a lot more of these in a chassis. So we all end up going through with our customer what kind of chassis they use, how much floor space they're dealing with, whether PCIe is a better solution for them than SATA, etc., etc., etc. There's a lot of ins and outs to it. And the Intel can be a very competitive device compared to the Micron, depending on the exact latency characteristics you want. And the Fusion IO, IO Drive 2 that's a 700 gig MLC part priced at $8 a gigabyte doesn't really seem to fare too well in the same test scenario. One thing that we see is people trying to use true consumer grade devices within these chassis. And you know there's firmware bugs in some of them. Especially when you're dealing with higher throughput. And I don't really mean to put OCC on the spot but this was an email that we got from a customer who didn't take our recommendations to heart and was seeing what was his phrase. We're currently in the process of replacing. I think he actually said something about dropping like flies. So this is one of the things that gives SSDs a bad wrap within the big data and high performance community because someone has an experience like this and suddenly SSDs and flash are unreliable. Well, more importantly, the vertex 4 and possibly the batch that he got was like that. OCC puts a lot of different flash chips because they're buying from whoever they can find at the time. Compared to people like Intel who have their own chip lines and have different levels of quality control at a higher price. So these are things you'd have to watch out for. To put some other perspective on it, these are some of the devices we saw probably more than a year ago starting nearly three years ago now with the original Intel X25. We saw a fair number of 64 millisecond requests. That's a very different world from where we are now to where we came from. That Samsung device was really a very solid device at the time. Intel had a pretty good device with its 320. We recommended a lot of those. And you can just get an idea for some of the trade-offs we try to help our customers through. So let's talk a little bit about how Aerospike actually approaches the HA problem. Because once you're dealing with high uptime high availability as well as SSDs one of the great benefits is being able to have a distributed system. I'm going to skip over a little bit here. So our architecture allows us to uniquely do the kind of SSD optimizations I was just talking about. A particular independent company called Thumbtack did a set of benchmarks and you can read through the details of what they did online. They wanted their system integrator in New York that deploys a lot of these different NoSQL solutions. They wanted to see what different databases work better in which scenarios. We were pretty happy with what they found with us for their SSD based systems as well as the latency that they found. One of the benefits of our system and our design that I'll mention in a minute is the ability to continue operating even in the face of failure. So this is a graph from their system that shows what happens when one node drops out and when a node arrives again. So the system that we use in order to get those benefits is one where it's a shared nothing architecture. There's not another server. There's not a name node map. There's not a Mongo S in the middle that directs traffic. Instead, every cluster node every piece of software has the capability of doing the routing and the Paxos configuration so that it is a shared nothing system. The way that we do this is that when new nodes are detected we run the distributed consensus algorithms in order to figure out which nodes should be in the cluster and are added to the cluster. This allows us to create a cluster master during that period that allows that one machine to determine what data is moving where and what node has what responsibility. So from an application perspective it looks all like the same on the outside. But internally there's a master and there are specific responsibilities within the system. That allows us to have a very high level of reliability in our data rights. So what happens in our system is for any given row in the database there is a master and there is a replica. Now that's all randomly distributed. Every node is a certain percentage master and a certain percentage replica. You don't know from outside the cluster what is master and what is replica and if there's a failure within the system, if there's a hardware failure or if you add more capacity then you will these responsibilities may shift depending on whether you want to automatically advance or pause and wait for the system to come back up. Having masters and replicas for individual data rights is the only way to get a high level of consistency within your system. Yes, as many as you want. Of course that slows down your rights. So what we've also found is that having a client which knows about the data layout within the cluster at any given moment is also of critical importance. So we've seen more of the NoSQL systems move to this where for example Hector within Cassandra is capable of tracking the cluster and we've had that from the beginning. It's really very important to being able to tolerate failure. So within a client library within the process itself there is tracking of where data is and who the master is on a moment basis. You don't have to reconfigure you don't have to specify a list of IP addresses that you might have to do with a manual sharding system. You simply have the client running and it is automatically learning cluster reconfiguration in process. What we've also found is that being able to replicate your data between multiple data centers is critical for high availability. If you're only in one data center in some sense you're not serious about availability right because data centers go offline. It happens. So what we have come up with at Aerospike is both the capability of doing a single master star topology and also using multi master rings. This particular customer who deployed has one ring across three data centers, another ring across four and a very small ring with just two data centers. So every time they do a write the replica that write copies itself throughout the data center set. With different data pools that's always very important. Monitoring is crucial so we ship for usually for people who are getting started with their system they want to see that their cluster is performing well. What we recommend for our higher scale production customers is that they integrate with their own monitoring system. Because you want to do a full stack analysis usually as you're having performance problems is my problem in my app, is my problem in the data layer is it in the network. However having a simple tool that allows you to see what's going on is often a first step for diagnosis. So one thing we're doing here at NoSQL now is our new Aerospike 3 capability that allows us to do queries and gives extensibility to the database. So where we've started at Aerospike in this release is the ability to do queries on index values or columns just like in a standard relational system. Columns of course can be defined on the fly there's no need to pre-create your columns. However if you want an index column you do have to tell the cluster that you want an index on it. We support background recomputation as well as allowing the index to be created on the fly synchronously and with consistency. So the scheme that we have for doing this is that the index is on the same machine as the data itself. Now there's pluses and minuses to this there's another way of doing indexes that we will also get to probably in the next 6 months. The benefit of this kind of scheme is the random distribution that we get across the cluster is maintained also within the index. When you do an insert you're touching the machines that hold that data and you're updating the indexes that are in RAM on that same machine. Remember all the benefits I was saying about having indexes in RAM. The minus of this kind of architecture is that every query has to touch all machines. Now as long as you are touching say hundreds of rows and your cluster is tens or hundreds in size it is actually efficient to do that and removes hotspots from the system. So in a system where you are focusing the data that is similar together it's very easy to generate hotspots. A classic case of that is a time oriented index where all the data from one period of time all the inserts end up hitting one server. You tend to see that hotspot roll amongst your cluster on a second by second basis. You're not actually getting parallelism and you're not getting the scale out capabilities. And then similarly if you are only querying say the last 2 or 3 seconds you can see those queries hitting one server after another you're not actually getting scale out. So the benefit of this system is the queries go to all machines okay that's not necessarily the best. On the other hand each one is doing less work and you've got no hotspots. So we think this is a good design for this system. User defined functions a classic database technique now in Aerospike. So if you take for example the work that we've seen out of VoltDB and a lot of stone breakers ideas this level of database optimization we're very excited about because so many applications have their own very specific needs in terms of the data sources that they have the data structures they have and the optimal data patterns. By writing a small user defined function you can often get a much higher level of optimization in your database. It's like having your own query language being able to say hey I'm looking I've got this document and I'm looking for all of the H1s within my document. You can write a user defined function to pull those out for you instead of having to actually build those into the database. So we look forward to using this kind of extensibility to cover things like all of the Redis operations and some of the various operations. Once you have indexes and you have user defined functions you have the benefit of doing map reduce within the the new Aerospike 3 system one of the benefits is that we use the secondary index to touch less data. So imagine that you only want to map reduce over the last minute or say even seconds worth of your data. Imagine a query that goes into a Hadoop system where you often need to touch all of the data in these big table scans by only touching the few things that have changed or updated or let's say in advertising or let's say file detection you only want to touch things for a particular class they address. You can do a range scan hit those things reduce over that particular portion of it instead of having to touch all the data. Touching data really is one of the primary bottlenecks right now within map reduce. So you know some of the benefits of Aerospike, the tunable consistency we're very excited about our flash support the fact that we've been used by so many great customers over the last few years with very high availability. Aerospike is available in Community Edition today including the Aerospike 3 functionality in APIs as well as an Enterprise Edition which has a full set of functionality. The Community Edition is only limited by the cluster size and amount of data other than that all of the APIs are available. So thanks very much for listening and any questions? Great so as I mentioned at the top of the talk you know I know that a lot of people are here from other NoSQL companies and this is exciting to you. We are always hiring and we have a great engineering team both in here and in Bangalore so feel free to drop by and talk with me. Thank you very much.