 You mentioned a little bit about my back. I just wanted to mention that I'm a big advocate of standards, and I'm also very much involved in a lot of the NoSQL community and trying to promote education and helping match the right business problems with the right NoSQL database. My wife Ann Kelly and myself are co-authors of a book being published by Manning. You can actually get it today from Manning.com slash McCurry. And our goal is to have about a 40-hour curtailed solution to architects come up to speed and recognize the right solutions. I'm willing to cover a little bit of the book today, but if you're Chris, you can go to Manning.com slash McCurry and download the entire first chapter, and then you can purchase the book in PDF, and it will be available in print. Right now, we're looking at the end of August. So let's talk about what we're going to cover today. Our focus is really going to be on understanding architectural patterns that are introduced by the NoSQL and really look at what problems they address, and then match the right business problems with the right pattern. To make sure everybody's kind of aware of what's going on here, I think in the 80s and early 90s, we saw relational databases as kind of the dominant form of a dominant database pattern. And then in 1995, towards 2010, we saw a database that used this old-fashioned analytical model. Now we're starting to see the third slice of the path by emerging, and that's what we call the NoSQL world. We saw the NoSQL movement. We really kind of had two dominant patterns in most databases. We had this additional pattern where we put data in a table, some called the row store pattern, where each, every time you add rows to a table, you get a consistent set of records. In the columns, I drew them in different colors, meaning that they're all the same data type. So that's kind of the traditional SQL relational database. The second one that was really common was this analytical, or often called the OLAP pattern. And there we have a central fact table with maybe a history of all your transactions. On the edge, you have the star schema that has all the dimensions, such as time and product and things like that. And it was similar to the relational that you store data in tables, but unlike the relational, you had a different language, you used the language MDX, and you really focused not on transactions, but on doing analysis of historical data. So the problem was that a lot of people were using these patterns, but they broke in certain areas. They broke in stability. They didn't have the document centric. You had to do a lot of data modeling upfront. So there's a lot of different pressures, and we're not going to really spend a lot of time on the historical drivers for no SQL. But what we really want to say is that we've now added four new patterns to the traditional two, starting out with relational and analytical, but we've added some value stores, how many family stores, good stores, and good document stores. And so what we're really talking about for this hour is these four new patterns. We're not going to spend a lot of time on relational and analytical. We're not going to mention where their strengths are, but we're really going to go into each of these four new ones and talk about what types of business problems they're best at solving. So when you think of key value stores, the icon in the upper right there really shows that keys are basically small strings, and then I often display the values in terms of black opaque values. That is, you can't really see what's inside the values. Focus on putting things like a URL in a key, and then the blob of data would be in the value. A very simple structure that doesn't have complex language, and so we'll talk about how that is one of the things. The next one we're going to talk about is column family stores, and I always think about this as keys in a matrix. We'll go through and describe all that, but basically each one of these cells in a very large table can be millions of rows and millions of columns, and you can't really select all columns very quickly because they may be distributed over a very large number of networks. That's the types of things that you're starting to see in Cassandra and HBase. The third new NoSQL pattern is the graph, and this has actually been around for quite a while, and the characteristic of that is that you're focusing on the relationships between data and graph traversals, and the last one is document stores. Document stores are kind of one of the more popular ones. They're the ones that are really promoting agile development and search and retrieval, and in documents we have a hierarchy of objects that are kind of clustered together in logical groupings. So we're going to go through each of these and kind of find which ones are useful in different areas. All right, so why is this important to understand these? In the past it seems like a lot of organizations I went to and did consulting seemed like they had a very binary world. If it looked like a document, they'd say, let's put it in Microsoft Office or a spreadsheet. If it was a little bit bigger than a document or didn't fit well, then they said, let's hire a team of monitors and do relational data. We'll create our data definitions, and we'll do that, and it was a big project. Now what we're starting to see is that we have a lot of different types of data nowadays, and we have a lot of it. This is just kind of getting you familiar with the fact that there's not just transactional data, but we have on the left-hand side a lot of read-only data, or we write once and read many times, not just databases, analytical databases, but also lots of log files and events. One event in a very large log file is critical, and it needs to bubble to the top quickly. We have a lot of unstructured data, documents, XML, and JSON. We have lots of open-linked data, and we're starting to see that each of these different systems handle different types of data well and also can be configured in different ways and using different types of hardware to be optimal. These are different types of data. There's also different types of uses of these. We're starting to see not just transactions and analysis for the first two, but now people want to do very high-quality Google-like search and findability analysis. They make changes very quickly. They want to discover and find insights in large volumes of data. They want speed and reliability. They're starting to do streaming things, or they never store it to disk, and it only goes through memory, but they still want that consistency and availability that they get in on a transactional databases. One of the things that we found is that when people select the right pattern, sometimes they're unconsciously steered towards existing solutions that they're very familiar with. I just attended the Carnegie Mellon Saturn conference. There were some wonderful presentations about why people tend to pick these older relationships and analytical systems rather than looking at the new ones. They're adhered to the old systems. There's the black and white effect. They're only looking at a subset of the information out there. So there's all these different biases. We could spend an entire hour just talking about why people are still deciding to use these systems. I have a blog post about these, but what I want to do is try to overcome some of these selection bias and help people get more information so they can pick the right solution. One of the things I have to remind people is that many non-SQL solutions have a radically different architecture like these key value stores. They don't necessarily have to be conformed with all these anti-SQL standards and that they take one task and try to do that very well, and sometimes they really focus on scalability. One of my favorite examples is Amazon Dynamo which is a very simple key value store edit store, but it also has scales and Amazon will give you 44 million transactions per month free because they really want to show off their scalability. So sometimes having a simple query language or a very simple interface helps you excel in other areas. One of the things about non-SQL systems is that non-SQL has really become a design style. People are looking at the entire code base of the product you're using and they're throwing things out. They say, you know, we don't really need transactional control. We don't need all the same asset guarantees. We need all these update things. We just want to have one database just like a touch screen has one interface and we want to repurpose it for multiple things. So that simplicity design style is very popular. I wanted to make sure that everybody knows that this isn't just a little trend. If you go and look at the Google Trends and search for both non-SQL and RDBMS, you'll find that non-SQL is really growing quite a bit and RDBMS articles and blogs are starting to flatten out. So it's really something of becoming a mainstream development. But it's something that you want to just take your relational database problems, your general ledger counting systems and your AD systems and say, let's move it over to non-SQL. The systems are often highly tuned and highly designed to be optimized for a certain type of online transactions. The whole point of this is to look at areas that those systems weren't designed around. Like Eric Evans from Rackspace who has been very influential in this movement has a very good point that the whole point of this is to look for problems that don't fit well into relational databases. There are a lot of companies, if you come to our conference at the end of August, you'll meet a lot of these vendors. They're all taking very different approaches, but one of them that you find is that they have a lot of innovative approaches to both functionality and agility as well as interesting ways of doing things such as streaming data. So there's a lot of different companies, and I've tried to group some of them in the high-level taxonomy that we have for database architectures, but often move around. For example, Couchbase, which is one of their document stores in the upper right, has to be mostly a key value store and they just added the JSON functionality in their latest release. So they're much more firmly in that document store camp. We often see new products being started up in simple key value stores, and then they add more and more features so they can have more complicated structures that can be queried. So let's now go through these patterns and start to look at where they're good at. I'm going to just mention Relational very quickly. Most people probably are coming with Oracle and MySQL, Postgres, and the Microsoft SQL servers and IBM DB2. These are good examples of very solid relational databases that all adhere to standard language, but to use them, you need to do a lot of upfront data modeling. And once you do that, you're going to be using processes called Joins to merge data from multiple tables. And they are very mature. They have very good control of transactions and fine-grained security, so you can own a subgroup of people to see certain columns and certain rows in these tables. But it does require upfront modeling and they also tend to have a challenge with scaling. And just to help people visualize that, if you have a distributed database and you have your products on one database or one server in your network and you're... or another, you're going to have to do a joint operation over the network. And those joint operations send data back and forth between the servers. So what normal people kind of do is say, let's avoid moving data across networks. What would happen if we kind of had orders and products kind of grouped on these servers and we sent the queries across the network? So instead of moving data back and forth, we moved queries back and forth. So it shows you a little bit different approach. Some of the things that those equal systems tend to be used for a lot are highly available systems and highly scalable systems. But we shouldn't forget the fact that most of the people that are doing high-end analytics already have very good tools in place that work and solve business problems very well. And I'm talking about companies like Cognos or Experian, Microchagy, Pentaho as a very good open source data warehouse system these days. Microsoft Oracle and Business Objects all have very, very good analytical systems. As I mentioned, these are based on this concept of a system. They're really analyzed for read mostly and they're very fast and allow non-technical users to be able to do queries by simply dragging and dropping things like categories and matters using a graphical interface. The problem is that these systems aren't really optimized for transactions and updates and they don't really deal with document stores that well. You can't really go in and do keyword searches on full text databases. So they certainly have their place and now let's go into some of the big patterns. So we talk about key value stores and the exclusive key value stores would be systems that the Berkeley DB that has actually been around for quite a long time as part of the initial Unix Berkeley releases. This is a Memcache, DynamoDB, S3, Redis and React are all good examples of solid key value store systems. What is important about these key value stores is that you don't really get to many of these related to see the values in the design so that given a key it returns a value very quickly and if you have lots and lots of key values it's very quickly introduced. They tend to be very scalable but they don't really have complex query lengths. So let's take some numbers. If you're from the test tables you can think of a table as a table that has only two columns. The first would be a simple string and then the second would be a blob where you're storing any type of byte arrays in that value. The key to this is that you're not going to be joined to any point. You can certainly use relational databases to store key value stores initially and then migrate them to services if you want to. So you can always kind of build a hybrid approach to these things. The way to look at it is think of a locker metaphor. A locker metaphor is you can see all the keys that you have but you're never going to be able to open up and see the values inside of it unless you are presented with the key. If you don't have the key, you can't get the values out of it. So it's a very strict and rigid system unless you have a small dataset where you're going to be doing string analysis of those values. So key value stores are very much like dictionaries where the key is something like the entry of the dictionary and the value are all of the definitions used in the entire extension of the dictionary would be the value. So to think of the relational databases, you often think of I have a set of items and I can collect any subset of those items by adding things like where clauses to your queries. Well, key value stores have a different model. You can only grab one item at a time out of your store and you can only do it if you're given the key. Now, there's a lot of variations of this where people call a sorted key and maybe have other keys that contain or other values that contain a collection of keys to be able to pseudo-fold their structure. But in fact, this is kind of the core model of key. So there's a lot of variations where you're seeing a lot of variations. You're seeing a lot of things such as eventually consistent key value stores when they're distributed, you can also create to wherever it takes a while for those updates to propagate. There's things like hierarchical key value stores or key value stores that you store within a hard disk or just in just themselves data drive. Some of them are focused in high quality and some of them do allow full list considerations within the values of that system. So there's not true key value stores. Most of these systems allow you to do something great, but they don't support a full Cray language. I'll give you a little example of this and that's the Memcache. Memcache was one of the very early open source key value stores and what was created was by a group called LiveJournal and LiveJournal had this problem in that launchers were coming to their site and reading the same data over and over and they had different web servers and they figured out cache data in each web server. But sometimes you'd have duplication of the data in each of those caches. So what they did was they came up with a very simple way to allow these RAM caches to share key value and they came up with a simple protocol. So you can say, I've received a request for this data item. Here's a data item. Does any of my neighboring caches have it? Oh, yes. So it has it and it returns it out of memory so it avoided all those disk access. So Memcache was a very good example of a simple key value store that ended up big difference. So Memcache is a good example of a key value store and it's created by a company called Bash Show off on the East Coast. Memcache is another example of what we call a dynamo empire. Amazon is a paper called DynamoDB, quite a little. And it is a lot of different organizations and the key thing about these systems is they really focused on high availability and fault tolerance. You may have a cluster of 10 nodes. If any one or sometimes two of those nodes fail, they know how to duplicate the data. The data was replicated automatically for you. If a cluster failed, it would automatically offshard and add new things if there was no data. And they also integrate with a lot of these high volume transformation tools like MapReduce. And then they also support a storing full text search on some of those things. Really interesting because it is written in Erlang. Erlang is a functional programming language and it really takes the burden of distributing these queries around a potentially fault tolerant network and makes the job really easy. So it's a lot of different systems in the NoSQL space that are written in these languages like Erlang. Redis has a little bit different spin on this one. Redis has a focus on speed reads and writes and they're focusing mostly on in-memory key value stores. They also give you a much larger palette of queries that you can do inside those values such as lists and fonts and hashes. And they have a lot of different features that developers really like, like automatically expiring things after a certain time. They support transactions. They even have these mini queuing systems where you can subscribe or publish for different types of documents and different positions. So Redis is a very good example of a NoSQL key store that's gone in a different direction. And DynamoDB, as I mentioned earlier, is one of the earlier architectures and it didn't become available as a service until last year. And you could actually go to Amazon and sign up for the Dynamo system and it gives you a lot of the benefits of cloud-based storage. You don't have to worry about it. It's Amazon's fastest-growing product. Interesting that Dynamo is strictly designed to work on solid-state drives. And they designed this because they want to guarantee the number of reads and writes per second even though you're scaling or may have huge peaks and demands at very short times. They focus on throughput and not necessarily storage and they have very strong integration with Amazon's other products like the Amazon F3 system and the elastic map produced. Now let's go on to column family stores or the second major pattern. And I mentioned earlier that I had to think of a column family store as a grid of keys where the key is composed of the row and the column and the value is kind of stored as a blob. Most of the column family stores are a little bit more mature than simple key value stores but they have some same limitations but also some more advanced features. A couple of columns of family stores would include Cassandra, HBase, Hyde Cable, Apache Attunio, which has gotten a lot of press because it's the system used by the NSA and has very high security and fine-grained control and then the original Amazon Bigtable, all used column family stores. I would mention really quickly that many people confuse these things called column family stores with the variation of an analytical database which is a column oriented store. Column oriented stores still stores things like an OLAP cube but instead of grouping all the data in a row together groups data in a column together. That's been more of an implementation. The model you use for a column oriented database is SQL and it has pretty much the same quantities to change the storage patterns. But column family stores are really unique and they are really the champs at a lot of scale-out and they have multiple versionings. You can have multiple versions of each blob in each cell and they're really the primary choice for a lot of people that are doing web harvesting. So you have a new version of a website. You just put it in the same cell and the URL might be the row ID and the column might describe that version of that web page. The only problem with this is just like key value storage, you can't really query any of the content in the blob. So you have to be pretty careful about designing and putting the data in the right section and this doesn't have to be just binary blob data. You can have lots and lots of small data and millions of different column names and you just don't, you won't have the ability to do the same types of queries although a lot of people are starting to build SQL and SQL length to these systems. So these big table systems can be used on very large systems. I think a typical, the minimal HPACC in production is usually about seven nodes minimum. And so they have redundancy built in, they have automatic web replication. They're often tightly coupled with MapReduce. Many of them use the Hadoop distributed file system, HDFS, which has replication by default of three. So every time you store it, you need to store it on three different nodes. Although you can certainly do testing and development on a single laptop, this is really one of those systems where people have a lot of data and they do a little bit more flexibility in the key value store and they also want to scale out. But they also have this mental model of a table that still tends to work well because you are kind of looking at a grid layout of a lot of these things. You can't do the nutritional things. You can't have some query languages, such as PIG and the HBase languages. So what I like to visualize this is if you see the spreadsheet of having a key, which is a combination of the row and the column letter, we can think of a big table storage as kind of like that, except the key has a little bit more information. It has not just the row identifier, but it also has a column family, kind of a grouping of columns, as well as the column name, just like a column name in a relational database. It also has timestamps so you can store multiple versions and you'll have that value. So the column family store is nice because you can keep adding more and more data and you can keep it making it more distributed and it allows you to continue to add more data as your systems grow. The columns really allow you to move things together and allow people to do other outputs and still have things such as triggers that can serve them. So there's a lot of different options of this that go on. Okay, when we think of column families, I like to think of a tree and when you're starting to think of it, many of the systems have this concept of columns and column families and you're putting your data into certain columns. I understand that. Designing with a nice system is similar in some ways to designing hierarchical documents, but it's also similar in some ways to the relational models where you should be having common and similar data and similar structures. The systems that implement these column families are a system like the HBase system and is popular and originally created by Yahoo and has a huge amount of development that has gone into this. They all have Java interfaces, they run through Java, JVM's, and they have strong support from a lot of different vendors in the community. Hortonworks and Clodera, MapR and NumIntel and Reform are also very, and IBM are also very active in those areas. One of the biggest successes that's coming from the table of companies from different stores is the Cassandra system. Cassandra, which is an open source of Apache products but also supported by the company DataFacts, has a very interesting, instead of a master play model, we have a cool, what's called peer-to-peer distribution. This is wonderful for people that are scaling out because each node in the network in their cluster has almost a full set of the data it needs to keep running. So if multiple nodes fail, it automatically recovers there's single-point failure and Cassandra and DataFacts have done very, very well recently with a lot of good information and a lot of good support and monitoring, and they've taken a long business away from some of the bigger relational databases. Cassandra's also written in Java and works very well on the selection of the HTVS but also the integrated components, which the standard HTFS and MapReduce. One of the slides that I wanted to highlight was the one that Adrian Cockcroft did at our conference last year who will be a keynote this year at the conference. Netflix has done an amazing job with Cassandra at helping people understand how to build tools that will scale out and prove that these tools scale out. What they do is they set up an infrastructure where you can lease 200 nodes in an Amazon cluster and this one's 288 large instances and simulate a load of up to a million transactions per second and simulate it for about an hour and then shut it all down to show that these systems really do scale. Creating these benchmarks requires a lot of infrastructure that Netflix has not only created but also made open source and I think Adrian is one of the best speakers around here who really talks about how this peer-to-peer architecture that underlies Cassandra has really been a great benefit for a lot of companies that need to provide scalability. So we're going to move on to GraphStores now. GraphStores is when we start to focus on relationships and the properties of those relationships. Examples of these include News4J, one of the biggest ones. Real Graph and Infinite Graph and several of the RDS triple stores are all types of GraphStores. The interesting thing that's different about GraphStores is that when you do queries you really are not looking at these projections giving you a subset of these nodes but charting to do things that are almost looking like you're walking through the graph. They call them graph traversals and they are very, very fast when you want to search networks and they are very good at pulling in publicly linked data. The disadvantage of GraphStores is really the thing that you have with relationships is that because they have so many relationships just like relational database relationships we have graphs that need to span multiple nodes in a network so they really need to share data across these nodes and so you get a slowdown if these queries all don't go around. We also have very specialized languages a lot of the RDS databases use Sparko which is a W3 standard for GraphStores. When you have GraphStores it's really when you have a great understanding of the focus on relationships and the types of relationships between each other. Fantastic for so many network queries. How many friends of mine are friends that have visited this restaurant or how many friends who live in the area that have these properties? They're also very useful for likely inference and rules engine. They're heavily used in pattern recognition fraud detection. Our book has a case study on YARC data which is Cray and spell it backwards. We have YARC, and they have a hardware box that has a huge massive carabyte scale and then they do thousands of processes all doing graph traversal on centralized arrays. So they can really scale up to the carabytes of graphs and these queries can really be used for leakage, fraud, and a lot of other discovery where you're looking for complex patterns and so forth. So how do these systems work? Well, most of these systems when they merge data together they require that you identify a node with some kind of unique identifier. In this case, the example is you have two nodes and they're defined by a node that has the same identifier for person one, two, three. So then you can almost do inference and you can ask the question do any of the books have the name of Dan just by joining two disparate data sets together? The quick tip is how do you make sure your nodes have the same name? That's how a lot of organizations are starting to use USIs for that. You can see that there's a lot of data out there. This is one of the pictures from the people that are connecting databases. Each of these circles is in fact a database and some of them are media data, some of them are the yellow and the lower left or geographic data. The upper right is more publication type data. Life sciences is down in the lower right. And all of these data sets are designed to have similar things called sparkle endpoints. These endpoints allow you to in a sense take a very center to them and get a graph out of it. DBP is kind of in the center of this graph because they have a lot of links and the big news that happened just recently is that Wikipedia has now taken over a lot of the job of managing these links to have consistent links and rule checks. The data that Wikipedia has used as a lot of source for this link data. One of the most popular systems for this is the database Neo4j. Neo4j has really gotten a big percentage of the graph market because they have really appealed the developer and it's an open source although they do have a supported version and also have things like asset transactions. And so people have often built these small Neo4j databases and done specific queries and then integrated them with their other systems. So they may not be doing all of their storage and retrieval and searching off of Neo4j but they certainly have done a lot of them. So let's move on to our last one which is document stores. And document stores is kind of the darling of the venture capital area. There's a lot of funding going into this. And this is this area where you're starting to store your data into nested hierarchies. You're putting logical data together. So for example a branch sorter would all be represented in one tree. And examples of these companies would be Marclogic, Mongo, Couchbase, CouchDB and one of the ones I use almost every day is ExistDB. The key advantage of this is that you don't have to have this object related on layer. They're really very compelling for search. They are a little bit more complicated to implement because of their query languages. And they aren't necessarily compatible with a lot of the SQL. They use the path expressions rather than the SQL queries. But they also have been associated with a lot of the accelerated development. So there are two major subtypes of documents both JSON and XML. And they really are kind of similar to the old object stores that we used to see. They're storing your a specialized version of an object and all the objects that contain those objects all under the same structures. They are very quickly maturing to add a lot more features. You're starting to see more support for asset transactions and a lot more revenue coming into these companies as they get funded. Here's a little chart showing in red here some of the companies that are really focused completely on no SQL systems. And I wanted to point out that MarkLogic, KenGen, and CouchBase KenGen is the company behind Mongo are all some of the biggest leaders in this graph shows that they're still tiny compared to the revenues of the big companies that are doing support and service but starting to see a lot more of these documents store models become very popular. So why is this important? If you let's look at this problem of moving data in and out of relational database. This is what I call the four transformation model. You have four different transforms that have to occur. Getting data out of your web page in one moment or in another object middle tier Java or .NET then the transformation and T2 and T3 are usually done automatically for you by hibernate and some of the object relational mapping or mapping tools. The way to get it out of the object is to put it back into serialize and HTML. So the key thing about the docket models is that you don't need to do this or you need to send data back and forth. You'll also be able to build more transforms out of your objects. A lot of people have called this transformation layer kind of the become of application this quagmire that you get stuck into many projects get lost and getting in and out of relational database becomes the most important problem rather than solving your business problem. Data stores can store the entire document in this application layer building in the web browser. You might just do one call to your document store and pull out an entire JSON object that populates all of your web browser information. There are complex middle tier. There's no shredding. There's no reassembly really simple. One of the things that I have done a lot of is work in XML standards. A lot of work transforms and Xform has a model where the data is all in an object. The save button goes directly to the database. You can save that into your native XML database. With one REST query, you can rebuild all the data in your forms. There's a translation of the native XML database by their three queries. It makes it much easier to build and maintain. The stores are the champion of free that you'd have to do data modeling beforehand. You don't have to know much about your data. You just receive your data. There's metadata inside that and that metadata all builds up the indexes. So you can load the data and then you can still validate it. There's still a constant scheme to validate the data, but you don't have to know before you do your modeling. Your free systems are also very nice when you're integrating multiple versions of documents. You don't really care if there's extra elements in there. They just absorb them anyway and then you can do the queries on the data that is consistent. So there's a lot of companies that are starting to doubt these because they have such dynamic systems that are changing so frequently. So upfront modeling means you don't have to provide exactly how your data is going to be loaded until after your data definition language is written. This is one of the reasons that agility has really gone up in many years. I should mention that you still have data modeling tools. This is a example of an XML schema diagram that's done with a little oxygen that we use quite a bit. It's pretty easy to read. You can see that the black solid lines here at the top kind of represent a requirement fields such as thin lines or optional fields or data types. You can see the criminality here with a little different URL directly in this. And you can teach business analysts how to model these to modify to value your data and check that it's consistent. There's also other tools that actually build these schemas directly from the sample data. We're inferring the structure of the schema from the instances. The company most often associated with the large-scale XML database is MarkLogic. They've been around for a long time. They have a lot of the enterprise-scalable features, acidity, and compliance. They're heavily used by federal agencies as well as document publishing and organizations that have a high visibility data. We have 2K studies for MarkLogic in our book focused on financial industry so they're definitely one of the most successful companies. Well, one that means I think that has gotten more monitor and a much more visible is the Mongo database. The old source JSON data store created by Tengen. Mongo uses a very powerful master slave distribution model. They have a fantastically strong development opportunity that does a great job at building alliances and party relationships. They have operating and automatic movement of data. People seem just like it because it's so familiar to the JSON data storage that they like. They have a lot of different languages so you're not really focused on the query language directly or whatever programming system that you want. Capital Space is the one that has really started to come from behind. They used to be a simple key visual but just in their 20s a lot of their have had a full document store and they've also been very, very good at having that scalability and high availability system. It's also written in airline and has a lot of the nice features of high availability systems. I just want to make sure that everybody knows that CouchDB is a very different product than Couchbase. CouchDB is the original catchy project that a lot of the people left and are working for Couchbase but they have almost a completely different source code base now. They are both written in airline and CouchDB has a lot of other systems that they're going after. A lot of consumers are using it and they have companies that support distributed JSON stores on the world for a very fast and scalable system. They also have small versions that have some of the database services that run on your cell phones so it's easy to think those up. I just wanted to mention Exist. Exist doesn't get a lot of publicity in the U.S. but it's very big in the digital humanities area. It's an open source of data of XML. Database has strong support for Xquad and a lot of people are using it for doing queries on annotated data. Things that where you have people, places and states all annotated. A lot of nice features are getting the same scalability issues. One of the things to mention about document stores is they allow you to retain structure. When you retain the structure you can then do search where keywords that appear in a title for example might hold directly to the top of a search result page. So it's much easier to implement high precision search and retrieval things using document stores. You can have a hierarchy of documents and you can have different rules for every ID for ranking that search results. And to wrap up you'll find that there's only one system out there. There's a lot of different architectures that are combinations. You might use document stores. You might use keyword stores. You might use search and retrieval. You might use MapReduce. You might use OLAP for reporting. They can all work together. But in the first half I'll show you how to manage your project. I just wanted to mention some tools to help you pick out the right systems. We really spent a lot of time in our book in the last chapter going through this. This is the Carney-Pon architecture trade-off and mining process. You're both your business drivers and your architectural options together So in this case you can see there's insert query and publish efforts. We add them all up and we can help analyze which of these architectures might be best for a certain type of thing. We build these things called utility quality attribute trees. We talk about how useful each system is and we can score those based on things. And we have even released some open source software on GitHub that helps you to build your quality attribute trees. So just to close up and open it for questions, our book is available at Manning and I would really love to hear from people about what topics they're interested in. And with that I think we're going to open it up for questions. And let me just make a sound check again given some of the audio problems we've had on the broadcast today. You can hear me fine I think. You're just fine Tony. Let me just mention to everybody given the number of questions. The slides and the recordings will be distributed as soon as the recording is available from WebEx which is generally within 24 to 48 hours. So we'll be distributing that to everybody who is registered. We have numerous relatively short ones. One is the scene towards the very end of your presentation there. Would you consider Lucy and Solar examples of key values to us? Not quite. Although I have to say that Solar is starting to really become a very strong NoSQL solution. But I think it's more closely related to a document store. But what we really focus on is indexing documents. So what's your document and key value store? But I can see the similarities. Okay. Suddenly you know how to be able to NoSQL picture and integrate it with a lot of other solutions I think is the end that you were getting at in your slide. Right. Yeah. It's often part of other projects for the full-texturing of both. Almost all the NoSQL products have integration points with Lucy and Solar. Right. Now there are a couple of questions about ACID and whether you see ACID being incorporated into NoSQL technologies in the near future. And again, you sort of address this as you went through the presentation. But I'm going to just give you the part of this question comes from people who looked at products like Mojo, for example, which not provided ACID support. Do you see it coming into various NoSQL solutions in the near future? That's a good point. I do want to say that Mongo does have a variation of ACID. That is, if you store documents, if you make a series of changes within the same document, you'll have many of the benefits of ACID commitment in that. It's only when you're doing multiple changes in multiple documents that that doesn't come out of the box. Of course, there are other ways to do that within your application. So you can still get the same benefits from a NoSQL database for reliable transactions. Most of these NoSQL products have something called base or basically available systems. And what they allow you to do is to tune how many servers will respond to get a consistent result. So what you end up doing is you're going to be using different types of APIs. For a simple begin and end transaction in a SQL, you're going to be using parts of their APIs to make sure you're getting consistent, viable things. Any of these systems also support advanced features like cross-data center replication. So you can say you have two data centers, one on the east and one on the west coast. Don't respond to the user until you get a right response from both data centers. And things are much harder to tune in some relational databases. So I think what you see is different ways of getting reliability and high availability. Certainly some of the newest NoSQL products don't have that asset commits. Sure, systems like MarkLogic have had asset commits for a long time. And so it really depends on your budget, how much you're willing to pay for commercial licenses, how much time you're willing to spend having application developers make sure you put in that. I do think the most important number to remember is that only about 5% of transactions really need to have asset guarantees. And so if you have a project, you can focus your developers on those and get around some of these problems. So some of the potential applications that really do need transactions relational databases might still be the right way to go. Sure. There's a question here about whether you see any relational users, data warehouses for example, or OLTP systems migrating to NoSQL. And depending on why not. Yes, I do see some of these large systems I think migrating to NoSQL because of their scalability. A lot of these NoSQL systems have very sophisticated ways of doing hashes. And they're looking at the hash values as algorithms for doing reliable distributions over clusters. And OLTP seems to not have the same number of failure points but also have the scalability. They haven't yet added the automatic aggregation, the automatically compute sums and totals that these very mature OLAP products do. Sometimes I think you're starting to see some of the benefits in these larger systems as they add more aggregate features. What they really want is they want fast results in sums and totals of things. And each of these major NoSQL database vendors are starting to add those already as well as allowing other systems to in effect send standard reporting systems like OLAP to an interface that looks similar to a relational ODBC interface. But the back end really is a NoSQL system. Okay. And there's one question here that I was to just make no mention of, D2. I'm wondering, isn't IBM implementing NoSQL functionality on the Mongo model? That's kind of a complicated question. IBM has had a native XML version of D2 since DB2 version 9, where they store large XML documents and it immediately indexes it, and they also support a subset of the full XQuery system. So I would say that's kind of true, that IBM does have a lot of the document store models that Mongo has. What they do have that MarkLogic and Exist and some of these other companies that are supporting the JSON query link is a really nice, easy, transparent way for developers to store and retrieve their JSON objects without ever having to go through the hoops of SQL and object relational mapping. Most of the object relational tools don't yet work cleanly with the stored in IBM DB2. And if that's changed, please let me know. But last time I looked at it, you couldn't get Hibernate to automatically store your JSON directly into the blobs. And so the object relational mapping wasn't as elegant, but I think IBM is certainly moving in that direction and has seen some of the benefits of these JSON document stores. Right. I think our goals can be making some major announcements in this short way as well. There's three or four other questions here. We'll try to keep them short of the shop. If possible, we'll get a little more discussion on the difference between the graph or map model versus the document model. Yeah. The key thing is that they both have nodes that are in structures. Graphs really allow for very large interconnected things, whereas documents really kind of, I think, are complex structures that are more leaves on trees that can have lookups between other documents. That may be kind of a subtle difference, but document stores are really ideal for natural related documents whereas graphs are much more about chaotic data that may be coming off the Internet that has no relationships. Okay. A short answer on a complex subject. You mentioned that data modeling. Pardon me, Dan. Okay. Go ahead, Dhoni. I think that data modeling is not on a critical path of creating those SQL stores. Is it required at some point at all? And if so, in what way? You know, it all depends on what you're using it for and how important it is for you to have what I call canonicalized data, which is very, very consistent data. You certainly don't have to do it beforehand, but if you really have business rules where certain objects have to have certain constraints and rules and relationships, then you are going to start to want to do data modeling. And I think the key is that you just have so many different options, different ways of doing it in document stores, whereas you're kind of forced to do it up front if you're going to use relational databases. Okay. And this is a bit of a trick question, but it's called product for each use case, products available today. I'd say it's actually a higher number than that. Yeah, I think that's a reference model. Is this a reference model for matching business requirements to products? There is no one published model that I know of. What we do as consultants is we have a series of best practices and a series of if-then-else rules, and we have a process of taking people through that using the Carnegie Mellon ATEM process. And I saw some of these rules in the presentation. You saw that the search is really critical and you want to rank titles of documents higher that you need to retain the structure, and the document stores are important. So what I say is it's probably about two to 300 if-then-else statements that will help you rank different things and then the formal requirements process will help you answer those questions and then sort of rank different architectures first. And then once you have an architecture, then you can start to look into products that have certain features that are important for you. And that's really what the consulting business around this NoSQL matching is, and that's really what the solutions architecture groups of these new companies that are having the diversity are really trying to focus on is understanding those rules in the context of a certain situation. Indeed, we'll be coming up with these questions at the conference in August. I think that's all that we have for today, Dan. I'd like to thank you very much. I'd like to also thank our audience for persevering through some awkward videos. We will be distributing both the slides and the links to the recording within the next couple of days and as soon as that is available. We will also be in contact with someone shortly who has won a free ticket to the NoSQL Now conference in August, so we'll be doing that drawing after the event today. I thank everybody for attending today's webinar. Thank you again, Dan, for contributing your expertise today and we hope to see you all soon on the next in our series on NoSQL. Thanks very much. Bye-bye. Thanks, Tony. Bye.