 Hello and welcome. My name is Shannon Kemp and I'm the Executive Editor of DataVersity. I would like to thank you for joining today's DataVersity Webinar, Databases, CAP, ACID, BASE, NoSQL, and my, sponsored today by MarkLogic. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the public chat or by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DataVersity. As always, we will send a follow-up email within two business days, containing links to the slides, the recording of this session, and any additional information requested throughout the webinar. We're going to have two great speakers with us today from MarkLogic. Joining us today is Diane Burley, the Chief Content Strategist and Jason Hunter, the Chief Architect at MarkLogic. And let me spend a moment to introduce Diane, and then she will introduce Jason in today's webinar. Diane Burley is a former business and technology reporter and online media executive. Burley believes that regardless of industry or audience, be it internal, excuse me, or external communication, engagement is engagement, and unless the content is highly relevant and perceived to be valuable by the individual, it is worthless. And with that, I will give the floor to Diane to start the presentation. Hello, Diane, welcome. Thanks, Diane, and thank you so much. So today is Diane Burley, my colleague, Jason Hunter, and I are going to spend today talking about this new category of databases that is ironically enough to define what it is not versus what it is. And within the category of NoSQL are many different types of databases. We'll talk about the types, some of the promises these databases make, the myths that are out there, and whether or not they can be relied upon by your enterprise. The format that Jason and I prefer is a conversation. So we have some slides for you to refer back to later on. But we want this to be as interactive and informative as possible, so as Shannon said, far away. With questions, use the chat box in the lower right. We're going to get to them through another conversation. So, you know, something is popping right now. Let's get started out there. Let's get started. Jason Hunter is our chief architect, as Shannon said, and our chief really smart guy from ArcLagic, which means we pack him up and we use him off whenever we need to do some evangelizing or some teaching. He's also on the strategic product development side. He's an author, was one of the founding members of Apache Tomcat Foundation, and at heart he is a developer. So, hello, Jason. How are you today? So, I do these talks because you say such nice things. You're going to get stuck doing more, too. Good. Let's talk a little bit about how you got into this whole journey of no sequel. We sort of laughingly talk about the fact that it's hard to describe it except for us that it's not. So, maybe you can just tell us a little bit about your journey into this universe. Okay. So, I work on a lot of this different stuff, and I'm the guy in charge of JDOM, which is Java Optimized XML Manifestation. I spent a lot of time, and this is kind of 2001 timeframe, trying to figure out how to scale up the size of JDOM so that can handle gigabytes or more of XML and looking at optimized memory and everything. And I stumbled upon as part of a consulting job, this company, which became our company when there were just four people with this product. I was able to load up gigabytes of XML and ask a deep question, and it gave me an immediate answer. And it was one of those moments that just really hits how you think about things because I'd never seen something like that before. And then the first time I saw the Internet and connected to Japan was transferring files freely to foreign countries for what should cost a ton of money for voice. I was like, okay. I cocked my head to the side and thought, all right, this is going to reach things. And that was basically me seeing no sequel before there was no sequel. The way this has come since didn't happen yet, but it was just me thinking, all right, if this is a way to store data that's not in a relational way, not in a relational database, because I'm able to make node level modifications, it's directional, and I can store different things in here. But I didn't have to do any DDL. I didn't define any data up front. I just loaded a bunch of data and then asked a question. That's just weird. I wonder what I could do with this. And so that got me started. And then I used it as a customer contract. And then I took a day job about a decade ago. So now I've been at Markologic 10 years. And it's kind of gratifying to see no sequel for long when I had seen it before. I guess I got buzzy. So you came to the RDBMS world. You've been doing this debt for some period of time. Would this have abandoned a lot of prior knowledge in order to take this leap, or was it fairly easy to just get into it? Well, you have to learn something. And I found it quite freeing. There's a lot of parts in relational that I always found frustrating. The classic one is I would model something like a person. And I'd give a person a birthday, so they're in a column birthday. And then one time you find a person who has more than one birthday. And how is that possible? You know, my model is so perfect. And so, well, sometimes a person lies. And we actually have two sources for their birthday. We might not know which one's real. You know, I'm just trying to deduce like that. Or sometimes, like Facebook, I like because I don't trust those guys. So if you're a model, you have to understand all my different lies that I tell people about my birthday. And that's usually just a security question. The last thing I want to do is to put that out there publicly. Especially when you do counterterrorism or genealogy. People are lying. The records are slim. So whatever reason, birthday becomes something with a cardinality. Now what do you do? Every query breaks. You know, every thing, you get to suddenly do a join where it's not really natural. And every query that ever existed has to break. And what I try to do is I try to bribe the person so that the problem goes away. I will give you $1,000 to forget that. Because I prefer to have my data model be clean and perfect, and I wish the real world was perfect, but the real world usually wasn't perfect. And that's one of the reasons NoSQL comes about is to handle the messiness that the real is. But you do things differently. You think differently. And it's just like learning programming languages, though. You know, when you start programming in C and you move to Java, you do things differently. And if you call a C programmer who's now a Java programmer but hasn't learned the idiom or when you go from Java to Ruby, it's a similar thing. Different languages have their own dialects and so NoSQL has theirs. You know, I was building this deck. You were joking with me. I was trying to come up with some characteristics of what NoSQL is that we could all agree on. And it really wasn't a whole lot out there that we could agree on except for the three bullets on the screen right now. I thought this might be a good time to do a level set and push a poll out and find out what others are thinking. Because I think there's a lot of confusion around it. I've spent a lot of time with developers that are not a student with Mark Logic. And you know, I think the message is fairly clear out there that there is this constant confusion. And I think it is because it's fine that it's not only SQL and it doesn't really help you out in figuring out what it is. So how do we put some body to the shape-shifting database spaces that are out there and all the flavor of them? So let's take the poll. I think that's, is that up now? Okay, so the poll question is there if everybody take a moment to take a look at that. So what are you wondering most about NoSQL? What are some of the top of mind concerns? But maybe we can talk a little bit about the fact that this term of not only SQL is fairly recent. Mark Logic is over a decade old, as you said, that we have that terminology. We struggled internally to try and figure out how to describe us. So why don't you go into it? Why don't you riff a little bit about, you know, your thoughts on NoSQL? Yeah, I'm frustrated. You're like, so what does every NoSQL have in common and it's not? It says there are only reptiles in the world and suddenly there's all these different kinds of plants and animals and mammals and what do they all have in common? They're not reptiles. That's what they have in common. For some of the relational databases dominated the way that you store data that they're actually defining a whole new rule even by it's the not relational stuff. And there's a lot of kinds of not relational stuff. They call it NoSQL. A big public called it not relational. NoSQL is catchy here, I guess. So you get, like, a key value store. You get an object database. You get a document database. And they're all just different approaches and it's legitimizing the alternative ways to store your data. And one of these has its reason for existing markets and document database which tends to involve some data, still somewhat structured, but more like semi-structured, changing structured textual. Where the value store is an alternative way to write applications where I'll say, you know, we never compete with them in a market. I keep out of stores like a hash table in the sky if you can solve your problem with a bigger hash table than that's the way to go. But it all came about and it's legitimized the market. We were in internal meetings. We figured how do we say what we are? We just came back to it, the stuff that doesn't fit very well in Oracle. It's when you're done putting everything you could in Oracle, there's still stuff left over. And in fact, some of our do, it's 80% of the world's information doesn't fit well in relational. So it's the majority, the vast majority of stuff that doesn't fit in there. And that often can fit in a NoSQL database. But because SQL is still a very dominant way to declare what you want, people will add SQL interfaces to NoSQL databases and they kind of retroactively said, well, let's call it not only SQL. For example, it has a SQL interface, so that's a Jupyter Python thing. That's where the name comes from. But the other part has been a legitimization of alternative approaches to problems. So in combination of big data and NoSQL, the market has finally created terms that we searched for as a company 10 years ago to say what we were busy doing. As more people enter into a space, the market starts creating terminology that you can leverage. So that we can call ourselves an enterprise NoSQL database. People now know what that means. Let's talk about some of the problems that we solved or that NoSQL solutions solve out there. I'm noticing that the polls is revealing that people want to know what kind of NoSQL cases are out there and the differences between NoSQL and RDPMS. And then I think some people are going to be pretty intrigued when we're showing them what some people are doing with different successful deployments that are out there. So that's good. We can definitely address all of those issues. But right now, you've kind of gone on it before. There are several different flavors of NoSQL. Let's dive into them a little bit. Dr. Sundar, you were saying, and... Hey, well, I checked the graph. There are still a lot of columns like uses. And, you know, some of the values, one, like Cassandra, where people are looking at, hey, what can I solve with that? And they tell them that maybe you would have solved with a little database before and see a thing that I can have some sort of key... Think of the key value. Look, they're the simple ones. Just like on the dashboard table, I'll store something by a key and I'll be able to look it up by that key later to find its value. If you can get that massive scale, that actually becomes kind of interesting. And if you can make it a persistent store, then that's a good idea to just remember huge amounts of data about people. Think of it to remember the last logon of half a billion people on Facebook are bigger than that now. And I'll put that data. Are you going to put it in the table? So that's a good fit for a key value store. And I think it's Facebook that even uses value stores to do the search for messages. What's interesting there is the combination. What they did is they put the messages. Each individual message, as a binary, would be the value put inside the key value store. When it comes time for you to search for messages, I do a key value lookup saying, hey, the key is this user's inbox search engine database. And then the value is that database, and then they can run against that database to do the search that you just requested. So the value has to be a simple little string or something, and that basically is the index as the value. And where else are you going to put half a billion people on my search engine indexes, right? It's fine. And so then you put this new kind of technology that solves the hard problem in an easier way. And that's how this new wave happens is people start looking and saying, hey, could I use some of that technology for my own personal problems? It's based on what they call it, kind of system. But the key values have the limitation that there's no looking at value to find it. There's no source. And how many of those look like this? There's no index on that. They call that secondary indexes, which kind of tells you that it's a normal way to look at stuff. And so a document store, like Mark Lodzik, is saying, you store a bunch of documents which are self-describing data formats, such as Mailer, JSON, libraries, possibly. We use all those. We usually do a key value style lookup, which is based on the document ID find it, give it back to me. But focus on, and what our document-centric ones focus on is let's index the value, which is the document, so that you can find documents based on their own attributes, which is find me people who live within this geography, who are in this height range, born around this time frame, and in their description have this sort of text. Whatever it is that you're modeling, if you're modeling packages, here's an example of relational versus document-centric, was a logistics company that did shipping of packages. It sounds like a pretty easy relational problem, but it's really complicated because there's so many different attributes that might be present on a package. If you're shipping wine, it has all these temperature controls of what the temperature was everywhere, and the jostling of it and everything. And so to do a report of, tell me about this package. As a matter of joining a huge complex schema, they found it slow, and they found it hard for a new programmer to be able to really understand their schema. And so how would you guys approach it using a document-centric data model? And so we'd have a document for every package. It's a really big data model. I mean, it's on the back of an appkin, and then it will define the things that matter for that package. And there's what we had to do with the attributes. We'd modulate that element, XML element, that might be present. And if that's fine, there's no wasted index space. But if they all would be indexed and queryable, and then if you want to, say, find new packages that went through these two places, that's a really big query. There's not even a join involved. If you're looking for packages where the temperature was out of range at any time, or for a certain duration, all that knowledge is encoded in the same document, so it's really efficient. You're not doing all the joining again. You're not looking for attempts to win by minimizing the number of required join. And it keeps the package. You know, when it's delivered a week later, you can delete it. You just delete the document. That's the full representation of that package. And you're also trying to figure out how to do a big old cleanup where all the data is scattered all over the place. And so just, it was just somewhat natural. The things were a lot faster. The writes could be a little slower, because to do a write, you'd have to go and find the document, modify the document, and so just adding another row somewhere. So there is a performance kind of thing. Many times you're doing a read, many times you're doing a write. How important is third normal form, and how important is it just to have a serialized representation of the data on that situation? It was just so much more quickly developed than application, I would say, because if you want to add new metadata about a package, that's the, okay, the data is going to be this. Everyone is going to go, okay, let's figure out how to join that to endorse a more complex data model. So Mark has oftentimes reduced the number of DBAs required on the project by up to, you know, 9 times reduction. I'd just stand in front of DBAs and say that, yeah, your job will get a lot easier, which is what DBAs want to hear, surprisingly. Is it true? That would be a different, you'll see databases out there that really, the primary distinction is you want to get at the data and the queries you're going to want to use to get at that data or is it more complicated than that? That's a good way to look at it, yeah, is what kind of data and what kind of questions will you ask of the data? How will you retrieve the data or are you trying to analyze the data? And then there's also a bit of how some of the enterprise-grade features we'll talk about with how we differentiate of how important is that data to be, you know, is it okay to lose that? Because a lot of the NoSQL products are still of the, it's okay to maybe sometimes lose some data. And we think that's okay, so we make pains to the transactional subsystem that doesn't let you get yourself in trouble. It's consistent results. And so we talk about that when the time is right here, but it's okay for your data. You know, what kind of questions do you plan to ask or what kind of policies do you need in place about data retention? And the document-centric data store is probably the most flexible of the column or the graph, the key-value store that you were mentioning before. Flexible. Probably. I may have a selection bias because I spend most of my time working on problems where that is a good selection. Right? So we have a simpler key-value look-up problem. While a document store can do a key-value, it might be overkill for it's a simple managed-to-ask login time for somebody. See? Because it's just easier to solve with a big hashtag on the sky. I'll do that in a document store, but you need to. Obviously, those were around for a long time. I haven't talked too much, which is kind of interesting. Someone will have to know exactly why that is. Maybe it's programmer-centric, maybe it's that the schema were still so tight that the solution was hard. The math things were sometimes for mapping between objects and relational. One of the things documents have by the way that's on the side is it's content on the web and the back end tends to be structured markup. And the front end tends to be marked-up. And have a database that glues that into the front end using markup and understanding markup. It's a little much easier because there's not an impedance mismatch. You go to a normal Java conference, you go to these talks and it's all about mapping relational tables into objects. Then you go to the next talk and it's mapping objects into HTML. And while we have three different data models involved in writing these applications, right, and I'm spending all my time objects to tables to structure look, complicated and it's I found it a lot easier to do markup, markup everywhere. Well, plus it's very slow. Your time to market has impacted. That type of that is exactly the type of thing that would slow development to market. Yeah. Markup logic tends to thrive on the need for agility when people say, you know, I know what I need to solve today but I'm not a hundred percent sure. I need the flexibility to change gears. I want to be able to create new products faster and I want to wed myself too much to something. If you know exactly what you are looking for but if you've gotten me to change a direction different technologies have that with different characteristics for how easy you can turn left or right and and that kind of approach the school approach. You don't want to paint yourself into a corner which is typically what happens when you're in the RDBMS world you can't easily start to add new data sources and I think that's really impacted MarkLogic too. Over the last couple of years this is whole big data that you made it earlier this volume, the variety it really is the volume and the variety of data that companies are having to deal with now. It's not straightforward. People want to say okay, you've got a ton of single purpose systems out there collecting all types of data but has to correlate to one another you've got an ERP system over here and you've got some other system over there connections you've got log files that are coming in you know if everybody's looking for that little fuel kind of a edge it's going to boost your margin over somebody else's it's the data but can you get the data to actually play well together. That wasn't a question so I'll just say okay. Your wife trained you know how you can tell I'm not used to that. So anyway I'm going to ask another question earlier asking how can you be sure that a search on a document for a particular attribute will return you the exact number of hits. Happy that that will return the exact number of hits? Yeah. I don't know. That's what we're paid to do really right. So we have an index that we call index and the way that it works is we notice all sorts of facts about every document as we load it and so we don't notice what we're told we might see we just index what we do see and so I believe that there's a certain element with a certain value we keep an index for that even though in advance you didn't tell anything about your schema and we do this for all sorts of elements and attributes and we remember that there's lots of little facts just like how a search engine will remember all the words and so the simple way to think of MarkLogic is that we're a database built like a search engine using engine style indexes instead of classical relational indexes and that's what we manage to get speed and scale. So we notice all these facts about a document and for every fact we remember which documents have that fact and then the question which facts would have to be true about documents matching questions and so you're doing a terrorism trying to bad guy scan and you're like well we're looking for people who can't describe all their attributes birth, birth, day, name, height, weight, whatever and we can do bounds on name it and those for all of those traits every instance tells us which documents have that fact and then we start intersecting when you're doing a phrase and what we end up with is a set of documents that have all those facts true and so it's strange to me when I talk to relational people because we're always trying to find the one index to solve the query and so you tend to have indexes that are very query specific and you have all these lots of indexes brought together to solve the problem by intersecting so maybe some of the line I've found if I have a table with 10 many columns and I do a query and I'm going to specify some of those columns are equal to some values and I don't know which columns and values but I just want the flexibility to have that question which is a kind of metadata problem I'm going to find objects, people, or things or something where these all things are true in a relational problem but it will only use one index to resolve the query so if I have 10 columns that I'm specifying I'm only going to get to use one of my constraints to limit those and the rest will be brute force and it's so wasteful and so inefficient and inaccurate to the person's question here of how can I be sure that my count is right my approach is different because we have indexes on all these things and we're always intersecting together so we'll figure out which those are documents in our world we're going to have every one of those facts we'll intersect document identifiers and that's the sort of document that have all those facts true and it'll be exactly correct and so then count those which is a set of integers internally which we can count quickly run other analytics against it as needed and still be accurate without having to brute force things and then our indexes are different instead of expecting compound indexes we don't have compound indexes so I don't like how ands or bs swarm to solve a problem so instead of having one big thing that's designed for a problem you have this very flexible set of indexes that together solves the problem and as a result it's more flexible and more accurate and although we don't tackle the relational problem it is a different approach toward the relational problem and what the guy who is doing is doing a bit of time for the way that they did it now you choose an index you choose one that's pretty selective and then you just brute force the rest to get the answers you're happy didn't you and he said well but their indexes will point at row basically directly at the row of locations whereas we're always doing things based on document identifiers so it's easier to do necessary if you assume documents with possibly a schema as I talked about use cases a little later we look at use cases common one is the data is not perfect but it's important and I still have questions I want to ask of it we have a question here that I want to get to because it you've just chained it and people are really confused they're not going to get much confusing people already so you're talking about there's three questions and the use of no sequel is informed by three questions this is the data questions to be asked of that data retention policy so can you give an example of how those answers might change and make a difference in the decision okay is it started and repeating is it probable so that's the classic domain of relational database it's a synthetical that accountants can create and everything can fit in and it's just perfect if it's perfect you force the world to be perfect you know and we have a pretty good experience equities is a good example of you know I bought stock at this price at this date from this it varies regular and you can have tens of thousands of those every second flying by that's the stock market you need enterprise grade you buy Oracle you need less enterprise you call or something kind of queries you're going to tend to be find me the one where if you're seeing the more analytical queries then you go into OLAP where there's a hypercube index that's going to let you answer more analytical questions data warehouse style but then the content gets a little messier then you have to start looking at all this or you're just saying your head against Oracle which we see a lot of people do I mean where stuff goes they'll come have water and just stay in your head until you're done and people like to do that so what's kind of messy data I've got a little bit of data that's kind of messy it's not so easy to relationalize and if you're going to actually have any questions of it derivative contracts are messy which isn't what we can type here which is it's doing stock that's really simple because the stock is just a ticker of a kind of class there's a lot harder to model because it's not always the same and so that's less pure data and more contracts and so if you're going to try to do that you are I gotta figure out some better way to model it and the last good modeling of that is what caused the 2008 financial crisis because no one knew what it had you know what exposure if Lehman goes under I'm not going to stack a paper over there and I can't ask a question against that paper so I know so they're not willing to do any trade and the world is not the result that was not queryable to the extent it needed to be and one of these cases is that we have now made that queryable very much contracts are complicated but they're very much document centric because they're contrast between people and you can make advance queries analytical queries against these documents and then that's an example where alternative models come into play but there could be other ones like I'm going to be doing photo management at massive scale or something and one of the questions of the data is not just data retention but like what kind of questions will you ask how fast do you need to ask how fast people or I'll advise sales reps at MarkLogic ask them how fast they need the answer if they answer tomorrow and ask if you have an answer sometimes a Duke Map reduced job is using full processing answer question so like when you go to LinkedIn and they say you might be friends with these people you didn't go on the flight they calculated that last night in a big batch job then they just know what the results work something like but I'm going to try to do a let's go LinkedIn and ask a deep question about within three sets of people worked at this company with these kind of attributes and where they overlapped with me once before and they're interested in this technology with that that's the kind of up where you have to start doing a richer data thing because the person's going to ask a question and expect an answer right away where MarkLogic tends to focus on harder problems like that high up time do you need do you need rebalancing things like that which is how MarkLogic differentiates from other documents of course I'm glad you mentioned that there's conventional wisdom is that no sequel cannot be any of the things you just mentioned so what is what is conventional wisdom yes conventional wisdom is open source was always open source and there's no mandatory reason why that's the case because if you look at a lot of no sequel systems they are often open source but exclusively logic is exclusively and we are an offer company that licenses software so this alternative approach but it can be a good set of features that can justify someone actually paying for it so there's nothing that says no sequel has to be open source it's one of the people it's like well alternative data model has to have this licensing scheme well of course not it says well you couldn't have enterprise great features you couldn't have transactions which is kind of what you're showing on the screen just face let's go figure so in that world you know we're used to a lot of enterprise great features and with transactions everything can be atomic consistent isolated durable and that if you make a change it happens or it doesn't it won't break the rules of data you're not allowed to see it halfway done and once it's done it's actually done sometimes no sequel by breaking from the relational model they also said let's not do that either let's not have transactions and so they went with base which is the opposite of acid in chemistry so it's kind of of course accident to be cute based on soft data eventually consistent and then read some old data for a while but it'll eventually get there other things being isolated well you can you can have a collision and if there's a collision the last right away is often the case one under that you can have full transactions on the other side well let the programmer deal with it instead of database taking on the pain of dealing with that you have the programmer deal with it in general we can talk about cap after this but in general acid pushes on availability and consistency while base focuses on the particular tolerance and consistency and acid focuses on handling questions in a different way as I'll explain and base acid is robust in a simpler code versus simple database more robust code that you have to force an effort out of the program am I going with this so in the sequel they kind of went more base with most of the systems but there's nothing that says you have to there's nothing that says you can't have transactions with a data model what do you think as a company would it be cool if you could have a really robust data format with indexes including with us a lot of textual aware indexes and keep the enterprise great features like transactions and high availability and security and kind of a role based really rich security not just like everyone sees everything or see anything you know security kind of would be a combination so that's a build and you know if you look at the NoSQL market that everybody's private it's known to really talk about revenue but the numbers that are predicted is that logic has more revenue than all the others put together and I think that happens when you have the flexibility of NoSQL with the enterprise great features that people are used to with Oracle you get the best what you mentioned textual indexes what do you mean by that maintaining the hierarchy of the the information came in the data came in what do you mean by that because more logic is a database built like a search engine with search engine style indexes an easy built in indexes is text understanding and I see a lot of customers that focus on not just messy data but messy more content the human content those are things get to people the messier stuff gets right that's why we're looking also perfect for this pure world of numbers like some sort of finance report but then if you're saying let's do let's start doing some of the use cases let's do something that involves people and messiness still rely on a purely stream match to be a match that's less effective and all that you can do is is inner out of the set that's as good as if you can do relevance order data an example that I'm working on right now is on immigration where you're trying to match name and you know the old system name has to be perfect it's a perfect match but if it's at all not perfect it will be seen as different people and that's a problem when you're trying to do a match against the bad guy database so when you're matching against the bad guy database wouldn't a little bit of fuzz be helpful it's a little bit of relevance ordering be helpful so that I can say well here's an old list of people that this might be maybe you should check a little deeper because this looks like it really is this person this might be this person right well in this case these aren't good at fuzzy matching and relevance ordering or results good advantage I like to exercise imagine if Google was like historical right if you typed in anything wrong no results you kind of read your mind starting to trust that the system could be better at fuzzy matching and relevance ordering and Marsik tries to do that but in the database world so still understand human text and still understand the spelling and fuzziness and things like that but being a transactional database underneath at the same time I want to move through some of these bullet points of the rest of the NoSQL community against MarkLogic and you've touched upon most of these but let's go a little bit deeper on MarkLogic versus on RDBMast again touched on these I'm really going through these so when people get the the deck later they'll be able to go through and know these different points that you made are actually written down so they didn't have to take copious notes here the speed of light but back a little bit more about about this we can about scale out on commodity hardware versus scaling off what do you say about that and why is that important in the center today for a little database just to say the kind of older technology if you had a run bigger and bigger they tended to recommend that you buy a bigger box just to get really expensive when you're hitting the high end accidentally more expensive and it makes it very hard to just add a little bit more capacity and for a lot of most NoSQL databases have in common is the real thing that it would be better to add new nodes in all age just send more servers to be in capacity but they need a different way of indexing and so the approach of indexing and the approach of other noSQL senders tends to be more friendly to that kind of it like growing rather than old secretaries of relational databases we you can serve up on one server and then you have two and then you have five and maybe eight and as your load gets bigger your data gets bigger users grow and add more hardware to handle the base load and increase demand you don't have to just say I need to buy a bigger box I need to buy an even bigger box and throw away the old one not a bigger bigger box and throw away the old one I think that's good to go I think customers really prefer especially when the cloud it's becoming more and some are as well suited for that about how big I'm going to get it's not most of our customers can feel comfortable that we have customers bigger than them and so we scale bigger than they're possibly going to go so it's the safer bet you know okay guys can go to that level the night's small and I know I'm safe I need to grow and I don't have to spend all my money on some big Oracle hardware to get that big that's what I'm talking about a little bit about schema agnostic a lot of these will say that they're schema agnostic but they don't really need it and when we say we're schema agnostic do we mean that you can go in with absolutely it's just a free for all there's something in between there and it's a point of like distress because we say we can load data as is and you don't have to do all of that modeling if you're going to make some sense of your content and be able to do the types of querying you're doing there is some data manipulation going on so maybe you can talk about the difference between data modeling and sort of data getting ready that prep work and what some of the different tools are that might streamline that a little bit or now we're going to get into the part where we ask the hard questions okay so what we mean by schema agnostic is that there's and you don't have to declare in advance what is going to come and that we still index and resolve queries quickly now how messy should your data be no matter immersive Oxford university press you know they do like the Oxford English dictionary and whatnot and their content is beautiful I told them I almost cried it was so beautifully marked up every time a person was ever mentioned it was marked up with attributes of that tonical identifier so that you know this job that John could be beautifully separated and understood and the amount of the the willingness of an application that you could build on top of their data was breathtaking and that's because they spent the time to you know have this wonderful markup and then we're happy because Mark logic allowed them to unleash the power of that markup for the first time because other tools wouldn't be able to really leverage the intelligence community who are just snarling in data and trying to make an understanding of a huge amount of messy data and many things but if you're not up at the same level work with both and we help you do if you start with the messier data as well let's load it step one and see what we've got so you without having to go through a bunch of pain should we clean it up maybe we can normalize this down maybe places can be video tagged maybe entities can be whatever it is depends on what questions you're trying to answer one of the machines to clean it up or are they using people or is it a combination of both are they using Kadoop in this case for some of these processing clean up typical machine when you're talking intelligence because it's too big for it to be people although in places where a person can request a human translation of something so this is not in English so I can say hey why don't we have a professional translator do a better job to the automated translation because it looks up pretty much interest and so having a transactional database I can mark up data as this needs to be translated and Mark logic can drive the Q of what can be translated next Mark logic conference and as we're talking about their unstructured data and someone a piece of paper found in the how to understand a stand and said do you think you have unstructured data this is what we start with and the idea was digitize that translated or understand Arabic translated to understand English and to T identify it and try to get some actionable intelligence out of this massive messy paper but Mark is there every step of the way because being document-centric you can store the binary you can store the with Mark this is a person this is a place this is a phone number whatever it is and asking the questions using the database the database isn't the last place the data ends up it can be the place it goes first but messy data means it's a lot harder to query that's why we have something called pocket lint which is like it's not like he had a number in his pocket we don't know what that number means and this is where the real world is not as friendly as I call them should I put that in? I don't know exactly what that number means I don't know he's not he can just put it into the document like as pocket lint whatever element you want but then you have indexes where you can say find me any document anywhere that has this number in place you're not constrained by where you need to look you're able to be everywhere this number of things and different kind of indexes you can ask a lot of questions which then is more flexible and then we act to really start thinking wow I have to specify what column it's in I don't know what column it's in you know so I put another slide up looking at the versus search engines who surprised people we were recently in a magic search engine which I think surprised a lot of people with that favorite product out there of no-SQL database and search but really our roots were about building a database that has been enabled more granular search so maybe you can just go through that really quickly about how we compare to most search engines out there I like how does a car compare to a car engine because a search engine the logic includes within all the search capabilities but has a transactional store up to store the data and the environment of executing code next to the data and also that are listed on the slide and so we are probably I think the largest independent search vendor that once it got caught in off we don't tend to hook ourselves or mark ourselves a key component in a lot of deployments for the flexibility that search brings so that's what this slide is talking about it's on the right it's all simple search the largest index we've ever used do you know in the number of characters we just got to answer that question I'm curious I don't know the answer I know one of the largest search queries but I don't know the answer to how big an index it has is because of how we solve the problem our index is this fact is present in these documents this fact is present in these documents that kind of thing and so some facts like the word the are everywhere you know and because thing ulticoded that's actually highly efficient so it's still okay some other facts are more rare and when you the search engine does but then we go beyond the words and the structure and the values and the relationships of everything which is why it still acts like a database even though it's like a search engine and so there are very large queries we have people doing quarter thousands or more probably even millions of constraints and that's okay remember you and I in Texas talking with a partner a large and query size of anyone else goes which is how we would measure things how big a query can you ask and how quick can you get the answer but a digital index tends to be fairly small let's talk about how the FAA is doing no sequel and monitoring what what are they doing and what do they want to what's the problem I have a flight right after this but there's one what you do on an inverse that happens bad weather work closure and whatever it is and so you need to make fast decisions and do it against a wide variety of data and so the slide shows weather data internal FAA data airline data SharePoint document data GEMS geospatial data social media you can it all need to be used as the basis for these decisions so you need a database that can handle all that and can handle the new data feeds that will happen next week without it being such a major deal to add a new data feed load it in some place and then ask deep questions against it and so Mark was chosen to do that and you know weather data will be seen as documents internal stuff as documents and so Mark was chosen to do that and in some overall wait in some where it's going to be set up and that's where the things are going to take place as it will have results after these and when it's kind of it'll to answer these complex questions. But the same thing. This needs to be up. You can't say, oh, you know, we're down for hours right now. Please have a crisis later. Okay. And as everything stays, they wanted to build cheaply. And so having a system that was able to do fast and well, that was good. So the POS of two weeks, as it says on the slide, and the purchase to beta in six months. And that was a critical thing. This is simple. So it's a good enough simple fit because it does not so easily relationalize with its submission critical data problem. And it has to be brought in in a timely fashion too. Right. We are running a little long on time here. So I want to get to the Obama Care. If you can talk about that real quick. And then we can segue and talk about how they're not sending that into fraud as well. Because I know that a lot of insurance companies and banks are on the phone and they might be interested in that. Let's talk about the data exchange first. Okay. We'll get to the next. So CMS, the Center for Medicare and Medicaid Services. Their balance was great, that Health Information Exchange with 50 states. Wouldn't it be nice if every state used the exact same formats and schemas? But in another case where the real world isn't so beautiful, that should be Markologics and Marketing, right? Designed for real world data. And they're able to take the data as is from all these insurance providers and all these different states and agencies. Still in one database and be able to drive this insurance exchange out of it in a cost-effective manner. And the main goal of this is for people to be able to understand what insurance plans are available. But it'll, I think, extend beyond that into a more medical record kind of data. And as you've pointed out, there's an aspect of anti-fraud going on there, too. Because if you have all this data in one place, it's easy to recognize fraud. And fraud is a good example where there's a lot of data that could be brought to bear to unbecoming fraud. So the more data at your fingertips, the better. And the latest, they're probably the most important, right? A lot of fraud. It's what happened in the last couple of minutes indicating if this new thing is fraud or not. It's a little bit less for healthcare, but a little bit more for, like, a credit card fraud or something like that. So if you have a system that can handle a wide variety of data and still have real-time access to the data, and you can have in-database analytics, we didn't talk about Mark Lodge's analytics and analytical indexes, but they do exist and they're useful for things like fraud. So to be able to take all these claims. And especially from what I'm told, by unifying across the states, it's the first time you get a global view of people's activities and therefore something that might not have looked like fraud when isolated texts will look like fraud when you get a bigger view. So loading in all the data there. And if this person seems to do a lot of this in different states, I think there's probably a fraud going on. And so that's what I'm going to say a lot about you. In the underscore, the fact is that a lot of fraud, the forensic side of it is really about plunges, trying to figure out what are the correlations that are going to work? How do you test all these correlations? And if you want to bring in a new data set and it's going to take you a couple of months to do it, you're really going to go through the risk of wasting time and resources to bring in something that you don't know whether or not there's actually going to be a payoff. Whereas now you can get in much more quickly and to see those correlations or not see those correlations and be able to move on their way. You're saving money. So we'll talk about the BBC and the next phase of NoSQL, which is really about picking the document store. And within that argument, you're talking about the pretty markup before echoing out the triples within it and now combining it with the triple store. So I'm looking at the clock. You've got to talk fast. Okay. Did I not already talk fast? Your question is what's coming next, right? And the BBC is a good way of what's coming next. So we've got to go to the BBC London Olympics data platform. So all the statistics, the highlights on video, everything, all the time I mark logic, it's really kind of one of the core demos that you can do. It's one of those things I can show my wife. I love that. Because backend systems aren't always as exciting as the stuff that's showing up on your screen when the Olympics are on. But they're using mark logic because of the flexible model, the real-time access, the high-speed requirements for everything to do with what's coming next. And if you throw a couple more slides forward, maybe there's big numbers on this one. So one thing that they didn't use mark logic for, but they used a triple store, so this is another SQL database, was to understand the world. One player plays for this team, plays in this league, plays over here. And this person like David Beckham is married to Victoria Beckham, who is Bosch Weiss, who is the Weiss girl, which is one of these movies. So you can understand about that, but better you can make your site by showing the right things on this page. If you're interested in some sport or whatnot, everything has understood. And you can understand when things are and who's in which finals and everything like that. I went there and they showed me and others that we used mark logic for the data that we used at this other triple store for the relationships between the data. And they kind of said, why don't we use mark logic for that? You guys should do it. And then some others, they convinced us that it was time for triples RDF and semantics to commercialize. The thing that had kind of been academic for a long time was becoming time-ready. And they showed us what they did. And so now in the new mark logic seven, and coming out of here, we have a triple store with a custom special index designed to make triples really fast and be able to go to billions and billions of triples. So you can have other people do what we saw the BBC and others do in the world. And we're really excited about it because it opens a new avenue. It feels like we're spelunking in a cave and we just went through this little bit. And on the other side is this huge, massive cavern that's unexplored because we have a lot of choices in data modeling of sometimes our model as data, as documents. Sometimes I'm going to model as small little triples, but still saying database where I can unify them. So I'm going to have the metadata about the documents, you know, in documents or external to the documents. I'm sure that you're going to do like, you know, different and you have all the transcripts and videos as documents, but you also have this speaker works at this company, this be used to work at that company. And this is a trend point of that. This was something like that brought together. They have a real deep understanding of the world. And that's one place that Mark Lodge went next is when you look at the NoSQL database, you know, in the graphs, graphs and document stores, we're going to be two of them. We're going to be both graphs and the document store. So I know we're at the top of the hour and Shannon's probably something at the bit here for us to start talking. But go out and you can try Mark Lodge for free. You can download it. Jason's got a great paper on the site as well, and we'll be sending that out to you in a couple of days. The question always comes up, will this be recorded? It is and then we'll be sending out something shortly. But really quickly, I mean really quickly, what does it take to get started? What kind of skill sets do you need? If you've got a group of DA's, you've got people that understand SQL, TLCQL, what did they need to do in order to start playing with it? Generally, to be a programmer, you just want to be a programmer who's open-minded to try something a different way. What I've taught classes, sometimes by the end of the week, people are writing code that's really impressive to me and is better because I would write. So it's not that hard to learn. But I do want someone who's not just stuck in their way saying, you know, I've done it this way for 20 years. I want this to be a relational database because it's not a relational database. It's a different way of looking at it. You need to think in terms of, it's like a giant file system in the sky where you've got billions and billions of files, documents, where the file contents are indexed so you can ask questions, not just by looking up based on their name, but by looking up based on the contents and what kind of system could you build fast if it was a highly reliable enterprise-hardened system? You can build a lot of great stuff, but it's just a lot of looking at it. Can leverage your SQL programmers can use a SQL interface into MarkLogic, and then you can use application developers on top of that to make it all come together? No, not that way. Okay. But MarkLogic's SQL interface is designed for integration with a BI tool more than it is as a primary means of accessing it as a program. That's not a good way to go about it. You should use the rest APIs. You should use the internal programming APIs, things like that, more than the SQL. The SQL, in a way, it's read-only. Because SQL is, by nature, structured, we use to synthetically create this fissile structure on top of unstructured data. So it's okay for some things, like a metadata field, but I can't express the richness of what MarkLogic can do because the query language was not designed to ever be able to express it. We tie together here. How about one takeaway, one takeaway that everybody should have from this? Not a way. Probably just the same moment that I first saw, you know what, there's another way to do it, but that's the excitement of no SQL, is there are alternative ways of doing this. The one side of this fissile error is over. It's the famous paper you can go to where it just says it's not over the case that everything should be relational, be tree-driven. There's five to write. Look to understand your problem and then understand what tools are out there. There might not be a tool much more designed for your problem than a relational database. And one, but there's a lot of other ones. It's like a fun time to be involved in databases. These things were boring 10 years ago. Now it's kind of fun. I'm going to let that be the final word, Shannon. Thank you so much for letting us converse here. Thank you. This has been great. It's been, and thank you to the attendees for staying and being so interactive, some great questions from the audience. And again, thank you for this. I love the format that you guys have with the conversation. But I'm afraid again, of course, that's all we have time for today. So just to remind everyone, we will be posting the recording of the webinar and the slides to dataversed.net within two business days. And I will send out a follow-up email to let you know the links and other requested information. And a big thanks to MarkLogic for sponsoring today's webinar. You can see MarkLogic and learn more about their product at our NoSQL Now conference in San Jose, August 20th through 22nd. And thank you again for attending today's webinar. And I hope everyone has a great day. Jason and Diane, again, thank you so much. This was great. Thank you. Bye-bye.