 Hello and welcome. My name is Shannon Kemp and I am the executive editor of DataVersity. We would like to thank you for joining this DataVersity webinar building on multi-model databases sponsored today by Mark Logic. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, you'll be collecting them by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights of questions by Twitter using hashtag DataVersity. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now, let me introduce to you our speaker for today, John Biedebach. John joined Mark Logic in 2015 and has 25-plus years of experience in BI and data warehousing, including 10 years at Oracle. He works in the field organization helping customers and prospects understand strategies and application of multi-model databases. He has a specific industry expertise in manufacturing, energy, and healthcare, but often deals with organizations outside these verticals. He lives with his family in Dallas, Texas, and in his spare time works as a volunteer firefighter and paramedic. John, it's a pleasure to have you with us. So hello and welcome. Great. I appreciate the invite. I've included my contact information here, and I'll play it again at the end of the presentation. So if anyone wants to follow up with me directly, that's fine, or you can always get to us at marklogic.com. And I'm sure somebody can help you out there, too. But if you want a friendly face, I'm happy to answer any questions. Looking at the agenda, we're going to take kind of a broad view today. So we're not going to drill down too deep into any one of the technologies. Certainly if you want follow-up sessions and that type of stuff, those can be arranged. But this is just going to be more of a general overview, looking at how to get disparate data into a single database, and then looking at the implications of that. And this is based on a book by a guy named Pete Avan called Building on Multimodal Databases, and it's free from O'Reilly. So if you want to go out to their site and download it, you can also get it from the Marklogic site. But if you have specific information, feel free to go grab that book. So if we were doing this live, I would ask for a show of hands, and I would say, you know, when do you think the first database was invented? And typically the answers I get are sometime in the 1940s or 50s, and people have this image of Admiral Grace Hopper taking out her broom and actually physically debugging the computer by sweeping the bugs away. But actually the first database was in 1890, long before we had computers, by a guy named Herman Hollerith, and he worked for the Census Organization, and he used punch card technology. And punch card technology had been in effect for a number of years in things like player pianos and looms and that type of thing. But Hollerith was the first one that could, and invented a machine that could read cards, interpret data, and keep calculations. And he actually finished the Census in the 1890s much faster than anybody could do it by hand. And we really took this technology and didn't change it a whole lot until we got to the IBM punch card. It's still essentially the same technology. All we did was we automated the reading of that information. And if you think about what kind of database this is, this is literally, not in the way that people use the word literally today when they really need figuratively, but this is literally a document database. Each one of these records was contained on a physical document. And if you look at the advancements that we made over the next 40 years beyond the invention of the IBM punch card, we really just sped up the reading of it and had larger storage arrays and those types of things. It wasn't until the early 70s when a guy named E.F. Codd thought about maybe there's a better way. One of the problems with mainframe databases is that the database and the application were co-resident. So you couldn't leverage that data outside of that mainframe database. So those mainframe databases were contained in proprietary structures like VSAM and ISAM and those types of things. And they had cobalt programs that sat right on top of them. And you really couldn't leverage it. And so he said if we could take the data and we could establish relationships, and this is the old Scott Tiger schema that you might have learned to work with, if we could take data and we could establish the relationships upfront, then we could create systems, a structured query language to go in and interpret the data and construct queries and get this. And this is kind of how we all learned in the last 30, 40 years of database technology. There's a challenge with it though. And I've been told that this audience is largely data modelers, and this is probably what you all do for a living. The problem is that it's inflexible. And so you look at the data and maybe in step one you don't really know what all the fields are. Maybe this was a dataset that was passed to you or you collected it from a number of different disparate datasets. But you haven't had a chance to fully understand the implications of all the data. And so you design the model as best you can and you form ETL. But you have to make a number of compromises when you start to join disparate data systems. And we're going to look specifically at that. But if you have two files that are similar but they're not the same, then the challenge is you've got to then decide what information do you include and if you don't include the information, what you do with the leftovers. You create indexes and that's an evolutionary process and then you build the application. And then as the business requirements change, we call this the cost of asking the second question. Because inevitably what happens is the minute that they get the first answer, they have another question. And the answer to that question may or may not be incorporated into your model. If we look at just a very simple transaction, like going to the grocery store, the information contained on this receipt probably fits in four tables, maybe more. So you've got a transaction header and that's going to have all of the header information about the transaction, date, store location, that type of thing. And you may have tables that are hung off of this because you may not want to do a primary key, foreign key join there so you can only have the store information once and you're maybe referring to it by the store number. And then you're going to get the transaction detail and then even then you're going to probably join that out to a product table. So that if you want to reconstruct this transaction, it's not really that big a deal yet. You go through a couple joins and you've got it. But for the uninitiated, this can be a challenge because if they don't understand the schema and they don't understand how all the data is joined together, then they can get a Cartesian product. They can get a lot of different types of joins and that's a frequent problem. Uninitiated business users trying to work with the data. And so one of the things that the industry has done is come up with go back to the old standard, remember mainframe, you remember the punch card, go back to the document, so that we're storing the transaction all together. And yes, there is a trade off here too because one of the things that cod tried to do when he created the relational database was reduce cardinality, save space, save compute, those types of things. And in the 1970s and 80s, that was a very, very important thing to do because compute and storage were very expensive. But now, if you look at this USB thumb drive that I have that has 128 gigs of memory that I bought for $15, storage is very cheap. Compute is very cheap. And going to the cloud is only going to escalate that. And so we can afford to be a little bit more liberal now with the way that we store data. And so if you look at a document, this happens to be an XML document, I can store all of that data in an XML document. So I've got my header section and I've got my detail section. And the nice thing about this is that I can have a variable length on this record. Every record can look fundamentally different than the record before it, which is a limitation in relational databases because every record has to look essentially the same. This is especially important when we start talking about disparate data, data that's similar but not the same. So in this particular case, I've got two address files here and we'll take a look at them real quick. So I've got this one has a city state zip and so security number and phone number. If I come down here, again, they're similar but they're not the same. So here I'm referring it to as F name and L name. Here I'm referring to it as given name and family name and that's okay. We can deal with that. But now I've got two address lines here. If I'm going to have an address that joined address table, am I going to have two address records for this customer or just one? Which one am I going to pick? Am I going to populate this address twice? So all of a sudden you can see we've got to start making compromises and you guys are all skilled data modelers. So this is not going to be intimidating for you at all. But if the business comes back and says how do I know which one is the billing address and you've only collected one address, then you're kind of forced to use this for both maybe. So we begin to make those compromises but I kind of describe it this way. So I've been married for a long time and what if I had to have every fight that I was ever going to have throughout the entire length of my marriage in the first year of that marriage? Probably wouldn't last very long. And that's a lot of times what we get into in the data modeling exercise is we have arguments and religious arguments and very passionate arguments about things that may not take effect until years into the future where we can't fully really understand the implications of that. But we have to make all the decisions up front because changing those schemas over time is a very expensive thing to do because then you have dependencies and all sorts of stuff. So I took some screenshots of a little demo application just to kind of show you about the document model and the flexibility that it gives over relational. And so I imported some data and it's very similar data to this. I've got two different naming address structures. And so as I go over to here, you can see that I have all my names in here. So Amy Adams and Sean Adams and Beverly Adams. And if I look at one of these records, we'll look at Amy's record, you can see that we're storing this as first underscore name, last underscore name, and Amy and Adams. And I've got zip in here, but I don't have city state. So that may be hard to go back and reconstruct, but we'll worry about that later. I've got some redaction in here. And that's because a lot of times you can with document databases, you know, just by keying in on an element, I can go through and say mask this element or even even redact this element so that you don't even see it there. And this particular record can be stored as XML or JSON. Those are the two most prevalent document types. And there's a lot of technologies that can do the same type of thing. So I have got first underscore name, last underscore name, and I've got all the data here. And then if we look at another record, we'll look at Julie Adams. And you can see, again, it's similar, but it's not the same. So in this record, we've got first underscore name, last underscore name, and over here we've got F name and L name. Again, that's really not even a problem because we can deal with that. But we have other data now. So we have birthday in this one. We had social security in the other one. And so if I were to construct a relational diagram, a relational schema, I would have to have a column in there for birthday if I wanted to maintain that data. And then for the half the records that don't have that, that would just be null. And so I'd have a bunch of null joins. And those are expensive operations as well. So that data sparsity can create certain problems. With the document database, though, what we can do, because we have the flexibility of the document, we can ingest all of the data as is because you remember the document is self-describing. And so if I've got a first name field there, the computer doesn't know that it's first name, but it knows that I've called that F name or first underscore name or whatnot. And so I can have any number of fields. I can have any number of records that all have a different layout. And then if I apply some search technology on top of that, what I can do is without even understanding the schema, I can go out and look for any record in the database that has the word atoms or its root, because we can use stemming technology to look for the root. So anything that is even close to atoms, we're going to go out and grab Amy Adams and Sean Adams. And it's also going to grab Adam Bowman. Now I don't necessarily want Adam Bowman, so I might put quotes around this or use some other Boolean operator to describe that, just like you would in Google. So I've got all my documents loaded and I'm using search technology to go over those. But this isn't going to get me very far. So I have to begin to harmonize this data. Now you notice I said harmonize is not normalized. And the difference between the two is that harmonization does subtle changes to the records over time in the places where I need them. So I don't have to forfeit these records and make them all look exactly the same right now. What I can do is just begin to make the changes. So I'm going to make a few changes to the name field. And as I come over here, you'll see again, I'm going to ingest data as is, access the data, and now I'm going to harmonize and enrich that. And so if I go to my record, you can see I've got first underscore name, last underscore name. So what I'm going to do is I'm going to change the structure of this record using something we call the envelope pattern. And the envelope pattern is just like it sounds. It's like a big, imagine a big FedEx envelope. And I take this document and I put it inside the FedEx envelope. And I can put a number of things inside there and contain them all within the envelope. And because this is XML or because it's JSON, it's all referential. And I can go back and find it again by using that element name or that property name. So I run a little script and we can get together another time. And I can show you the code on exactly how I do this. For now, we'll just assume it's demo magic. And as I run my little demo magic, I create that envelope. And so I've got this envelope section here and I've got the header section underneath that. And that's where I'm going to put my data lineage. And this is important because I am capturing a couple of things in this envelope. So I'm capturing the original source data that's down there under content root. And that's always going to be with me. When I begin to harmonize this, I never lose that original source data so that if I do need to go back and make changes to it and grab something I can. And up in the header section, now I've got my data lineage. So I know who loaded it. I know what process loaded it. I know when it was loaded. And so I can begin to, if there's a problem, like I load hundreds of records from tens of sources, or thousands of records from tens of sources, and one of my sources is bad, I can go back and grab that and back that out. Because I have the stability of the document to be able to do that. And I'm going to be able to get this information out in any form that I need it. So if I want to put it in a table and get it out that way and look at it through Tableau or something like that, I can do that. If I want to make a REST call to get to this data, if I want to write an API to get to this data, if I even want to write a SQL statement to get to this data, using some database technology, you can do that. And so I have this, I've got the headers section and the content section. And now what I'm going to do is I'm going to enrich this data so that I'm going to grab those first names and I'm going to standardize on the construction of first underscore name. So Amy, in her original record, she had first underscore name. That's not an issue, last underscore name. But you remember our other friend Julia didn't have that. And so now in Julia's header section, I've got first underscore name, last underscore name. So now I can begin to write queries and all the records are the same for those two fields. Because right now I'm not really concerned about email address. I'm not really concerned about city, state, zip, although we're going to get to that in just a second. So this is the first type. If we're talking about multi-model, what that means is multiple data models in the same record. So this is the first data model type that I'm going to have. So I've got XML or JSON. And that's going to be the structure of my document, that envelope that I'm going to shove other things in. But I can use additional data model types in this document. So it gives me a lot of flexibility, especially with all the different data sources that we have now. I might want to include some GIS information. So I'm going to do some master data management. And because I have a zip code in this record, I'm going to go out to a third party source and I'm going to grab some geo data, some geospatial data. And again, this is another data type. So this data type, I've got lat long and I've got all of the specific geo functions in there to be able to do things like chart it on a map. And now I'm passing the database, this geospatial polygon. So any record that exists within this polygon return that record. And so you can see that out of the 2,000 records that are in this database, I'm limiting it to the 124 that exist within this polygon, not by constructing an XML or JSON query, but by constructing a GIS query that passes geospatial information, lat long information back to the database, which I have in my enriched record. And so now not only do I have an XML or JSON document, but I've also got geospatial information contained in my envelope. And so now multiple different types. Now ultimately, I'm going to have to join that data because I can't just put everything in one record. That would be kind of crazy. In fact, we call this an entity. And so if the customer is my entity and entities can be anything, entities can be transactions, they can be a residence, they can be a property record, they can be a sales record, they can be anything. I may want to maintain two, three, four different types of entities in my database and then have a way to join those together. And so I need a technology to join those. Now I could do the primary key or key value. I could do that. But I have actually a different way that I'm going to use and that is by using triples. And so RDF triple stands for the resource description framework. And any of this stuff that's new to you, you can either grab the multi-model book or you can look it up on the Internet. But basically triples are a way to describe joins. And so from that aspect, I've got my subject and I've got my predicate and I've got my object. So just like we would construct a sentence in the English language, I'm going to be able to tell the computer that John is a member of the pre-sales organization and pre-sales. You know, and I messed this slide up and I apologize. What this should say is it'd be a lot more compelling if it did. This should say pre-sales is a sub of the field organization. And if I have those two facts that John is a member of pre-sales and pre-sales is a sub of the field organization, then I can infer and the computer will actually do this for me. It'll make an inference that John is also a member of the field organization. I've got another example here talking about me and my boss, Derek. So John is a member of pre-sales. Derek is also a member of pre-sales, but Derek is also a manager of pre-sales. So the nice thing about these joins is if you think about the join structure, go back and look at our relational join structure. It tells me whether it's a one-to-one or a one-to-many or many-to-many, and that tells me a little bit, but it doesn't give me any kind of quality about the join. So I don't know if locations, I don't know how locations is related to the department, but by using triples, then I get that quality information. So not only is Derek a member of pre-sales, he's also a manager of pre-sales. So he has a dual role there. And now I can go look, and if I want to grab all of the employees in pre-sales who are also managers, I'm only going to grab Derek, I'm not going to grab John. And I can do that as many different ways as I want and visualize that as many. And so now, within my envelope, I have my original data and I've got my triple section. So Jacqueline Gibson is related to ABC Manufacturing. And as I visualize that, the computer is going to figure out Jacqueline Gibson is related to ABC Manufacturing, so is Amy Torres, so is Bruce Gordon. And so not using a query language called Sparkle, I can actually run queries that will show me what all the relationships are with all the inferences. And so now, with inside of my envelope, I've got documents, I've got geospatial, and I also have triples. And what that allows me to do is then begin combining these in really interesting ways. And so this next chart is going to look really crazy, but I'm going to take some time and explain it to you. So if I have a really complex problem, this is a law enforcement problem. And so I've got this character here, and I know by the icon there that this is Dobby, and Dobby is a person. And so the triples probably told me that Dobby is a person, and he's related to all of these different things. And so he's related to all these phone calls. Well, I can have recordings of the phone calls that I can store in the database. I can have tweets that I can store in the database. I can have documents. So maybe this is describing an encounter. It's a document that's written on an XML, maybe, or text or PDF. It can also be connected to objects. So he's connected to a phone. He's connected to other people. And so I can begin to construct what's going on in this so I can link all the documents and link all the geospatial and link all of the other records together using triples and create something that is pretty special. So that's what we mean by multi-model. We've got documents, XML or GSOM. We've got geospatial. We've got RDF triples. And then I can go grab all sorts of other data. So if I want to use binary files like videos, photos, this can be PDF documents, Word documents, social media. And so a lot of the data that we see is very structured. So when people typically, when they start thinking about document databases, they start thinking about unstructured data because of the flexibility of XML. And that's not entirely the way we use it. We do use it in a lot of very structured ways. And so you might look at some of these documents and they look exactly like a relational record and they all look the same. But what we're doing is we're using the scalability and some of the flexibility to capture the structured records. But as I'm maintaining it in whatever form that I need to, I can dump all that information into the database and then I can use search and I can use the different APIs to get that out. And so just to do a little time check, we're a little bit over halfway through the presentation. And so we've talked about how to get unified views across disparate data models and formats within a single database. And if I was looking at this from a relational perspective, I would have to spend a lot of time upfront, sometimes weeks or months or even years designing a data model that could accommodate every different type of format. And just like with me and my dear sweet wife, we would have to have every fight throughout our entire marriage in that first design section because if we didn't and we didn't get those things locked down, then they would come back to buy this later. But with the flexibility of the document database, I can literally dump everything in there without regard to what it is or what it's used for and then begin to refine it over time because I'm maintaining that original content. So I'm never changing the original content. And then I have these different sections. And so if I find out later that I make a mistake, I just go back to my document section and maybe we wipe that whole section of the document out and we reconstruct it a different way. And the record itself is intact. The original data is intact. All the other components are intact. And it's very, very easy to change. And then the benefits of a single product versus multi-product, multi-model approach to data integration. I don't have any slides for that, but you can begin to understand that if I have a document technology and I have a triple technology, I have a triple store and I have a search technology, these things get very complicated to manage all the relationships throughout them. And so MarkLogic happens to be a product that will do all of those things under the covers. But even if you take different pieces and parts, you can get some of the benefits. But in constructing this, what we are trying to promote are different, are products that can do all of that under one umbrella. And the nice thing about that is you've got a single security model. You've got a single distribution model. So as you start talking about how to scale this out, it's all on a single server. And now the importance of agility in data access and delivery through APIs interfaces and index, because this is the quintessential problem. What happens is the business goes to IT and they say, we want data. And the IT says, great, we have data. What data do you want? The business says, I don't know. What do you got? And they play this kind of who's on first routine going back and forth and they don't get a whole lot done. And the business puts something together and the IT puts something together and the business is never happy with it. And so what we really need is a system where we can take anything that we have. If we have geodata, if we have triples, if we have XML, we take it and we dump it in there and then whoever needs to get at it can get at it. And this is really about the proliferation of web-based applications and cloud-based applications, because it used to be, if we were talking about EI, we might take records from one database and ship them over. We might even convert them into XML. So if we think about SOAP, that's the simple object access protocol, that's XML over HTTP. Well, now we have REST, which is JSON over HTTP. And so we can use the REST API. We can use programmatic APIs. But we can also, if we want to flatten this data out, we can flatten it out and run SQL against it. We can run Sparkle against it for the triples. And we can go after it with the native JavaScript and XQuery languages. And using REST, this is where a lot of people default because they don't want to switch to any of these. They want to be able to use R or Python or some other data access methodology, MapReduce. We also have a MapReduce connector as well to go get that data. So if I'm coming at this with .NET, I can just use the REST API to get that. And that's just going to go into transactional apps, analytical apps, downstream apps. And again, by having all of those located in one server, I can begin to scale that out. And so I don't have to worry about my slowest query problem. The slowest query problem is when I have multiple technologies, the challenge is that my whole query is only performing on the slowest one. So by having it in a single server, now everything has the same performance layer. Flexible deployment. And this is really, really important now, especially with a lot of your data I'm sure is going up in the cloud. Maybe not all of the important stuff, maybe just test and dev today. But the nice thing is if you have the right setup, what you can do is write it once and then move it around. And so I can move these documents, I can move whole forest of documents or directories of documents to the cloud. I can have a hybrid model where maybe some data is in on physical premise and some data is in the cloud. I can have things like tiered storage. And so I might move all of my really old records up into the cloud on S3 storage. And I might have the really new records in a physical environment on premise. I worked with a large health insurance company helping to redesign their claim system. And what's funny about that is when they receive a claim, it comes in an EDI format called X12 837i or 837p. And it is a hierarchy. And they were taking that hierarchy and splitting it out into 70 relational tables. What we did is took that hierarchy, put it into a single document, and then because 90 plus percent of the claims that they query are claims that are less than a week old, we can have the latest and greatest queries running on very fast storage on the physical. And then over time we could ship all the old claims off to very cheap storage, whether that's HDFS or S3 or whatever it is. Now, this is something that people don't often talk about, document databases and asset transactions, because a lot of document databases didn't start off with respecting transactions in mind. But you will find some out there, and Mark Lodzik happens to be one of them. And what this really happens is the multi-version concurrency control. And so when we receive an update of a new document, we don't ever delete or update that document. What we do is we expire it. And so by expiring that document, we can keep in number of copies of that document. And so originally when the document gets put into the database, it's got an expiration date of an affinity. And when I was a kid, I think we all thought we could count to infinity, but we all know now we can't. And so that record will never expire until I receive an update. And then when I receive an update, I will wholesale expire that document and I'll rewrite that new document. And we can actually do this at the sub-document level. It gets a little bit trickier at the sub-document level. But a lot of times what we'll do is we'll end up rewriting the whole document. And then we can keep in number of copies of this. So maybe that number is zero. And we don't keep any copies of it. And it's just highly fluid because we're not worried about the statefulness of the record. And just I want the latest and greatest. But one of the things that we can also do by keeping in number is something called bitemporality. And so you can think about this. If you think about Ralph Kimball and slowly changing dimensions and things like that, this is very similar to that. And so this might be, we'll call this copy one. And then I might have copy one prime, which is the old version of this. And I could actually do a bitemporal query, which looks at both versions of the document. And it tells me all the differences between the two so I can see how the data has changed over time. And this becomes very, very important. It also enables organizations to do transactional apps. And so we have a number of highly-performance transactional apps that rely on that multi-version concurrency control in order to do that. And so you can see this is we're starting to build a list of requirements so that we have a flexible architecture. We've got asset transactions. And then they've got to be secure. Now, one of the things that you always hear, for instance, is that Macs are a lot less susceptible to hacking attacks and viruses and malware and whatnot because they are just not the major, you know, Windows is by far and away the major popular PC out there. And you've got all the Windows browsers and whatnot. Well, likewise with document databases and with their multi-model databases and relational databases. So if I'm going to do a hack attack, I'm going to do, you know, probably some, you know, a SQL injection attack. Well, SQL injection attacks don't really work against document databases because the SQL API is not exposed at that level. So you can't just pass SQL to a document database and get any records back. So just by nature of being a multi-model database, the structure of the database is not consistent throughout, you know, between any given record. And so that secures it a little bit. It's much harder for people to find out, but then we have things like role-based access control at the sub-document level. And so if I have that multiple sections of documents, I can identify by element or by value who I want to see that. And so when you do a query, you could run a query and I could run a query and we could get two different result sets based on the fact that you can see more fields than I can. When I run that query and won't even return those fields, I won't even know that they're in the dataset. And then when I run the query, this is being run against the security index. And so when every query is running it's a security index and so I'm not even returning information that people aren't allowed to see. And because it's a document database and because in this particular case there's embedded search technology, every value in the database, every word, every character, every phrase is indexed so that the search goes very, very fast. So we've got confidential information and we've got the asset compliance. So that as we begin to create requirements for our multimodal database, they've got to have all of the capabilities that a very, very mature database technology like Oracle or SQL Server also has. So it's got the scalability and elasticity. It's got the flexible deployment, the asset transactions, high availability and disaster recovery, tiered storage. We talked about bitemporal security because if you're going to start constructing multimodal databases and especially if you're going to start constructing them with multiple pieces and parts, then you are always going to be subject to the most vulnerable technology and the slowest technology. And so that's why having all this in one shell is a really important thing if you're going to build applications on top of that. But there are lots of different technologies out there and these are always evolving and so it is absolutely requirement too that beyond this we have to play nice with other technologies. So we have to play nice with the relational because there's no way that we're going to come out and wholesale replace relational or replace some of these other technologies, no SQL technologies that are out there. So we looked at how to get unified views across disparate data models and formats within a single database and we do that using the envelope pattern. We talked about the benefits of single product versus multi product, multi model approach to data integration and there are some single product solutions out there that do things very, very well. And so for instance, if you're only problem, if you don't have a multimodal problem, then you may not need a multimodal solution. So for instance, there are technologies out there that are graph databases that do things like shortest distance queries and they do them much better than any of the other RDF technologies, because they're single point solutions and there are columnar databases and Hadoop databases and lots of other kind of no SQL databases that do individual things very, very well, maybe even better than some of the aspects of these multimodal databases. But if you're really looking for something that can do all of those things or you don't know what you're going to need, that's when you need the single product that has all of the different approaches. So whether you need a point solution or you need more of a platform solution, you need to make that decision. The importance of agility and data access delivered through API, so you got to be able to get anything out, anything in and send it out to anywhere else. It's got to be able to scale. Go into the cloud. It's got to provide asset capabilities and securities. And then a lot of these may be no duh moments for you, but the big question is now you've been edumacated on multimodal databases. And so the big question is, does it fit for you? Is it right for you? And a lot of the fight that you're going to have in a multimodal database is really just political. So if you have a project that is a great fit for relational. Maybe it's coming out of a relational system or maybe it's coming out of a CSV and it's not that complex and the joins aren't that complex and you got good conformity. You're going to get a lot more institutional support for relational than you ever are for any of these emerging technologies. So things that are often a great fit for relational are a poor fit for multimodal, not because multimodal can't do it, but because the institutional momentum, still, I mean, I turn 50 this year and I have 25, 30 years experience with Oracle and relational databases. Going back to some of the really, really old relational stuff, if it's a great fit for relational, it just is. And you're not going to be able to change that. Heavy reliance on traditional BI tools. One of the things that the NoSQL tools, none of them, have really today is a great robust set of BI tools that can sit on top of these things and really understand and resolve a lot of the strange, wacky, unconventional data schemes and data systems that are in the multimodal databases. I think those will evolve. But if you just want to dump some data into access or Oracle and you want to put, you know, micro strategy or Tableau on top of it and rock and roll, then that's probably a good use for that. And ultimately it comes down to lack of institutional willingness. What are good fits? If you can convince your organization to do it, anytime you have disparate data, and this can all be structured data, right? So you might want 100% structured data. You're not bringing in tweets or videos or anything like that. You're just bringing in disparate data. But hopefully you've been able to see that delaying those religious arguments and delaying those use arguments for down the road when you can really understand the ramifications that's so much healthier and so much faster because you can get... This is not that we're going to do away with data modelers. We are not going to do away with data modelers. What we're going to do is we're going to make data modelers very, very strategic in the overall operation because long-term they're going to be the ones that guide and direct the direction of the application because they understand how it all fits together. A lot of times in a data modeler project, all your work is up front. You do 6, 8, 10 months up front. And then once the application is launched, you move on to another project. Whereas this, you know, we're going to have the need for that evolution over time. Anywhere you have a changing schema, multiple data types. If it's hard to model, it fits great in a document database, multimodal database, because you construct all the documents, however the data fits, and then you can have all of those documents linked together using triples. You can embed GIS data, and it's really straightforward. If you have multiple targets, and master data management is something that is really evolving with multimodal databases because you have the flexibility to be able to keep that source data. And then as you're doing all your lookups and all of your, you know, different master data changes, you can just change that record and keep all the data in sync. So hopefully what this has done is kind of busted some myths because I think there are a lot of people out there that will tell you multimodal. And what they're really talking about is a lot of the open source technology is not secure. Some of them don't even have security layers on them. But what I'll tell you is that there are a lot of, there are some multimodal databases out there that are very secure, mark logic among them. Multimodal is not good for transactions. Well, if you have acid compliance, then it's great for transactions. As long as you have that acid compliance, then you're good. Multimodal is hard. Obviously I'm an excellent presenter with awesome skills and I've made it look very easy and I expect the downloads to jump off the charts because I've made it look so easy. It's different. It's not hard. It is different. I'm sure the first time that you sat down and tried to join four or five relational tables with all of the inners and outers and what not was very complex for you. And probably maybe you can still do today. And those same challenges, multimodal is not magic. There's no magic box. IBM Watson, you can't just dump all your data in and Watson figures it out, wins jeopardy and just does everything for you. There's a ton of information and effort that needs to go into designing these things. But it's not hard. It's just different. And a lot of people will say, well, I don't know how they run in the cloud. And you'll find multimodal out there that runs great in the cloud. Again, the book is Building on Multimodal Databases by Pete Avan. It's free at O'Reilly. My contact info as well. Hit me up and cash me outside. How about that? And my phone number, my email, and I will take your questions. John, thank you so much for the fabulous presentation. What a great presentation it is. And we've got a lot of great questions coming in already. If you have a question, feel free to submit it. Oh, we just went to a big screen there. Okay, so if you've got any questions, feel free to submit it in the Q&A in the bottom right-hand corner of your screen. And just a reminder, and to answer the most commonly asked questions, I will be sending a follow-up email by End of Day Thursday with links to the slides, links to the recording of this session, and anything else requested throughout the webinar. So, diving right in, John, how do you line up the disparate attributes as your entity changes over time? Hashtag data management nightmare. Well, it is and it isn't, but it's no different than relational, right? So, you know, I think the hardest part about relational is as you're building that data dictionary and you have all of those source data columns coming in and you're putting them into one, you've got to go back and be able to trace and understand. You know, I took 10 different tables and I put them in here. And now I'm calling this column phone number. But there were really 10 source columns that constructed that. And so by maintaining the original state of the data, if you remember, we'll go back to probably right around here. So in my envelope, I've got the original state of the data right here. And so I always have what the original state of the data was, and then I'm making a data dictionary just like anything else. And so, you know, yeah, you can say hashtag data management nightmare. I would say hashtag data dictionary because you're building a data dictionary that you can also store in a document that tells you what you're doing and you're not doing it all at one time. So you're only doing that over the amount of time that you need to for the... So in this case, I'm only harmonizing first name, last name, which to me to do it incrementally is a lot simpler than having to do everything like you would in relational. And again, it's not hard, it's just different. So how do you account for performance goals with a single model database? Well, so, you know, you can do it with indexing and with the types of queries. And because you have different structures, you can do all sorts of different things. And so for instance, you may have something that's hard to model and you may take something that you originally modeled in documents and you may go back and remodel that in triples to make it more performant. You know, what you really want to do is you really want to drive search results out of your result set. And so you want to make that query as constrained as possible up front. And so we can combine search technology. So because in this particular case I have embedded search technology, I can do a term search against a whole data set, come up with a subset of documents and then begin to search those documents without ever having entered the documents at all because I'm simply hitting the index. And indexing is the key with this. Again, it's a series of trade-offs. If you think about the amount of data that is indexed in a relational database, it's probably, I don't know, 2% in a large database, 100% of the data in a system like MarkLogic is indexed. And so what are the trade-offs? You need a lot more storage. You need a lot more memory. You need a lot more storage. You probably even need a little bit more compute, but all those things are dramatically cheaper than they were 10 years ago, 15 years ago. So what would you consider a traditional BI tool? We're looking into the use of Power BI. Yeah. So, you know, I'm old school, right? So I came up during Cognos and Business Objects and MicroStrategy. And now, you know, there are a lot of the newer ones like Click and Power BI and Tableau that have a lot of good advancements. Tableau is the one I'm most familiar with because they have a REST interface. And we're actually doing some things to kind of construct those tables. The biggest problem in using BI tools is this is something that is not... So if I have something that's first name, last name, email, gender, social security number, that's easy, right? That's very columnar. I can throw that into a query and get a result set out like that. But something that's nested is more difficult. And so what you have to do is flatten that structure out. And so what we do is we have something called template-based extraction where we create an extraction template. And we pull a table structure out of this, and then we feed the table structure to the BI tool. And I know MongoDB uses a Postgres front-end on there where they have some sort of layer between Mongo and Postgres, and they're dumping it into Postgres and they're creating relational views on top of that. So, you know, there's a lot of different strategies, but it is a struggle. And there's new...I would have told you that the BI war was won years ago, right? Because Cognos Business Objects, these are massive companies, but yet you see the market disruption that click in Tableau. And I'm not as familiar with Power BI, but, you know, those other vendors took because of really the web interface and really driving that to the web. There is a question that I saw in here from Belosh, Belosh Dash, that I wanted to answer if you don't have a problem with that. Sure. So, the question is how does fine-grained security...how does fine-grained security are defined on a multimodal database, especially document having nested structure? And it's exactly because of the nested structure that you can get the fine-grained security. And so what we do is we look at patterns. And so we look for element names, we look for element properties, and we construct a logical case argument, and then we create a security index which accounts for that logical case argument so that as we're evaluating the data and it satisfies that criteria, then either you are allowed to see it or you're not allowed to see it. You're allowed to update it or you're not allowed to update it. We go with a set of tasks and a set of privileges and a set of rights, and we create roles based on that, and then we can tie them back to LDAP roles and other things like that. So I'm happy to do a follow-up on that, but we actually do a lot of government work. And search is a really good example of this where you don't want a situation where a search query is going to return results, and then you're just going to redact the result. You want to not return the result if they don't have the ability to see that. And so that's why every query that we do goes through our security index where we're building those policies and we're comparing the data against those index of policies. But that's something I'd be happy to follow up on. And somebody asked a question, where's the book located? We'll go to that slide right now so that if you want to look at it, then you can. O'Reilly.com slash data slash free slash building on multi-bottled databases. You can also get it at marklogic.com. There's a link up there. All right, so we've got a lot of great questions coming in. Is there any place to download the platform you presented? There is actually, yeah. If you go out to marklogic.com and there's a download link out there, you are free to download a copy of MarkLogic. We are now on version 9.03. And you can download a developer copy. And there are some terms that come along with that, so it's not open source, but you're free to play around with it and create applications that you can, you know, run in development. And then when you move to production, that's when you need to begin licensing that. And there is a term, there's no expiration on the software itself, but there is an expiration in the terms. If you look at the terms, and that's the biggest lie, right? Yes, I accept these terms and conditions. I read the terms and conditions. Nobody ever does that, but that is one of the things in there. Susan, how do you, so going back to Dan to the questions here, so how do you write meaningful reports across multiple sparsely populated columns? Well, the nice thing about that is that I'm only returning the values that exist within the database. And so if I have a very, very sparse result set, then if you look at it in a table, it's going to look odd, right? It's going to look very sparse because that's the limitation of tables is every row has to look the same. But if I look at it in a result set, I'm only going to have those values in the result set which return. And so if the value is not populated, it won't return. And so you have to write your application which then is taller than that. So if you have an application that has to get the same number of values every single time, then that's something you need to take into account and you need to return those values whether they're there or not. But if I'm just constructing a result set and passing that off, so for instance, I may have name address, phone number, but I might not have phone numbers for a bunch of different folks. And so when I write my application, and that's the nice thing about having that structured result is that I can just return that JSON result. And then I can have a little writer that writes out the JSON document into my web page or my report or whatever, and it's only featuring the elements that are there. So I don't have to decide if it's sparse or not. The date is not going to return. Now if I have a tabular format, that is one of the limitations of a flat format. That's one of the things that we really struggle with. So, you know, John, back to, you know, kind of, well, back to the education piece. You know, as an experienced business analyst, I'm looking to better understand these technologies, but I'm having trouble finding a good, relevant starting point. What would you recommend in researching it would be that book or would you recommend just investigating those sequel technologies in general, JSON, R, ObjectDVs, et cetera? So if you are a fan of the Four Dummies series, I don't want to insult anybody, but if you are a fan of the Four Dummies series and you want something that's written for business analysts, non-technical people, there's a great book called No Sequel for Dummies. It's available from the Markologic website, and that's a great place to start. It goes through and it talks about all these different technologies and I would say building on multi-model databases is the kind of companion to that after you go through No Sequel for Dummies, probably building on multi-model databases. We also have another one called Semantics for Dummies. So if you really want to get educated on semantic triples and RDF technologies, and just a sideline here, I'm noticing a lot of great questions, some from the same people. So if you have a lot of questions, like B-Losh, it looks like you got a lot of great questions, just feel free to hit me up in an e-mail or call me, and I'd love to discuss it and follow up with you and give you more demos and more information, and that goes for anybody. I love it. Thank you for being available. Yeah, the community, we've got a lot of questions, great questions as you say coming in still. You know, I think we have time for a couple more as well. So top three reasons that you have for why multi-model is better than relational. So how would you convince relational database fans? Yeah, it's hard, right? I mean, I came from Oracle, I was there for 10 years, and I would have told you, being from Oracle, that I knew about NoSQL technology because we sold something called NoSQL. In fact, the name of the product is called Oracle NoSQL, and it's really a columnar database. It's based on Cassandra. We also sold a Hadoop variant as well. And so I would have told you that I was very, very schooled in that. And what I can tell you is after being at Mark Lodger for two years, I knew nothing about the technology. I didn't understand, you know, the hardest thing that I've had to do is forget the relational constraints because when I come out of a problem, my go-to is relational. And so when you do that, you start to layer on a ton of work. And, you know, the way I describe it like this is imagine you drove to work on the same streets every day for years, and then they put a freeway through. But you didn't know to take the freeway. You just kept going down the same set of streets. Well, your trip is never going to get any faster. But by eliminating the work that you have to do, all those stoplights of the relational model. So in my commute to work, right, I always had to do all these stoplights. I had to take a current snapshot, design the new data model, create the indexes, build the application, and every project takes the same amount of time. But by doing, you know, kind of a more dignified approach, ingesting the data as is, and then being able to really play around with a model that you can search, I can begin to start making those decisions. And this is really where all the speed comes in. And we have multiple projects, and I can point you to them. And if you go to markologic.com and look at customer stories, you'll find time after time. So I'll give you a quick one. Deutsche Bank had 32 trading systems, and they tried for two years to construct a relational model that would account for all of these trading systems, and they couldn't do it because the downstream systems were changing so fast. And with markologic, they were able to in six months create an operational data store that contained the data from all 32 trading systems and got the reports out to the regulators like they needed. And the only way they were able to do that is with the flexible data structure. And I've got story after story after story that are just like that. So top three, and I know we have about 10 seconds left, I would say anytime you have disparate data, anytime you have a lot of data systems that change, and anytime you're not exactly sure about what the data is and what it's going to be used for, multi-model is a great solution for that. What a great way to end, John. Thank you so much for this fabulous presentation. And thanks to Mark Logic for sponsoring today's webinar. Just a reminder, I will send a follow-up email within two business days. So by end of day Thursday for this webinar, with links to the slides, the recording, I'll send out and make sure John's contact information is nice and bold there, along with everything else that was requested throughout, including links to the book and such. And a little note from Mark Logic, she says that we'll have a link to the book as well as a developer license in the follow-up email. So for you all. Again, thanks to all of our attendees for being so engaged in everything we do. We just love all the questions that have come in today. And John, thank you again so much. I hope everyone has a great day. All right. Thanks, John. All right.