 Hello and welcome my name is Shannon Kemp and I'm the Chief Digital Manager of Data Diversity. We would like to thank you for attending Database Now Online, the first occurrence of this online conference produced by Data Diversity. We are very excited to kick off the event and have a great lineup of sessions for you today. And of course a special thanks to all of our sponsors today to help make it happen. Just a couple of points to get us started. Due to the large number of people that attend these sessions you will be muted during the event. For questions we will have a short Q&A at the end of each presentation today and we will be collecting questions by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet we encourage you to share highlights or questions via Twitter using hashtag DB Now. If you'd like to chat with us and with each other we certainly encourage you to do so. Just click the chat icon in the top right-hand corner for that feature. And for this event we will send a follow-up email next Monday to all registrants containing your unique login to access the recording and the slides from today's presentations. Before that may introduce to you our very first speaker for today, Dan McCreary, who will be discussing data lakes and data hubs, which one is right for you. Dan is an enterprise data architect with a focus on NoSQL database architectures. His interests include topics such as distributed computing, programming languages, databases and XML standards. He's the co-founder of the NoSQL Now conference and is a co-author of the book Making Sense of NoSQL by Manning Publications. Dan was also a principal consultant for Mark Project. Dan is currently an independent consultant with a focus on AI and natural language query front ends to data hubs. And with that I will give the floor to Dan to get today's conference started. Hello and welcome. Thank you, Shannon. I'm looking forward to presenting and this is a great topic and I think you have a great audience for us today. So let me just give a brief overview of what we'll be discussing here. The first thing we're going to be talking about is kind of an introduction to the terminology of what is a data lake, what are definitions. We're going to really emphasize why some of these are superior architectures than trying to use relational databases for certain business problems. I'm assuming that most of the audience here is familiar with relational databases and their strengths and weaknesses. But what we're going to do is talk about data lakes and data hubs and their common attributes and then talk a little bit about how they're different and then we'll go into a little bit of a case study of an infrastructure for feeding data into these AI and deep learning systems that have become very popular right now. We'll wrap up with a proposed objective selection process to help you try to minimize the bias in the selection process. So when I start out these presentations I often start out with how many people have actually heard of the data lake and data hub sermon. I think most people have heard of data lakes and a growing number of people now have also heard of data hubs too. And I think in general about a third of the people that I talked to are using them or maybe a subset of the technologies. But I'd say at least two thirds are actually either building or considering building a data lake and a hub. So it's starting to grow in popularity. And just a quick background for some of the concepts that I introduced today. They are discussed in much more detail in our book, Making Sense of No SQL. And this is really a guide for solution architects. So we're going to take what's called the solution architect approach today that is given a business problem. What's the best objective way to come up with the right architecture based on your business crimes, not necessarily on your legacy systems or what your IT has done in the last 10 years, but it's really an objective matching of a business problem. So let's review what these patterns are. Relational databases we have in the upper left-hand corner where we store data in what's called a row store. We add one row at a time and the constraints are that all the columns have to be in a consistent structure. For example you might have dates in a certain column, all the things in there have to be dates. The second one is the analytical where we have these star schemas with a large fact table in the center that could be billions of rows long. And then outside of that you have dimensions for dimensional analysis. Then we these are the first two traditional ones and then the four after that are considered the not only SQL patterns. And we're going to briefly talk about them and then talk about how we use various ones for data analytics. Key value store is the most basic and everybody uses a UNIX file system or Hadoop distributed file system. You have a key which is the path to the file and the value. But the value is not indexed. There's no way to search for data that's inside. You can't say select all data where the value is something like that and that's the most straightforward and easy to set up system. We're also going to talk about column family stores, graph stores, and document stores. And so we'll go into a little bit more detail in each of those as we need to. So just to make sure everybody knows a little bit of the trends, I mentioned that data lakes and data hubs are both popular terms and growing in popularity. They've had peaks at various times. But I think both of them have been around for a while and continue to grow with people's awareness. And the biggest thing that we want to understand about these two architectures is what's called horizontal scalability. Horizontal scalability means that as I add more data or more documents or more information about the systems I want to analyze, we simply add new nodes to a shared nothing cluster. And as we add more nodes, we get increased computing power. And this is different than many other of the systems on those six, specifically relational databases have a characteristics where they try to do joins. And there's nothing wrong with the SQL language. It's really the joins between those structures. And what it means is that joins need to move data around. And the more data you have, the harder they are to scale. So we're going to talk a lot about what we call that scalability factor. So let's talk first about the data lake definition. So what do we mean by a data lake? So I'm going to use the definition from tech target, which I think is very small and brief and concise. It's a storage repository that holds a vast amount of raw data in its native format until it is needed. So I'm going to focus on three words. First of all, vast. Vast is of course a relative term. And every year, the definition of what means by vast seems to go up. But in general, I'd say it's usually data lakes have usually more than 10 terabytes of data. Remember, you can buy a 10 terabyte drive now for $300. So that doesn't mean you're spending a lot of money. It just means that you have a lot of data. We also want to focus on the term raw. That is, it's data that's not processed. And it's in its native format. So if you dump it from a relational database into CSV values and you have a lot of numeric codes, those numeric codes are going to be stored untouched, unchanged and untransformed. So let's talk a little bit about some assumptions whenever we talk about a data lake architecture. This is what they have in common. First of all, they all have the ability to scale out, but they're not designed to do joins between systems. They're designed to be consistent for consistent performance. So if I'm reading data, regardless of how much data or writing data, even under a heavy load, you should be able to get a consistent performance over these clusters. We also want high availability. So if one node in a cluster of 100 nodes goes down, the entire system is unavailable. There's replication. By default most of the system have what's called replication factor 3, although that's very easy to change. That means when you store a document, you're storing it in actually three different places. The other thing is that there are no secondary indexes. By indexes, just like we have a back of the book index or we index words in a search engine, data lakes don't, by default, index your data. And they're also very low cost as measured by terabytes per year. So this morning I went and looked at the Amazon S3 pricing and they're running at 23 cents per gigabyte per month. So if you multiply it by 12, that's 27 cents per gigabyte per year or that's $276 per terabyte per year. So that's kind of the baseline pricing. Everybody kind of measures that. If your organization isn't doing charge backs, you may not know what it costs your organization to store a terabyte of data for a year. But it's something that people that are building cost effective data lakes really are concerned about because storing your data in a relational database tends to be several orders of magnitude higher than this. And then a couple things about Amazon S3 as a baseline is that they're designed for five nines availability. You can always read your data at any point in time and 11 nines durability. That is the probability your data will get lost. And I haven't ever talked to anybody that's ever lost data in Amazon S3 because they replicated on solid state drives very frequently. And an interesting thing. Amazon pricing has never gone up. It's only gone down as they use more efficient drives and they've passed those savings on to customers. So how are they implemented? In general you can implement data lake in any technology you want but the most common is probably Hadoop. Hadoop has what's called a distributed file system often abbreviated as HDFS. You can also implement this on any object store. And there's a growing number of these distributed file systems. So you're not really locked into just HDFS. And they are getting more popular and better documentation, easy to set up. And a lot of them actually adhere to the APIs that Amazon S3 has established. And there's a very good Wikipedia page that has a list of these distributed file systems. So that's kind of a quick overview of how we define a data lake. Now let's define what we mean by a data hub. So the definition from Wikipedia is a collection of data from multiple sources organized for distribution, sharing, and subsetting. So this is a very different definition. This doesn't really talk about storing raw data. This talks about making sure that it's useful for data subscribers. That you can subset it often means you want to do it quickly. You want to generate web pages in real time. And this is all about indexing the data, all about caching the results of those queries. But still can we benefit from this distributed architecture. And that's really the goal of the data hub. So it's also very different because we try to look and make data consistent. The homogenization process of building what's called canonical things. And that's not a trivial thing to do, especially if you have a large organization that can't agree politically on the definition of data. But that's really what we're trying to do when we build data hubs. So let's just compare the data flows here. So in the data lake, we take applications and we're going to either get CSV dumps or if you're doing Internet of Things and you have log files of all these sensors, you're going to take all these things and you're going to start to dump the raw unfiltered, unprocessed data in a distributed file system and then it's ready for batch analysis. This is not a real time process. The focus there is analyzing data that is relatively what we call immutable. It doesn't change a lot. It's not ongoing twinkling transactions. Whereas the data hub is a much more ambitious architecture. We have these applications all injecting data into this data hub where we're going to try to make it meaningful, convert it to consistent format, and then provide a series of high availability sub millisecond response to web pages. And that you can also do things like search. You can manage your transactions and you can do analytics all from the same data set. So very different use cases. So let's just take a little bit about how we actually go about and do some of these things. And by the way I should mention data hubs and data lakes can work together. You can standardize your data and then dump it back into the data lake. So these really, it's not really, you have to have just one or another. They can work together and they frequently do in large companies. So let's just talk a little bit about how these data hubs actually do their data. They also take the raw data and they put it in what's called a staging area. Staging area is also very common to anybody building a data warehouse. And the most common pattern there is every time you have a row in a spreadsheet, in a database, or in a log file, you create one document per row in these data hub architectures. And then you transform those into what's called one document per object. So what do we mean by an object? An object is an enterprise entity. It might be a person, an organization, a product, a sales transaction, an event, a claim if you're an insurance. And it's all the data that's a social thing. So if we think of relational databases having lots of one to many and having all the do those joins, we eliminate the need to do joins by collecting all the information for enterprise objects and putting it together. We can then do things like validate the data against standard document schemas and then we can look it up against metadata and reference data. And then we can create these extracts of that on the index data and create output from that. All right, so that's a little bit about what these architectures start to look like. So what I've been doing is specifically focusing on AI driven strategies and if you think of AI driven, we mean a lot of these things that are doing deep learning where we're taking a lot of data but we need to format it in a correct structure. We call this feature scaling and means normalization. It means that instead of having data that has huge ranges every all the data sets might range between maybe for example negative one and one and then everything is scaled on that. We can then take training sets out of that, train our predictive models and bring them back and then we can use those models to do the predictions and build AI applications. So one of the things I always ask people is, so if you have the question, what is the answer to life, the universe and everything, the data lake would answer that 42 just like in the book, Hitchhiker's Guide to the Galaxy. And the problem here is that what does 42 mean and that is the meaning should make you think about semantics. Semantics is the process of assigning meaning to this low level numeric or binary data. So I'm going to propose seven levels of semantics here from stage one where we get a raw, commas value and the column name may not mean anything to us. For example, this one actually means the individual gender code in 3D, but the numeric code once again doesn't mean a lot. So we can change the names and have more human readable names on stage two. We can then associate those attributes with an enterprise object like person. And then in stage four we can go from short numerics to maybe short letters but maybe we actually really want to have searchable strings. So when people type in, find all female physicians within 10 miles of my house, the word female can be used in a search string. Then we can start to associate each of those elements with a formal namespace for your company. And then the last stage is the ultimate in semantics is what we call the RDF model. So let me show you what that RDF model means. It means that for every element in our data hub we have a subject which is a full URI that points to the person in our metadata registry. It has the gender code which is also a pointer called an IRA which is the predicate and then the object. Sometimes objects are literal strings and sometimes they're also pointers. So RDF is the ultimate semantics but it is very verbose and most organizations just don't have the people and time and patients to go to that full RDF system. So what we talk about now is let's talk about the spectrum of semantics. Not everybody is going to be able to move their data to full RDF but what we are starting to say is that data lakes focus on cost. They focus on getting things up and running quickly so that we can have lots of data to run our analytics on. Whereas data hubs have a much more ambitious goal of doing incremental harmonization of that data and then optimizing it for queries and indexing it. So it's a second set of processes around that. All right so one of the things about these data hubs also is they help you build integrated views of your customers or maybe integrated view of all your hardware devices or whatever and the key about that is if you have 10 different ways to represent a person or a claim or something like that and you put it in your data hub it's going to be very difficult query because your query is going to have to have get this data, get this data, get this data and then put in the if then and else rules for all of them. If you really wanted integrated views of your customers we need to start to talk about these data hubs and how they work. A couple things about implementation here. People ask wait a minute we have this wonderful storage area network we've spent millions of dollars on this. Can I build data hubs and data lakes on this internal shared storage area network lands and sands? And the answer is really no. It is these data hubs and data lakes are designed to scale on shared nothing on the far right hand side in the green box. Every single node in your cluster you may have 100 nodes and each of them have their own desk, their own RAM and their own processor. The other architectures where you have shared memory and disk may be good for certain things maybe your sand does automatic compression of your data sets and things like that but what you find is that these data hubs and data lakes need to have their own hardware. They need to be in independent and highly specialized things and especially in data hubs where we have lots and lots of rights to indexes you have to have very fast right throughput using these storage devices that have lots of very fast solid state drives. So a couple things that we talked about the 360 and integration. I just want to really emphasize the fact that many of the studies that show that people try to do these integration tasks on relational systems have huge costs over on some huge delays. IT business edge said they did a survey of people who tried to use relational databases for integration hubs they had 100% failure rate on their survey and the reason is variability. If you start to gather one set of data and it has some small variations you can model it in your ER model but what about the second data set? Well you try to standardize it and they have separate variations so you have to go back and remodel the data and then you get the third data set and you get this constant things where people have to constantly change that enterprise relational model. So the key thing about this is when you talk about variability make sure you know that you can't predict future variability so you should strongly think about an architecture documents and graphs that support that variability. So just to summarize what we've been talking about both data lakes and data hubs have that consistent horizontal scalability. They both have this agility and flexibility. We call that schema agnostic. They don't care what data, what the structure of the schema comes in. They just will absorb that data and store it. Both of them leverage the low cost of shared numbing racks of computers commodity maybe not cheap but certainly commodity so that you can use generic components and they both know how to reliably distribute the data and distribute the computing loads. So if you're doing a transform of your data and you have 100 nodes the CPU load on all of those 100 nodes should be evenly distributed and both of them help organization understand the challenges of distributed computing. So what do I mean by that last data? One of my favorite articles is the Peter Deutsch article called the Eight Fallacies of Distributed Computing. The network is reliable, latency is zero, bandwidth is infinite, network is secure, topology doesn't change. There's one database administrator transportation of data around the network is zero and the network is homogeneous and in general you don't see these things. These things are all challenges that people that have never done distributed clusters and replication between data centers they simply get, they underestimate the transition. So one of the things I'm saying is if you don't really have a business need to do distributed computing try not to do it because it is complicated and expensive but if you are really trying to build a 360 view of the customer then it is a very good thing. So let's just talk a little bit more about the difference in philosophies and let me zero in on how data hubs and philosophies are different. So we're going to say data hubs we ingest everything, we index everything, we analyze everything from those indexes, we run data profiles that is we do statistical analysis of data variability, then we can track data quality and we can incrementally harmonize. Remember several people who set up these things think oh I have to translate every single element of every database into a consistent representation and that's not the case. We use these things called the wrapper patterns, we put the data that's been harmonized in a header and we can put the source data in the body of the document but we don't have to harmonize everything from the get go. We also want to promote strong data governance and stewardship. We also want to make it easy to do transaction search and analytics on these harmonized data. So let me just talk now about the challenges of using both graphs and documents. So these are the last two patterns and I think one of the things that I have found is that if you use graph and documents in a mix of these things they're really the ideal structure for loading data in and specifically to help you get data out faster. We'll talk a little bit about why not having to do joins really does make your performance go up and make it much easier for your users to access this data. So just to introduce some terminology here. If you've never heard of this term enterprise canonical model it's important that you understand that what this is, this is a standard pattern you can go to the enterprise integration patterns book and you can look up canonical data model and what you see is that this is the way that we minimize dependencies when we're integrating applications that use different formats and what we do is we effectively have this canonical data model and then we translate it into that standard structure and if people need different structures on the way out we can transform it. So this is an important pattern that everybody understands it is not necessarily an inexpensive pattern and you need to know when to apply it. So one of the things about this is understanding your cost models and remember if you look at a traditional we call three different component system where you have transactions you have your data warehouse and you have your search through separate computers and separate people managing and separate security models your applications will store their data in the transactional database but then you have to make copies of it every night into these other systems and then you'll run your BI and analytics off your data warehouse and your search indexes. So the cost of these systems you have to add them up it's the cost of storing relational storing the data warehouse data storing the search data as well as the two ETL processes and then I ask the question so where do you store your metadata is it consistent what are your charge backs and do you have consistent security models and if you're dealing with highly audited organizations healthcare data is a very common one HIPAA rule says that anybody that touches any of these databases has to be audited and you have to have logs to do it so it gets to be pretty complicated. The big problem with this is the ETL ETL can become the monster of complexity it is very brittle it's not suited for dealing with unstructured data it doesn't respond easily and quickly when you add new columns and it also doesn't scale it doesn't it's very hard to distribute ETL because ETL often uses joins and those joins suffer from the same problems so it is the really in my experience it has been the biggest linchpin that has made the cost of the traditional data warehouses go through the roof and caused huge amount of people and time and overhead so let's talk about the deal no sequel system so if you have a cluster of computers and each node in that cluster has the ability to handle transactions and search and analytics then you don't have to move data around I work with a CIO that frequently pounds his fist on the table and says every time I move data around it costs me money how do you minimize moving this data around and the data hub architecture is a way to approach those cost savings by scoring it in a single structure that supports all three of these things transactions analytics and search we can do a lot more and I remember I mentioned that example earlier finding all female doctors within 30 miles of your house if you don't convert the numeric codes into those strings you can't do search and you can't come up with canonical models that support all three of those things and that requires some relatively senior people to build those models all right so document stores let's just go into them a little bit ideal when we have any highly variable datasets the reason is every document can grow branches on trees as the variability comes up comes in there's no rule that says you have to store things in certain structures and you can add optional elements at any time and we don't have to have the same problems with joins so moving to document stores has been the principal thing that helps people lower the cost of building data homes they're very different because they handle data in a flexible way one of the most important case studies around is the healthcare.gov site where they tried to do it in almost three years with a relational database they had 37 different states they had 37 variations and different business rules but they could easily do it in a document store because those stored that variability and there's a lot of diverse examples and having a document store that has good security where you can audit personal healthcare information where it has good role-based access and we have that volume scale out and those are the really key things about using document stores so summarize there an operational data hub is really a single system that does three things does transactions it does the search and it does the analytics and the idea is if you can do it in one box then you start to minimize the cost so when people start to go through cost of these things they say wait a minute this data isn't that going to cost a lot of money so one of the things I try to mention is that data storage costs are going down and they continue to go down so I went to ebay this morning and I noted that you can now buy a 10 terabyte higher drive for under $300 so it's not really the cost anymore this is becoming so cheap so why not index everything and why not index the data as it comes in so that you can do analysis as well as index it as after you've transformed it let's talk about this thing called normalization and normalization is when we put the data in specific tables and then in order to get the data out of that we have to do joins document stores store things in what's called the denormalized where they effectively unfold all the joins they do them all and they put it in one document model so instead of doing 16 levels of joins on 17 different tables that's actually a real case study from a data warehouse that I worked on it's one line of code and because the data has all been pre aggregated together about each person those business entities your performance goes up there's no joins you can distribute your documents and it's so much easier for people to write code that accesses this data all right so let's just talk a little bit about data quality and data governance so data quality is really when we can start to apply a schema against valid canonical data now remember I said that data hubs and data warehouses are schema agnostic they don't care about the structure they load but that doesn't mean that we can't come up with business rules and store them in documents that validate documents for consistency what are the required elements what are the option elements what are the data types what's the order and all those things can be stored in what's called an XML schema document and that means that we can validate those documents and then we can assign a numeric score for data quality once every document has a numeric score and then you can do things like only allow data out in search where the data score is above maybe a score of 60 out of 0 to 100 and anything less than 40 we have to go through a data quality process and continue to enrich to make that data better so just to go through this we talked about these different sections just wanted to make sure that that everybody has clear definitions of each of these sections so staging is where we put the raw data canonical is where we start to denormalize and enrich it start to make it semantically clear and then egress is the area that we start to pull the data out and that's where we make sure that we have the right indexes and the people that are accessing it have SLAs that are going to meet they're both their read access times as well as their high availability I should mention that many of the projects I've worked in have huge requirements for reference data I'm not going to spend too much time on that but I want to make sure everybody knows that reference data is one of the most difficult things and many projects once they get underway half your budgets can be sent on getting your reference data up and running and consistent for reporting. So how do we do this we take muddy data in our data lake and we try to clear it up we give it continuous enrichments so this pipeline in the bottom says we're going to validate it we're going to enrich and we're going to run that cycle over and over until we get a score that's reasonable from the EIP book there's a very good section that talks about the enrichment pattern where we take resources and one of the best examples is you want to search by how far people are away from you find all physicians within 10 miles well they may have an address you have to enrich that with your longitude and latitude and index that so you can do distance calculations those are all examples of enrichments. Every time you have numeric codes you can think of converting numeric codes into these high precision things with very clear URIs associated with both the data element names and their values as that continuous enrichment process. So what we then talk about is this data opaqueness in the muddy systems you don't have the definitions it's expensive to understand them there's no data stewardship and change control and once we have a lot of these process in place we start to clear up our data make it easier and more transparent for people to access we have great APIs that are well documented and all of our processes are designed to make things easier. I should just mention that there are a lot of very good tools for doing data profiling that's really essential when you load your data you index it even in staging so that you can run analytics on your data and you can find out what is the variability of the way that people represent a gender code and if it's 10 different ways you know you have a lot of work if it's all just to one of two values then you know your transformation process is going to be much easier. What I want to do is make sure people get out of the trap of doing data archaeology of data in your data lake and I've certainly seen projects where they have numeric codes the numeric codes changed every release of the software they don't have the back versions of that software and they'll literally spend tens of thousands of dollars to find out that they can't figure out what this numeric code meant that can be very very expensive and so what we want to do is make sure that your downstream systems archive any reference data have time stamps if they change those so you can know the time set ranges of the meaning of changes and the semantic drift over time. In general what we find is if you don't have strong meaning the value of data has a half life typically it's 18 months so data that's 18 months old is half of what it is that data that just got ported in because the software that created it is much more up to date. I also want to mention that when we use the word semantics there's two different meanings. There's the uppercase S meaning semantic and the semantic web stack versus lowercase which is having the data governance and registry so it's different things. I'm going to just quickly go through making sure people understand that search is very complicated. We don't want you to have to build your own search engines. Use an existing search engine because they really do do a lot of the math of how to do good search relevancy and word density ranking. That's not something you want to do on your own. We want to mention that you want to make sure that you understand this concept of coupling between an application and a database and you have to decouple application code from the data that it lives on. The biggest thing I want to mention about this as we wrap up is that when you start to build data hubs you're not going to see a lot of value as you start to spend it up. The value of any network in Metcoss Law says that the more data is in it, the more highly these things are connected, the easier it is to provide value. So data hubs are low in value when they start but as more data is in there and it's structured the value of it it goes up. What's also true is that the incremental cost of adding a new dataset tends to go down because people often know the structures, know the codes, know the reference data, have all these things. So it does take a while to do those things. So as I wrap up I just want to mention that if your job is to be a solution architect, to give an unbiased view of which is the right architecture, it really means that you need to know more than just one pattern. You need to know more than just relational and maybe a little analytical. You need to know both data hubs and data lakes, know when to use them, what their strengths and weaknesses are, and know your own bias as well as the bias of people around you. If you don't have people that have built these systems in production in your selection process you're risking not providing value to your solution. So I encourage you to go out, reach out to other people that have built these systems. I do have a process I recommend and this is called the ATAM process. It's a standard process out of Carnegie Mellon and what it means is that when you consider your architectural alternatives, that box in the middle that has the picture of the six models, that you consider all of them. And you also in parallel you build what's called a quality attribute system. So what is a quality attribute system? If you think of a tree, a utility tree, that utility tree describes how important a feature is to your customer and maybe just a high, medium, low setting. And then how easy is it to implement that system on a certain architecture, maybe high, medium, low. These utility trees will show you the fitness of an architecture. So if you think of a globe fitting a hand, it's really how well does the architecture fit the business problem. So in summary, what are data lakes and data hubs having common? They both are great at distributed computing. They both have horizontal scalability. They both leveraged low cost storage systems. Make sure you're measuring it in terabytes per year. They don't work well with storage area networks. They like to have their own hard drives that they manage the systems. They don't require upfront data modeling, although you may design the structure of folders in certain data lakes. And they're very different in terms of semantics, indexing, caching and security models. So my recommendations is be very cautious about doing integration in a relational database. Make sure that you understand the role of data variability before you do these projects. Not just variability today, but variability in the distant future. Look at data lakes for lowering the cost of analytics. Use documents and graphs for building integration data hubs. Use data hubs when you care about the search and analytics and eliminating the need for expensive ETL. So with that, I've said pretty much everything I want. And Shannon, would you like to open it up for questions here? Dan, thank you so much for kicking it off in such a great way. What a great presentation. So I will jump in. Just a reminder to answer the most commonly asked questions. I will be sending a follow-up email on Monday for this event with a unique login for everybody to access the slides and the recordings. So let me jump into the Q&A because we don't have much time here for it. Is a data sandbox more akin to a data hub than a data lake? Boy, I would call it a data lake. Sandbox has that very informal nature to it and doesn't really have anything to say about making your data consistent or indexing it. So I'd say it's much more like a data lake. Alrighty. And if you have questions, feel free to smith them in the bottom right hand corner in the Q&A section. So is data modeling not required at all for data lakes and data hubs? I would say in general to get started with a data lake, there's very little modeling. And the only thing is remember that's a key value store and the key has can have a folder and you may design folders. And for example, you may want to put this year's claims in one folder and prior claims in another folder and then be able to narrow your things. So you might design some folders in a data lake. Data hubs do require you to do modeling at the document level. So there is a lot of data modeling process, but sometimes your organization may actually have a canonical model. So you might use that too. So in general, data lakes have relatively little modeling and then data hubs much more document and sometimes graph modeling too. Sure. And what is the difference between traditional staging area in a database warehouse and data lake or data hub more specific to data hub? Yeah, that's a great question. Both of those systems use staging areas and there are separate copy of the data that's kind of blocked off from user access, but there are usually replicated copies of the data that are dumped from the relational system. So I'd say they're very similar. In a data hub though, as we move towards document stores for that flexibility, we use this pattern called the one document per row. So every time you have your relational databases, you're going to convert each row of every table into one document. And you may think that that's going to be millions of millions of documents. That's true. But remember, if you have 100 servers in your cluster and each of them, and if you have a million total documents, one one hundredth of them are going to be stored on each system. So in general, we don't worry about the scalability of lots of small documents in the staging area. So this summary is they have very similar function, but the implementations are slightly different. Sure. So where, how does data hub reduce data anomalies and create, update, and delete since it is denormalized? Let's see. I'm not sure I understand the question. So when you talk about data anomalies, so let me just assume that what there's users saying is you have data that has unusual data elements in it, and that you haven't seen those data elements before. And data hubs do store and index that, and it's always immediately searchable, but it may not validate against your data. So, and it may not be included in some of your reports. So that's what the data profiling tool really does is it detect anomalies and that's what schemas do. If one of the favorite jokes I hear about is data hubs allow us to ingest data that we actually get, not what people said they all sell us, send us, right? So a lot of times people say, well, we're going to give you this extract from our database and here's the fields going to be, but somebody adds a new column to it some week and then you get that column. And you've never seen it before. So your validation can quickly say, hey, I got a new element. I don't understand it. We're going to put the data quality score down to 10. We're going to look at that. We're going to say, oh, that makes sense. We'll add it to our schema. We'll validate it. We'll boost the score up to 70 and then it's searchable. So there's two data profiling for the first set of data, data validation for continuous things that will help us detect anomalies to data sets that are coming in. And I hope I answered that. And if I didn't, send me email and I'll try to answer the other meanings of the question. Love it. And so where integration is heavily needed, what are the general challenges of moving relational data to NoSQL? Great question. It's something that I have been obsessing over for almost two years. How do we get data out of relational and into these data hubs? And in general, there is a very important pattern called the load as is pattern where we just move the data directly into staging area unchanged. It's very important to use that load as is pattern because we can do data profiling analytics in it. So load as is really a five step process of getting data ready to be transformed into standard formats. Here's the problem. Once we do know the transformation, that transformation is somewhat a manual process. But wait, there's this new technology called artificial intelligence and machine learning. And what we're starting to realize is that we can use AI and machine learning to partially automate the conversion of data. Partially because it's a huge complex problem. I've heard of several research projects that have doubled or tripled or maybe sometimes a tenfold increase in the speed of converting relational into other formats by doing machine learning on it. And effectively what we're doing is we're doing a matching. We're matching. Here's person's last name and the relational, here's person last name and in our canonical, we have a vast amount of statistics. What's the probability distribution that the first letter is going to be an A in the last name? We can store those statistics. When we sample our new data, we can put that statistics through machine learning and it can give us a probabilistic estimation is this in fact a person last name? And it can do that based on lexical, what's the column name, semantic, what's the definition that we got for the source as well as statistics which is from the data profiles. So the number one thing is that right now moving data from relational into document stores and data lakes that's meaningful is an expensive and slow process. It takes about 70% of the budget as the data transformation process. And in the next five years I suspect we'll see at least one and maybe two orders of magnitude lower costs in automating data conversion because of machine learning. And that's certainly one of the most interesting topics that I'm interested in. So anybody else that is interested in using machine learning to do data mapping, let me know. I've been working on that for quite a while. Dan, thank you so much. And there's so many great questions coming in, but I'm afraid that's all we have time for in the Q&A. Dan, thank you so much for kicking off the event today with such a great presentation. So thanks to our sponsors and thanks to all the attendees who have joined us so far. We now have a 10-minute break where we encourage you to network with each other as you hear us set up the next speaker. The next session will begin at 12 p.m. eastern where we will hear Donna Burbank discuss the latest in database and metadata relationships. Dan, thank you so much and thanks to all of our attendees so far. Thank you very much, Dan.