 And welcome, my name is Shannon Kemp, and I'm the Chief Digital Manager of Data Diversity. We'd like to thank you for joining today's DM Radio, our data lakes for business users, sponsored by Arcadia Data. It is a deep dive in continuing conversation from a live DM Radio broadcast a few weeks ago, which if you missed, you can listen to it on demand at dmradio.biz under podcasts. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. If you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the upper right-hand corner for that feature. For questions, we'll be collecting them by the Q&A section in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DM Radio. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now let me turn the webinar over to Eric Kavanaugh, the host of DM Radio, to introduce today's webinar and speakers. Eric, hello and welcome. Hello and welcome, everybody. Thank you, Shannon. Yes, indeed. It's time for another DM Radio deep dive. My name is Eric Kavanaugh. I will be your humble, if excitable host. Let's dive right in. Are Data Lakes for Business Users? Obviously, that is a slightly rhetorical question. The apparent goal, I think, for any organization is to have that answer be yes. Of course, that's what we want. So features figures today, Steve Woolwich of Arcadia Data. Of course, you are truly in the middle there. And my good buddy, Wayne Eggerson of Eggerson Group. He and I go way, way back to the old TWI days. So we've had a long history together focused on things like data warehousing and now, of course, Data Lakes. And as a concept, I want to touch on very quickly before I hand it over to Wayne to give us some results from his assessment. And that is this whole concept of data science. I keep hearing about data science. This is one of my favorite quotes from the movie Nacho Libre, that's Esqueleto, who claims, I only believe in science, right? We hear all about science these days and data science as well. We all know that numbers don't lie, but they sure can be misused or misrepresented. But I wanted to point out just a couple of quick things to keep this whole topic in perspective and what we're trying to accomplish here today and what we're trying to accomplish in the broader business intelligence analytics big data market. We're trying to use data to get insights to be able to make better decisions for our business. The mission is the same. The mission has not changed. The tools have gotten much more powerful. We now talk about Data Lakes as opposed to data warehouses. And they are, in fact, very different things. They're designed very differently. They were developed in different eras of this industry. And let's face it, there were a whole set of constraints many, many years ago when data warehouses were designed around which they were built. Processors were slow, pipes were thin, for example, memory was expensive. So all of these factors really dictated what had to happen in terms of creating data warehouses. They were also extremely expensive. If you compare a data warehouse deployment today versus one 25 years ago, it's astonishing the price difference. We've gone from millions of dollars to hundreds of thousands of dollars to even 100,000 dollars or less depending upon the use case and the complexity. But I want to point out that communication is hard to do. I think people take communication for granted. And at the end of the day, if you're not communicating clearly with your team, with your business users, what it is that you've learned, what you've gleaned from the data, then really you've been on a fool's errand. So that's something to be considered here. So science, data science, I think it is applicable these days. I think that is an accurate term that we can use to describe some of the more robust, well thought out and efficient environments for managing data. But I think there's a significant disconnect in our culture today when we think of this term science. I think a lot of people believe that science represents a virtually infallible version or representation of reality and that's just not true at all. Science is a discipline and it relies on a methodology, a.k.a. the scientific method, which if applied appropriately and effectively and efficiently and responsibly can give us great insights about the world around us. But remember axiomatic to the scientific method, fundamental intrinsic to the scientific method is a commitment to forever question your data, your processes, your hypotheses, and even your conclusions. So again, my point is that we need to take the term science with a bit of a grain of salt here and scientists change their minds. One of my favorite references is this whole story about eggs being bad for you. Remember how it really hit a fever pitch about 10 or 15 years ago. All the cholesterol and eggs, you're going to get a heart attack and then what happened? Then it came out with good cholesterol and bad cholesterol, right? Okay, what does that mean? I think the point is that scientists will change their minds about things and scientists frankly can also be paid by large organizations to say things that they probably believe but are then used to distort what is the reality that we're all trying to better understand. So as just one quick example of how much we really don't know these days, when will the lava flow stop in Hawaii? The answer is we just don't know. And the reason we don't know is because the Earth is a really large environment and things like volcanoes are extremely hard to predict. They're very powerful, very complex, and we just don't fully understand what's going on. The magnitude of the problem space, if you will, trying to understand where the lava is going to flow, where the fissures will come from next, what that volcano may do next. We just don't know. And so I think it just pays to remember that data will always require analysis. No matter how efficient you are with a data lake management project, for example, you're still going to need to analyze that data to put it into context to view it in reference to your current situation, the historical data that you may have. No data is ever going to give you the complete and total answer of the story because you have to come up with a story yourself. So leveraging what you know is important. It takes savvy, it takes moxie, analytics and big data are all very useful and valuable if we understand what we know and know roughly what we're doing. And so with that I'm going to hand it off to my good buddy Wayne Eggerson who is going to talk about the assessment that we've done on behalf of Arcadia Data and our end users about data lake and the value that it provides for business users. So with that, Wayne Eggerson, I hand it over to you. Thank you, Eric. It's great to be here with you once again and Dataversity and everyone in the audience. Whoa, that didn't work. That image for some reason is not showing up. That's an image of a data lake. Hey, Shannon, can you flip over to the other one for him? You're going to have to take, but there you go. Okay, take it away, Wayne. Yeah, all right. So there's the data lake. Yeah, so this webcast is about the business value of data lakes. And as Eric mentioned, data lakes arose a number of years ago, almost 10 years ago when Cloudera was founded, really to address a lot of frustration on the business side with the data warehousing. As Eric mentioned, too slow, too hard to design, too hard to change, very costly. Scalability capped out at a couple terabytes, really. Didn't really handle unstructured data very well. So fast forward to the data lake and Hadoop, and that made a lot of people happy. However, it did not replace the data warehouse. What the data lake became, in essence, very quickly, was not a data warehouse replacement, but really an ultimate ideal sandbox for data scientists or power users who really wanted what they've always wanted, historically, is a big giant data dump, and then to get IT out of the way. And the data lake essentially was that, just put all the data in one place, and then let me go in and navigate it and manipulate it and manage it and analyze it and create models from it. So the data lake, really, in its first incarnation, turned out to be great for data scientists and data analysts, power users, who wanted access to the raw data. It really wasn't a replacement for a data warehouse that supported standard dashboards and reports for, I would call, casual users, executives, managers, frontline workers who really needed tailored access to information. So, you know, we've asked recently when we got together with Arcadia, what about data lakes and regular users who don't know SQL, Python, or Java, which are the tools of choice for Hadoop-type processing and analytics, who need a graphical interface to analyze data, also known as a BI tool, who require clean, curated, aggregated data, in other words, someone, typically an IT, to go in and take the raw data and then manipulate it, so clean it and integrate it so that the casual users can make sense of it without having to do all that manipulation themselves and who need subsequent performance, query performance, and views of data in reports and dashboards that are highly tailored to their needs. So, they only see what they need and nothing that they don't, with predefined drill paths that really meet their needs to glance at KPIs and take action if things are awry. So, my hypothesis, frankly, is that data lakes have not been a good thing for regular users, regular Joes, if you will, executives, managers, frontline workers, even customers and suppliers, but we got together with Arcadia and decided, let's test this, let's do an assessment and figure out if this is still the case, if data lakes are still just for power users or not. So, we did an assessment and came up with a survey of 22 questions. It took about five minutes to complete. Once they completed it, Eggerson Group Assessments or surveys generate a dynamic report, as you can see here on the right. I'd be personalized, it gives them a score, compares them to everyone else. Overall, and by category, with recommendations for next steps based on their rank in the scoring. So, that assessment is still running now and I encourage you to go out and take it at the link below to assess the value of your data lake, if you have one, for your regular business users. So, as of April 20th, when I put these slides together, almost 200 had started the assessment, 162 had completed it, 93 have a data lake in production and those are the folks that we really wanted to focus on. Of those, 74% were from North America and about half were fairly large organizations with more than 10,000 employees. So, the data that you'll see here in the charts I'm going to present is based on that subset of the respondent base. I think we're up to almost 250 respondents now. We'd love to have you get us up to 300. So, write down that URL and go, it only takes five minutes or less to complete the assessment and you get your own personal free report. So, what did we find? First of all, surprisingly, a little bit, is that for data lakes, most people, almost two-thirds are using Hadoop for the data lake. I suppose that's not too surprising. Data lake has become synonymous with Hadoop, but in the last couple of years, people rushed to move these data lakes into the cloud and replace Hadoop with cloud object stores, which are currently running at 14% of respondents in our pool. 17% are running their data lake in a relational database. Some of you might think that's an anomaly, but truly, if you use the Inman method of designing a data warehouse, it's always called for a staging area, essentially a place where you put your raw data before you turn it into third normal form and before you create and push out data marks from that. Also, 6% said no SQL database. No SQL database is not an analytical database by any means, but it certainly can hold a heck of a lot of data that can be used for analysis. Second question here, users query the data lake, and this was very surprising. Data scientists tend to prefer tools like Python, Perl, Java, and other coding type languages, or in the Hadoop world, Pig, Hive, tools like that, or just plain SQL if the data in Hadoop is written to parquet files in columnar format. So we were actually pleasantly surprised that more than half are using a point-and-click visual BI tool to query the data lake. So that was surprising. Now I will say both Bloor Group and Ekerson Group and Arcadia promoted this survey, and we each delivered an equivalent, a number. So there may be some bias in there, but I don't think too much. So I think we can trust that this data is generally representative of the marketplace. Okay. Then we asked what, you know, where have you deployed your data lake? And you can see here that the large percentage is still on premise. Public Cloud ranges between 19 and 20, 18%. So less than 20% there. And between 15 and 26% have a hybrid environment both on premises and cloud. Now I just said that we're seeing a large gravitation towards the cloud for running data lakes. But this chart actually contradicts that, and it shows that companies that have deployed data lakes in the last two years are more likely to deploy on premises. I'm not sure I quite understand that. That kind of runs counter to what we're seeing generally out there, or at least anecdotally. But numbers don't lie, as Eric would like to say. So we'll have to discuss that in a little bit. Can business users explore data to get the views they want? So this is part and parcel of what power users always do and casual users to some extent do. And you can see here that more than half, almost two-thirds agree or strongly agree with that statement. So the data lake really is an exploration area, a discovery area. And if users are using BI tools, then we have to admit that a large percentage of those users are casual users who do want to do exploration. We're seeing here that the data lake far from being a data swamp is actually providing information and data that users find trustworthy and enables them to make better decisions. And of course, that's the whole point of using data is to improve your decision-making, improve outcomes for the business. So it's great to see that over 50% agree and 70% strongly agree with that statement. This is another surprising one, and we asked about query performance and 50% agreed or strongly agree with the statement that the data lake provides consistent fast performance. When you think about it, Hadoop was designed as a batch environment and only recently has become interactive with SQL interface. So things are moving very fast in the data lake world and able to support fast query performance and response times. The next question about the accuracy of analytics in the data lake, that's another reinforcement of the notion that these data lakes aren't data swamps and that people with BI tools can not only make good decisions, but trust the data that they're working with there. We also did a lot of analysis by company size and we didn't find much variation between large and small companies. Although this chart will show you that very large organizations with over 100,000 employees are a little bit more advanced. 47% strongly agree that business users can explore data to get the views they want. Whereas very small companies with less than 100 employees, a good 40% disagree with that statement. We did a lot more analysis and we're writing report up on the results. But in general what we're seeing is that according to this data from this recent assessment, most data lakes today run on Hadoop on premises. We're seeing that the data lakes are not data swamps, according to some gurus out in the industry, that companies are able to maintain high quality data in the data lakes. And most importantly, they're not just for data scientists. There are graphical BI tools being used heavily that provide fast query performance for queries and exploration. And finally, that the quality of data in the lakes is suitable for regular business users. So I must admit that these results in summary, and we do have more details in the data, were a little bit surprising to me, but I think it's a good testament to how far and how fast we've come with this new technology, Hadoop and now the cloud. And I think that is probably a good segue to our next speaker, Steve Woolwich, who can talk about how they're supporting both regular users and power users in data lakes using their visual BI tool. So I'm going to pass it back to you, Shannon or Eric. Yeah, real quick, if I could jump in here and just ask you a couple of questions, Wayne. I'm curious to know, have you found or what's your take on the people who were involved in these projects? In other words, do you find that the people who were in the data warehousing team are the same people who are working on data lakes? Are they different teams? Can you offer any context on that from your experience? Yeah, you know, I think in the early days, a lot of the data lakes were started by advanced analytics teams, kind of as experiments to create an analytical sandbox to fast track the delivery, creation delivery of analytical models, predictive models, what we're calling machine learning models today. I think very quickly as those things scaled up or failed, a lot of them did not work out, but IT took over that infrastructure, which makes sense as an enterprise environment that can support either a lot of users, the enterprise or a very important segment of users, the power users and data scientists. So administering that environment became largely, though not entirely, the domain of IT. Now, Steve may disagree with me, but that's what I've seen today. Yeah, I would agree. Go ahead, Steve, yeah. I would say that we've seen, as Wayne mentioned, the IT teams as these data lakes have matured. Take them over, and that includes other groups. I would just mention like data governance, stewards, and you see the BI Competency Center being involved, and there's sort of choosing standards for these platforms as well. So we'll talk more about that, but definitely as it's becoming mainstream, it gets woven into the fabric fuel of the organization. I do find that a lot of organizations struggle to reconcile their expenditures on data warehouses and their expenditures on Hadoop. Hadoop, obviously, is less expensive by terabyte, and a lot of business people look at the budget or the bottom line of these environments and want to replace the data warehouse, but technically that has not really been feasible. There are things that companies are offloading from the data warehouse that probably never belong there in the first place, or offloading ETL or detailed data. And we're starting to see this bifurcation, at least for now. Things do change quickly that the data warehouses is well suited for supporting large numbers of concurrent users. We need to do basic reporting and dashboarding, whereas the data lake is suitable for power users and for BI SWOT environments to build things really quickly, prototype them, experiment, test them, deploy them. But now we're starting to see a lot of deployment of standard applications, analytic applications also happening in Hadoop as well. So I think these two environments are co-opting each other. They're quickly developing the capabilities that the other one has, and they're becoming more and more identical. They'll never be the same, but the dividing line between them is getting fuzzier. But we are seeing Hadoop or data lake taking over more and more of the functionality of the analytics. Yeah, and I guess, and Steve, I'll just kind of throw this over to you real quick before you jump into your presentation. Really, you do want these two environments to be coordinating, collaborating. You want there to be a lot of overlap between them, and it seems to me, and I know that you guys are kind of playing in this space, but from my perspective, and granted I'm in the analyst space, I'm on the outside of all of this, but I see a resurgence in business intelligence almost like we went down the road of big data analytics. We learned some interesting things, but maybe we're not as tethered to the core business objectives as the world of business intelligence was, and I kind of now see a resurgence of use in BI tools enabled by more powerful infrastructure underneath that can tap into traditional data warehouse environments, but also pull insights from data lakes and from these new environments. Is that what you're seeing or what's your take on all that? Yeah, definitely. I think we have the power user. I think there's terms like citizen data scientists floating around which are kind of interesting because a lot of the excitement around Hadoop was getting after all the granular data, there's not some IT department that's pre-processing and telling you what you should be analyzing. It's sort of an exploration area, but I think what's been missing is how do we give that same power to the business users, and then you've got things like machine learning that are being adopted by the BI analytics tools out there that can speed up that discovery process, can put more power in the hands of these power users or citizen data scientists and things like that. So I think it's just been this natural evolution as kind of the next generation of data people start using technologies like Hadoop and Cloud. There's, I think, I don't think it's too philosophical, but there's sort of this generational growth in technologies that need to keep up with the demands of these different types of users. But I do see it coming back to, at the end of the day, SQL is the language that people want to speak, and if you've got a GUI-based tool that can generate SQL that can be utilized by the data warehouse or the data lake, I think that becomes a standard through which you can do your analysis. That'll make sense. Okay, good. And folks, just as a note here, that assessment which Wayne talked about from which we got all that data we were just sharing a moment ago, you will get a link to that assessment in your follow-up email later on this week. So we hope you take a look at that and dive right in and use it really also not just to understand where you are, but to see where you compare to other companies, and you can even do analysis of companies, your size, your region, and so on and so forth industry. So it was designed to provide some really nice granular detail to give you some perspective on where you are in your organization and give you some advice on which direction you should take. So I think it's a very powerful tool, and I would recommend checking out that assessment. So with that, Steve Woolwich, take it away. Yeah, thanks, Derek and Wayne, and I'm really pleased with the survey because I honestly have been in this industry 17 years overall. I've been looking at Hadoop big data and data lakes for like the past, I don't know, 8 to 10 years, and I've never seen research, frankly, that really gets into the adoption, the usage, with platforms, et cetera, on data lakes. So it's really cool to see this research kind of coming out, and as Eric said, I think there's a lot more people out there using it and love to get the perspectives from folks. But what I'd like to talk a little bit about is what I've seen change in the technology. I've been at traditional BI companies in my past. I've worked for large database companies like Teradata. I've worked at Hadoop distribution vendors, and now with Arcadia data, we were really built to focus on that challenge of how do we put the power of BI into the hands of people that want to go after these modern data platforms, if you will. And really, slide, what we're starting to see now is that large enterprises, as I mentioned, these BI competency centers, they are choosing new BI standards for their data lake, which are separate from and really not competitive with their data warehouse BI infrastructure. Because as Eric mentioned at the beginning, the technology for BI that came out around the concept of the data warehouse was really based on the processing power and memory and things that we had then, and I think there's a whole new world around big data, which is obviously the size of it, but also the variety of the data, the speed at which it comes in, the need for people to have more real-time access, as well as just distributed systems, and the whole concept of figure out what questions you need to ask and true data discovery versus, again, having the IT department try to curate and build cubes and things like that that are based on business requirements, but maybe not opening up all the granular detail data to the exploration that some of these citizen data scientists or power users want to do on all these new sources of information they now have access to. So long-awaited way to say that I think times are changing and there is this inflection point. And if you look at the technology history, it's kind of interesting because the data warehouse, relational technology, as Eric mentioned, was built at the time when processing your hardware was really expensive, memory was really expensive, and there was a lot of optimization done at the software level to integrate very, very tightly with the hardware to make sure you're maximizing resource utilization. So those systems tend to be proprietary, which is not a bad thing. They're actually super high performance. But you couldn't take BI server software or a BI server and run it in that same software layer with a database that's running because it was so engineered for performance. So that's why you've got traditional BI tools that sit on servers or on desktops and will access data in the data warehouse. And there's nothing wrong with that. That's how it was set up. But when you look at the analytical process, you've got to create physical optimizations of the data and how it's stored physically on disk. And there's aggregates that are created. You've, of course, got these semantic layers at the BI. Tool level which connects to different data sources. But a lot of times that data has to be secured and loaded in two different places. And when you start talking about real-time insights, the laws of physics state that there's going to be latency as you're moving data across the wire from one system to the other, not to mention the overhead of multiple security layers and models and row-based access controls that need to be connected and kept in sync between these different systems. So when you start to throw big data into this type of an architecture, you've got semi-structured data. You've got these massively parallel systems like Hadoop and Cloud Object Stores. And just the volume of data and the time that takes to move it, et cetera, you lose that ability to connect natively and do real-time analysis on the system. So when we founded Arcadia data back in 2012, it was really to solve that problem. And for people that have the Data Lake in place, how can we give large numbers of concurrent business users access to that information? And the big aha was, you know, rather than having it work as a separate server, let's do what they said Hadoop was all about. Let's bring the processing to the data. Let's build a BI server that is fully distributed, runs in parallel across all the data nodes. So rather than having a separate BI server, we said let's use the servers that are already in place. We'll install and run our software natively on each of those nodes. And we talk about native BI. That's what we're talking about. It's a BI server that takes advantage of the open nature of open source software and just modern data architectures like the cloud where you've got lots of processing engines that can run on those data nodes and take advantage of the low-cost commodity hardware. And, you know, the resource utilization may not be as highly optimized, but the cost is so much lower, you can just continue to throw machines at it at a fairly low cost and scale extremely well. So that was the big change that we made on the architecture side, which also has huge advantages from the overhead side where you don't have to optimize physical layers twice. You create a semantic layer once you can connect natively to semi-structured data. Security is done once we inherit security from the underlying file system and security systems like Apache Century and Ranger and some of those. And you're only moving data once. You put it in the lag and it's there. You don't have to bring it into separate analytical layer, which just by nature gives you more real-time access to the data. So that's really architecture and the results look like this. This is a proof of concept that we did from someone who's now a current customer, a teleconferencing platform that I am not allowed to name, but their requirement was that they needed 30 concurrent customer success managers to be able to analyze the log information around the use of this teleconferencing service to look for bottlenecks or issues when service would go bad and things like that. So they had to be complex queries. It had to be a BI tool and it had to be 30 concurrent users. And I took away some of the names of the different tools because I'm not trying to point out any issues with SQL on Hadoop Engines, but the issue was they were trying to take a traditional BI tool and connect it to a SQL on Hadoop Engine. And there were three different engines they tried in blue, gray, and yellow. And once you've got above five concurrent users, the performance degraded significantly and results were not returning. So again, the concept here is that Arcadia data BI platform is not a SQL on Hadoop Engine. It's not just doing scans. It's actually optimizing performance and thinking like a BI server that runs in the data platform that gives you the ability to support lots of concurrent business users and accelerate existing BI tools or we provide our own BI tools to show in a second. And of course, data doesn't only sit in the data lake because we talked about the data warehouse is not going away. It serves a very strong purpose and workloads that didn't belong there are moving on to other systems. We've also got things like event streaming, which are really popular now. People want to be able to stream data from IoT sensors out in the field and connect to cars, which I'll show a quick demo on in a second, but needing to be alerted to, but also see data as it's happening in real time and be able to respond to business as it's happening with the ability to drill to detail in the data lake or connect to other systems, whether it's a no SQL or a relational system and be able to visualize all that in one place is a requirement, of course, that you would have in any BI tool and native BI tools are no different and can support that. So just one last thing on Arcadia specifically is the other thing we really thought about was, I talked about cubes and OLEP cubes and this idea that IT builds these with the business based on business requirements in advance and they can be fairly complex projects to take on. You build a cube and you're trying to teach people how to fish, but you're only handing them a certain number of fishes within that cube and every time they ask for more information you've got to go and hand them more fish or recreate that cube. So what we talked about is, how can we give end users granular access to all the data for ad hoc queries in the data lake and provide optimizations as we go on the fly. So we create these things using some machine learning and a recommendation edge and we're actually looking at, what are the queries that people are running? What are the tables they're accessing or the files they're accessing? And we'll recommend to the administrator what we call analytical views and these are caching mechanisms, aggregates, and physical models that will build back on disk in the distributed file system or in the cloud object store. We also take advantage of memory on the machines to make sure that in the next time those queries come in there's a cost-based optimization decision which will route that query to the fastest way to bring it back. So in terms of modeling in advance, a human is still involved and can choose the physical modeling strategies but it's using AI if you will or machine learning to recommend the best ways to speed up those queries in the future. So that's smart acceleration is what we call it within our system. That's, again, kind of flipping OLAP cubes on their head. Nothing wrong with OLAP cubes but it does introduce some form of latency into the process. So speaking of the process, I'm going pretty fast but you'll get these slides afterwards. I think what we're seeing is if you look at the bottom in white a lot of people are taking the data lake and they're treating it just like another data warehouse or storage machine and they're trying to take their BI server and connect to it and there's nothing wrong with starting that way but what we see is this analytical process that can really be delayed because as Wayne referred to if you're following the Inman model and there's no more form in that staging area there's this process by which you're going to land the data in the lake you're going to transform it into some model and create this schema that then you connect your BI tool that's running on a separate server. Just the modeling part of that can take weeks so the stuff in red. Before you can even start to connect the BI server to it you've got this modeling that's done and then by the way you're going to create cubes on the BI server to speed up performance there once you've brought in the data from the lake you're going to get to step five and then again you've got to secure it in two places so before you get to step six it could be weeks or months before you're actually able to do any kind of analysis on data that started out in the data lake and before you put into production there may be some additional modeling so with the native approach again one system where it's stored and the analytical processing is done there so we land and secure it once you can normalize and create schema if you want there is a semantic layer that can also connect data, structs arrays and those types of things and your analytical discovery process is much faster because you're not moving data you don't need to worry about the optimizations in advance it will run just fine for those discovery queries and then that AI driven performance modeling the smart acceleration that's done can be after the fact when you decide you want to push something into production so you're not moving the data it's one security model and you're taking advantage of next generation technologies to speed up that analytical process and it's one of those performance modeling on the back end so it greatly accelerates that time to insight from weeks or months to days and I can tell you again having worked for big database companies I've had customers I've worked with who said anytime we need to add a new dimension to the schema in the data warehouse it's literally 6 to 12 months of time and a million dollars of cost so if you just want to bring in I don't know extreme data into the warehouse for discovery a lot of these systems and departments have been set up by which so highly governed it's just a long process before you can get into some of that so I think that's why you saw this need for data scientists and the resurgence of or not the resurgence but the creation of data lakes where it's more exploratory in nature so I know we're going to save some time for questions I'm going to give you a quick demo fly by of kind of what's possible with a data native technology we can come back to this but I'm going to flip over here and sorry my e-mails up here but this is Arcadia data and in this instance here we've created a couple of different demo environments I've got one on the connected car there's a cybersecurity application that I won't show here but I can go ahead and launch this and this is a demo environment talking about connected vehicles which it's a very hot topic now particularly around automated or autonomous automobiles and you can imagine as a fleet manager for I don't know some service company like AT&T that's putting vehicles out in the world you want to get some notifications of things that are happening so you can have real-time event streams that are coming in this could be coming from something like Apache Kafka it could be coming in from spark streaming people use solar and indices and things like that for more real-time updates and analytics and what we're collecting is just information from the vehicles and again fictitious data but we're looking at illegal lane departures things in orange or collisions and hazardous conditions are in red and we're looking at you a map which you can zoom into and look within San Francisco for specific events that are happening or I can click on an individual VIN for a car and do a more detailed analysis of the history of the car and what's been happening over time so this could be across different drivers and for this VIN we see all these different events that happened these are results of spark jobs that are looking at the acceleration aggression score if you will for this vehicle how much has it been accelerating what's the strength at which these people have been breaking steering and all these are sensors that are on the device for accelerometers and things like that and then you can start to do some correlation analysis for those drivers or cars and things like that and look at things like is there a correlation between people that drive really aggressively and the number of collisions that they're in again demo data obviously you would think there's a correlation there but also gets into things like predictive maintenance and what's the correlation between acceleration and the needing to replace brakes or transmissions and things like that so the fleet manager you've got the ability to monitor things in real time but also drill the detail look for correlation and all that within the simple UI so that's a quick fly by the types of things that are possible I saw one question that came in about what industries are these in data lakes that is and I think we see it hugely in financial services telecommunications governments retail CBG all the traditional industries that have lots of products locks the customers particularly with IOT and sensor devices will be a lot more growth and things like that but it really does span all different kinds of industries different forms of use cases so just real quick I wanted to show the tool itself and how easy it is to build stuff so this is an environment we've got running it's connected to the data I should say is sitting in a data lake from a Hadoop distribution and all I want to do is show you how I build a dashboard so I'm going to connect to a data source this is just TV data on viewership across different channels from a TV network I called it Eccleston TV I did it into TV radio again someday just kidding Wayne but I'm just going to take that data set that's already been connected to I won't bore you with how we do the connections but it connects to lots of different stuff and now I'm going to build a dashboard so I just click the button that says create dashboard it pulls in the data that's been connected to so this is looking at session ID user ID etc I just want to simplify that down so I'm going to click edit here I'm going to look at the time so bring in the date string all record count for all channels all programs I'll just refresh that really quickly so now I've got a nice simple date string I'm looking at all the record count over time for that but we're visualization tools so let's visualize something we've got something like 30 different visualization types and I'm lazy I don't want to try and test them myself so I'm just going to click on this button called explore visuals and what this does is it uses some machine learning and best practices that are built within the product to recommend different visualization types based on the dimensions and measures that I selected so here's some different things like bubble charts and scatter plots and horizontal bar charts there's a calendar heat map which is kind of interesting so I'll grab that and this again is just all records over time but you can see the hotspots of days in the month that were really heavy in terms of people watching TV so I'd like to try and explore that so I'll close this out save it let's add one other visual type in this demo and I'm going to slice it a little bit differently I'm going to look at channels and programs and the measures will stay as record count but I'm going to limit that to the top 50 just to speed up what we're looking at here and simplify it down and brush that and it's churning but there we've got all channels different programs and record counts again nice type of the form but I'd like to visualize it so let's see what the system recommends to me and this is the real data being visualized it's not just make up thumbnails I can actually see the results here I'll go ahead and click the horizontal bar chart and that looks good different channels and it's ranked we've got the top 50 so we'll save close that and one thing I want to do is add a couple filters real quick and then we'll open up for questions but again just showing how you can connect the data explore just like you would expect within a BI tool so let's add some filters I'm going to add a filter for a channel and a filter for program save it one more time view it so here we have it and I've got some filters so I can see what's happening over all time for all channels let's pick a channel I don't think sci-fi is in here I'm kind of a sci-fi dork I don't know what it is sci-fi HD I've never actually looked at this one so let's see what are the top programs on sci-fi so face off Friday night smack down born ultimatum XMN yeah so anyway just kind of interesting the hotspot so this can be used for advertising things like that if you're trying to tell someone when they might want to advertise based on the demographics for your different shows and things like that so again simple demo but it gives you a sense of how you can do this and this is just again Arcadia data running directly in the data lake giving you access to all the granular data so with that I will stop jib jabbing and we can open it up for questions we do have some good questions here so let me just start throwing some over to you one of the attendees is asking about data quality where is the quality of data being curated inside the Arcadia architecture we do not focus on data prep that's something that our partners like trifacta fax data stream steps and folks like that will get into we have a little bit of data prep stuff within it for the business analyst but we really rely on those partners to provide again a native solution that runs within the data lake to do all those standard preparation steps that you would want for more curated data okay and one of the attendees is asking about S3 as a possible destination can you kind of talk about your relationship with Amazon S3 yeah absolutely we have a number of customers that are fully on the cloud trying to think of which names I can mention I think Newstar is one Turner Broadcasting is another but a lot of people are starting to store data directly in S3 they still leverage the Hadoop ecosystem in many cases so Arcadia would run in the elastic tier and connect directly to data in S3 to visualize it but that's something we've had for a while and we just announced support for Microsoft Azure data lake store as well okay you must have been reading my mind because that was my next question someone was asking does it work in Microsoft Azure and the answer is now yes so let me throw another one over to you it's an interesting one we kind of talked about it already but one of the attendees notes that likely IT people used to working on a data warehouse are going to require a bit of a mindset shift what have you observed that can facilitate them to kind of reorient themselves to focus on supporting a data lake versus a data warehouse well I think what's good is that a lot of the skills that BIAs or what have you have are completely re-usable I think we're starting to see more and more analytical workloads also moving to the data lake for new applications as people want to build them and as one of the callers that's about data quality, cleansing, scheme all those things are still really valuable and important I think what's changed is just rethinking what's available in terms of BI tools I think we were first to market to be able to connect Apache Kafka natively because we're just kind of in that space and they've got a new case equal interface that allows you to query streams of information or things like Apache solar or Apache Kudu and other types of data platforms that have some benefits to it and being able to explore data and take advantage of nested data like things like JSON, structs and arrays where you've got the metadata in the data format itself so you may not need to build a lot of scheme in advance just give the end users more access to it but you still need to have things like role-based access control and security and things like that and I think those concerns about security, those have all been solved by the community I think the next wave is just providing tools that can take advantage of those to a broader set I don't know if that totally answers the question I think Wayne probably gets more involved with end user clients in terms of training on that Wayne? I was going to say that you've got to pay the piper at some point and you have to create a schema for this data the value of the data lake for power users so it was schema on read and you didn't have to wait for IT to model it but at some point especially when you're trying to get strong query performance for large numbers of concurrent users you probably do want to model the data and that raises the question I had for Steve when you talked about your smart acceleration you kind of insinuated that you really didn't need to model the data that using machine learning your tool would be able to essentially create caches and aggregates automatically so that you could get up and learning pretty quickly in a matter of days without having to do any modeling at all the tool would essentially create structures on the fly based on queries you feed it maybe priming the pump to deliver the kind of performance that users would want I'm wondering if that's accurate reflection of what your tool does Yes, it can certainly do it that way but it's not pixie dust right I mean you still need metadata definitions and data stewards, data catalogs business terms on the data and tables that people want to access I think you still need that at some level particularly as once you've done some initial exploration if you want to provide a broader view to a broader set of people I think having those definitions and semantic layers and things like that in place are also important so anything that someone's built in the hide meta store in place we can sort of read from that and make it that also available but yes, I think even for those queries that may have been defined or the tables that have been set there's going to be acceleration strategies based on actual usage that an administrator may not think about in advance so we can kind of monitor that and the system will recommend other ways to speed up those queries in the future so yes it can be used kind of on raw data as it's come in without any setup but it's also beneficial to kind of more the curated data that will live and support some of these end user applications where you're talking to hundreds of thousands of users on the system as well but you don't necessarily require it certainly wouldn't hurt for users to create a schema inside Hadoop using Hive or whatever right to support the queries it doesn't require it to give value to it and we can also read it and take advantage of it yes, it's not required right that seems to be the trend these days with a lot of these new technologies and tools is that the processing power is so great that they can deal with the source schema the sloppy schema that comes from the source and do something with it and give value pretty quickly and you can only enhance that value by doing more design up front and in your tool you actually help do that as well with the smart accelerator capabilities correct we got a couple more questions here or several actually that me throw in you've just alluded to this moments ago but there's a specific question about catalogs and semantic layers and what you were saying is that Arcadia is set up to leverage those how that happens and where it happens in the process you cut out a little bit there I think you were asking where do data catalogs play within all this you said that you can leverage existing data catalogs how does that actually work well so I'm pausing just because so there's this consortia for what it's worth that we're part of called make big data work it includes vendors like trifacta streams that's in waterline data waterline is a data catalog that was built specifically for data lakes in Hadoop in particular but there's also elation data and others that are out there so I'm not an expert on those things but I'm seeing more and more people who have systems like the data and the data lake together they need common definitions of customer and things like that and where the data is stored and what data is available where so we can connect to any of those views and kind of provide back to the business user those definitions that have been defined access that data bring it in and things like that and then within our own tool we have a semantic layer to create their own data definitions for tables or data they're looking at that hasn't been defined yet and there could be user A in sales is going to name the data one thing that makes sense to them and user B that sits in I don't know engineering might name it something else so you can also do that at the BI tool level but there's obviously some concerns with that if you're a data governance purist and kind of having a single definition for data and things like that but there's kind of all possibilities and I would encourage people to go out and check out and make big data work we've done kind of a webinar education series in and around data catalogs and things like that in this world Okay good and here's a good question from an attendee I think I know the answer but if you would share with the audience from your perspective what's the main difference or differentiating feature between what Arcadia is doing and what Tableau Yeah the key differentiating feature is the fact that we're a massively parallel system that runs directly with the data Tableau can cluster environments but our perspective has been that there's a lot of knowledge about how data is stored on the individual nodes and our software is sitting there next to the data we can take advantage of that local knowledge we're not just passing SQL back and forth through an ODBC driver or something like that we're kind of running natively where it sits so that just gives us tremendous scale performance it's a lower TCO solution overall you know I think it's the architecture that really makes the difference but then also as I talked about that process it really speeds up that time to insight because you don't have any data latency over the wire you're not needed to move data from one system to another it's already where we just inherited directly from the data platform you don't have to re-administer it in a separate BI solution and that's just the philosophy of a native BI solution which I think is becoming a thing Okay good and you run both on premise and in the cloud right can you talk to that real quick Yeah absolutely I mean you can get in a long debate on the differences between cloud to a lot of customers I should say it's just a deployment preference a lot of people that go to the cloud often start out because they just don't want to manage their own data centers so you can install our software just as you would anything else in that environment there's also some advantages that we have in a cloud based environment that I won't get into on this but I think there are some with virtual machine instances and things like that some different thinking around how you architect software to run in those environments to scale precisely with the workloads I'll just kind of leave it at that but there's a lot of things that we do in the cloud that are very interesting you know I think our breakdown of people that are on cloud versus on prem is pretty similar to what you saw in the survey results thus far from weighing about roughly 20% cloud and a large majority still on prem but certainly a lot of people interested in hybrid and cloud environments but yes we can run there yep and there's another question this will be the last one I'll throw to you what about are you doing something like data virtualization I think the answer there is no someone is asking is it similar to what Denoto does your key is giving direct access to the data through this highly parallelized environment right literally taking the processing to the data in a highly parallel way to do virtualization is that right correct and you know there's a need for data virtualization or a value to it I think for us you know where you want kind of the physical copies in one place like that's where you're going to get the huge performance game so you know obviously we live in a world where data sits all over the place so there's needs for federation and virtualization those types of things but I think for production applications where you want to deploy thousands of users again that's why you would look at something like a native architecture in addition to the benefits of exploration everything we just talked about okay good well folks who burned through a whole hour let me hand it back to Shannon Kent thanks for your time and attention thank you Steve and thanks Wayne great stuff we'll talk to you next time thank you all thank you Steve thank you Eric and thank you Wayne what a great presentation and thanks to our attendees for being so engaged in everything we do and all the great questions that have come in just a reminder I will send a follow-up email by end of day Friday with links to the slides links to the recording and links to the assessment for you and we'll see if we can get you a link to the additional demos and such from Arcadia Data so thanks everybody and thanks Arcadia Data for sponsoring today's webinar I hope you all have a great day thanks all