 Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager of DataVercity. We'd like to thank you for joining this DataVercity webinar today, which is today a case study, how JB Hunt is driving efficiency with AI and real-time automated data pipelines sponsored today by CLIC. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them by the Q&A in the bottom right-hand corner of your screen. However, if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DataVercity. And if you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom right-hand corner of your screen for that feature. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now, let me introduce two of our speakers for today, Newman Fahar, Joe Spinelli, and Ritu James. Joe is the director of engineering and technology at JB Hunt. He leads the data empowerment organization consisting of the business intelligence, data engineering, and Google analytics teams. Joe's teams are focused on enabling the entire organization to get the maximum value out of their data assets. Newman leads the ISV Solutions architect team at Databricks. With his deep knowledge around data engineering, data science, machine learning, and analytics, Newman helps companies obtain quantifiable business value from their technology investments, including Spark, Hadoop, and other open source technologies. He works closely with Databricks partners to help architect join solutions and take them to market. Ritu is the director of product marketing at CLIC. In this role, she is responsible for the go-to market strategy for CLIC's data integration portfolio with a specific focus on data lake creation solutions and associated products. Ritu has over 20 years of experience in data management and analytics and like global marketing strategy for multiple industry verticals and solutions at FAST, Alturx, and Tradia prior to joining CLIC. And with that, I will turn the webinar over to Ritu to get us started. Hello and welcome. Oh, you're still muted. There we go. I got you. Are you there? We can't. Hi, Ritu. Did you meet on your end as well? Can you hear me? I can't hear you. Yeah, now we can hear you. We got it. Fantastic. Good morning, good afternoon, everyone. Thank you, Shannon, for introducing us. Welcome to our webinar today where we would be talking with JV Hunt on how they are improving their efficiency and customer experience with CLIC and Databricks technology. In my section, I will be talking about some of the challenges that companies like JV Hunt face in terms of getting maximum value out of their data lakes and how CLIC and Databricks solve for those. But before I do that, let's talk a little bit about who CLIC is and what we do, especially for those of you who are not very familiar with CLIC. So CLIC is a data and analytics company with over 50,000 customers worldwide and presence in over 100 countries. We're known as a leader in both data integration and data analytics space. In fact, Gartner, for the 10th year in a row, has awarded us leadership position in their BI and analytics magic quadrant again this year. We're also seeing huge momentum in data integration space with double-digit growth in that market. Gartner very recently named us as the fastest growing vendor in data integration market with over 70% growth year over year. We work with over 1,700 partners globally, both GSIs as well as technology partners like Microsoft, AWS, Databricks, obviously, Snowflake and Google and more. In terms of what we do, we work with organizations such as JV Hunt to help them turn their data into business value by providing solutions for end-to-end value chain all the way from data to analytics to insights. Our solutions fall in three major categories, data integration, which focuses on turning your raw business data into analytics-ready state, data analytics, which focuses on converting that data into actionable insight and data literacy as a service, which focuses on providing organizations education and consulting support so that they can deploy those insights to drive business value. I don't think there is any debate anymore on data driving business value. There is enough studies, there is enough research that has been done on this topic to show that companies that are using data to drive business decision-making tend to outperform their less informed counterparts. But the reality is that most of the companies still are not using their data to the extent that they should be able to use. And we recently did a research work with IDC. And some of the findings were quite surprising. One of the things that we found when we were doing this research is that even today, less than 10 percent of the new relevant data is being used by some of these companies for analysis. And of the business executives that were surveyed during the study, only 32 percent were confident that they can create value from their data. The most surprising to me was the third finding, which was basically a number of business decision-makers saying that they still don't have confidence and understanding of their data so that they can deploy it for business decision-making. And that's the reality. Most of the organizations, even today, they are struggling to convert the data that they collect into analytics ready state and make it available for their data consumers, let alone derive any business value from it. And if you look at it, it makes sense, because when you look at the data to value continuum, there are multiple steps from all the way when you start from accessing and capturing data and storing it into your data lake to consuming that data for taking business decisions so that your business impact can be there. And really, the value creation, the maximum value of that data is when you deploy that data to drive those decisions. But most of the companies, if you talk to them, they are still focused on capturing that data and just think that data and maybe storing that data into the data lake and not so much into processing that data and getting it into analytics ready state so that they can feed their AI models or they can feed their machine learning models or managing and governing that data so that right people have access to that data and their data consumers have that data available to them when they need it in order to take those decisions. We did a webinar in September with Eckerson Group. It's a group which very much focuses on data management space. And one of the things that we were highlighting during that session was that there are too many companies that got caught up in the front end of the data to value continuum where they are just storing data or capturing and just storing data into data lake. And that is an important step, but that is the first step. Making that data usable and accessible is even more important because that's when your data consumers are able to take that data and convert it into business value. So let's talk about some of the challenges that companies that, you know, like JV Hunt or other large enterprises are facing when it comes to gaining value out of their data. So when you talk about the capture and ingest side, there are challenges around multiple data sources. Most of these companies are pulling data from multiple sources. You know, sources that don't talk to each other. And in some of these companies, these sources can run into hundreds. In order to ingest this data, they need to connect to these data sources and they need to be able to set these pipelines. A lot of these times, these data pipelines are being set manually by their data engineers. And therefore, there are errors in the process. There is manual coding happening. So there can be coding errors. There are challenges around just, you know, the time it takes to set up those data pipelines when you're talking about large volumes of data sources. And then there is a challenge around data latency. When you look at your AI models, your machine learning models, when you look at even your real-time operational analytics, you want data to be real-time up-to-date data to feed these models. But most of the times when you're talking about large volumes of data, the batch mode does not work anymore. You want the ability to load that data in an incremental fashion as the changes are occurring. So that's a big challenge in getting all your data working. The second challenge is around storing and processing that data. So if you look at from the storing and processing perspective, there are obviously challenges around the cost of storage, the flexibility of storage, right? Because you're storing large volumes of data. You're storing different types of data. But then there is challenge around the reliability of the data, right? Is there consistency between what's happening in the source systems versus your target data lake? Is the data consistent? Is the transactional consistency being maintained? Are the changes in the data definitions or data structures being migrated or propagated to your data lake? Also, there is challenge around the timeliness of analytic readiness of the data. So as you're capturing this data, you need to standardize that. You need to format it. And merge those changed data files in order to get that real-time view of the data, the most current view of the data. And a lot of times that work is being done by data engineers or data scientists, and most of the companies don't have an army of data engineers, data scientists. In fact, in the efforts and survey that we did in September, one of the things that we found that the data engineering positions are four times as unlikely to get filled as data science positions. So there is a real scarcity of your data engineering resources. And it is just not possible for any company of whichever size and however investment they may have made in their data engineering side of the business for them to be able to process all this data manually by doing hand coding and by hand coding your ETL scripts. The third challenge is around managing and processing the data. Most of the companies, they are in a hurry to dump data into their data lake because they want that feed to store that data into that data lake. But what happens in the process that they usually are not generating metadata around that data. They're not common data definitions. So there are no, even though data is coming from multiple sources, you have some common semantics to search for that data. So data consumers are struggling to find right data from your data lake that is relevant data, combine those data sets and be able to use those data sets. They also don't have the ability to control who has access and who does not have access to this data. So there is this whole question about the confidence of that data in that data which makes the usability of data questionable. And then lastly from the consumption and analysis perspective, your data consumers want the ability to access all that data themselves and to be able to provision that data because they need data when they need data. But in most of the companies, you're dependent on IT, you're dependent on data scientists or data engineers to provide you access to the right data set which means by the time you get data it's already too late or it is even the need for analysis has gone fast. And that's where CLIC and Databricks are coming together to provide this end-to-end real-time data pipeline and we'll talk about it from JB Hunt's perspective. But if you look at the overall architecture of the joint solution, it really helps you automate that entire end-to-end data pipeline. The CLIC solution for Databricks, it comes with a market-leading change data capture capability which provides you that universal connectivity that ability to access and ingest data from virtually any enterprise source, data, data warehouse, your legacy mainframe systems or your enterprise applications like SAP or Salesforce. It can ingest that data and load that data directly into Delta Lake. And the solution provides for both batches loading of data as well as capturing those incremental changes as they are occurring so that you can convert your slow data into fast data. You can get that near real-time latency. And then the solution also supports source and schema sync. So it basically what it does is any changes that are happening in your data definitions and your data structures in your source system, it propagates those changes into target into Delta Lake just so that you have the most reliable, the most consistent data at any point. And once the data is loaded into Delta Lake, it also auto-generates a SQL code for merging those changed data so that you can generate that operational view of the data, the most current up-to-date view of the data by executing that merge code within your Delta engine. And once that operational view of the data is created, you can either feed that operational data directly into your analytical models for real-time insight or you can feed it into your AIM machine learning models that are developed in Databricks to train and score those models. And we talked about this in a resource constraint of our data engineers to set up those data pipelines. Click solution also comes with an interactive interface so that your data engineers can point and click manner set up those data pipelines. They don't need to write any code. They can just in an interactive manner set up those pipelines and then they can monitor and manage those pipelines in an exception-based manner without having to deal with the challenges of managing a lot of pipelines from multiple sources. The solution also comes with a cataloging capability. So we talked about the data consumers having the challenge to be able to search and find that data. Well, the integrated catalog in the solution, it allows you to generate very rich metadata on all the data, not only data that is in the data lake, but across all your source system so that your users, your data consumers can very quickly search for, find and understand that data. It comes with a data market space so that they can, in a very online shopping type experience, shop for self-provisioned that data to get the data subsets ready in the format that they need in order to analyze that data. And the catalog has governance capabilities integrated within it so you can set up access controls. You can set up security features like encryption and authentication of data to protect your more sensitive PII data, et cetera. And then the whole solution is platform-agnostic so once you have gotten your data in the format that you need, you can actually then either analyze it and click. Obviously, you can feed it into Databricks or you can use any analytical tool if you choose. So let's do a quick recap here in terms of the capabilities that we were talking about against the challenges that most of the organizations are facing. So in terms of ingest the universal connectivity and the change data capture capability takes care of some of the challenges that are associated with the heterogeneous data or with the data latency. In terms of storing and processing the multi-cloud environment because the solution is platform-agnostic supports multiple clouds. You can choose the best cloud storage that fits your needs and it also fully automates the entire data pipeline from ingestion, real-time data ingestion or the way to create in that analytics-ready data so that you can read for write real-time data without having to deploy an army of data scientists and data engineers to convert that data. In terms of the change propagation, it provides that reliability that allows you the data, the transactionally consistent view of the data that is support through Databricks so that you can use that data with confidence. The management and governance perspective, the catalog provides the metadata, it persists history for end-to-end lineage and it provides the governance features and then from data consumption perspective, the data marketplace allows the ability to data consumers to search, evaluate, and then self-provision that data. With that, I'm going to pass it on to Newman so that he can talk about some of the capabilities that Databricks brings to the table. Great, thank you. Thank you everyone. Thank you for joining us today. I'm Newman Fahar. I'm the partner solutions architect who's been working with the quick teams to help integrate the two products together. With that, if you can go to the next slide, please. I'll talk a little bit about what Databricks does as a company and where our roots are and then I'll speak to what problems we solve. Our aim is really to make it easy for customers to become data-driven companies and data-driven organizations. And we do that by bringing together the personas of data engineers, data scientists, and analysts together on those single unified platform. That's Cloud Native. That is Cloud Ignostic. That's easy to use and that can scale up to virtually unlimited scale at the needs of the business growth. We have over 5,000 customers and hundreds of partners including CLIC being a highly valued partner for the purposes of data ingest and MBI as well. We are also the company that's behind the popular Apache Spark open source project which has become the de facto sort of data engineering framework in the industry and machine learning as well. I'll speak a little bit more about Delta Lake as well in more detail. We are also the creators of Delta which allows you to build an open Cloud Native data lake on your cloud of choice. We are also behind the MLflow project which allows customers to manage the machine learning lifecycle as Ritu was explaining there. The more recently we've brought in redash into our purview which is essentially a SQL workbench and basic dash recording capability to make it easier for SQL oriented personas who are in the data lake to be able to do SQL workloads very easily on Databricks. Next slide please. What problems do we solve? As we talked to customers we're trying to do large scale machine learning and analytical projects we found out that really they struggle with four key problems. One is the problem that Ritu was describing that data tends to be siloed and messy. It tends to reside in all kinds of places in the org and bringing it out of those silos into a reliable central place to do analytics has been a challenge. Number two everybody wants to do machine learning because companies have realized it gives them an edge in the market and helps them to build new revenue, new products, allows them to become more efficient but it's hard. It's historically been hard and I'll talk about how we make it easy. On the BI front what we increasingly seeing is that customers with build data lakes want to increasingly do SQL oriented and BI workloads on the entire data lake as opposed to being limited to a fraction or subset of the lake. I'll speak a little bit about how we're solving that problem and of course productionizing it all, which is number four is always been hard and I'll speak a little bit about how we help out there as well. Next slide please. Coming through the problem of breaking down the silos this is where the click integrations are absolutely key and critical for us because what we see is that customers have data in various systems, right? Be it relational databases, be it main firms, be it ERP systems, be it cloud CRM systems and to really derive any value in a sort of a 260 oriented way on those data sets customers have been asking that hey I want to bring it into a reliable lake where I can do my machine learning and analytical workloads and where I can do it with confidence, right? If you start doing these things at large scale, right? If you imagine that you're ingesting 100 million records and something goes wrong at the 50 million's record then how do you, number one, find out that something went wrong in your ingestion pipeline and then secondly how do you kind of actually roll it back to ensure that you don't end up with a half ingested or corrupted data set in your lake and end up reporting on something or doing machine learning on something that's technically not reliable. That's where Delta helps and that's where the integration between click and Delta brings that value together for customers to be able to break down those silos, bring those data sets in, replicate them in very, very easily via a product that allows them to do so visually and then be able to trust, more importantly trust the data that's in the lake because with machine learning it's garbage and garbage out, right? If the model takes in data that is not trusted or that is not high quality then the model will essentially make predictions that are wrong or that are not reliable. So hence, it's absolutely critical to have reliable data at the lake and that's where the click integration with Delta is extremely important and then second aspect of the scaling when you start doing this at multi terabyte, tens of terabytes or petabytes scale then your lake has to perform from an engine perspective and that's the second aspect of Delta. Delta is not only the open format that allows you to build a transactional lake on cloud storage natively. It's also the engine which allows you to then do queries and do analytical machine learning workloads on that data scale. Next slide please. Now, once you have the data right inside the lake and you just click to get it into the lake the question is, well, I want to do how do I actually do machine learning, right? What goes on this? Historically, we've seen customers complain that, hey, my data engineers who munch the data and my machine learning engineers who actually build a model they tend to be siloed, right? There tends to be a wall between them. How do you bring these two personas together, right? And that's one of the key value props, right? The data brings things to the table. Once you have the data in the lake then the engineers and scientists can work off of the same data sets and iterate together without being siloed. We have really great sharing capabilities built into the platform. And equally importantly, we ship all of the major machine learning frameworks that you find in the market. The machine learning market is fairly from a framework perspective, it's fairly fragmented. Everybody has their favorite tool. Some people like Spark on the Lib, some like Placket Learn, Keras PY towards and so on. We've packaged all of these frameworks as essentially managed clusters on Databricks so that you can use your favorite framework with Delta exactly integrated with Delta and you can build your model. Not only can you build it, you can track the lifecycle of it and the lineage of it and then also deploy it out at scale and that entire lifecycle is managed by MLflow which is tightly integrated with the platform and available as a managed service. That's how we make machine learning easy. Next slide please. Coming to the BI aspect what we've increasingly seen is customers when they've built these data lakes they want to do not only machine learning on it, but also SQL oriented workloads on it. Analysts want to come in and be able to do exploratory analytics and BI on that data. Instead of always having to move curated piece of that data from the lake into a warehouse customers have been saying why don't you allow me to have first class SQL experience on the lake itself and part of that strategy is what we are calling our lake house pattern where we are investing in essentially cluster types and engines that will make it big SQL and absolute from a BI perspective and absolute first class citizen in the lake. We are also working very closely with our BI partners like CLIC to ensure that the SQL or analyst oriented persona can have a great BI experience on the lake itself and not hence be limited to a fraction of data, but be able to do BI on the entire lake. You will see announcements come out this month specifically targeted towards this persona. Moving on to the next slide and of course productionizing it all is always a challenge, right? The cloud gives immense power immense compute storage network power. It is virtually unlimited regardless of which cloud you are on AWS Azure. And the question is how do you productionize these workloads on the underlying cloud? How do you ensure that they are secure? How do you take advantage of the elasticity of the cloud without having to become a DevOps user? That is a key part of what Databricks brings to the table because we abstract out all of those cloud concepts, the notion of deriving compute capacity from the cloud or storage capacity or network capacity and we just abstract all of that out and run it for you so you can focus really on your analytical and machine learning needs and we focus on scaling out the underlying system for you and we won't force you to become a DevOps guru a big data DevOps guru just for the sake of doing machine learning or analytical workloads on the underlying cloud. So next slide please. Kind of bringing it all together, the key thing to keep in mind starting at the bottom the value of Delta is that you remain in full ownership of your data. Delta as a format is 100% open and your Delta League datasets reside in your cloud storage, your S3 account or your ADLS or blob account on Azure and it is transactional so it allows you to do data engineering in a transactional manner at scale and hence the data can be trusted at very large scale and it also comes with the high performance and engine which acts on that on those datasets natively on cloud storage so your analytical data engineering machine learning workloads can scale out to virtually unlimited right unlimited levels and that engine then serves the personas at the top, whether you're an analyst, whether you're a data engineer or data scientist, you can derive value in an abstracted way from all of the power that's underneath the covers and focus on your use cases in a reliable way. So with that I'll have it off to Joe who will speak about how that JB Hunt is using the two products together. Thank you very much. Is my screen okay? Okay. So my name is Joe Spinelli, director of data empowerment at JB Hunt. I came to JB Hunt last year and I lead the BI data engineering and Google analytics organizations. Before that I spent 10 years at Tempur-Sealy in ERP implementations, business intelligence and data management. Before that I was 14 years at Toyota Motor Manufacturing where I work primarily in supply chain systems. So long history of supply chain and IT systems here. Just a disclaimer I'm not a professional spokesperson. I don't represent the company. So don't go these are just my opinions. Don't go buy or sell stock. I'm not qualified to make any forward looking statements. I'm only really qualified to tell you about the great experience I've had here at JB Hunt. I'm also qualified to tell you about my amazing chocolate cake recipe. But I don't think you tuned in for that. So we'll stick to the tech. Thank you all for coming today. Really I'm here to help those of you who are trying to see if these technologies can potentially help you with the challenges you're facing. I've likely been in your position looking for solutions and so really I just want to give back a little bit and show how this has helped me and how it could potentially help you or at least kind of feed your thoughts as you're working through these problems. For those of you who aren't familiar with JB Hunt we are a transportation logistics company and our mission is to create the most efficient transportation network in North America. We do have service offerings including transportation of full truckload containerizable freight. We also have arrangements with most of the major North American rail carriers to transport truckload freight and containers and trailers and really our ability to offer multiple services including all of our four business segments and a full complement of logistics services through third parties, that's what gives us our competitive advantage. But as you can imagine having that large and diverse business comes with a lot of data needs and so let me dive into some of the things that we needed here at JB Hunt and some of these things will likely be very similar to the things that you're seeing at your organizations. We can talk about how we use CLIC and Databricks to help bridge these needs and these gaps. As you're looking through this slide right here you'll see several of the things that were mentioned before but you'll also see some other things. Data science of course is a key department at JB Hunt where we do have data scientists who are working to create machine learning and AI working on those models to help drive business value and they swear up and down by Databricks I think if I tried to take Databricks away from them that would cause a revolt. So for that reason alone I would say you've got a lot of very happy data scientists who use Databricks and so we've found that not only does it really help speed their ability to create and train models but Databricks also helps solve a number of other use cases that we have as well. So as you look across the top you'll see EDI, Asset Telemetry and Analytics. So EDI most of you are familiar with that kind of format where that data is coming in usually in a streaming format and for most people who are using EDI that's just sort of coming in and going into a system and then being processed by that system but the other thing that we found is we have such a volume of EDI that we really needed visibility into the problems that were happening so if we expected to receive EDI and didn't receive it or if there were problems with the transmission we need to be able to kind of look into that quickly and identify those things quickly and be able to respond to it. And that's where Databricks comes in and it allows us to pull that data in and to from that unstructured table to be able to draw structured information out of it and provide that on dashboards to folks so that they can see in near real time what's going on. And so I point that out in a lot of solutions that are out there they're purely structured data and you'll get needs for unstructured data. EDI is a good example of that. Asset telemetry is another good example of that. For us, for our business we get telemetry data off of our assets or trailers and things like that that come in and this is raw data coming in out of an event hub that comes in and lands in Databricks that we're able to basically immediately plot and use either in reports or in visuals and we're able to do that thanks to Databricks. So when you have needs that include both structured data and unstructured data that's been one of the key things that we've found with Databricks is that it can handle both and not all the solutions out there on the market can. So when you're picking a solution keep that in mind. Another thing of course the analytics we have a real need for real-time analytics. I'll talk about that in detail in some of the upcoming slides as well as the ability for applications to get information on what's happening inside the database without necessarily putting additional load on the production database. So this is where that particular use case is where CLIC came to the rescue and we'll talk about how that architecture works and how we solve that use case as well. And of course there's much much more. So these are probably the biggest cases that you've seen I'm sure if you're thinking about if I had real-time data available or near real-time data available if it could be all the structured formats that I've got if it could be all the unstructured formats that I got, what could I do with it? As you can imagine that's just incredibly powerful. So let me go to the next slide here. In talking about data science in particular this kind of speaks to why we created a cloud data lake to begin with. So when you're working on data science and you're trying to create ML code the ML code itself is not really the issue. When it comes to creating that system, getting everything stood up really getting the code stood up is not the hardest part. It's everything around it. So collecting the data verifying the data managing the machines the infrastructure, the tools things like that. So just as Newman kind of said in his segment they abstract a lot of that out for you. That's exactly what happens with Databricks. And that is what led to us going to a cloud data lake is that we wanted to get to value quickly. We wanted to be able to get our data scientists into a system creating value quickly and we really didn't want to build all the infrastructure that was needed around this and that's again where Databricks came in and really provided what was needed there. And so the other part of that equation of course is click and I'll talk to that now. So how do we, okay, so we know that Databricks can offer structured and unstructured data but how do we get it in there? So the first step to that is having ingestion flexibility. So it's good to have a cloud data lake for things like data science or solving multiple use cases but you have to get the data in there to begin with. And so in our case we had data in IBM DB2 mainframe. We had it in SQL server. Of course we had the IoT data like the telemetry, EDI data numbers and types of sources that had to do it. And so click we found had one of the most flexible and diverse set of ingestion capabilities as well as the ability to do this in Databricks in very good time, right? So when we wanted to get that data ingested in, one of the keys is it couldn't take forever to come in. It had to come in right away. And so that's where click came in to help us out is that we were able to get the lion's share of our data into that lake with near real time results. And so that's why the combination of the two has been really fantastic for us. In terms of repository again you have to have that flexibility as well. You have to be able to in our case store blob data event hub data really all types of unstructured data as well as all your traditional structured data that has to be supported. And so when you're looking at the type of big data or cloud data like solution that you need to do flexibility is key if you're limited to one of the other you'll find yourself potentially cobbling together solutions from a number of sources and we really wanted to keep that to a minimum. We wanted to keep the number of technologies that we're using to a minimum and use best in breed and best in class and that's where Databricks came in. From there we needed automation so as we talked about earlier we didn't have the ability to just infinitely augment our staff to take on this project and do this and I suspect that most of you do not as well so really it became vitally important for us to do this in a way that was very light and very easy for our staff to do and so click replicate is that tool that allowed us to ingest and replicate all those sorts of data into Databricks and do it in a way that was very efficient for our team so the other way to do it which you're probably familiar with is to try and build all that ATL I wish these tools had existed throughout my career would have saved me a ton of time unfortunately they did not but we've got them now and so we're using them and it's driving our time to value monitoring was another key component again kind of going back to that issue of staffing we really didn't have the ability to build up a huge monitoring staff around these processes we needed to be able to go in and easily see what was happening with our ingestion with our repository and these tools allow us to do that so it's very quick it's very easy using the dashboards that are available to see what's going on to find and flag problems and get them addressed kind of moving down to here this chart is probably painfully familiar to a lot of you so I won't spend a lot of time on this in that this is kind of the simplified version of the old way of doing things the legacy way which is very very similar to many corporations companies that I'm familiar with where you've got all of your operational apps and things like that operating happily in the production space you've got ETL jobs that bring them into a staging area and from there reloaded into a reporting warehouse where then you can visualize it out to those folks who need data insights and so this is your traditional bolt-on reporting mechanism as you can see in sort of the left center portion you've got a lot of time taken in the process getting it from where a transaction happens to when it's able to be reported on this is probably painfully familiar to many of you as it has been to me it's very time intensive extremely manual takes a lot of work to do it and frankly just didn't work right so not the way we really needed it to so this is kind of a simplified diagram of how things are happening in our future state in our current state so here we have a number of types of data sources here on the left that I've mentioned before from your ping data, your asset telemetry your SQL servers that are in a PAS state like your Dynamics or things like that as well as your on-prem operational DB's which are all structured so we use click replicate for the vast lion's share of everything that we do and that puts things into both data bricks here as well as our hyper scale data store and what that allows us to do is to of course then if you read left to right here on the data bricks section it allows us to take that raw data and to make it available for both the data scientists up here on top as well as data insights users down here on the bottom so when you talk about the major use cases for data bricks by and large we're using it for data science and because of the real-time nature of click replicate and because of the open-source structure of data bricks we're able to make that data available extremely quickly so the ingestion process probably takes one to two minutes so there's about a one to two minute lag between when things happen in production and when we're able to visualize them in the data lake and that's fast and so from there it can then be used in any in any data models or in any reporting that you have so that of course kind of speaks to the efficiency of the whole thing so if you can imagine we're not adding tons of staff we're putting this in place we're putting that value out and we're doing it without having to build a ton of infrastructure so that's really kind of our time to value of course the results continuously update there's really only a minute or two lag between when it happens in the transaction and when we see it here in the lake to be used the automation is it's highly automated it really doesn't take tons of people doing tons of ETL to do and it's secured so from there I'll go ahead and pass it back to Ritu for her comments. Thank you Joe this was very helpful one thing that you saw in Joe's slide that he was sharing in the previous slide and he was talking about the previous slide. Sorry the share came off of mine. Just go to the next slide please. Thank you. So one thing that Joe was showing in that previous slide is that how they have a data lake with Databricks but they also have a cloud database house which is where they are sending data for more traditional BI type reporting, right? And that's the reality in most of the organizations at this point especially the larger organizations besides the JV Hunt and more companies have multiple architectures they have cloud data warehouse they have data lakes they are also streaming data to their applications so that they can do real time analysis of those and reach out to their customers or do operational decisions based on that insight very quickly and today when we were talking about it I focused mostly on the managed data lake creation aspect of it with Databricks but that said we do support those different architectures so we do if you have a cloud data warehouse you do automated data and the creation of that data warehouse so that you can have that low latency real time view of your data within your data warehouse you also stream data to whichever application or service that you prescribed to so that we can actually feed those applications for that real time data Joe can you move to the next slide? Sure. So the last point I wanted to make in this particular deck and this presentation was there are five key principles that I consider when it comes to getting performance data lakes or getting the most value out of your data lake data and those are listed here I'll just go quickly over these you need to accelerate your data delivery into your AI models into your machine learning or your real time analytics the need for information and for responsiveness is so great that you can't afford to do batch uploads you need that change data capture ability to convert your data into fast data so definitely focus on accelerating data delivery make sure that that is there automation kind of goes hand in hand with that you cannot accelerate the delivery of your data to your analytics model unless you automate your data pipeline all the way from ingestion to creation of that analytics ready data so make sure that you invest in automation because no do number of data science resources or data engineering resources you may have can take care of those automation needs and allow you to leverage all your data within your data lake without that catalog is critical because your data consumers they need to understand the data that isn't the data lake they need to find that data they need to be able to to trust that data and catalog provides you that capability you also need to make sure that your data consumers have the ability to curate that data themselves they have the ability to prepare and self-provision that data themselves so having the capability around data marketplace is extremely critical because then not only can they use the metadata to search and find data but they can also use data marketplace to shop for that data to prepare that data and to get it into the data that they need and then future proof Joe was talking about the flexibility right so you more than likely going to add more data sources there are going to be new types of data that you want to ingest into your data lake you want a solution that can accommodate for those kinds of changes in future if you want the solution that is platform independent that can allow you to deal with the changes that may come either from merger acquisitions or just change your technical direction right so make sure that when you are looking for the solution you look for something that has that platform independent that universal connectivity type of capability so that it can adapt to those changing data architecture requirements so those are my five key takeaways we can go to the next slide please on this slide I don't have anything to say the only point I would say we have a happy customer if you have any questions reach out to us we're happy to help that said I'm going to open it for questions if you guys have any questions feel free to ask Shannon can you make sure that people are unmuted we don't unmute but thank you so much for this great presentation and just to answer the most commonly asked questions just a reminder I will send a follow up email to all registrants by Thursday with links to the slides and links to the recording of this session and if you have questions for our panelists feel free to send them in the bottom right hand corner in the Q&A section so diving in here is there a data brick support Windows standalone Kafka streaming clusters you may turn that over to you oh you're muted sorry yes I was I was muted so I think the question was Kafka right do we support connecting to Kafka clusters yes absolutely we do we ship with a Kafka connector for Spark inside the Azure Databricks platform and that can connect to any Kafka endpoint as long as there's separate connectivity between the Databricks cluster and that Kafka endpoint I love it so question about slides about the architecture in slide 22 I know you have controls of the slides currently if you can share why not have the Databricks stream processors that write to bronze silver gold be arranged serially between them and writing in parallel to the data stores sorry can you repeat the question why can we why are we processing them into bronze bronze silver gold be arranged serially yeah between them and writing in parallel to the data stores why do you have Databricks processing there so in this case this is so an example of this type of thing is EDI so in this case we have we bring the raw JSON data into the raw into the raw the bronze layer and but we're not we're not composing every bit of that EDI necessarily so a lot of times EDI will have irrelevant information so the processing that happens inside of Databricks in that case is just really pulling the data that we sort of need for our company and kind of leaving the rest out Newman anything you want to add to that yeah it's one of the reference architectures we have for Delta the whole raw bronze silver gold pipeline concept because you know as Joe was saying there's a good chance that the data as it lands right for the first time into your lake is probably will need to be massaged right and kind of you know aggregated and cleaned up for ultimate downstream use cases such as machine learning or or you know bi analytics and hence it's very very common for us to see customers take you know the pipeline that reads from sort of the raw landing zone and then the data into further downstream zones for you know downstream use cases yeah that's not something sorry it's not something you necessarily have to do right so you've got the flexibility to land it in the in the zone that you need to land it in and then go from there so just to take this as sort of a representation of the types of things you can and I'm going to slip in one more quick question here any specific reason to use click replication instead of open source connectors which can stream CDC data via Kafka yeah absolute oh yeah okay for my perspective yeah from my perspective yeah because we wanted one tool that in an automated way did the lion share of our of our work for us I mean we could of course try and cobble together open source tools to make the same thing happen but but again it kind of it comes down to how many people you have to do this how much time do you have to do this if you don't have if you have the staff and the time to do I suppose you could do that for us that really wasn't where the value was the value in us was you know getting a proven solution that could get this in near real time into into our lake and do the lion share of our work and have that monitoring in place and so so for us that's where the value was I'm sorry Rita go ahead no you answered it perfectly I love it well again things you offer this fabulous case study it's been very insightful and helpful to and thanks to all of our attendees for being so engaged in everything we do but I'm afraid that is all the time we have for this webinar again just a reminder I will send a follow-up email to all registrants by end of day Thursday with links to the slides and the recording and I hope you all have a great day thanks everybody thanks to Clix for sponsoring today's webinar and enjoy stay safe out there thanks guys