 And here we go. Hello and welcome, my name is Shannon Kemp and I'm the Chief Digital Manager of DataVercity. Would like to thank you for joining the DataVercity webinar, Achieving Top Efficiency in Cloud Big Data Operations, sponsored today by Unravel, just a couple of points to get us started. Due to a large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A in the bottom right-hand corner of your screen, or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DataVercity. And if you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom middle of your screen for that feature. And if you'd like to continue the conversation after the webinar, you can continue networking at community.datavercity.net. And as always, we will send a follow-up email within two business days containing links to the slides, the recording of the session and additional information requested throughout the webinar. Now let me introduce to you our speaker for today, Sandeep Udomchandani. Sandeep is the Chief Data Officer and VP of Engineering at Unravel Data. He brings together two decades of experience across Intuit, IBM, and VMware. He has experience both in building enterprise software as well as running petabyte scale data platforms for analytics and artificial intelligence. He was the Chief Data Architect Intuit where he led the transformation of the petabyte data, AI platform used by the 3 billion QuickBooks product line for data and machine learning products and business process analytics. Sandeep has 42 issued patents, 25 conference publication and is an author of an upcoming book on building self-serve data platforms. And with that, I will give the floor to Sandeep to get today's webinar started. Hello and welcome. Awesome. Thank you so much, Sandeep. Can you guys hear me? Yep. You are there. Fantastic. So thanks for the option. So folks, really excited to talk to you about the webinar topic, which is achieving top efficiency in cloud data operations. One of the things I'll be doing in this talk is actually talking about big data operations in general as well with an emphasis on cloud. So every enterprise today, we are data rich. We are sitting on petabytes of data. And the goal really is how do you transform this data into insights? And to do that, you basically have pipelines that essentially transform the data in various shapes and forms to create these insights. Now, when you think of insights, there's a large variety of insights out there. And just sort of sharing some of my experiences while running large platforms for analytics and ML. When you think of insights, you basically have four different buckets in terms of the retrospective, like what's happening in my business. And most of the time, it's spent there in just understanding what's happening with the business. The next bit on top is why it's happening. So sales are down, marketing campaign doing well, what exactly is the correlation. And this is, again, involves a lot of both looking at data, but also creating hypotheses and bydating it. And most of the organizations are also now moving towards what's referred to as predictive and prescriptive analytics and insights. This is where the modeling aspects come in. What's going to happen? How exactly should I be dealing with some of those scenarios? And just going back to the specific times we are in, specifically with the pandemic, some of these aspects with respect to which shifts in businesses, what should the data look like? How should we better align the product positioning, so on and so forth? These things become very alive and very real, especially given the times of uncertainty as well. Now, broad insights and this whole area of creating insights, there are various studies out there that this is becoming the new normal, the new differentiator, studies of the form with respect to retaining customers, profitability, acquiring customers. By becoming data driven, that is the key in pretty much every industry that's out there. So with this perspective, now let's look into what it means to extract and really make sense of the data. So when you think about the process of how do you put raw data, data coming from your transaction sources, extreme sales, marketing, so on and so forth, how do you take that and convert it into tangible insights in the form of dashboards, models, process data, which then serves into applications? You're looking at a pipeline. In a logical view of a pipeline, think of it very simplistically as doing a series of transformations. When I was that into it, running pipelines, we basically had thousands of pipelines running every day, doing transformations of these forms and creating models that were in the product, as well as business critical dashboards that pretty much ran the business. And these pipelines can be humongous, like on an average. We had in the pipelines anyway, from 12 to 20 transformations that happened from the raw data to the final insight. Now, this is a logical view. If you look at the same thing from a physical standpoint, a physical view of a pipeline, it basically looks of the form in terms of the data streams, the different sources of data, going into what was referred to as a fabric. You need a fabric to collect and store. Once you have it collected, you're going to analyze this data. So the data could be sitting in, let's say, S3, or your object storage. And you'll be running query engines on top, something like HiveSpark, HiveOnTes, and so on. Or you can also think of store and analyze to be a part of a NoSQL store. So data is sitting in Cassandra. The data is sitting in Decay. And eventually, there is a serving layer of some sort which then serves the information out. In the case of machine learning, for instance, the fabric will look similar. When it comes to the serving part, you would have things like SageMaker, things like TensorFlow Extended to deploy and host the models that you're trying to show. So this is a rough skeleton that pretty much applies to any data pipeline out there. And in the pipeline, these stages of store and analyze, they can be multiple. That's the recursive part. Now, the key point I wanted to bring up here is that we are living in a phase in the big data history or a big data landscape where there is no one size fits all. And what I mean by that is just look at the different choices that need to be made. At the storage level is it object versus file system. What are different data formats? Should I be running on bare metal or containers? But from a processing standpoint, is it lambda or kappa or streaming? What are schedulers? What are meta stores? Should I be focusing more SQL based or should I be using big data programming? And what libraries and what packages? Relational versus no SQL. This is just a very small sample. And this complexity kind of reflects in pretty much every architecture there is. So if you think about the Azure world, you can basically again see the plethora of choices in terms of how you would build your pipelines. The same with AWS in terms of both batch, stream, number of serving layers, number of different data stores that are available for actually building out the pipelines. GCP, no difference. We can pretty much see that over the years, the evolution of the big data platform has been such that the fundamental building blocks are more or less similar across no matter whether on-prem or in the cloud. And for each building block, there are technology choices that need to be made. So what this really leads to is the fact that today, the ground reality, when you think about the process of generating insights, it's fairly complex, right? On an average, an organization would be running hundreds and thousands of daily pipelines, and when these pipelines are run, there are all kinds of issues that you will encounter. Just to name out a few missed SLAs, extremely common, is to get alerts and you have these pages constantly on to make sure that SLAs for model refreshers, SLAs for dashboards, SLAs for real data becoming available to folks supporting customers, customer success, so on and so forth. Incorrect data, that's the other aspect, right? If some miss happens, some application has a bug, what will come out on the other side is incorrect data and debugging that is extremely complex. Also scenarios like run away and hung processing, resource contentions that happen. Cost is becoming another huge, I would say, concern, especially in the cloud, right? And needless to say, we'll double click on some of these aspects later in the presentation, but the key point is, this is difficult, it is not trivial to be running production scale pipelines, which is actually critical for the journey from data to insights. In fact, there are various studies out there that kind of highlight this in terms of, like in a big percentage, 85% of projects fail and again, speaking from experience, it is very easy to get the initial hypothesis in place, but by the time you productionize this, by the time you make it ready, you really have to think from the standpoint of, do we have the right mechanisms in place to run this active in production? So I'm going to change gears now and really start off with this fundamental question. Are your data operations efficient, right? And this is the number one question that each of us has to answer for our own deployment, right? There is no, I would say single silver bullet and there is no single requirement, right? You could be running real world processing, you know, on the other extreme, all you're doing is batch processing. You may have data which are extremely critical or sensitive to quality and errors, versus you may be looking at broader trends, in which case the specifics of, you know, errors may just cancel out each other. So every situation is different. Now, when you think about this question, are data operations efficient? The way I look at this problem, essentially is a journey map view. How much time does it take to go from raw data and generating insights? And there are various different ways you can measure it, right? Again, while running the big data operations, you know, we struggle with complex metrics, but I finally boiled it down to one metric, which is what is the time for insights? How much time does it take for me to go from data and really build out the insight? And when you think of this journey, right, from a data user standpoint, it's four fundamental steps. How do I discover the data? How do I prep it, prep in terms of wrangling and so on? How do I build the processing? How do I build out my queries? How do I build out my logic? And how do I operationalize it, right? Now, there's more to this, right? Essentially, today, this process is a tango between, you know, the data users, the analysts and the scientists and data engineering and the platform IT, right? So for every aspect of the journey map, there is a corresponding aspect, which is heavily driven, heavily driven by, you know, data engineering standpoint. So, you know, when you think of discover, obviously discovery is based on collecting. You know, how am I ingesting data? How am I ingesting data from new sources? How am I keeping up to date? So on and so forth, right? When you think about the whole prep process, prep can wrangle data, but there's a fundamental question. How do you store it? How do you comply? What are the namespaces in the lake? How is the lake organized? What is the authentication and authorization model? So you get the picture. This is clearly, you know, I refer to this as a tango, a tango dance between data users referred to as analysts and scientists and data engineering and all the platform IT team. Now, the way things are evolving today is that the need to become data driven is getting deeper and deeper. So there are a lot more data users that exist in the organization. In fact, you know, every organization has a mantra of becoming data driven. What that means is you may be a product manager. You may be a marketer. You may be, you know, an exec. You have to make decisions using data, which basically means you'll be looking at extracting insights. What is becoming super critical is for these pipelines to also become self-service, right? And this is an interesting sort of representation in terms of the ground realities, right? Today, think of what data scientists and data analysts do. You know, they have a lot of competency in looking at data. The data engineers have very strong competency in distributed systems and data pipelines and advanced programming. And there's kind of what I refer to as, you know, the black hole. And the reason this is a black hole is that there are issues which kind of are not really solvable by either side. So a data engineer does not understand the business logic. They can look at an ETL and say, yeah, this ETL seems to be running fine. It finished the processing, right? But they cannot really comment on whether the ETL logic is correct, right? Whereas if you think of a data scientist, they can tell you that whether this logic makes sense, but they may not necessarily have the skill sets in terms of is this scaling correctly? Is this written to handle large volumes of data, right? So that kind of gets us into, you know, the middle bucket. And most organizations are realizing this, right? In the past, the model was, you know, the data users create an early model, ETL prototype and then throw it over the wall on the data engineering side. The data engineering would spend months to make sense of it and eventually productionize it. And what we discovered is that's not the right model, not scalable, introduces errors, not flexible. So the key push today, it's really sort of creating that self-service view. Now, how do you create full-stack data scientists, full-stack data users? And another term out there is citizen data users, which is really focused towards, you know, operationalization and operationization, not just of ML, but operation of any analytics, right? That is a very important point when it comes to thinking about the operational efficiencies, not just how we are able to go through the journey map in terms of raw data to insights, but how effective it is for end users to accomplish the same journey. Are they getting stuck? Are they filing giras like, oh, I need this data set? Is it something that they can go and get it themselves based on the automation? Or for every step, the tango between data engineer and the platform and the data users is getting more and more difficult. So that's the key. Now, one of the points I want to bring out here is that when you think of self-serve, I often, you know, talk to folks and talks to people leading these initiatives, and they think of self-service as zero or one. Either you are self-serve or you're not, right? And the way to think about self-serve actually is, you know, the analogy here is self-driving cars, right? There are different levels of automation, right? In terms of the kinds of cars we drive. And level five is full automation where the car just drives by itself, but we have sort of graduated through the process. So in a very similar way, right? And again, this is more a broader topic, but when you think about the self-serve data, there are different levels of, okay, can the pipeline be monitored? Like, can I get one complete visibility of my pipeline? Can that information be analyzed? Can they optimize themselves? And, you know, are there an authentic actions that are happening? Be it data collection, be it data interpretation, data wrangling, so on and so forth. So, you know, that's an interesting viewpoint because if you confuse self-serve to be, oh, I need to go from zero to one, that will probably not happen. You have to think in terms of phases, right? So going back to, how do you measure time to insights? That is going back to the question, are your data operations efficient, right? How would you answer that question? This is where, you know, the thing that we found very useful, and this is something that I actually deployed in my past experiences, is to create a scorecard. And when you think about the scorecard, time to insights, right, is basically a summation of the pieces or the tasks that need to happen in the journey map, right? So, for instance, the very first step in insights is let's think about discovery, right? The part of discovery. In discovery, how do I search data sets? Because typically you have a business problem in mind. You need to be able to search data sets, search for attributes. Once you're able to determine, yeah, okay, this is generally where the clickstream data sets are, then how do I interpret them? What do the different fields mean? So, time to interpret, right? The very next thing after that is, you know, do we have features created? Because if those attributes are available as features, I can directly plug them into a model rather than having to sort of go through the process of curating, really creating, I would say, taking care of outliers the whole nine yards. Also, discovery involves time to move data. Data today, pretty much in every organization, silos, right? Your sales team has all the customer interactions. Your marketing team has different campaigns they are running, right? The team has different metrics in terms of usage, information they're collecting. Combining all that to get a 360 degree view to be moved is that that's another part of the overall time to insights and so on. So, I'll probably not have time to walk through each and every specific matrix here. But the key takeaway for you at this point is understanding where you stand with respect to the scoreboard, right? And one of the exercises that, you know, something that, you know, we have done in the past, this would be very useful, is to really think of, you know, this dashboard as a living dashboard in terms of looking at, you know, where most time is spent, you know, where are you red, where are you, you know, orange or yellow and, you know, where are you green? Right? And the interesting thing here folks is that every enterprise is different, right? Really varies based on use cases. We really varies based on the data sets, the processes, you know, to give you an example, you know, it's talking to an enterprise and they have a lot of data. But for them, the data is a large number of small tables, right? And whereas another organization has a small number of very large tables, right? Two very different scenarios altogether. So the moment you think about a small number of large tables, you know, interpreting, featurizing both become easier problems, right? It becomes more in terms of, okay, how do I wrangle these large data sets because there'll be more outliers? Whereas if you think about having a very large number of small tables, which is, again, very common, especially with acquisitions, especially with companies growing, you end up into, you know, these unruly data sets. A lot of emphasis then moves to even, how do I search data? Because I may have everything under the sun, but guess what? Nobody knows and, you know, tribal knowledge does not scale beyond the specific point. So it's really sort of stepping back and looking at where your slowdown points are. And one of the things, you know, just to sort of further put this into context is, you know, we used to take each block quarter by quarter and basically instead of trying to focus on everything, like I said, you focus on, okay, which is the next red? What's slowing me down, right? And then really double clicking on, okay, what's the level of automation there? Like going back to the self-driving analogy, right? Where is the time being spent? What can we be doing? Can we even monitor? Can we even provide basic, you know, analysis? So thinking in that way, double downing says that if you try to solve everything at the same time, it can guarantee you it's like boiling the ocean, right? And typically it can be difficult. So hopefully this gives you a grounding point in terms of how to actually think about data operations, right? This is very holistic. You can think of this from the cloud standpoint, you can think of it on-prem or you can think pretty much hybrid. This is the ground realities in the way the business would be measuring data operations efficiencies. I want to sort of briefly talk about, you know, just to sort of give you some more context on like taking a one or two examples here, right? So like for instance, if I take the, you know, just time to, time to interpret for instance, right? What does that mean? And it translates to a series of things which, you know, data users run into. What does the data represent logically? What is the meaning of the attributes? Who owns them, right? What are the typical query engines used to access this data? Where is the data located, what's format, when I was last updated, so on and so forth, right? This is all questions in time to interpret, right? And now guess what? If you don't have any mechanisms in place, this will take a humanless amount of time. I have seen projects, you know, take all the way from four to six months just to even figure out where the data is and what data means, right? Forget the fancy modeling part or, you know, dashboard creation part. This is like the basic building block there. And again, within the context of time to interpret, one of the biggest questions people have is, hey, what's the lineage? Where is it coming from? Who created it? Who's writing to it? Who's using it? So as you pick on these kinds of parameters, let's say time to interpret is a pain point within your organization, right? There are tools out there. This is another part where leveraging the ecosystem to your advantage. There are lots of open source tools that exist, you know, from Netflix, Lyft, there's Apache Atlas. In fact, my previous work act into it. I had built Super Glue, which is an open source offering out there. And, you know, just to give an example, the whole idea is to really sort of automate lineage. How do you create a view of the processings? If I looked at, you know, any specific dashboard, like, okay, my final daily global report, what are the tables it depends on? What are the dependencies? How are those processing jobs executing? Are they, have they failed? Are they waiting? So on and so forth. So, you know, this is, and again, this was basically the view of our world. Having this notion of lineage, you can very quickly figure out, okay, this is where the job is failing. So I really encourage you to, depending upon your current state, you know, pick a parameter and really understand, you know, what is slowing you down and then looking for what's available in the ecosystem. Right? Another example, compliance. Compliance is front and center, especially with a CCPA, especially in California and GDPR. And I think these compliance rules, data regulations will come up pretty much across the world. You have clear tasks there, right? In terms of, okay, deleting from backups, managing the user preferences, how do you, you know, ensure that you're collecting the right data, discovering violations with PII, accesses within your use cases, levels of access restrictions. If you think of these as manual, then again, time to compliance would be significantly high. Like you wouldn't be able to launch your insights in a timely fashion or even keep up because preferences of customers change. A lot of places have adopted a very coarse-grained approach, which is the moment you give your data, we will use it for a bunch of use cases, right? But imagine over time, this customer will expect more fine-grained scenarios. That is use it for, you know, giving me recommendations, but don't use it for marketing or, you know, some permutations of this form. Again, another example, double-downing on your state, understanding the tasks that are applicable and then really looking for the right tools and frameworks. What I'd like to do in the rest of the talk is actually change gears and focus more on the operationalize piece, right? And again, I love to talk to every bit of it. I think in a one-hour session, that is not possible, but really happy to even share more insights, feel free to post questions. We'll be happy to sort of share more on that. What I want to do is double-click on operationalization and the reason operation is key is just the fact that it is the most repetitive part of the pipeline, right, from time to insights. Where I see team spending a majority of the time, the majority of time to insights, goes out in operationalization, especially if you're running thousands of pipelines where things, you know, running out of SLAs, quality issues, some kind of harm issues, you spend a majority of time over here, right? So this is basically number one area that I look for all greens. So when I try to build out a dashboard, my thing is, okay, let's sort of optimize for the most common case, which is operationalization. Now, to sort of motivate this side, right? I'll give you a real-world example. And again, this example, I've taken away all the details, but the example is very real. Imagine what you're seeing as a dashboard, right? And you all have dashboards of these forms, or imagine what you're seeing as an output of a model, which is either telling you what your gross new subscribers are or predicting gross new subscribers, or predicting campaigning results, right? And in this graph, you see a spike, right? The number one question that comes up is, is this real or is this a pipeline issue, right? And the reason it's very important to distinguish this is before the business gets excited, before the business says, wow, you know, we are truly nailing it and our subscribers are increasing, you have to make sure that, you know, this is not some kind of a pipeline issue where either some data was delayed or the job ran twice and aggregation happened the wrong way. Lots of things can happen. Now, this is a happy case where the spike is there and you're trying to justify, right? Then think about the more common case where you see valleys, right? Suddenly the campaign is completely drowning. It's gone to, you know, a very low number. Like what happened? Did we lose the customers? Did we suddenly become unpopular? Or guess what? It could just be extreme data issue, not running to the job or the job having a bug, right? Now, the key folks essentially is that there are lots and lots of things. In fact, I can talk all day long about the different battle scars that you run into. All the problems, bugs, bad joints and conflict settings, you know, different kinds of approaches in terms of machine degradation, schedulers, data layouts, lots of things happen, right? There's also another aspect of this, which is data quality, right? So data quality can have its own issues, own reasons as well. You may have source tables that got modified. There was some hard delete that happened or think of scenarios where, you know, time zone inconsistencies are coming in or there are duplicate records. In fact, duplicate records have been very notorious. If I'm having issues, sometimes the records coming in the tables have duplicates. It's then wrongly calculate values on the other side. Referential integrity is another big domain of quality issues where the idea is one database got updated, you know, my credit card billing database, right? I got the credit card number, but my billing database is not get updated. So guess what? The number is valid, but I'm not charging the customer, things like that. NetMet, as you can imagine, you know, operationization gets extremely complex, looking at the logs, looking at the bugs, looking at, you know, why the app field. So this is definitely one of the areas in the time-to-insights, which really impacts your data operation efficiency. The other aspect which is becoming very, very real for everyone is the cloud cost management. And the reason I'd love to share some examples here is we're running these things in the cloud, right? The cost is linear. There are things in IOS where somebody spun up, you know, some very costly GPU processing over the weekend, and, you know, we incur a bill of like 100K, just in two days, right? Or runaway queries, like somebody issued this complex join. Guess what? It's trying to join, you know, this use table with billions of rows. And the processing is happening. And the person issuing the query, when they see the bill, they're like, I didn't know it would be that expensive, right? Especially if you're leveraging things like big query or things like a T9 AWS, where it's all about data scanned, right? You're paying by amount of data that the query is scanning. So, you know, just to wrap up here, you know, all the way from SLAs to making sure the results are correct to closed, this is definitely a space which is extremely complex, right? And at the same time, when you think about specifically, so I'm sort of, you know, keeping quality out of the mix for now, right? And focusing purely on, you know, the whole time to root cause analysis, time to tune queries to run correctly, both for SLAs and resource consumption, and time for cost governance, ensuring that we can, you know, validate the cost of each of these queries and the applications associated. These are complex computer science problems. In fact, you know, there are different ways to come back and say, how do I make this self-serve? How do I truly automate and make even users and users be able to tune their queries, right? Or be able to debug what exactly is going on. But guess what? This is a difficult problem. And just to talk about Unravel within this context, you know, Unravel has been on this mission. In fact, the company's been around for seven plus years, focusing purely on how to really solve these aspects of the problem. In fact, one of our co-founders, Shivnak Babu, you know, he comes from, he was originally a professor and he, you know, got his PhD from Stanford. And this is something he's done for pretty much his entire career in terms of looking at problems in this space. So what I wanted to talk about, you know, just for a few minutes and then really open up for questions is really sort of give you a sense of what's possible in this space, right? Again, there are solutions out there. There are open source options out there. You know, you often hear about things like Spark Lens, Dr. Elephant, there are things out there, right? But to truly solve the problem, which is what, you know, Unravel has focused on. And my goal here is to purely sort of get you across as to what's possible in this space, right? Especially as you consider things like tuning. Like tuning is, you know, often trial and error. In fact, it's one of the most complex problems out there. If you look at the number of parameters, you know, Spark or Hive or any big data processing query engine has, you really have to make the choices. In fact, it's run Spark in defaults. You really don't get the results. This is something that you truly have to look at, mappers, reducers, container sizes, partitions, so on and so forth. I'm pretty sure most of you on the call, you have seen this in some shape and form. It is not as trivial as writing just a query. It is the tuning aspects that go with this, right? Now, within this context, you know, what Unravel does, essentially is he looks at the platforms in a very holistic fashion. So the interesting thing you will see here is that it looks across all the layers of your deployment, all the way from the hardware. Looking at, you know, specifically, let's say, if you're in the cloud, you know, looking at if you're running, let's say, on Kubernetes or if you're running on VMs, that forms the underlying layer, right? On top of that, you basically have your deployments of the clusters. And this is where it could be a combination of query engines, no SQL engines, so on and so forth. On top of that, you then have jobs, you know, how the jobs are being profiled, how these jobs are then be used within the pipeline. So really sort of taking a holistic look at all the layers, mapping it back into correlation. So correlating it, right? I can have a lot of information available. How do you make sense of it? So that's the second part of it. If it is uncover, understand. And really the last part is concrete recommendations, right? So one can analyze data all day long, but, you know, if you cannot really make sense of it. And, you know, a lot of the open solutions kind of have stopped, you know, at an intermediate point, right? They may collect some information, but, you know, all you really see is analysis. End of the day from a self-service standpoint, how do you leverage the expert in the box, right? Especially with so many technologies changing. And we have teams of people specializing in each technology and kind of building in, I would say, the recommendation logic into the product, right? So to give you one example here, you know, what you're seeing here essentially is a Spark application running in auto-tuning, right? So when you go back, so, analogy I had, you know, think of level five automation where the car is self-driving, right? So in this model, right? And again, this is just an example. You don't have to run it in self-driving mode, but just to sort of show you what's possible, what Unravel is doing here essentially is, it is running, you know, the initial app run failed, right? Did you out of memory? And this is in self-driving mode. So it automatically is finding the configurations and in successive runs of the same application, you can actually see how the duration is reducing, right? So this is an example where by properly tuning, and again, instead of getting mired and obviously end users may not really sort of be deep into, you know, Spark versus Presto versus or Spark, Athena, Hive, Hive itself on Hive-1 MapReduce, Hive-1 Spark, Hive-1 Tes, but again, the sheer complexity. This is what the system does. You know, another example here is, again, this is again actual real-world example, right? This is a Spark application running, and this is a Spark scenario, but it's not just for Spark. I just wanted to pick a couple, originally running for three hours, almost four hours, and through the recommendations that were provided, you can see the time duration going down, right? End of the day, you know, there are definitely aspects of, okay, how do I use my root cause analysis time? And also how do I tune? How do I really optimize both for performance, cost, SLAs, right? And this is what you're eventually looking for, right? To be able to run more for less, right? And there's no real magic here, right? If you run things in a default mode without the tight configuration, tight tuning, you know, you are potentially using a lot more CPUs. In fact, a lot of jobs, you'll see the driver getting, you know, having heavy amounts of container sizes and cold allocations and memory, but really just sitting and doing nothing, right? At the same time, users may not have the ability, given the scale and given just the plethora to be actually able to do it. So this is where that automation kind of comes in, in the end-to-end time-to-insights perspective. And there's some more examples along this line, that it is not just about, you know, looking at apps specifically, but also where do I focus on, right? You know, working with customers, they have like thousands of apps or thousands of, you know, scripts that they are running in their pipelines. And typically what we have seen is a strong, I would say, distribution aspect and, you know, more of a dichotomy here, where, like, for instance, what you're seeing is like, I'm pretty much just top 17 apps can get inside of a person of the memory. Like, what does basically mean? So this is consistent. We have applied this from duration standpoint, core standpoint. You always see huge opportunities to go behind and just double-clicking on those to immediately get the benefit. And end of all this, it's, you know, how about how to accurately even forecast? There's one of the biggest issues you have is in the cloud, especially, right? Forecasting becomes an extremely complex problem. In fact, apps run, you know, data pipelines or data platforms with sizable budgets or millions of dollars. And constantly the issue was we were not being able to keep under budget. And the reason is that there's no real way to predict, whereas if you start looking at your workloads in this holistic fashion, you can have very good trend predictions and more importantly, align the spend with the business use cases. Like, one of the things I've found is that most of the times you are running applications which were the bottom of the barrel, they were not critical from a business standpoint and we were spending about 60, 70% of our budget just on that, right? So right off the bat, you know, you can get your budgets in control fairly by able to map it back. I wanted to wrap up by, you know, really sort of summarizing the overall data pipeline, the, you know, the time to insight in all the different metrics you saw. We have a book coming up and this is an ORIP publication. So, you know, this is something that you can get more information and you know, connect us at hello at unraveldata.com. Really sort of breaking this down into, you know, what the different metrics are and for each metric, like what are the key automation aspects to care about, right? And our goal in this presentation really was to help you out with your pipeline from an end to end standpoint. So, you know, unravel definitely has a very, I would say important space, especially in an operationalized standpoint, but really making you successful. So it may not just really be an operational space if you want to connect with us and discuss or talk to us of the broader data pipeline. We are there for you and then we have folks actively looking at this email address. So with that, I would like to open up for questions and thanks, thanks everyone for your time. Hope you found this useful and we are happy to engage further, even beyond this webinar. Cindy, thank you so much for a great presentation. And if you have questions for Cindy, feel free to submit them in the Q&A section in the bottom right-hand corner of your screen. Just a reminder, I will send a follow-up email for all registrants by the end of day Thursday with links to the slides and links to the recording of the session. So diving in here, Sandeep, you know, what is the best way to seamlessly integrate the analysts and scientists' activities and the data engineers? So that's an interesting question and actually it's a really broad question. What answer folks essentially would be, it depends on really the skill sets of the teams, right? So there is no single answer here. Especially if the data scientists and data analysts have a business focus, right? I've seen all varieties, you know, you can, they are purely business-focused folks with very little of the engineering talent to the other extreme, where you do have a lot of engineering talent and they want to sort of be hands-on, do things in end-to-end fashion and really looking for data engineering help to kind of accomplish that. So depending upon where you are, I would basically say really sort of breaking the problem down into understanding where the data users have the biggest pain point, right? When we started doing this exercise from a data engineering standpoint, we got laundry lists of issues, right? And it's good to at least first create a list of what all is slowing them down. And then there's really about prioritizing. So like, you know, the number one issue points we were seeing is, you know, moving data, time to move. We had lots of silos of data, right? So we picked that up and he basically said, okay, now we are going to work to make data movement, simplify data movement, right? The number one thing we found is that, hey, internally, they have a homegrown tools that analysts had built, the engineers had built, they had 30 homegrown tools. The round them up said, okay, we are going to now create a single tool. This is what it will look like. This is how it will be used. Does it meet your requirements? So I would basically say, break the problem down, take one piece at a time, and then really engage in terms of matching the needs to what you're trying to automate and build from a data engineering standpoint. What's the time to apply governance and governance of what? So governance, sorry, go ahead, Shannon. Oh, is this a, another broad question? It is a broad question. And governance actually is used in a few different forms. The two elements that are my personal favorites are data governance, which is about making sure that we are complying with the regulations and cost governance, which is making sure that we are complying with the budgets, right? Data governance has become a hot topic for lots of companies, right? Because it's all about how the customer's data, it's about customer's data rights, right? Where they can come in and say, delete my data. Don't use my data for these use cases. I don't want my data included in the ML model. Or I don't want my data collected for marketing campaigns. How do you enforce them? How do you enforce them reliably, right? Oftentimes what we will find is that when these pipelines were written, the pipelines typically may not, I would say they're not tractable. In fact, even having a view of, okay, these are all the clients I'm running. This is the sources they touch on. These are the attributes they touch on. It's a fairly complex problem. So that's a big data governance issue. Again, depends on industries. Anyone who manages user data really has those issues. Traditional issues like, you know, PCI stocks and so on, they remain. Things have got better there. Cost is another dimension. In fact, I don't even remember now the number of times we had to go back looking at our budget purely because, you know, I think the users often confuse that there are in cloud resources, but they're not infinite budgets, right? They just remember the tagline, oh, cloud is infinite, right? I think what if I get to tell them is that there's a cost to use these cloud resources. So hopefully that answers the two big buckets I care most about. You mentioned that 85% of big data projects fail. So how can we help prepare for that not to happen? Excellent question, excellent. I think part of that preparation really was what this webinar was hoping to accomplish as well, right? In terms of firstly, when you think about the entire journey, right? The biggest reason why projects fail is we stand up focusing on one or two pieces of the puzzle and not the whole journey of the puzzle, right? So what I mean by that is typically when you start a project, you know, even before you try to build things up, right? How do you quickly identify the data sets, the quality of those data sets? How are you able to rangle some of that data? How are you able to build the first prototype? You know, do the integration, do the, you know, connect with different data sources, which is what I refer to as data virtualization. And oftentimes these tend to be discovered in a reactive, you know, oh, that's what I need. Oh, now I have to figure out how to deploy. How I have to figure out how to monitor. And by the time you sort of get to that phase of bringing out the O's and finding the answers, the business relevance is gone. Like you know, that project may have not been relevant at all or the priority might have been. So what I would like to say here is really sort of think of your end-to-end data operations and think about what your time-to-insights is and prepare for time-to-insights irrespective of the project. Because if you can, if you have nailed it, nailing it doesn't mean you automate every aspect, right? Look at your hotspots. Make sure you have the right level of automation. Again, automation comes in flavors. So that would be the number one preparation one can do. Because once projects come, they are on a short fuse and they need to be delivered very quickly. What is the ROI model in the data efficiency, especially with the cloud-based models becoming more prevalent? So when you think about ROI, right? The ROI can be looked at from two standpoints. One is, you know, a more realistic time point, which is the business value, right? And these two measure ROI from the standpoint of business agility. So if I were to launch a new service and grab the market, right? I need to move fast. My differentiator as a company could be, hey, look, I have a lot of customer data and I can enter an adjacency. Like for instance, I'll give an example. In one of my previous workplaces, we were looking at, okay, how do we loans? I know we manage a lot of the information and accounts for the customer. Now, how do we get into the loans business? And this is where you have data and you want to enter that space quickly by launching the service, having a model, having the right, the recovery and prediction there. So business agility is one aspect of the ROI, which often goes unnoticed, right? The second aspect of ROI really is in the bucket of productivity, productivity in terms of how efficient did we able, or how efficiently were we able to use the time of our data users, right? When you think about the whole tango between data user engineers, short circuiting, like we again have had this experience of hiring some of the best data scientists, right? And again, data scientists, as all of you know, they definitely are very expensive, not fairly pricey. And the worst thing you would want is for a data scientist to be waiting six months to either get stuck in finding the right data or building the right model or the whole RCA. Also the expectation would be data scientists to do more of the operational initial prototyping. So that's the other aspect, which is, how do you improve the productivity of the teams here? Data scientists, data engineers. And the last piece of the puzzle, which is an ROI aspect, but definitely I would say least of the two I mentioned is the actual costs, right? And by cost here, what I mean is from a savings standpoint, right? Yes, there are savings that are very important, but also looking at the whole thing together, just not savings are not, savings by themselves may not be that, I would say the only factor you have to think into the productivity and also what it means for your business. Any open source cloud resources available for usage and are they safe? Well, this is a very interesting question, especially the last piece is something safe, right? Again, safe really is a function of what you're trying to accomplish, right? What kind of data you have, what kind of vertical you're in, what kind of regulations you're accountable for. For instance, the finance and healthcare industry is in a whole different ballpark compared to retail and some of the other sectors. So I would basically say the broader statement here is clouds have come a long way and definitely there is the early inhibitions of is cloud safe is getting lesser and lesser prevalent. And I think there are various reasons to sort of think of it that way purely in terms of the amount of investments that are being made in terms of the security standpoint, the standpoint of detecting issues and intrusions and so on. So, but again, whether the safety of the cloud that is really dependent on your scenario. With respect to cloud resources, I would basically say cloud really comes down to three major providers today and what we are seeing in the market is really a parity in terms of each of the providers Google, Microsoft and AWS really sort of build out the entire portfolio, right? Making it a one-stop shop when it comes to simplifying the end-to-end pipeline. And then again, if you go back to insights, metric, building blocks that help you kind of accomplish some of that. I'll just give an example, maybe a little tangent here is when you think of AWS, for instance, they have this notion of data crawlers, right? As a part of blue, blue, which is the catalog engine. And what it does is it goes out and starts crawling your data sources, the transactional data sources, looking for schemas, looking for new data sets, looking for changes, right? So these are again examples. If you go correlate with the example I was talking about, when you think about time to interpret, time to discover, clearly, for each of those building blocks, the cloud is starting to provide, I would say, the capabilities in terms of helping you through the journey, right? And so are companies around including Unravel, which is focusing a lot on the operational aspects of data operations. And I think we have time to slip in one more question here. Is there a place for a data model in the graph database world? And also, do you recommend the data model in all the phases of your data pipeline slide, which is the acquire stage, organize stage, deliver stage? That's a question. And in fact, data model by itself is, I would say, slightly broader. Let me take a crack at it. And in case I did not answer it, I'd love to follow up with the individual who posted it. So, you know, firstly, the question about the graph database or the graph data model, totally, goes back to the statement, there is no one size fits all, right? In fact, in my two decades of, you know, application building in the data space, there's really, my graphs have gone, have come a long way. In fact, a lot of relational based models are much more better suited for the graph. But is graph a one-stop shop for everything? No. The real answer here is what I refer to as polyglot systems. So, app owners really need to think about polyglot in terms of a combination of, you know, graph, time series, key value store, you know, multi or white column stores in memory. Pretty much, you need this combination to take any application architecture. I would basically put it that way. Now, with respect to the data model aspect, I think the key might be, and again, I'm just trying to extrapolate what the question would mean. In each phase of the data pipeline, right, there's clearly metadata that is there, that is calculated. And these metadata catalogs are becoming very, very critical. So, going back to how do you search for datasets? How do you, you know, look for information in terms of when was this dataset last updated, right? And by whom? You would need, that is now becoming part of the meta, meta information, right? And you see catalogs becoming more and more popular, Apache Atlas, for instance, and also the commercial solutions like, you know, you know, Waterline and so on and so forth that are providing that metadata store, right? And this is kind of extending, you know, when you think about the original Hadoop world of the high meta stores, this is like high meta stores on steroids. So, long story short, the meta model is becoming critical, especially from the standpoint of performance, quality, and a lot of other things that can be done, you know, in addition to the actual dataset. Well, Sandeep, that's perfect timing that brings us right to the top of the hour, I just want to thank everybody again for attending and being so engaged in everything we do and all the great questions that you submitted. Again, just a reminder, I will send a follow-up email by end of day Thursday for this presentation with links to the slides and links to the recording. Sandeep, thank you so much and thanks to Unravel for sponsoring today. Really appreciate it. Absolutely, really enjoyed it and folks really looking forward to hearing from you and engaging further. Thank you so much for your time and stay safe. Thanks, all.