 Hello, and welcome my name is Shannon Kemp and I'm the Chief Digital Manager of Data Diversity. We'd like to thank you for joining this Data Diversity webinar, which today is unlocking the value of your data lake sponsored today by Ahana. Just a couple of points to get us started, due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them by the Q&A section, or if you'd like to tweet, we encourage you to share highlights or questions by Twitter using hashtag Data Diversity. And if you'd like to chat with us or with each other, we certainly encourage you to do so. And just to note the chat section defaults to send to just the panelists, but you may absolutely change that to chat with everyone and network throughout the webinar. And to find the Q&A and the chat panels, you can click those icons in the bottom middle of your screen to activate those features. And just to note, and as always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now let me introduce to you our speaker for today, Dipti Bokhar. Dipti is a co-founder and CPO of Ahana with over 15 years experience in distributed data and database technology, including relational, NoSQL, and federated systems. She is also the Presto Foundation Outreach Chairperson. Prior to Ahana, Dipti held VP roles at Alexio, Connecticut, and Couchbase. Dipti holds MS in computer science from UC San Diego and an MBA from the Hof School of Business at UC Berkeley. And just to let everybody know, I've known Dipti for some time now since she was at Couchbase. Always glad to have her speak with us. She's a great speaker and very excited to see her channel, her passions into her own company. Congratulations and hello and welcome. Thank you, Shannon. Great to be here again. We're big fans of dataversity and it has such a great audience always looking forward to a lot of questions, interaction, and good conversation about all things data. Should I go ahead and get started? Absolutely. All right. So thanks, everyone, for joining today's webinar on unlocking the value of your data lake. Shannon kind of gave a very generous background of me, but let me tell you a little bit more about my experiences on the data side. So I actually started off in distributed DB2 on the structured databases. It was a core kernel developer in storage and indexing there. And I've transitioned myself in many different ways in the ways data platform teams transition themselves from structured to semi-structured data with Couchbase building SQL on JSON and then move back to the analytical side with this big disaggregated system that we're seeing now, which is kind of the next wave of analytics that I'll talk about today. That's driving data lake adoption. I founded a HANA just as the pandemic was getting started last year to simplify SQL on data lakes, a SQL on S3, which is where we're seeing a lot of data moving towards. So today I will talk a little bit about data warehouses, how data lakes could sit next to the data warehouse, how you could unify them, and what are some of the disaggregated languages and query engines like Presto and Spark and others. And then we'll follow up with a quick overview of a HANA, which brings open data lake analytics to data platform teams. So let's get going. So as most of you know, you know, traditional data warehouses are all about structured data, RDBMSs. These typically have a star schema. There are some specialized versions of this, which have column stores. And you can do quite advanced analytics on it, multi-way joints across your factor dimension tables. But it's mostly structured data. It's very, very clean data. It's highly normalized, modeled. And typically you have an ETL approach, which means from your operational data store, you extract that information, you extract rows. You transform it in an ETL tool traditionally with like an informatica or talent or others, and then you load it into your data warehouse, and that's typically a flow. Of course, you have a SQL access on the top. But over time, what has happened is we've seen challenges. It gets extremely expensive because it's a fairly, it's a tightly coupled system with storage and compute together. And so it's quite expensive. Over time, it might become difficult to manage. It's costly to maintain some of these data warehouses. And there's only so much data you could store. And that's how kind of Hadoop started off where the warehouses got quite expensive and people were looking at alternatives to store a data for a longer period of time. And you also have limited access from the kinds of processing that you can do. It's obviously structured access with SQL, but other workloads like general purpose computational workloads are harder to run on the warehouse. And so there's this big modernization that we're seeing over the last I would say two years, but really, or even accelerated over the past 18 months. Digital transformation. Everyone's obviously talking about it. That means a lot more data. There's a lot more real time, lot more real time events in information that's coming in that needs to be streamed in. And that means fast data or the ability to the system needs to be able to handle a lot of fast data that's coming in and streams as opposed to batches. And there's a lot more modern techniques to process data. There's AI, ML, general purpose, computation with Spark, for example. And of course, SQL with engines like a high even Presto that have emerged. And so that that is all about making your data more valuable for you and make it smarter for you. And so what we've seen is that the data warehouse or the database essentially is split apart, right? And so it is like I call it the the big deconstructed database because now you have a storage engine or storage layer that's separate. It ends up being the lake, which is S3 or HDFS. You have a query engine on the top. You have other components like the catalog is a separate system. Even the transaction manager, the log manager is a separate system. And so there is it is now a kind of a disaggregated system that users and platform teams are figuring out how do we put this together and get value out of the data leak once the data has landed in there. So let's let's kind of take a look at, you know, what's happening there. So traditionally, you obviously have the data warehouse. You have your reporting dashboards running on top, Tableau, Looker, many other anything that connects with JDBC or DVC. But because of a lot more data being available, being generated, it's the people are looking at the data lake approach, right? We're seeing thousands more data. How do you keep that warehouse going? You know, it's finally someone you were seeing the other day. You know, it's called terror data, it's terabytes, right? And then you had exadata. But what about petabytes or, you know, hundreds of terabytes, right? It's where does it go? Where does it live? Because data is extremely valuable and you want to have access to it to be able to query it. In addition to that, because of the devices that are coming up, third party data, telemetry data, event data, there is a lot of different types of data, which is in some cases, it might be JSON, CSB, or optimized formats like ORC and parquet. And so the system, the modern analytic system need to be able to handle this new kind of data as well. And then lastly, platform teams are looking for a lot more flexibility. You know, there was a saying that, you know, software is eating the world. And what's happening now is open source software is eating the world where you have a lot more open source technologies that are coming out of internet giants like Facebook and Uber and others that give platform teams the flexibility of not being locked in, of using open formats, of using open clouds and all of these things together are driving open data leak adoption. With a lot more workloads that can be run on the top. So you obviously have your reporting and dashboarding with bring your own BI tool, but you have data science. You have SQL notebooks, perhaps you have Python notebooks that might need machine learning workloads like TensorFlow. You might have in data lake transformation with engines like Spark and others. And so this is what open data leak analytics means is open source, open formats, open cloud and open interfaces like SQL, which is your traditional JDBC, ODBC. So you have your traditional data lake. You have a more traditional data lake, which is, you know, used to be Hadoop. Hadoop was extremely complicated. It was very much file system oriented. It was built for HDFS. And it was you had to ingest into HDFS. There was, you know, discovery that was needed to be done. It was less expensive than the data warehouse, but still a lot more governance was needed on top of it. And so both data warehouses and data lakes started to move to the cloud, right? And so what we're seeing a lot more of is that data is moving into the cloud data lake. You still have in many cases, you still have a data warehouse and then you have a data lake which augments your warehouse that sits next to it that contains all the data. So you might have 10 percent or five to 15 percent of your data in the warehouse. This is for your near term or your really, really low SLA use cases. And then the rest of it for broader use cases, historical data analysis, ad hoc analysis is in the lake. And the query engine is now a separated query engine where you have Presto, which is which came out of Facebook. I'll talk a little bit about it further is becoming the de facto query engine for the reporting and dashboarding use cases, as well as some of the SQL data science use cases. Then you also have engines like TensorFlow, PyTorch and others that help with the data science use cases when you have machine learning involved for feature development, for example, and so on. And then your your your ELT, your ETL, which is where you had transformed before in your load is now changing increasingly to ELT, which is load first and then transform. And so we're seeing a lot of use cases within data lake transformation with engines like, you know, on application Citus with Python or Airflow, but which engines like Spark. And so so this is the new modern stack that we're starting to see quite a bit of this is what runs at Facebook and Uber at at Twitter. In fact, they call Presto their open source data warehouse. And there's thousands of servers running off these of Presto, for example, for all of these workloads that run on top of it. So now let's take a look at, you know, step back up one level and see how does this fit into the bigger picture? Because, you know, you you have your operational systems, you have your stream systems, you have your ETL systems and so on. So typically, you have your operational data sources or your my SQL Postgres, including no SQL, Mongo, Couch Base and others. This is where the business is running, right? The business is running on top of it. You use typically data engineering pipelines or, you know, it could be streaming. It could be your more traditional ETL to store it into the warehouse. And then you had workloads on top, reporting, dashboarding. This kind of flipped horizontally a little bit this chart. But what's happening is the storage is now going to the data lake, increasingly S3. If you think about it, S3 has been around for 15 years. There are trillions of objects in S3 and just from a revenue perspective, AWS makes billions of dollars just on storage. And so the amount of compute that's needed on top of this is pretty significant. And that's where some of this new data comes in, streaming and IoT data, third party data. And you have these workloads on top of it. You have obviously your SQL query. That's where Presto is. That's where the HANA set. You have a machine learning transformation. That's where data bricks and cloud data are. And then you also have a smaller workload, which is coming up, which is data virtualization or federation in situations where you have data that's not landed in the data lake yet, or you might have ephemeral data in another system and you want to correlate it with data in the lake. And so these are the popular workloads that run on your lake. Obviously, you have your consumption on the top. Where you have your reporting, your dashboarding and other other workloads, as I mentioned, running on top of it. So this kind of completes the picture in terms of what the modern data warehouse would look like. Now, I talk to a lot of users and I broadly kind of classify them into two categories. You might be a platform team that is even born in the cloud, right? But pre warehouse. That means you might still be running Tableau or Looker and others on an Aurora or MySQL or a Postgres. And you're thinking about, you know, moving to a data warehouse or a data lake, right? Which one should you choose? And and then there's the other category of folks who are already on a warehouse, but where the warehouse is starting to get very expensive and they're looking at an augmentation strategy, right? And so for these two use cases, if you're in the former, I would say that, you know, it's very interesting that over the last couple of years, quite a bit has changed. And finally, users actually have an option for the first time in in 20, 30, 40 years where they can try and skip the warehouse, right? If the use case is simple, if you have simple joins, you know, maybe simple star schema or even a denormalized schema where you have a single table that's very sparse, you could you could just run analytics on a data lake and it could be extremely affordable, very flexible. And and that could work for you. But in some cases, you may not have the skills, right? And so in that case, starting off with a warehouse and then moving to a lake or augmenting it with a lake might be the better approach. So. So I talked a lot about, you know, the the different workloads, why open data lakes, skip the warehouse strategy versus augmentation strategy. And so for the rest of this presentation, I'll focus more on cloud data lakes. Now, cloud adoption is three adoption, GCS adoption, all of these object stores, right? That's where the data is kind of landing first, even before hitting a warehouse. These are driving the adoption of these open source SQL query engines. Here's a chart from the DB engines, right? So DB engines is a pretty well known ranking for a variety of systems, databases, relational databases. Now, these engines are actually not databases because they don't actually manage data. They're just the query engine. But even then, you know, they're kind of included in their SQL. But you see that, you know, the the growth of Facebook created Presto 20, 20, 13, 20, 14. You see the growth of it. And especially over the last two years, you know, over will kind of overtake Spark. Spark is kind of more general purpose query engine. You can do a lot of things with it, but but it's, you know, Presto is starting to really catch up there and it's growing very fast. And then you also see Apache drill, which which was a precursor to Presto came out of was based on Jamele, the Jamele paper that Google published. And and is also kind of a federated engine, which means you can query data lakes as well as as databases. But it kind of stayed flat and and and Presto is really has kind of risen across and so high high, I should have probably included high as well. Hive came out of the Hadoop ecosystem. It's largely used on prem, but not as much in the cloud. In the cloud, we see Presto, which was about it's about 15 times faster than Hive and built for interactive quitting. Ended up being the kind of winner in for cloud, from a cloud data lake perspective and a query engine perspective. So so these are the some of the considerations that you can think about as you as you look at data lakes. Now, there are a lot of similarities between a modern data warehouse, which is, you know, cloud data warehouse, snowflake, redshift and others and modern data lakes. In general, they tend to be cloud first managed services. They are in largely in memory, both support kind of complex data types. Data warehouses tend to be better with columnar data sources, columnar formats, but CSB, JSON, others are supported as well. Data data lakes tend to be better for this separated storage and compute. That is what they were built for. The warehouse can support it, but it's not what it was built for. It's built for tightly coupled system and highly ingest, you know, where you have to ingest into the system versus an in place and in place approach for the data lake. And so there are a lot of similarities. But as you look at these, the workloads, you might want to combine them. You might want to look at combining them and merging them into one with a distributed query engine, like the ones we talked about. So there's six considerations that I would, you know, would suggest SQL access. What kind of SQL access are you looking for? What kind of clauses, sub clauses, features are you looking for? Does the query engine support that? Now, Presto is we're working on Presto and it's the innovation on it continues where not only are we supporting data warehouse use cases, but are going even further beyond that. And so over time, the data lake is where a lot more of the innovation will happen. Some of the state of our database capabilities will also be merged in and over the next three to five years can handle would handle very, very complex data warehouse workloads. Today it supports the simple to medium complexity workloads and over time it will get even even more wider. So you can unify both these in using storage as S3 or GCS and run the SQL engine on top. You have distributed query engines like Presto that allow for you to query across different systems. So if you have data in the lake, you can query that. If you have data in Redshift, you can query that with Presto and you can actually perform a join across the two as well. Limitless scale with the data lake. You really I mean the cost profiles are much, much lower than a data warehouse because your storage is very, very cheap. S3 is ubiquitous and it's cheap and you can build on top of that. And you can support many different types of data. So let's take a look at some of the use cases that that we see on the data lake. Any questions, Shannon, let me pause there and see if there's any thoughts, comments, questions. But there was a comment earlier, traditional warehouse store history like SED type two. So using the lake, would you recommend storing historical data? Won't that make it processing more difficult? So if I understand correctly, I think the question is about maybe version data or time travel, perhaps for traditional warehouses. There are many different features on traditional warehouses, right? So they've been around for a long time and you could some of them, not all of them, some of them support versioning of schemas and versioning of the data itself. So you could go back in time, business time and system time. Now data lakes up to today, I would say, even up to the last, in the last before two years ago, couldn't support it. But now there is a new layer that's actually emerging. And I'll talk about the lake house where you have the transaction manager, which traditionally is the log manager within a database that is coming up and there are two or three popular ones that are out there. They're called table formats incorrectly named again. You know, naming is hard for and it's confusing when it's not named. Right. But these you can think of them as transaction managers. With the with the innovation of the transaction managers for the lake, what's happening is there's a new layer that's come that sits on top of S3 and it allows for versioning of both the schema as well as the data. And so with this now, you can actually go back in time. You can do time travel. You can you can actually support version schemas and so on. So I think that was a question. If not, feel free to, you know, correct me. And I can try to answer, try to answer the the question that was intended. So let's take a look at the use cases pressed to as an example for a distributed query engine was built for interactive query use cases. So think reporting and dashboarding. So this is your tableau, looker, quick site. Super set is an open source open source one that we really like as well. And and that's kind of what the first use case was. So reporting and dashboarding, SQL data science are kind of bread and butter use cases, if you will, for for presto. Federation is another great use case where you can query across databases and data lakes and in in some cases where you have depending on the customer facing apps, if you have the need to do large scale, massive scale ad hoc querying, it might also be a very good back end engine for customer facing applications. A good example here is a security cyber security where threat hunting, for example, you might if you have needle in a haystack kind of queries, but you have to query what has happened for this IP address across all the millions of events that have been tracked. And in those cases, a simple warehouse or a simple relational database operational database is not going to be able to solve that problem for the customer facing application. And so we're starting to see some of these come up quite a bit as well. The emerging use cases are the lake house that I mentioned. I'll talk about it in the next slide, as well as a lot more of the transformation using SQL that's coming up. So Facebook recently added built support for Presto on top of Spark. And so now you can actually have one single layer of SQL on the top, which is anti SQL. Presto is a really great anti SQL engine. Spark SQL has some peculiarities about its SQL. And so now you can use Presto to run Presto or Presto to run Spark. And we're starting to see a lot more Uber just recently transitioned its transformation workloads to Presto on Spark. And there's others as well. And so these are the emerging ones that I'm seeing as we talk with users, the community and customers. So let's take a look at a little bit of a deeper dive on one of them, which is the data lake house use case. This is what the lake house use case, the stack looks like, right? So you have your BI to a lot of notebooks on the top. You know, there's many different that are supported. You have your JDBC, ODBC driver. These then connect to the Presto cluster. The Presto cluster needs a catalog. The catalog is not a part of the query engine. And there are two popular catalogs today that exists. One came out of the Hadoop ecosystem, which is called the Hive Metastore. The Hive Metastore is not the Hive query engine. Note that there's confusion that's out there because they're both called Hive, but they're very different. One is a query engine and one is an actual catalog. So it's an operational catalog. This is what does an operational catalog mean? It means that it stores schema for the lake. And it is the mapping for the query engine to the objects that are stored in the lake. So what tables, what schemas exist, what tables exist in these objects and files? What are the columns in these tables? All of this information is stored in the Metastore or the catalog, the operational catalog, and the Hive Metastore HMS is a very popular one. If you're in the AWS land, glue is the catalog. By default, everybody uses glue. It is compatible with HMS from a wire protocol perspective and it connects seamlessly with the Restore. For Ahana, we actually have a managed Hive Metastore. So with every cluster, you can actually just check a button and it will be created for you and managed for you. Underneath this is the transaction managers that I was talking about. There are a few popular ones that I've emerged. Apache hoodie is one that it came out of Uber and Presto hoodie and object store is the stack that's running at Uber. Delta Lake is another popular one that was created by Databricks for the Spark ecosystem and that works with Presto and Spark as well. Hoodie tends to be a lot more engine agnostic. And so it works with Spark, Flink, Presto. Delta is a very good engine, very good format that Databricks is investing in and it works very well with Spark and has good integration with Presto as well. And then AWS is coming up with its own, it also supports Hoodie but is looking at building its own transaction layer as well. And then the layer at the bottom is the storage engine. And the storage engine is the lake, which is where your objects sit. These could be parquet files or C files and others. That the meta store that I was talking about maps down to these objects. And so then the query engine knows where the tables live, the tables, what do the columns look like and what files and folders do they point to in the object store. Now, a reminder that these catalogs that I'm talking about are quite different from other catalogs that you use for governance purposes like a popular one's Colibra or Alation or others. A Mutson is another one, Data Hub, a couple of open source ones that are coming out as well. These are quite different. I call them the human catalogs. They're for human consumption and HMS and blue are for system consumption. So they're more operational catalogs. I see a couple of questions coming in. So let's just take a look at this. Does Presto support XML? We have SAS vendors only supports web services. So the way that from what I know, XML is not supported for Presto. And so you would need a transformation from XML to JSON, typically, for any of these languages. Most times what we see is PARC is used for transformation. It'll do XML to parquet, right? XML to parquet or ORC transformation and then run Presto on the top. And here's why. When you want performance, better performance, the columnar data formats like Apache Parquet and Apache ORC are going to give you the best performance. And the reason is that it actually stores metadata, some metadata for the data within the file itself. And so you can actually skip parts of the file as you're scanning through this. And this becomes very, very helpful when you're doing large table scans across the lake. And so a recommendation from a performance perspective, it typically ends up being, you know, have your store your XML, transform it to parquet or ORC and run Presto on top of it. Now JSON is supported. There's a lot of good support for nested structures and so on. But you would see a performance hit, right? Because it is not going to be as performant as some of the columnar formats. Depending on what your use case is, you can decide. You want to keep it in JSON. You want to keep it in CSV or you want to move to a more performant storage. Very interesting. Love more information. We'll set out some more information at the end. Let's take one more question and then we'll keep going. See here. Can you? Yeah, go ahead. Go ahead, Shannon. I think I know which one you were diving into there. Can you access and query on? There you go. Yeah, a premier database and on cloud database and S3 at the same time currently. I think I can not query certain types of database like SQL server. Yes, so good question. So with Presto, yes, you can query different databases. There is a general purpose JDBC driver that can be used for any database. There are other drivers that exist, for example, with Redshift, with MySQL, with Postgres, Kafka. There are 30 connectors that Presto has. For Ahana specifically, it is built for the cloud. And so it is built for accessing S3 and a range of other on-cloud systems and provides much more federated access than Athena does. For example, is better with Redshift, with RDS, with Elastic and others. And because it provides native access versus Athena that provides a Lambda-based access, which can be pretty complicated. All right, Shannon, should we keep going? Big questions at the end? What would you suggest? Yeah, that'd be great. Yeah, lots of questions coming in, but yeah, at the end, they'll be perfect. All right, so let's keep going. Next, I want to talk about considerations for open analytics as you think about this. There are six, there are eight areas. I'll walk through them really quick today and feel free to reach out. Each one of these could be an hour-long session as you think about the consideration for open analytics. Data, there's a lot of different kinds of data. With data lakes, what we're seeing is it's really good for structured and semi-structured data, complex data, semi-structured data. It can be text, Presto can query text as well, as well as streaming. So you can have Kafka streams that drop data in. You can actually use Presto to query Kafka topics. This is something that Twitter does. Twitter actually has, it's literal Twitter fire hose that goes into Kafka streams and then lands into their cloud stores. So they are on GCS and in GCS as well as on-prem. But let's talk about GCS, it lands in GCS, Presto can query for GCS as well as the Kafka streams. And so with a data lake, you get a lot more flexibility over at Warehouse. Now, the other thing is that you can obviously run, you need a SQL workload, which will tend to be Presto, but you can run other workloads on that same data without moving it around, without ingesting it into yet another place and run machine learning workloads, for example, with PyTorch or Spark, general purpose Spark workloads for in-day-to-lake transformation. And so you have the ability to handle a lot more kinds of data with a lake. Analytics, what are the different kind of analytics that you could run? So obviously we've talked about SQL. SQL, Presto, SanSci SQL compliant, you get the best of all of that, you get extensions for the semi-structured data as well. You can run your Python workloads. So we see quite often just Python code, just straight up Python code running, connecting via JDBC to do either transformation, simple transformation, sometimes it's insert, sometimes it's create table as statements where you are creating a copy of a table or a derived table in Python and then running queries on top of that. So that could be the workload. Notebooks, Jupyter tends to be very popular, Zeppelin as well, comes up every now and then. And then you have search. So a SQL obviously has a like laws and others. You can, with Presto, you can also query across Elasticsearch for example, and query S3 and search. Who are the end users? Who is actually using the system? Now data platform teams or data platform engineers tend to be actually running the platform, right? And then on top of the platform, you may have data analysts, data scientists, data engineers and the business. Data engineers will typically run the transformation workloads or data pipelines with Presto or Spark on the lake. Data analysts will typically use to blow looker superset and others on top of the lake, right? Scientists will typically use notebooks that run on top of the lake with Presto. And then business users will use the dashboards that an analyst would create and get a view of the business. So really what we're seeing is that lake consumption is across the board. As an example, Uber runs Presto with Hoodie on-prem. Half of the company hits Presto every month, right? That is how data-driven the company is. And these are product managers, marketing every department, even beyond the ones that I've listed here. And so with a vision of being truly data-driven, a unified and open data lake can really get you to that point. The next one is the platform itself. So think about where do you wanna run it? Increasingly, I'm biased to the cloud. So we are focused, Ahana is only doing cloud, right? Because it's very hard to build a product for both on-prem and the cloud. When you have a cloud product like a snowflake, it's really native to the cloud. It's built using best-of-breed technologies, runs on Kubernetes, highly containerized, highly available, flexible. So think about where do you want to run it? If you are looking for an on-prem option, then an open-source option, do it yourself, might be a good option for you if you have a strong data platform team, or you can work with other vendors, get support for those open-source products. And the communities will help as well. Ahana is the Presto company. I engage very closely with a lot of community users and guide them in terms of their on-prem usage and how to solution their system on-prem. Cost can be obviously a very big factor. Data warehouses can be very expensive, very fast. You're storing your data in two different places. You're storing your data in the warehouse as well as typically in S3. With a lake, you have usually one copy of it. You might have some derived data or you might have some temporary data that data scientists are creating, but that is part of the workload itself. And so think about costs as you think about your platform as well. Cloud, and we kind of talked about it already, right? It gives you tremendous flexibility, elasticity, mobility, and the global reach, you don't have to manage data centers everywhere. Of course, there's GDPR, you need to maintain your data in certain places, but with HANA the way I've architected our product is we bring compute and presto to you and your VPC. So you have a separate HANA console, but anything that touches data, which is presto and other things are in your environment. And that is the new modern approach. It's called in VPC deployment. Look for that kind of architecture and it will allow you to have more control over your data by providing some privileges to the external entity to just run the system and manage the system. Security, privacy, governance, all of these are evolving. With presto, HANA created a plugin for Ranger. So Ranger, for example, is one of the very popular authorization engines, if you will, tools that allow for our back like you would do in a database. And it can connect to many different systems. And so now what we're seeing with governance and security is more unified access across the board. And lake formation is another one there. If you're running on that, it provides our back. And so presto, for example, HANA's integrates with some of these different systems so that you can manage governance outside of the database or outside of the lake and have more control over it. The business, what does the business need? The insights from the data, the value of the decisions that are made from the data, that is what is important. And with HADUB and with some of the complexities of it, it took nine months, 18 months to actually see real value from these systems itself. With HANA, as an example, presto is way more simpler than HADUB, right? And HANA even simplifies it even more. We do, in 30 minutes, you need to be up and running with presto, querying your own lake in your environment. And that's how easy it is. You need to get value of your data in weeks and not yours, right? And so we need to move to a model where the value of these systems, they're actually adding value versus data platform team spending time on just operationally managing them. The operational complexities have now all been removed. And so you have the ability now to truly see value of your data. And then finally, costs. So think about costs, AWS can get very expensive, very fast. It's flexible, but it also can get expensive. We just rolled out a feature in HANA called idle state management. So if the cluster isn't doing anything, it shrinks down to a node. And so look for these kind of capabilities as you pick your platforms. So you can make sure that your cost management is part of the platform and it's part of your decisioning. So building a lake on your own with all of these open source tools can be quite challenging, right? So to do it yourself approach, you need a very, very skilled data platform team, Uber, Facebook, their PhDs that handle this for them. They actually write code for the engine, right? Fix bugs, but Presto can be quite complicated. In addition, your alternatives or your options won't scale as much from a computer engine cost perspective. Athena is a serverless approach. It's very simple. It's really, really easy. It's a Lambda function for a SQL query, right? It doesn't get any simpler than that. However, you can't really, you know, creating a Lambda function on a SQL, on a database is non-trivial, right? And so there's a lot of limitations that come with it. Number of queries that can be run concurrently, queuing up of the queries. We see that a lot. Expensive, it's your charge $5 per scan, per query, and it can get, you know, expensive pretty fast. And so that's kind of where Presto and Ahana comes in. I'll talk a little bit about the community. Would love for everyone to participate in the community and perhaps try out Ahana as well. So Presto is a distributed engine. It has a coordinator, worker architecture. You have, you know, these are the clients on the top. You can see data analysts using the BI tool. It's the coordinator. The work gets spread across all the workers. And then you can connect to all the data sources underneath it. It was created at Facebook. It's hosted under the Linux Foundation. And so it is a sister foundation to CNCF and Kubernetes. This is very important. There's a lot of open-source projects out there that are company-driven, right? And so they're neither Apache nor Linux Foundation, but a community-driven project that you get benefits from all the innovation that's going in. Presto Foundation has Uber, Twitter, Alibaba, Ahana. I was a very early member of the foundation driving the community. And as well as Intel and HPE have also joined the foundation. Lots of big users for Presto. These are all users, Athena users, Ahana users, open-source Presto users. A lot of great adoption of Presto itself. And so with Ahana, it's very easy to get started. It's a managed service. So you sign up for the cloud. We create a compute plane in your account. So we use the AWS best practice of cross-account access with trust relationship. This is a one-time, 20-30-minute process which is fully automated. Creates a VPC, creates EKS clusters, everything from endpoint management on the network to the operating system underneath us is handled. And then you have a single pane of glass that you can use to create any number of Presto clusters for different use cases, workloads, et cetera. As I was mentioning earlier, we've split the responsibilities. There's a very clear separation. Ahana console is responsible for the orchestration, auto-scaling, change my, Resto nodes from five nodes to 10 nodes, logging, security and access and billing. And then everything that touches data is actually in your environment. And this is called the NVPC deployment, including we use SuperSet as an admin SQL editor. That's also part of the compute plane. And then you can query across the range of different data sources. So this is what the stack looks like. We allow for a managed Hive Metastore that you can create just with a click of a button. In addition, caching is built in another click of a button and you have an entire worker level cache that gets added to every cluster and gives you benefits of not rereading data form from S3 every time you do a table scan can improve performance depending on the workload. You can also connect. It has a very native integration with Clue. And through that, you can VPC peer to any of your systems and access data in place without any data movement happening. And so that's kind of a overview, a really quick overview of AHANA. And a quick case study. I talked about threat hunting earlier. Securonix is a large SIM SaaS company. They are a Gartner magic quadrant leader and they are using AHANA for threat hunting. So in their case, they have these needle in a haystack queries. They're storing billions of events every day, every week. And they saw much better performance moving from Presto and AWS to AHANA. If the stack is S3 glue with Presto stack. And if I see a lot of questions, I wanna save some time for questions and interactions. But this is one of the use cases is very strong for Presto for either interactive or ad hoc querying across large amounts of data at quite good latencies. So in summary, AHANA Cloud for Presto is brings ease of use. It is, you get a nice console. Obviously it's a fully managed system, better price performance. You have over 200 parameters that already come tuned out of the box so that you don't have to do it or understand what task concurrency needs to be said to. And it's open and flexible. We are open source first. We are community first. And you don't get locked in. There is no proprietary format. We use Perke or ORC or JSON, whatever your data might is in. And so it brings flexibility to you on your lake. We use the NVPC approach. As I mentioned, it's fully managed in the life cycle. You can stop, restart clusters. You can attach data sources, delete data sources. You can query other databases as well. And it's all cloud native. It's built to be highly elastic and available running on Kubernetes and lets you bring your own, bring your own BI tool, bring your own metadata catalog, the operational kind, which is the HMS or glue, bring your own transaction manager. And so this is how we have simplified and it's my very sincere attempt at simplifying SQL on S3, as I call it, or a SQL on data lakes. So with that, Shannon, back to you. Let's take some questions. If you thank you so much for another great presentation and just answer the most commonly asked questions. If you, I will send a follow-up email by end of the Thursday with links to the slides and the recording. So indeed, you will get copies of those. And if you have questions for Dipty, feel free to put them in the Q&A portion of your screen. Although I've got some questions mixed here going in. So Dipty, does Presto support XML? We have a SaaS vendor that only supports web services, which we use to pull their XML into our data lake. What types of tools or workflows have you seen with this type of situation? Yeah, yeah, so good question. As I was mentioning earlier, Presto supports JSON, parquet, text, CSV, Avro, some of the more modern formats, it does not support XML. And for good reason, XML is not a great, I've actually worked on XML databases, two of them. I've worked on, I've added XML to DB2 and I've actually worked on MarkLogic as well. However, it is in this modern data lake in the column formats, that is what the engines are built for, for columnar formats and XML is not a great column format. So what we've seen is that customers transform XML using Spark to parquet or ORC and then run Presto on top of that. So how does Presto query multiple data sources differ from other virtualization software? Yeah, so that's a great question. So Presto has a pluggable architecture. Typically I do an hour long session just on Presto itself, but at a high level, there is a very clean interface between the top of the stack of the database, which is the parser, compiler, optimizer and execution engine and the connectors. So because of this highly pluggable architecture, there are a lot of connectors that Presto has. The primary connector, the workhorse of Presto is the Hive connector. It's again badly named. It should have been called the data lake connector. So it's the connector that connects to S3, HDFS, GCS and others. And that is what most, 80, 90% of workloads would just go to the Hive connector. However, there are 30 other connectors and you have, for example, the MySQL connector. Let's say you have some data in MySQL that's not landed in S3 yet because you're running a batch process and it takes 24 hours for it to land. And you must find some correlation in these 24 hours between some data that's in the lake and some data that's in MySQL. Presto allows you to do that. You can run a query across these two systems, pull most of the data through S3, pull some data to MySQL and actually join across them and correlate and return the results back. You can do a join across our Redshift and RDS and S3. You can do a join across Kafka and S3. It doesn't have to be S3. We have customers who don't have S3 and are federating across. And so that's how it works. It's called federation. The challenge with federation is that when you have a complex query, not every system is going to be able to handle it because databases are all built for a purpose. If you run a five-way join and a MySQL in production, it's going to fall over. So please don't do that. Run it on your replica at least. But think about why you really need federation. There's certain use cases that are good use cases, but others you would likely want to use just the lake itself for best performance as well as improve stability. You mentioned S3 or No S3. Does it apply also for Microsoft Azure? Yes, so Blob Store, right? So any object store. So there are multiple systems that essentially support the S3 protocol, right? So they're S3 compatible. On prem, you might have Minio as an example, right? And so Presto will connect to anything that's S3 compatible. And that includes some of the, so GCS, for example, Azure, so Blob Store and other on prem options as well. Perfect. So a very important use case from my business users. Does AHANA or other Presto tools you recommend have IntelliSense and any support for user-defined functions? So I'm not sure what IntelliSense is. Maybe if you could just post a note about is it a BI tool? And I usually know a lot about it. So it's the first time I'm hearing it. In terms of the second part of the question, user-defined functions, yes, Presto actually has support for what is called function namespaces. It's starting to become a very popular feature that we are adding into AHANA as well. We are exploring, we are featureizing it. What it allows you to do is have a MySQL at the back to actually store these UDFs and be able to run these UDFs on the coordinator. There is going to be support for both local UDFs as well as remote UDFs. In remote UDFs, you might have high-view UDFs that are sitting somewhere else and you would be able to run that as well. There's a great presentation on this from Facebook. Happy to share that with you as well. And they define, it tells us it's auto naming of the columns and tables, like giving suggestions on column names and table names. I see, I see, got it. So schema on the lake is very different, it's a little bit different from the database, right? And the reason is that the catalogs manage the schema, right? So think of it, the S3 just has objects. These objects are immutable, they're files, right? So you have, you know, parquet one, parquet two, parquet three, parquet four. The catalog, which is either glue or hive metastore actually maps this and says, these files are actually a table customer where you have columns name, phone number, state, et cetera. Right? Now you can obviously do alter table, alter table supported. In terms of suggestions on column names, that is a layer that sits on top of the schema. And so, you know, you could have tools, there's other catalogs that are coming up like a Mutson and Data Hub and others that gather schemas across all these different databases and as well as the lake and might give you more better suggestions on these things. So it is a concern, if you will, that's now outside the database because it's the deconstructed database. And so that's how it would work. So it's, there's no direct support today in Presto or a catalog, right? But maybe AWS glue will add it and then Presto can add it as well. So do you see the industry using RLS along with RBAC? We're trying to get away from RLS but getting some pushback on that. So in terms of access control, right? On the lake, we are seeing, I mean, from an authentication perspective, we are seeing, you know, we take a step back. Typically in any database, you would have in database capabilities, right? So in database authentication, in database authorization, RBAC and all of this stuff. With the lake, what's happened is that the security concerns have been taken out of the query engine, right? And so they are outside. Now, Presto does obviously have support for LDAP integration, for example, right? For authentication and Kerberos and in Presto authorization for file, like your file-based authorization, multi-user authorization, stuff like that. But that's not the norm. The norm is ending up to be RBAC outside of the system where you have roles and these roles get defined in systems like Ranger or on AWS Lake formation. And that is where they get handled. So Presto, for example, has a plugin that on our order, we open-sourced it with, it says, hey, does this user have access to this table, right? And it's gonna go and ask Ranger, does it have access? There's no, there's caching and all of this stuff. But yeah, the policies will be passed from Ranger, the system outside to the query engine to say, okay, what is the authorization level for this person? And does he or she have access to it or not? So that's at a high level, the norm today that we've seen. Very few systems kind of have mapped directly to the cloud, right? Some will work well with from a lift and shift approach, but not all of them do. Ranger has now been like shifted if you will to the cloud and it is getting used a little bit, increasingly being used. And then Lake formation is the other thing that if you're an AWS user, you might tend to just stick with that approach. Well, Dipti, thank you so much as always. This is bringing us right to the top of the hour. It's been another fantastic presentation. And thanks to our attendees for being so engaged in everything we do. Just love all the questions and the engagement in the chat there. I always break about y'all and you never, you never let us down. Thank you so much, Shannon. Pleasure as always. Love the questions and the interaction. Feel free to reach out to me. Dipti at ahana.io, D-I-P-T-I at ahana.io. Always ready for a coffee chat on cloud day. Thanks, Shannon. Thanks, everyone. Thanks, Dipti. Thanks everybody. And again, I'll send you a follow-up email and I'll include that Dipti's contact though in the follow-up, but that'll go out by end of the day Thursday with links to the slides and the recording as well. Thanks everyone. Hope you all have a great day. Thanks, Dipti. Take care. Cheers. Bye-bye.