 Welcome, everyone. Looks like we've got folks from all over the world. In the Slack channel, I've pinned a link to essentially the lab guide that we'll be going to. It's also the content that we'll be going through through this session. The nice thing about that is you'll have all the takeaways, both the content as well as the instructions for the hands-on portion. OK? I'm going to give probably a few seconds so folks can click on it. When you get there, it will look something like this. I'm going to make this full screen so it's easier to see. For the folks in the audience here, you guys are all able to get to this site. OK? I can make it bigger as well. OK? Again, my primary way of communicating with everyone is via Slack. Obviously, the folks that are here in person, if you can go to this site, it's onehouse.readme.io slash docs slash intro. You're people of some time. How are folks on the Slack doing? I don't see too much activity, so I'm assuming folks are OK. I also have my colleague Rohan on the Slack, and so he'll be able to help as well. OK, so while folks are making their way to the guide, I'm going to just start and provide an overview. So the goal of today's mini summit is really to introduce you to the Open Data Lakehouse and some of the core technologies you can use to go ahead and play around with the Open Data Lakehouse. If you ever get lost, you can always go back to the introduction. This is where I kind of keep some key assets. In particular, we will be using AWS EC2 machines to query Presto, C Presto. This just makes it a lot easier. You don't have to install a lot of things on your local machine, have to deal with dependency conflicts and whatnot. OK, so I'm going to jump right into it and just kind of provide the background. So I think before we dive right into the Open Data Lakehouse, I think it's worth setting the context. And I'm not really sure where everyone is coming from, and so I want to set the higher level context of, where does the Open Data Lakehouse fit? Why would you use it? And things like that. So I think the first thing to start is really recognizing that we have operational and we have analytical data. So this picture, I should have put actually the credit, but this is from the data mesh paper. But all kind of systems, at least today, follow this paradigm where on the left-hand side, in my figure, you can see you'll have data storages that power your business or what you are trying to do or your application. And these are operational systems. They're really designed for transactions in the sense that applications are interacting with it, point-of-service systems are interacting with it. But they're not designed to do analytics, per se. So let's say you have a bunch of customers or a bunch of transactions you want to analyze or aggregate. These operational systems to the left and in the orange are not really designed to that. So the current paradigm is you move the data from all these sources on the left, on your operational side, and you put it to some analytical system. And so that's what the right-hand side is. And today, arguably, the two main sources for long-term kind of analytic workloads are really going to be the data lake, which is the foundation of the lake house, and the data warehouse. And we'll talk about these things. So that's kind of the context that we're dealing with here and where the data warehouse and the data lake house fit in. So we double-click a little bit. There's many ways to visualize and talk about the lifecycle of data. One particular way that I really like was a post originally published in 2020, but then updated by Andreessen Horowitz. I really like this in terms of, I think, its accuracy, as well as simplicity. I know it's got a lot of boxes, but the space is very heterogeneous. So on the top, you'll see the various stages that data goes through. So the sources, which they have, are primarily operational sources. So again, this is where all your data is getting collected and being used to run your business. The ingestion and transport is essentially what has been known as ETL. Obviously, there are newer paradigms than ETL. But then you go into the core piece for the analytics, which is in the middle, which is what they have storage and query and processing. And this is really your analytical engine. This is where you store your data for long-term analytics. And then the right-hand side where you have transformation, analytics, analysis, and output, this is where you take the data from this kind of analytical store and then you produce whatever output you want. So if you're doing some BI tool or dashboards, they're connecting into your analytics data system. You're going to build models. You're pulling data from your analytics systems. That's kind of what this output is right here. So I created a simplified version of this. You can kind of think of it this way. You have your operational sources. You're going to bring it into some analytics data platform designed to do kind of analytics on it. And then you have any number of outputs. It might be an application you're trying to build, a dashboard, models, whatever. OK. And today, the two primary analytics data platform are the data warehouse, which you can see kind of on the top here, and this more emerging paradigm called the lake house, which we'll be focusing on today. We go to the Slack real quick. So we have more folks joining. That's great. Welcome, everyone. OK. So basically, for decades, the Enterprise Data Warehouse was and is still for many organizations the workhorse of analytical workloads. Modern data warehouses include like Snowflake and Redshift. You obviously have still many of them on-prem, but many of them are also moving to the cloud. They do have their limitations. They typically work for primarily structured data. I think the capabilities from machine learning on top of the data warehouse is continuing to advance, but really the data lake was purpose built to do advanced analytics. So let's talk about the data lake, because the data lake is really the underpinning of the lake house. So if you look at this figure that I have here, on the left-hand side, you have the data warehouse. And you think about it, a data warehouse is kind of like a box. And in that box, you have the storage. And that storage stores data typically in a proprietary format. And then you have some kind of compute engine that knows how to compute on top of that data. And then from a user-facing point of view, you'll have an interface that you can then ask the system what you want, the primary one being SQL. Now imagine that you can pull out the storage and you can store things in a scalable, low-cost, open way. And that's really where the data lake comes in. So think of the data lake as really the storage layer. But it's very cost-efficient. It's open. And the advantage of this is you can allow different types of compute engines to operate on that data. And you can scale out the data. So the very first version of the data lake was probably in the mid-2000s with Hadoop. So about 15 years ago now. And Hadoop was really a framework of a lot of technologies. But the core Hadoop really consisted of two parts. The storage part was something called the Hadoop Distributed File System, or HDFS. So it's like a file system. It's like the ones that you have on your computer where you just dump a bunch of files. The advantage of HDFS was it's scalable. So essentially, it's a cluster of computers that you can store data on. So it's scalable. Turns out that this cluster of computers could also work on commodity hardware. So when I say commodity hardware, I'm not talking about specialized, purpose-built type of hardware. So you could use just a bunch of hard drives, put them all together. And then you would interact and put your files into this file system. So this was really the first version of the data lake. And a lot of organizations today still run Hadoop clusters. Now, more recently, I would say the primary substrate storage layer for data lakes today are the object stores. So an object store stores a bunch of files just like a file system. But it's a lot simpler. It doesn't have the same semantics as a file system. So you're not going to be renaming things. You really have simple APIs that you can put an object there. You can get an object, delete an object. In the cloud, all the major vendors, cloud providers, have their object stores. We'll be using AWS's S3. Azure has theirs. Google has theirs. There's also a variance that you can run on-prem, like MinIO. So today, by and large, any new data lake initiative will be built on an object store, and probably not on the Hadoop file system. So this is a storage layer. Let's talk a little bit about, well, how do you? It's good you have a storage layer. You need to do something on it. And even with data lakes, it was very clear that a large portion of the world interacts with data with SQL. So there's a tremendous use case to say, how do I run SQL on top of this data that's stored in the data lake? And that is what a SQL query engine does. It's this engine that knows how to interact with the data lake. You express yourself in SQL. It's able to interpret the files in the data lake and then return back the results to you in a very familiar format. Basically, tables, essentially. OK? Any questions so far? Either in person or virtually? OK. So let's get hands on. I structured this lab so that I wouldn't be lecturing for too long without us getting hands on. So the first part is let's get a little bit set up. Let's familiar ourselves with our environment and just kind of do a little bit review of SQL. So what we've already done prior to this lab is we've provisioned a bunch of analytics engines. We'll talk about what's running under the hood, but these analytics engines are what you can submit SQL to. And to make it easy so that you don't have to install a bunch of things on your laptop, I've provisioned an analytics client that you can SSH into. And once you do that, you can start running queries against any of these engines. So what I'll do is I'll walk you through this. I already have a terminal set up. So I think when we sent out the prerequisites, you should all be familiar with a terminal, whether it be on a Mac or on a Windows machine. So let's go step by step. So the first thing you'll need to do is you'll need to download the SSH key. So go ahead and download the SSH key. If you click on it, you should be able to download it. It's called Presto. And then the first thing you'll need to do. So I'll walk you through this. So I'm on my local machine now. I have the Presto key, Presto PEM. You're going to need to change permissions on it. So you can do a CHmod 400, Presto.PEM, and then do an LS-L. And you should see that it should be read-only. You can always validate things. So the way that this lab works is you'll have the commands here. You always can do an easy copy. So if you copy on the right, it'll copy it to your clipboard. Then you can just paste it in your terminal, not have to worry about any typos. And then under example results, I show kind of what you should be seeing if you execute the command correctly. So in this case, I've changed permission of that key so that it's only read. You can only read it. And then if you do an LS-L, you should see the permissions reflect that accordingly. So once you get, so I'm going to get into the box. So I'm going to actually just do it live. So I'm going to go ahead and copy this. So once you've changed permissions of the key, you should be able to just run SSH with the key and then log into the client machine. And so you'll see a screen that looks like this. If you do LS on that client machine, you'll see a bunch of files there. Don't touch any of them. Question? The link has expired? Yeah. Interesting. Let me actually run what you guys just said. Okay, so I can update that in real time. So let me go ahead and download it and try it myself. So I'm going to replace my key with your key that you guys just downloaded. So you should see the same key. I should have the same key you guys do. Oh, I see. Okay, I see. So let me do this. That's weird. I just downloaded it though and I'm using the same link you are. See, I just did this and I downloaded it. That shouldn't be the case. I actually also just refreshed the URL. Yeah, let's try that. So let me do this real quick. So what do I have? Safari? Okay. So if I click this, I'll go here. Let me actually delete my key on the desktop. So presto. Okay. And so now if I click this, I see. Interesting. Okay, I'll update it right now. Thank you, folks. But I've done this many times so I already have the screen ready to go. So I'm going to create a new pre-signed URL. I'll make it last for 12 hours. Network error. Interesting. Okay, let me, I got to log in. Okay, I'm going to temporarily unplug. So you guys don't get to see all the fun. Of me logging into my AWS account. MFA, so I'm back in. Let me go ahead and we should see this now. Okay, so I'm going to recreate the pre-signed URL and it should just work. Worst comes to worst, I'll just slack it. Everyone should be in the slack. So I create this, boom, go here. Let's go to the lab. I'll update the key. Let me go ahead and do this in raw mode. Put this here. We should save this. I'm also going to put the link in the slack. All right, try the link in the slack for those of you in the slack. And then I also just updated the, oh, I also updated the pre-signed. So you have to refresh the browser. So I'm going to refresh the browser. I'm actually going to do it in Safari to make sure it's all good. So I do that. Let me click this now. Why is it expired? Let me try something. Let's try it directly from the slack. Okay, no, that shouldn't be the case. Here's what I'm going to do. I'm also going to just put the key in slack directly. So the key is also in slack. I'm obviously going to destroy the key after so it doesn't matter. So you can also get the key in slack. Let's see how folks are doing virtually. Are the folks in the room good with the key? Okay, are you able to get the SSH key? Okay. Okay. Yeah, let me just do this. That's really weird. I am seeing that as well. This is important. This is like the core part that you need to actually get stuff. No, it's not just for you. I think I want to make sure everyone else, is everyone else in the audience good? That works, but I want to also make sure the folks here can get it. I just recreated this key. Yeah, so the key works. Let me just try that real quick. I'm just going to do it the hacky way and then going to refresh it. Okay, that's good. Why is it not showing up there? Okay, so when I click this, yeah, this should all work. So try Google Chrome. It should work. I don't know why this thing is not working in Safari, but it's there. Okay, well if you have the key, you should be able to SSH in. Can folks SSH in? Awesome, great. At least that's a lot easier than having to worry about installing stuff. All right, now the first command you'll want to do is you want to run source OPT set ENV.SH. This basically just takes care of all the environment variables and all of that so you can access everything. In this case, because we have a number of people registered, we've provisioned three engines. They're all identical. You can actually log into any of them. Just to kind of do a little bit load balancing, we'll do it the manual way. So whatever day you're born on, if it's one to 10, pick server one. If it's 11 to 20, pick server two. If it's the 21st to the 31st, pick server three. Then we're going to actually try to connect to one of the anecdotal Linux engines. Now there's a nuance here. You have to replace server with whatever ID you're going to use. So for example, in my case, I'll just put one. And then you'll get, no surprise, you'll get a Presto kind of prompt. Then you can start running some commands. The first command that I'm going to run is going to be use TPCH SF1. We'll describe what this all means later. What I'm basically just trying to show you is, you know, the idea of kind of running an analytic workload, you can see tables, and then you can also run a query. And this is also just to kind of get out of the way some of the prerequisites of interacting with the lab. And so this is obviously a contrived, very toy example, but this is what we mean when we say analytic workload, right? So somehow the data got into the system, and once it got into the system, I'm able to ask it questions for analytical purpose. In this case, I'm looking at the revenue from a certain line item, and I have some conditions. You know, I want to see if the ship data is past a certain date. This is actually a query from the TPCH benchmark. And it's a contrived one, but, you know, normal folks in the organization will ask whatever they want to ask, like, you know, who are my top spenders? What's the proportion of revenue that comes from different categories, etc., etc.? Any questions? And we'll peel the onion. I know I kind of went straight to the, hey, let's run a query, but I'll peel the onion in terms of like, how is this different, right? And if you think about it, on the interface side, if you didn't know underneath the hood, it looks like just any other database. You're writing SQL, you're getting your data, and so that's kind of where we're starting from, and then we're gonna peel the onion, well, how is this different? How does Presto interact here? How do we build this lake house? Great, thanks Pratt for confirming that the key worked. Awesome. Okay, so that was good. That's really just, you know, setting up the lab, getting the key, familiarizing yourself with our environment, SSHing, and really just running SQL. All right, so now let's get to the good stuff. Let's talk about Presto. And to kind of understand Presto, let's talk a little bit about the history and kind of take off where we left. So I talked about this data lake. I talked about the first generation of the data lake being a Hadoop distributed file system. And we talked about the fact that, hey, you know, it's great, you have this great storage that's low cost and accessible. I wanna run SQL on it. So the first generation of systems that were able to run SQL on top of HDFS-based data lakes was Hadoop MapReduce. And so Hadoop MapReduce is just a programming paradigm where you take files of data, you do things in parallel, which is a map, and then you have some kind of aggregation step or, you know, combining step, which is called the reduce. This was written in Java, and again, most analysts won't be writing things in Java. And so there you needed a way to express yourself in SQL. Even though underneath the hood, it would be running Hadoop MapReduce on top of HDFS. And that thing that was built was called Hive. So Hive was developed at Meta, formerly Facebook. And what it did was it provided a SQL-like interface on top of these first generation data lakes. It was actually open-sourced in 2008 and it allowed you to type SQL. Now this was SQL Lite. This wasn't actually the ANSI SQL standard. It was called HiveQL. But you have, you know, basic constructs that we're all used to, select statements, from statements, where clause, things like that. So Hive was a decent solution, but it had its limitations. It was actually not that fast. And so it wasn't really conducive for what we call human-scale interactivity. So if you think about it, you know, if you don't really know the data and you're trying to get your arms around it, you wanna really interact with the system in a human-scale way. You wanna type something, get your answer back in low latency. Cause if you don't, you're gonna forget what you wanted to do. And so that low latency ability was very important. And so that has led to the birth of Presto. So essentially Presto was like a next generation Hive. It did a couple of things to improve upon Hive. One, to achieve the low latency, it did everything in memory. So classic map reduce would read and write to disk. I think that was great for resilience, but not so good for latency. Since they're building a whole new engine, they're like, well, why do we need a specialized dialect of SQL? Why not build interface that's ANSI SQL? So that's another really good property of Presto. It actually exposes a standard ANSI SQL interface. And why just talk to HDFS? Why not talk to a whole bunch of other type of data sources while we're at it? And so Presto is built on this connector-based technology. We'll talk a little bit more about it later. But with Presto, you can query not only like HDFS and data lakes, although that's the focus of today's talk. You can actually query any number of sources, and you can actually query the sources together. And so there's a paper written about Presto. Presto's also open sourced. And today, Presto is heavily used at some of the most successful internet-scale companies. It's probably the largest lake house today in production. So Meta has 300 petabytes. It's 100,000 daily active users, 300,000 queries per day. Uber has a data lake of 50 petabytes, 7,000 weekly active users, 100 million, over 100 million queries per day. We have bite dance. The other thing about Presto, you guys probably know this because you're here at a Linux Foundation event, but the governance of the project is not tied to any particular vendor. It's really open, open governance, Presto, the Linux Foundation, and Presto Foundation. So the reasons people use Presto is, one, it's scalable, it works on big data. Two, it's low latency. And what this really means for the end user is it provides an interactive experience. It's really good for ad hoc use cases. What I mean is you don't know what you don't know. You're trying to get your arms around the data. It's great for that. It has a pluggable architecture to support multiple data sources. It exposes the interface in ANSI SQL. So the millions of people, maybe billions of people, that know SQL can already start using the interface. It's not a new programming language. And it's open source and open governance. So that's the high level of why Presto, how it came to be, why people use it. And so Presto completes the second part of my original picture, where you had the data warehouse. We know the storage part is the data lake. And for the SQL part, you can use an engine like Presto. So effectively with the data lake and with Presto, you have data warehouse-like capabilities. Great. All right, so let's get our hands dirty again. So now we peel back the onion. So to no surprise, underneath these clients and underneath these engines, we're just running Presto. So the client itself is a Presto CLI, just made it easy. And then in each of these, I have a really small single node Presto cluster for us to play with. So first, let's look at actually files in the data lake. So if you go into the machine, if you're still SSH-ed in, if you have Presto, you can exit out of Presto. So just press exit. And if you run this AWS S3 CLI command, the AWS, you should be able to get a bunch of files back, which probably don't mean anything to you. All of this is really showing is you have a bunch of files in S3, which is our data lake today, just for ease of use. So it's an object store, it's a bunch of files. That's what these are. These are a bunch of files. They're actually parquet files and they're in a data lake. And what we're gonna do is I'm gonna show you that, hey, we're talking to these files and we're actually exposing that data as a table. So you guys can follow along. You, Presto, once you start up a Presto server, you have a UI, it's Presto console. You should be able to click on these links and they should just come up. So you can see whatever server you're on, you'll see them. So I can see this cluster and again, I can see folks kind of running a few things. This is probably me or some other folks running. So you can see things in this UI. You can click on the ID. You can see some things about what the query did. You can see the plan, how it broke down the work. Again, none of the things we're doing today are necessarily novel or that real, but it's just to give you a flavor of Presto, the Presto UI, how it operates. And we'll learn more about the architecture as we make our way through this summit. And so I have all the UIs for all the clusters up here. I'll give you guys a few seconds to just kind of try to open the UI and poke around and you can kind of see things. You'll be able to tell the difference of what machine you're on. So prod one is analytics engine one. You see prod two is the second one and prod three is the third one. So now we can, if you kind of look at the files, you can reconnect to the Presto server of your choice. So I'm gonna go back to one. So connect to Presto one. And what we can do here is we can, instead of connecting to TPCH, we can connect to the Parquet files that I use. So I just showed earlier. So just use glue.parquet or PQ and then we can run a simple query on that data. And this data is actually just transactions. So imagine you are running a retailer and your data from your operational systems are the transactions, the sales transactions. So you can see on this table, I have the transaction date and time. I have some ID which represents maybe the customer ID. I have the credit card number. I have the category of their purchase and the amount of their purchase. So I'm querying this data, but this data is actually a file that is sitting on Amazon S3. That's essentially what's happening here. And we can kind of see what we did and we'll get into more detail if you actually show the show create table. I've created this kind of, this statement was actually what created that table. So I created this table. I told Presto, how should I interpret the data in this table? What type should it be? And what is the underlying format of the data? So in this case, it's Parquet. If you guys are not familiar with Parquet, we'll talk a little bit about later. It's perhaps arguably the most popular file format that data lakes store structured data on top of. So, and then the actual statement actually contains another clause where it points to the location, specifically in S3 of this. So I can do something a little bit more fancy. So instead of saying, hey, show me just a bunch of records from transactions, I can say, let me look at, let me sum up all the amounts from every category, call it the total and then group it by the category and then order by the total. So basically what I'm doing here is I'm seeing what category spent the most or the least here. And again, if you're not sure what you should see, you can just go to the example results, I guess I didn't put it in this case. And you should see kind of the actual output that you should be seeing here. And again, I can verify, right? So to prove that this is running on a cluster, I can go to the UI, I can look at the data and so this was submitted to the server and if I look at the finished one, you can kind of see the query that I just ran. And it looks like a bunch of people ran stuff, so great. I see at least three queries that were run by folks. And since just to kind of, I know we use the AWCLI, but let me go ahead and just show you my buckets. So if I go to my buckets and I go to the lab where I was showing it and so I can go to glue. In this very toy example, but these are essentially the files here under transactions that store all that data. All right, okay, still good. Of course you can. So you can add indices with more advanced technologies and so that's part of the lake house. So Hoody, for example, can implement some of those indices. So what I'm showing you right now is very what we call basic SQL on data lake. The question you're just asking is why you need a lake house. Is if you want to achieve parity to what you're familiar with in a data warehouse, like indices, then you're gonna need more than just a data lake and a query engine. You're gonna need these other technologies that provide for some of that. Now, it doesn't mean that Presto itself doesn't have any form of optimization. So one is you can obviously write your query in a better way, so folks know that. Presto, like a query engine you would find inside a data warehouse, Presto does have an optimizer. So it does look at your query and tries to optimize your query. In addition, you can do things where when you lay out your data in the data lake, it provides hints to the query engine how to optimize the query. So a very common one is what we call partitioning. So yes, you could have one fat directory where you just put all your data in there and you just grow with it. You could actually partition your data across a key dimension. So a very common one would be like a timestamp. So if you have, so then now if I write like a statement where I say, hey, give me all the transactions in January of 2023. Well, if I partition by the date and time and Presto knows about it, it's not gonna try to scan all the data in your data lake. It's just gonna go to that directory and scan those files. So performance comes in many forms. It's how do you make the data access much more efficient? How do you lay out your data more efficiently? You have other considerations like, hey, how should I combine my files? So it turns out that when you wanna do analytical workloads, larger files are better. You don't want many small little files. I'm getting a little ahead of myself, but basically I'm giving you a peek that these are very simple examples, but what you're asking is a very real concern that folks have been thinking about for the last 10 years. Yeah, cool. Yes, yes, yes, I think. Oh, gotcha, gotcha. Yes, yes, yes, yes. Great, great, thank you for that. Test, test. Test, test. Thanks for that, yes. I'll recap for the recording. So when we think about a data lake, it's completely immutable. There's nothing you can do with the data. You cannot manage the data. This is what a lake house brings. It brings the ability of doing things that you can do in a warehouse into a data lake. So the ability to have asset transactions be able to ingest and manage the data. This is what a lake house kind of brings in. So going back to your point about cleaning the data or managing the small file problem, which is what one was hinting at, and things like that, this is what you can do with something like hoodie. And we can go over the different services and stuff that help bring those capabilities, bring the data warehouse capabilities into the data like indexing that optimizes for right transactions as well as read, or right performance, as well as read performance and things like that. So. When you say keep statistics, can you laugh? Yes. Yeah, like metrics and things like that. Yeah, there are some, yeah. Yep. Great question. So for the folks on virtual, I apologize, I should repeat the question. So yeah, the question was, Reg, you know, are there any, does Presto take into account any optimizations for query performance? And the answer is certainly yes. And then it comes in many forms. And the Lakehouse builds on top of that. So up to this point, we've really got a little hands-on with Presto. You're able to interact with it. You can kind of see that it's anti-sequel and it talks like a duck, walks like a duck, and here when I say a duck, like a database. But let's, again, peel the onion and see kind of what happens under the hood. And this is also where we're gonna see maybe some of the challenges and why the Lakehouse adds more functionality on top of the basic SQL on the data lake. So first, let's talk a little bit about the architecture of Presto. So Presto is a distributed system. That's how you get the scale. And like many distributed systems, there's usually a brain and there's usually the executor of work. And so in Presto, the terminology for that is a coordinator and a worker. So a single cluster will have a single coordinator, which is the brains. This is the service that takes the query in. It processes the query, it helps with pre-optimization, but the actual heavy lifting of accessing the data, processing the data, shuffling the data happens with the workers, okay? So a typical cluster will have a single coordinator and many, many, many workers. Now, a very astute person would say, well, is a coordinator a single point of failure? And today the answer is yes, it's the brains of it. But Meta has released basically the ability to use multiple coordinators or disaggregated coordinators. So there is an ability coming up where you can have more than a single coordinator for a Presto cluster. These are processes, yeah. So you basically underneath the hood, what happens is you have a bunch of configuration files and then you start off the Presto server and those configurations tell that process, am I a coordinator or am I a worker or am I both? So that's kind of the details, obviously you get orchestrated. So let's say I have a hundred machines, I'm gonna have to like spin them all up, start all these processes and make sure they find each other, right? And so there's obviously solutions that help make that easier, but that's kind of how it works underneath. So what we can actually do, actually I'll just go ahead and do show you, I'll show folks here. So if you go to the docs and you look at like deploying Presto, you'll see a sample of kind of like, what are these configuration files? So this is where you say, hey, how much memory am I gonna allocate to this process? Hey, is this a coordinator? Coordinator equals true, right? Think things like that are not a coordinator, this is a worker, or basically is it both a worker and a coordinator? And that's kind of what this, I forget where the picture is. You've had it before. I have it in here, but you'll have a single coordinator and a bunch of workers. So the other thing that is important to know about the Presto architecture is this concept of a connector. So a connector is basically what provides, it handles all the work to make your data source look like a table to Presto. Now this becomes pretty obvious if the underlying data source you're connecting to is already relational in nature, because it's just mapping a table to table. But for example, how did we make all the data in a data lake look like a table? Because that's all files. And so the connector did that for us. So this is a nice picture, which gives another double click on the architecture. So in this gray box in the upper left-hand side, that's the coordinator. So you can see all the different sub modules within the coordinator. Takes in the query, there's a queue. It does some planning, some optimization. It's responsible for scheduling the work on the workers. It knows how to interact with metadata, yada, yada, yada. And then the worker itself interacts with each other. They have APIs that fetch the data from the underlying data source and then obviously returns the data back to the end user. So that's the architecture. So how is this presented to the user? So there is SQL, you know that. But there's actually a data source model or a hierarchy of how these are presented to the user. So each, and it's a three-layer hierarchy. So when you go inside Presto and you interact with Presto, you navigate three layers. The top level is a catalog. Then you have a schema. Then you have a table. And we'll see this in real time. So very similar to database instances, right? Usually you have some concept of a global space and you have like sub collections of tables. Very similar to that. Use the Hive Metastore that has databases, doesn't it? Rather than and schemas, but it has primarily catalogs and databases. Yeah, so the Hive Metastore, which we'll get to later is another piece of the magic where if you have files in a data lake and you have Presto, the query engine, how does it know what files map to what tables? And that data is stored in the metadata catalog, like a Hive Metastore. So presumably you have to map what a database is to something like a schema. Yes. Because Hive has a concept of schema as well. So I don't know how the mapping goes. Yeah, so you have to register it with the metadata catalog, but once you've registered it, the configuration in Presto takes care of all that mapping for you. So we'll get into it, right? So let's go through this and then the lab will make it clear, hopefully. So the top level is called a catalog. And really a catalog is just a configured connector, okay? So we can try this. You can walk along if you want to, but you don't have to. This is not really a mandatory part of the lab. So I'm already still connected to Presto here. So if I type show catalogs, these are the catalogs that I have. So in my case, my Hive-compatible Metastore is actually glue. I have a MySQL catalog, which is a connector to a MySQL instance. And then TPCDS and TPCH are just built-in connectors for the benchmarking data, essentially. So that's what a catalog is. Now, a schema is just a collection of tables. So I can inspect what's inside a particular catalog. So for example, if I wanted to see, show me all the schemas inside of TPCH, I would do this. For those of us familiar with TPCH and TPDS, we have these things called scale factors. That's kind of what these schemas represent. Scale factor one, hundred, and just like the size of the data basically. And all the schemas are collections, another logical unit of the data, collection of tables. So for example, I can say, show me all the tables in TPCH for scale factor one. And it'll show me all the tables. And the tables are identically named. So even if I picked another schema, they're different tables, but they're named the same. So that's kind of how you have name spacing of tables. And then you have a table. So a table is just like a database table, which is an unordered set of rows with type columns. So I can say, hey, show me TPCH SF1 nation limit 10. And so that's the data there. I obviously can go back and show catalogs. So once you get to know Presto, if you're new to a particular kind of system, then at least one of the first things I do is I try to familiarize myself with what catalogs are available to me, what's the schema there. So I can, for example, see like show schemas in glue. So I'll see a bunch of things. Show tables in glue.pq. And I'll see the tables in there. Now, it does get annoying, for example, to, so in this case, for example, in order to specify a particular table, you use a fully qualified name. So use catalog name dot schema dot table. It gets annoying, right? So what you can always do is you can say use, you can use a use keyword, and then you can specify the catalog and the schema. And once you do that, it implies that that's what you're talking about. So now if I just do show tables, I don't have to specify everything. And if I did the same query and I don't kind of pre-qualify the table with the catalog and the schema, it works, because I've already kind of told Presto for this kind of connection, namespace it to this catalog and the schema. So that's how you think of data within Presto. And again, it might seem kind of obvious, but, and it kind of is obvious for data that's already relational. This is kind of where the magic happens of how you map non-relational data like non-SQL or Lakehouse data into a table-like format. So now we'll peel it even a little, peel the onion even a little bit more. And after this, we'll take a break because we'll be about an hour. So we'll give people a little bit of a break. But we're doing very well that you guys are a very quiet bunch. So we're actually getting through this pretty quickly. Hopefully this is useful information. Oh, so I did get a question on Slack. What are scale factors? A scale factor is just like how big the data is. It's a benchmark term. So like scale factor one, I don't remember how many records that is, but then you have scale factor 10, which is just more records of the same data. So the larger scale factor, you have the more data you're processing. And it's just a standardized way of doing that. If you kind of Google TPCH or TPCDS and look at the results, and the spec you'll see the different scale factors there. So that's what a scale factor is in this context. For those who don't know the question, it's on my screen and let's go to the Slack. Okay, so I think as someone alluded to earlier, like we have all this technology we've already built in databases and data warehouses, like why are we reinventing it on the lake house? And so the reason we're doing this is it can be scalable and it's also open. And why is it important that the data is open? While we're focused on Presto in this particular session, any other engine can technically plug into that same data because it's open and it's open format. And so that's part of the kind of beauty of the data lake. You can have other engines, maybe non-sequel engines, like machine learning workloads interact with the same data. And so now that comes down to okay, what are these open formats that you're talking about? So if you're talking about structured data, and when I say structured data, I'm talking about data that's inherently tabular, data that you can represent as columns and rows. This is very different than say data that is an image or a piece of text or audio. You don't need to worry about this concept of tables there. And usually the engines that work on those types of data know how to handle that. So the open file formats I'm talking about are tabular formats. So a very common one most people are familiar with is the common comma separated values. More broadly, these are really just text delimited values. Essentially each line of this text file represents a row and you have some delimited that specifies separation of columns. In a CSV or comma separated values file, this separator is a comma. It can be a tab, whatever you want. Another very popular text based one is JSON or JavaScript object notation. This is basically a file format that's composed of key value pairs. It can be nested and this exact data can also be translated to a tabular format. The reason why JSON is somewhat popular is remember if you go back, a lot of the data comes from these operational systems, a very common way that you'll get that data is via an API. So when you get it for an API, usually your response is in JSON format. You're writing JSON format. But the end user probably doesn't wanna query it in JSON way. They wanna actually query it like a table. And so I've already actually done this. So if you actually look at the schemas in glue, so PQ actually stands for parquet. I have the identical table in JSON. So if I actually use glue.json and I show the tables, I act this transactions table is the same table. It's actually the same table, but it's in underneath it's actually in JSON format. But I'm interacting with that JSON data like it's a table. I mean, you don't have to trust me on that, but we could actually download it. I mean, so I can go here. Since I have time and you guys are an easy audience, I can literally go back to glue. I can go to JSON, go to transactions. And let me just try to download this to my desktop. Let's see if I have to actually unzip this. So go to my desktop. It's JSON, single record JSON. So it's actually JSON in the lake, but I didn't interact with it in a JSON way. I interacted with it in a SQL way and Presto and the connectors did all that magic. It's nice that I can show JSON. If I download the parquet, it wouldn't mean anything to you because it's binary, but that is actually kind of the better format. And that's probably the most popular that arguably are what we call these columnar file formats. And the two popular ones are ORC and parquet. They're both open source. Parquet is probably de facto standard today. The advantage of this is this comes back to performance and storage. So they're columnar, they have metadata within the files, intelligent query engine can actually look at that metadata and say, hey, yeah, I should actually read this file or not because it has data in there. The other thing is because it's columnar, the compression ratios are a little bit better. So they're better on storage on the disk. So like the gentleman alluded to, we have this other concept, which is the metadata catalog. So when I typed in this query to Presto, select category, some amount as total from transactions group by, how did Presto know? Well, where did those files actually live on my data leak? And it asks the metadata catalog. That's essentially what the metadata catalog did. It says, oh, that's where it is. And Presto knows how and goes to fetch it. And this was actually a part of Hive that was carried over. So Hive had this concept. If we go back to the introduction of Presto, you'll see this thing on the side called the Metastore database. This is what did the mapping between the files in HDFS and Hive, the engine. So the Hive Metastore has kind of persisted and it's a very popular metadata catalog. Amazon Glue provides a Hive-compatible, basically a Metastore. And that's what we're doing in this particular lab today. So I can go to show you my, I can go to Glue real quick. So if I go to Glue, I'm going a little bit off script, but that's fine. And I look at my tables, right? And I can say, so I have a bunch of tables here. Let me look at transactions. So a bunch of transactions tables. So if I say, hey, I have this table called transactions. Oh, guess what? This is where it's located in S3. So that's how Presto knows. It talks to Glue, which is essentially a Hive-compatible Metastore and it then finds out where it needs to be. Obviously, there's other things you need to provide it like, well, just because you know where it is, can you actually access it? Do you have the privileges to access it? So as part of the configuration, you need to provide the right credentials, keys, or whatever way you need to do. And so that's what this picture basically does. It says, hey, there's some mapping between the data lake, in this case S3, and the catalog and the database. And that's done with Hive Glue in this case. And so how would you actually create them? So if I actually was to create them, you would do something like this. So don't actually run these commands. I've already done it for you. So like if I wanna create a schema called glue.json, I say create the schema specified a location, right? And then when I actually create the table, I say this table is defined this way. When you read data from this file, you should map the columns and the types this way. And you should interpret the format of the data in that schema as JSON. And so I show examples of it in JSON. I show examples of it in Parquet. And then this is also where you can specify, for example, partition key. You can say, hey, the data's actually partitioned this way. So when you query this table, please be aware of this partition. All right. So I think this is a good stopping point for those of you that wanted a quick one hour world wind tour of Presto, you got it. And the next stuff will be a little bit more advanced and we're gonna delve a little bit more into the lake house. So let's take a good, maybe 10 minute break and we'll come back in 10 minutes. So there was a question on Slack, which is, does Azure have a service equivalent to AWS Glue? I am not sure. I'm not familiar with that side as much. So I will have to rely on maybe other folks on the Slack. I can probably Google it during the break and let you know. So I don't know if they have a service equivalent to Glue. And AWS Glue is actually quite loaded. There's a lot of components to Glue. There's like the crawler and things like that. I'm really specifically focusing on the metadata catalog. It looks like Rohan can figure this out. Yeah, thanks Rohan. Yeah, if you can search that Rohan, that'd be great. All right, so I'm going to unplug for a while, turn off my mic and take a break. See you guys back in 10 minutes, roughly at 10, 17 a.m. Pacific time. Feel free to ask questions asynchronously as well.