 So this is the last lecture, and it's going to be, we'll spend the first 10 minutes talking about all the logistics for the end of the semester wrapping up. And then we have Marcel Cronaker from Cloudera here today, who's going to give us a guest lecture on the Impala and Kudu, and Impala is the OLAP system that he has led development on at Cloudera for seven years now, and then Kudu is sort of their storage layer that can do sort of more OLAP to be things. So to put together, you get an HTAP system, which is sort of the main thing we've been talking about this entire semester. So we'll finish off talking about logistics, then we'll do a quick review on the final exam, and then Marcel will spend the rest of the time talking. All right, so just to wrap up everything that's going on, this is the same schedule that I showed you last time. We'll have the final exam in class this Thursday, May 4th at 12 p.m. And again, it's going to be a combination of multiple choices, questions, and short answer questions. And it'll be an hour and a half long, sorry, an hour and 20 minutes long, closed notes, and I'll cover at the end what are the topics that are fair game for you guys. The code review will be due also for the second round will be due on May 4th at 12 p.m. Some of you have already posted new PRs on GitHub. Other people have talked about how there's issues that when you pull in the latest version of the master, that brings in the new LLVM engine, and that causes problems. So we can discuss that as well. And then the final presentations will be not in this room, but in Ween Hall, in somewhere else I don't know where. And that will do at 5.30 p.m. And I think this is a Tuesday. And we'll have 12 minutes per group, and then we'll have a contest where we vote. Everyone votes to say who has the best project. You can't vote for yourself, you have to vote for somebody else. And then whoever wins will have the grand prize. But everyone will get a database t-shirt in the end, donated from various companies. So then for the final code drop, I actually misread the schedule again. Grades aren't due for Pittsburgh students until May 17th. So the Carter students are doing the 11. That's why this was originally 11. So the final code drop for you guys will be due May 15th. And this is where you're going to have to send us a PR on GitHub, where your code can cleanly merge in with the master branch. You pass all the tests, and you include documentation that what your code is doing. And you have additional tests that show that you can correctly verify that your implementation for whatever your project is is actually working correctly. And again, we talked to us before. It has to be a real test that actually checks programmatically whether the database system is actually performing the function that you wanted to, and not just printing out output that things are OK. That's not a real test. And again, so I realized also too, because the LLVM engine is a major change we put in the system that may have affected some of you people, what we'll do is we'll have a way to disable that so you can just go through the regular interpreted engine. For some of you, this won't be an issue at all. I know that in the case of the compression team, the LLVM engine doesn't actually make calls to the storage table anymore. It goes directly in memory, which is what you want an LLVM engine to do. But then that sort of fouls up how the way they do compression. So for them, again, we'll pry to flag that can disable that so that everything that still works. And so as far as I know, nobody else should have sort of maybe the same issues. I think for the constraints team, everything still goes to the data table. For the other team that's working on LLVM stuff, we can talk about that separately. That's more complicated. But as far as I know, for everyone else, this should be not a major change that causes problems. Is there anybody else having similar issues like the compression team? Has everybody else merged in the latest version of the master or no? All right, yes. Patrick. The other team has a minor problem for us. When you merge that, we have some features that support the LLVM engine. OK. So let's discuss this tomorrow. All right, again, I'm super flexible. I realize this is a complex project, and there's a lot of moving parts, and a lot of people are working on the system. So we'll do the best we can to accommodate everyone for this. But the end thing is that the final PR has to be done by May 15th now. And the website has been updated to reflect this. For the extra credit, unfortunately, we broke the website this weekend. Everything, if you started working on your article on the website, all the data is still there. It's in MySQL. It's not in our database system yet. So everything's safe. And so what will happen is we're going to try to fix the website today and restore everything. If we can't do that, then what we'll do is provide you with a JSON file. You just sort of fill in the information. And then you can use that to submit your article. And then when the website is fixed, we just upload it into the system. And that's how we did it last year. And the same thing, the new due date for this will be May 15th at midnight. And that gives me enough time to grade it over two days and put your grades into S3. OK. All right, so I brought this up last time. I'll bring this up again on Thursday. I need everyone to fill out the course evaluation. Don't blow me off like you blow up other classes. I actually do look at your feedback. And it does help me change how we do things in the course. Like, so last year, we had people implement the BW tree. And the course evaluation, everyone said that was too hard. And that's why we did the skip list this year. So you can put whatever you want. Just be mindful that I already know some of these things already. So depending whether you want to put these or not, it's up to you. So I already know that auto lap sucks. I already know that's a pain. It wasn't running all the time. I already know that you guys want me to make the system and be able to compile and work on the Mac. All right, we tried that. That was hard. I also recognize that all the changes we're doing with the source code conflicts are annoying. We'll hopefully, in the next year, the code will be more stable. So future students won't have this problem. And I recognize that, again, that I know that my personal hygiene has been declared as offensive. I've been seeing a doctor about this. It's been a work on progress. So we'll see how it goes next year. OK. They gave me this sort of special shampoo to sort of mask the odor, but it's not working well. So any questions about logistics, about the final project, or the extra credit, or course evaluations? So for the final exam, again, it's going to be one hour and 20 minutes. It'll be three to four questions. It'll roughly take you, depending on how fast you are, it can maybe take you less than 10 minutes per question. It's closed notes. And again, the reason why it's closed notes is because it's not meant to be, did you read this paper? Can you tell me what they said in this one paragraph? Because we've read a lot of papers. We've covered a lot of topics. And that's not really the point in this course. The point of this course is for you guys to understand, how would you build a modern database system by combining all these different ideas of those different topics that we talked about? So that's why it's closed notes. So the topics that we developed for the final exam were going to focus mostly on OSTC and MVCC concurrently control. You have to know what the different kind of storage models we've talked about, the NSM versus DSM, how would you actually do them in a memory database system, what are the pros and cons of each of them for different types of workloads. You should know about the different query processing models we talked about, and what are their trade-offs again, the tuple of the time that's from the iterator model or volcano, the materialized model that's in HStore or Monadb, and then the vectorized model that was in vector-wise. You should know about the different join algorithms. Again, I don't care about the complexity of sort merge versus hash join. I more care about how would you implement a sort of modern variance of these things? What are the trade-offs, what are the different design decisions you have to make to do a parallel hash join algorithm? You should know about logging schemes. I'm not really going to focus on checkpointing schemes, but for logging schemes, you should understand the difference between logical logging versus physical logging or command logging. When would you want to use one versus another? What are the pros and cons of each of these? You should know about the different types of indexes we talked about. We talked about a bunch of different OATB indexes. We talked about the BW tree. You guys implement a skip list. We talked about a high level B plus tree. We also talked about T trees. And then we talked about these OLAP indexes, like from the SQL server paper, these bitmap things, and the columnar indexes. So again, you should understand what they are, what are their design implications, and how would they may fit in with the different storage models or query processing models. If you're doing too much time, what would it be good to be indexes versus a bad one? Then the last three topics we're going to cover are optimizer implementations and cost models. We spent three lectures talking about the different types of optimizers you can have, top down versus bottom up, stratified versus unified, cascades, volcano, system R, all these different things. So you should understand what are their design trade-offs, how well they would work for different scenarios. And then cost models are at a high level. What are the different things you can include in a cost model? We mostly focused on, for in-memory databases, we focused on the number of tuples that each physical operator would emit. So there's other logical things you have to consider in a logical operator versus a physical operator. So those things should be aware about. Compression schemes, we spend a lecture on this. Dictionary encoding, delta encoding, bitmap encoding, mostly encoding from Redshift. And again, same thing, what are the pros and cons of each of these things? And then the last topic would be sort of the last quarter of the semester, we spent time talking about execution optimizations, so query compilation, code gen, JIT stuff, LLVM, or C++. And then we spent two lectures talking about vectorization. We talked about bit weaving, we talked about just doing parallelized sequential scans and predicate evaluation. We're not gonna, so the final exam's not gonna cover anything on the last two lectures on larger than memory databases or non-voluntary memory databases. Because that was last week. Any questions? Any concerns? Again, I passed out the practice final exam last class and that's roughly what the questions in the final exam will look like. So if you're comfortable with those kind of high-minded questions, if you will, then that's what the final exam of Thursday will look like. Okay? Okay, that's it. All right, so like I said, it's been an awesome semester and I always like to finish off with having, so if somebody from the real world come in and tell you how it is or how academia is completely wrong, or in some cases where it were completely right. And so Marcel is a, as I said, he's the lead architect of Cloudera and Paula. He's been there for seven years. Prior to that, he was the lead architect of Google F1, would that be correct? The query engine. The query engine at Google F1. And then his, the thing that helps him pick up women or men, whatever he wants, when he goes out the bars, is he tells everyone he worked on times 10 before it was acquired from Oracle. That's, when you think Marcel, just think times 10, okay? And he has a PhD, was that? Definitely selling the book. Definitely selling the book. He has a PhD from Berkeley and he was Mike Snowberger student in the 1990s. And I sent that article on Wired about how awesome he is at baking bread. So you have any bread questions that you should ask Marcel after his talk, okay? All right, so let's give it up for Marcel. Thank you. I'll come to you in a minute. Oh, I see. Does that work? Yeah. One, two, three. Two. See, this is on, this is also on. Fantastic. All right. So just for background, this talk, well, this talk is about the combination of Impala and Kulu. Kulu is a new storage manager for the Hadoop environment that gives you online updates. And I'll again talk more about it in a few minutes. This was a talk rated for a trade show called Strata and mostly for a not too technical and not database audience. So I'm going to splice in sections from the tech talks for Impala and Kulu themselves. So, but also please feel free to interrupt with questions during the talk and don't hold back and wait until the very end. So here, brief overview, what Hadoop is. Actually, quick question for the audience. Who has a rough idea of what Hadoop is? Okay, some minority. Okay, so I'll talk more. Basically, Hadoop is a collection of components that are meant vaguely for data management. That sounds very vague because it is. There is no really good definition. What you find in Hadoop is let's focus on this part and that part today. There are a number of storage managers in the Hadoop environment. HDFS is the best known one. The file system, patterned after Google's GFS. Then we're also going to be talking about the thing in the middle there where it says relational Kulu. This is a new online storage manager that in contrast to a file system gives you the ability to insert, update and delete records. So meaning it has a record oriented interface as opposed to a file interface. And then for the purpose of this talk, I'll be mainly focused on relational data management. And for that, we're gonna use the Impala SQL engine and talk about how it interacts with Kulu basically. So here, like I said, the interesting part here is relational data management. So it's gonna be Impala and I'll be contrasting that as well with HDFS and a little bit HBase. Very little HBase. Here again, just talking about the storage managers. So taking a step back, what does relational data management slash analytic applications, what do they look like quote unquote in a traditional Hadoop setup, traditional again because Hadoop is not that traditional but HDFS has been around for a while. You would basically use HDFS to store your data files. Your data files are prepared through some ingest pipeline. You typically will use a columnar format such as Parquet. And once you have your data sitting in Parquet and HDFS, you can then use Impala to basically point to that physical data and run your SQL queries. And Impala is fast enough so you can actually run interactive business intelligence applications such as Tableau or MicroStrategy which is what a lot of large companies do. So what does that give you? That first of all gives you the flexibility to use the Hadoop environment and in particular use multiple data formats. Although you typically want for performance reasons you want to use something like Parquet meaning a columnar format. That also gives you the proven scalability of HDFS. HDFS has been around for quite a while. It scales up to, you know, hundreds of nodes I wanna say easily. It scales up to very large amounts of data. And it has the availability also works reasonably well. That's all great. HDFS also has performance optimizations that when you are doing large data scans and you're doing that locally, meaning you are initiating the scan on the data that happens to sit on that same node in HDFS, then HDFS will actually bypass the data node protocol and you can talk to the underlying file system directly meaning you will get the speed that you would normally get from the Linux file system. So, you know, pretty much basically what the hardware gives you. HDFS also has a caching component that allows you to map file data into memory and then instead of reading it from the cache which still is a process to process copy and basically would entail recheck something and stuff like that. You would simply map it into your address space and you would basically get a copy free access. So all great things. However, doing something like updates in HDFS is impossible or very painful if not impossible. HDFS is an append-only file system and obviously if you have a columnar format you don't wanna go in there and try to change individual records. There is the other thing called HBase. HBase is also an online storage manager. HBase has been around for a while. HBase is used by a number of our customers to do stuff like a point lookups for serving web apps. HBase does allow you to do online updates so you can do insert, update and delete against individual records. However, HBase has a, HBase is patterned after a big table and as such an inherited big tables I wanna say very flexible, overly flexible schema and physical schema characteristics. So in big table basically the columns are really part of the data and every row can look different. And the way HBase materializes data to disk and makes data persistent on disk means that when you read data back you actually have to do a lot of merging operations and those are expensive merging operations because HBase basically writes data out into SS tables and then it has to merge it back based on the primary key which is a string. So all very expensive operations which means that in effect you are getting a small fraction, a single digit percentage point of the actual hardware capabilities when you're doing sequential scans. So for the purpose of updates works for the purpose of doing analytic applications it doesn't work at all, I wanna say. This is where Kudu comes into play. Kudu is an online storage manager like HBase but unlike HBase it is also optimized for read throughput meaning raw scan performance. It has made some architectural choices that avoid expensive mergers and things like this. Kudu accumulates data in a row format in memory that means you can do relatively inexpensive updates in memory and then when it writes data to disk it transposes the data into a columnar format very similar to what Parquet gives you. So I'll report some numbers later but here the goal was to create a storage manager that can handle both insert, update and delete and do point lookups very efficiently but also give you very efficient scans. And the goal here is that in combination with Impala you will get real-time analytics and also update capabilities that come close to what you can now get with the best systems meaning with HDFS and HBase and Hadoop but you're getting it in a single system and you don't have to move data between those systems. Does Kudu support single transactions? Kudu right now only supports single row transactions so like HBase the goal is to add multi-row transactions in the future. So the real sort of the vision behind this is to replicate the user experience of a single node RDBMS. So from an application program's perspective an RDBMS is great. It gives you comprehensible semantics. It gives you a lot of flexibility. You can change your data, you can query your data. Of course the single node RDBMS has no scaling and you can't use it for very large applications obviously. Let's talk about Impala for a second. So Impala is the query engine component of the stack that I'm talking about. And what does Impala look like? Impala is a SQL engine. So it is basically a parallel SQL engine. It is specifically not a monolithic database system and I'll talk about the differences in a little bit. So this is something to keep in mind when comparing database functionality in Hadoop and a traditional database system like Oracle that there are differences, operational differences that also impact the user experience and how you would interact with the system. So but here Impala is basically a parallel database system MPP. It was designed for performance. So this is unlike a lot of other systems that didn't use Postgres as an execution engine. It was actually written from the ground up and it was particularly designed for the Hadoop environment. So that means it utilizes components of the Hadoop environment that allow it to interact with the other frameworks that are also running in Hadoop. So typically our customers do data preparation and ingest with things like Hive or Spark or MapReduce. They might do batch transformations also with something like Spark. So there are a number of frameworks that are interesting and the goal is to retain the flexibility and the data sharing ability that you typically find in a Hadoop environment. So Impala supports the Hive meta store as a metadata repository, meaning that when you create a table in Impala it is visible in Hive and vice versa and Spark as well. Impala uses obviously the storage managers that are prevalent in Hadoop. So we talked about HBase and HDFS and now Kudu. And if you're talking about HDFS resident data Impala also supports the file formats that are standard in a Hadoop environment. So Parquet is one of them. There's also Avro. We don't really want to talk about RC file but our customers still use text which is convenient for ingest, et cetera. So Impala also supports those standards. Now, because of the target application and particular business intelligence environments for interactive querying, Impala also supports the standards that have evolved in that space. So that means you need ODBC and JDBC for connectivity. You want Kerberos and LDAP for authentication and you want things basically sequel style role-based authorization. These are all standards in the corporate environment and if you don't support them then people don't want to use your stuff. Little bit about the history. Impala started, as he said, actually was started around six years ago and the first commit was made in May of 2011. It was released in beta in October of 2012 and then went GA in May of 2013. So it has been around for NGA now for four years, roughly, pretty much exactly four years. And Impala's open source now is transitioning to has always had an ASF license, is now transitioning to ASF project management and has some external contributors from we got something from Intel. We had some small things from Google. There's also Arcadia data with which we're collaborating which is a BI tool. So if you want to hack on Impala, that is pending approval by your master here, obviously. Oh, right. So what does Impala look like from the user's perspective? Like I said, the goal here is to, or the usage scenario is business intelligence applications. So what you're doing in Impala is you're creating basically when you create a table you're really creating a virtual view over the physical data stored in HDFS, for instance. So here's an example of a customer table and you're basically telling Impala that the file format is parquet and he is a directory location and basically all of the files underneath that directory are now part of that table. Impala also gives you Impala in Hive, give you physical schema design capabilities such as partitioning. The partitions are then created as sub-directories. So it is basically the logical partitioning is then directly translated into physical partitioning in the file system and for each partition there is a single partition directory and again all of the files in that directory are part of the data. That also means that when you copy a file into a partition directory it automatically becomes part of the table. So something to keep in mind that this is an ingest path that is very different from that of a monolithic database system where all data gets inserted into or goes through a funnel that starts off the top of that database system. How does this work for like does it reach the parquet API in force like in string, not null field with a gap? No, I mean does it have not null? I'm not sure, I think it does but every parquet file comes with its own schema. So what you now need to make sure well when you're copying a data file you need to make sure it matches that schema. If it doesn't match that schema you're gonna get a runtime error. If it's you will also have schema evolution you're gonna add fields. You typically don't wanna drop fields you might deprecate fields and you need to do that in a particular manner in order for you to be able to then read the data. But that's outside of Impala, that's like. That is outside of Impala, yeah. Impala will help you with that when you can add columns through Impala and it will then do the right thing in parquet. That's basically what it is. So Impala implements standard SQL, basically ANSI SQL, I think we call it 92 with the analytic function extensions support standard data types including decimal, decimal is very important for financial applications, right? Nobody wants to deal with floating points. Imprecision. Is it new UDFs or what, not PLPG SQL? Oh yeah, exactly. UDFs, but I mean, obviously we could spend time on making that better and we could even try to support something like Oracle stuff that would obviously be extremely time consuming and have limited applicability so we're not gonna do it but UDFs today are in Java and C++ and I'll talk more about LLVM runtime code generation but if you're doing C++ you have the option of either getting it into the system via a shared object or even cross compiled IR so we can load your IR and then do the cross function call optimization. What does Impala look like? If you've ever seen a parallel database system it looks very similar. You basically have a demon process running on every node in the cluster that has relevant data. The goal here is to do data partition parallelism so I'll talk more about the query optimization framework but in essence it means that you're gonna run, you're gonna do as much processing as you can on the nodes that have the data and then what comes after that is basically a bunch of process-to-process data exchange and pipeline query execution. In addition to those execution demons you also have two single nodes, single node systems here at the top, the state store and the catalog service. So Impala has a symmetric architecture of every one of the Impala demons can also act as a coordinator meaning handle user-facing requests and can also act as an execution engine for the backend execution. That means they're all caching metadata. If you wanted to get metadata from the suppliers of metadata in the system which are Hive and the Hadoop name node that would make query execution extremely slow and also be very much not scalable. So there is a catalog service in the system, a single node system that manages the interaction with the Hive meta store and also gets physical metadata updates from the Hadoop name node. This catalog service then compiles basically updates that it then sends out to all the Impala demons and it uses the state store which is a PubSub system, a small and memory SubSub system for that purpose. And that was the question. So where is the, you don't want to say you cache the metadata do you mean the metadata in the catalog and you cache it in each node or? Yes, that's right. We cache the metadata in all of the nodes. We actually just implemented a separation of execution of the coordination functionality, the frontend functionality and the execution but each node can still play both roles, right? So you can have, typically people would have, let's say maybe a few dozen coordinators in the system and they're all caching metadata. Again, the reason here is that when you want to run a query and you want to do that in under a second, you don't have time to go to the Hive meta store and get all the logical metadata. And then after that go to the name node and get all the physical metadata, right? That alone would take you more than just a second. And then of course you would end up with a very much not scalable system because every query would then basically have to go to the same meta store and name node. The state store as I said is a PubSub system. It's a single node system, but it doesn't store anything and meaning at a soft state, you can basically take it down and bring it back up and it can reconstruct its state. And you can still run queries in the cluster even when the state store is down. So the state store is not part of the execution control paths of a running query. The state store is only part of the metadata propagation protocol. Likewise, the catalog service, like I said, interfaces with HMS, Hive meta store and gets metadata from the Hadoop name node. It is also a soft state, meaning that if it goes down, it can reconstruct its state from the Hive meta store and the name node. And again, it is also not part of the control path of a query. But if you want to create a new table, then it would be blocked on the catalog service not being available. Few words about the execution daemon. It is broken up into front end part which is actually written in Java. The reason was that Java is easier, with Java it is easier to interact with the rest of the Hadoop ecosystem in particular the HDFS, the name node, and the HMS, the Hive meta store, the clients are all written in Java so that made it easier. However, the execution system, the backend is written entirely in C++ and that is what is basically on the hot path for query execution. Did you benchmark like thrift or a buffer as we made this in? No, it was easier at the time because thrift also had an RPC system baked in. That's basically what it is. We're actually transitioning the RPC system to Kudu built a new RPC stack specifically and it's much more scalable in particular for large data exchanges. So that's what we're doing. Query execution runs in the typical phases. You have an application here. This is a three node cluster and all of the orange boxes are comprised the ImpalaDemon. It is internally not divided into different components, at least not in terms of threads. You don't have a thread that does the planning and another one that does execution. All threads can do everything. But here you have an application sending request via ODBC. It arrives at the planner, the planner turns around and produces a sequence of planned fragments that it then hands off to the coordinator. The coordinator in turn then distributes it to all of the execution backends that contain relevant data. So because it caches all of the physical metadata it knows about the location of all of the files and the block replicas and all that stuff. And so basically PIXWED wants to run, sends out the plan fragments and then the execution engines begin running and they produce intermediate results. Nothing is written to disk. You might have to spell. If you're doing a large zone you might have to spell. But outside of that nothing is written to disk and they all do process to process data exchange and it's basically pipeline query execution. This is what it does. Here you're doing a bunch of scans and joins and that's basically a rough outline of how execution works. Now talking about planning. Planning in Impala is done in two phases. There is the first to produce a single load plan. This gives you a tree of plan operators, things like scan, join, aggregation, sorting, analytic function computation. Then you take that single load plan and in order to distribute it you break it down into plan fragments. So the goal here is the parallelization of all of the operators in the plan. So and also as part of that parallelization you wanna do as much local processing as you can meaning scans run locally and after the scan you're also doing as much pre-processing as you can. So you might do pre-aggregation, you might do joins via broadcast, et cetera. Impala uses a cost-based decision model to create the join order and also uses a cost-based decision model to do join distribution optimization meaning the decision between doing a broadcast join where you have a, let's say, a large fact table and a small dimension table and you decide to broadcast that small dimension table to all of the places that are executing the join. Basically, the places that are local to the large fact table or you are doing a join of two large tables in which case you would have to redistribute both tables on the join expressions. So here's a very simple example of a single node plan. You are joining three tables followed by a grouping aggregation followed by an order by limits which is typically implemented as a top-end operation. Here you're seeing two joins in sequence followed by the aggregation, followed by top-end and Impala would actually if it knows that you are accessing very little data it would run that on a single node. So it would simply say you're only scanning whatever a few hundred nodes, sorry a few hundred rows or this many kilobytes of data and I'm not going to distribute that I'm just going to run this locally right here. In the distributed case the second step will then do the plan fragmentation and as I said before you want to maximize scan locality and minimize data movement. Paralyzing joins means deciding between broadcast and repartitioning join. Paralyzing aggregation typically means you're doing a pre-aggregation on the grouping expressions which you can do locally which often results in a reduction of the intermediate results and then you're doing a merge aggregation on based after a repartitioning step a hash repartitioning step on the grouping expressions. When you're doing a parallel top-end you're again doing a pre-aggregation top-end as an aggregation you're doing that locally on the node followed by now you have to merge on a single node obviously otherwise you cannot do top-end correctly right the semantics don't allow that. So your cost model for your query planner is based on network IO? No it's actually simply based on total bytes touched. So this model is basically having to partition something where you have to touch all the bytes where you have to put it on the network so we don't really differentiate between the effort of network IO and all these things and I think this sort of in practices has been shown that a cost model does not need to be precise meaning precise in the sense that it does not need to model exactly the cost you use cost to decide between two alternatives and the cost model has to be accurate enough to so that the alternative that is more expensive during execution shows up as more expensive than the cost model. Right so if one is more expensive than the other the cost of the one should be maybe a thousand on the other 500 but it doesn't matter if it's a thousand or 500 or if it's 10,000 and 5,000 the only difference that you're interested in is that one is more expensive than the other. Right so that's why we felt that bytes touched was a fair approximation of total effort involved. Here's the illustration of a distributed plan for that simple query that we saw at two joints followed by an aggregation. Here you are initially joining two large tables. The two large tables are executed as scans on their respective nodes because they're large you're doing a repartitioning join which means that those scans run in separate plan fragments and the output of those is repartitioned on the join key. Here it's the IDs. You then doing the join in a third plan fragment followed by the second join. The second join is now a small table to large table join which is then done as a broadcast. This second join is then followed by pre-aggregation which is then output into a fifth plan fragment after a repartitioning step on the aggregation key which is in this case the custom ID. Then you're doing the merge aggregation. Then you're doing local top N and then followed by global top N back at the coordinator. So this basically then results in a total of six plan fragments that are all run concurrently at all of the execution nodes except for the last one which only is run at the coordinator. Few words on the execution engine itself. It is, as I mentioned, written in C++. It uses data partition parallelism and it also has some performance optimizations the major one being LLVM based runtime code generation. And the way we do that, I know you're working on Peloton which takes a slightly different tack which is basically based on a DSL as I understand it and whole query optimization. We're doing slightly differently. We are optimizing, so each of these operators that you see here, there's a scan join aggregation is run as a loop over an input batch. So this is batched execution and when you're joining you basically present it with a batch and you're then doing basically the hash table lookups for all of the rows in that batch and you are then computing the join and augmenting the batch with the join results. This loop is our target of optimization and what we're doing with LLVM is we're basically generating the code for the entire loop including hash value computation, lookup in the hash table, et cetera and then using LLVM to basically produce one giant function for this loop with expressions in line and all that stuff. So there are no virtual function calls, et cetera. Whoops, I need to talk you through this and obviously because we know the types at runtime you're gonna get rid of dead branches, you can unroll some loops, you propagate constants, all that stuff. So we're seeing obviously a massive performance improvement with that kind of execution. We can leave this out. Well, I mean, this is one example. The actual savings depend very much on the query. Here's a bunch of single table aggregates, the first one being count store, the absolutely. Most simple one and then the last one, tpchq1, which has like three or four aggregates that require some arithmetic expressions and here you can see a factor 16 speedup over, this is on basically memory resident data, so the cache is warmed up, so it doesn't come from disk and obviously massive speedup. Let's get through this, we did that already. Now onto Kudu, this basically concludes the impala part. Now onto Kudu, Kudu is now the storage manager sitting underneath it and what is it exactly or what were the targets? One thing was scalability, so this is meant to scale to tens of petabytes of data and thousands of nodes. Targets were both the throughput of read and write operations, meaning you should be able to do very fast streaming inserts and also very fast lookups, but also analytic performance and that is measured usually in gigabytes per second. So here the goal to be able to scan large amounts of data per node in order to facilitate analytic queries. What does Kudu look like? Under the covers, Kudu exposes unlike HBase and Bigtable, Kudu has a schema that looks more like a relational schema meaning it's a fixed schema, you have pre-declared columns, you can of course add columns, but the columns themselves are typed and the column name is not part of the data. Columns are typed, that also means that you can do and Kudu does when it stores data on disk, it is stored in a columnar format and you can use typical columnar compression tricks which rely on the data being in the same data type. Kudu aside from being integrated with Impala, Kudu being part of the Drupal environment also has APIs, you know, NoSQL style, record level APIs, NC++ and Java and Python and Kudu is also usable via SQL from Spark and I think maybe even Hive. How do you fast alter table? Fast what? Alter table, how do you do that? Oh, it just alters metadata, it doesn't update it. That's it, if you add it, alter table add column means that you're simply creating metadata to store in your column, but you will still have the existing data, it doesn't have that column and when you read it, you get nulls back. Okay, do you have a notion of schema versions in some way, right? Yeah, I guess so. Or versions, you need to evolve your schema in a way that doesn't break existing stuff. So meaning in Parquet also when you use Protobufs, right Protobufs, there's a certain way you do schema evolution in Protobufs, meaning you will never rename, reuse a field tag because that would totally break everything and you shouldn't remove fields, you should simply only deprecate them and if you add new ones, you add them at the end with the new tag, right? And if you do that, then you're fine, you can always read your old data and all of the new stuff needs to be optional, right? So that's basically what it is. Yes? Instead of deprecating them, what is that? What? Oh, what does that mean? Well, I mean, you have a column of data and not deleting it means you keep the old data around and you simply deprecate it. Like sometimes applications, let's say you change application semantics, you want to encode something, some information in a different way, you will create a new column but you don't have it populated yet. So if you want your applications to continue working, you basically now need to transfer the data from the old columns into the new column. So every time you do that, every large database application, you can never simply stop the world and then reload all your data and then change your application and start again, right? It always has to be kind of a piecemeal operation and you're doing piecemeal schema evolution and you're doing piecemeal data transformation, if that makes sense. Yep? Do you always do the transformation when you have like an end of the book recovery or do you do it once you access it, you update it to the newest schema? The new schema is populated outside of analytic queries. So you need to figure out if you have SLAs, you cannot do heavy lifting and data transformations as part of your analytic queries and sometimes you see data that needs to be transformed and sometimes you don't. That would be total complete, completely non-deterministic behavior basically, right? You would never be able to stay within your service level agreements, meaning for this web app, every query needs to come back within 1.5 seconds or something, right? This is typically what corporate environments have. If you have a, let's say you have a web app, you have a screen on your application and the screen needs to run 30 queries in order to populate all of the UI components, right? Each of these queries needs to come back within a certain period of time, otherwise you are missing your SLAs. So a few words on physical schema design. This is very important for analytic applications because it is important for performance. Physical schema design means that outside of just the logical schema, you now also decide how you physically partition your data over the cluster. So typical things are time series, you find time series everywhere. When you talk about data warehouse environments, most fact tables are really time series, right, data is collected chronologically. You know, you might have click logs, you might have impression logs. Here we're gonna go through an example of collecting machine metrics, right? For every host, you collect a number of metrics every second or something like this. So these are all time series. And what you wanna be able to do is use time series, use your physical schema design mechanisms which are partitioning in order to support your application characteristics, meaning make it so that you can do fast lookups or fast aggregate queries, and then also make it so you can use effective compression algorithms and stuff like that. So here's an example of a time series that is both hash partitioned and range partitioned. So here the y-axis is the range partitioning on the timestamps themselves, and the x-axis is another hash partitioning on another field. Let's say it's on the metric itself. Why do you not just wanna, why do you not just wanna range partition your time series data? Because in time series data, you typically append at the end, right? You generate new data all the time, the new data is for now. And if now falls in exactly one partition, you have a non-scalable system because that partition would have to take the entire update stream, insert stream, right? So that wouldn't be a good idea. So this is why in this case you can also hash partition in addition to the time range partitioning. So in this particular example, you have 12 partitions and each time range is serviced by three hash partitions, meaning your insert stream goes into these three partitions. So here's an example of the inserts going to three partitions. Here's an example of you basically querying on one of the non-time fields and which matches to exactly one of the hash partitions. So again, you're getting a reduction of the total scan volume you need for that analytic query. So now talking about Apollo and Kudu combined and I mentioned before that there are some operational differences of the database functionality offered through the Drupal environment. And let's take a quick look at what a monolithic RDBMS looks like on the left-hand side. You typically have often in a single process but sometimes not multiple components that are layered. So you have the query execution component which takes a select statement and turns that into a sequence of a, turns it into a plan which itself is a sequence of execution operators, scan, join, aggregate which are then handed off to the execution environment which then executes these operators as a sequence of base table accesses. So base tables are now exposed through a record interface. So you can do lookups into base tables and you can do range scans in base tables. This record layer in turn implements this record abstraction on top of the storage layer which gives you blocks and byte sequences basically. Alongside all of that, you have a catalog that records both logical metadata, you know, the tables, the columns, column types, as well as the physical metadata, the location, the, you know, the extents of the tables which are then used by the storage API in order to facilitate the scans. You have that same functionality in the Dube environment but now it is broken up into components. So Impala is a query engine, meaning it takes SQL and it runs, it basically executes these query operators but it utilizes the underlying storage managers to supply the record layer functionality. So Kudu is a storage manager which with a record layer API, Parquet and HDFS together give you basically storage management plus a record level API, which is Parquet, right? Parquet has a record abstraction internally. You also have the Hive meta store that functions as the global metadata repository and the HDFS name node which in the HDFS case records physical metadata. There are advantages to such a decoupled architecture. One is that in this architecture, I'm only showing Impala here, but you can also run against the same data that is resident in Kudu and HDFS. You can run Mapides jobs, you can run Spark jobs, you can run Hive jobs. So typically people use multiple processing frameworks to access the data that is sitting in a Hadoop environment. You can also in this case run a query that joins data from HDFS and Kudu in the same query. So these are both accessible at the same time. This also means that because the SQL capability is separated out from storage management, et cetera, that if you introduce a new storage manager like we did with Kudu, you very quickly get SQL capabilities in this environment. So here I'm gonna run through an example of a metrics table, a time series table to illustrate how the physical schema design and the performance features of Kudu are exposed through SQL. So here's the metrics table that we saw before, which you are creating, every table in Kudu has a primary key and you can then decide how you wanna partition the data based on these primary keys. So in this case, we are doing hash partitioning into three buckets on the metric column of the primary key. And then we're also doing range partitioning on the timestamp column itself, which is the example we had before. Now when you're running a select statement, Impala1 will then utilize the predicates that are present in the query to first of all do partition pruning, meaning it will figure out, given the predicates, which actual partitions, which tablets in Kudu need to be accessed. You can see this on the right-hand side. The right-hand side is basically printout of a query plan. And you can see that out of a total of a number of partitions, we are only accessing based on the metric here, which maps to exactly one hash partition. We are only accessing three of the tablets. Impala1 will then also push down predicates that Kudu can evaluate directly. So this is the predicate on the timestamp and also the comparison on the metric column. Things it cannot do, such as evaluating a like predicate, meaning regular expression matching, is then done directly in Impala. So it basically tries to offload whatever operations it can onto Kudu and then does the rest of the processing such as the joining and aggregation directly in Impala. Impala and Kudu also share a common and memory representation, meaning that when we get the data back from Kudu, all we have to do is a bit of pointer transformation and we can then use the binary data directly, which is another performance advantage. Here's a single row inserts, not terribly exciting. The only takeaway here is that basically this is then mapped directly onto a single Kudu API call, meaning you can write your applications using SQL, using entirely using SQL, basically driving Kudu API as if you programmed against it directly. More interesting here and a novelty for the Doop environment is an update statement. So previously, like I said, with HDFS, you are not able to update data in place. With Kudu that is possible and through Impala you can then run update statements of arbitrary complexity, meaning you can do joins, et cetera. And they are basically transformed into a query plan that gathers the row IDs of the updated rows and gathers the update data itself and then executes the update, again driving the Kudu API. Delete works in a similar manner. It's basically exactly like an update statement, except you're doing a delete at the end. And there is a question. So in these examples, you all have size metrics for all tables. So do you maintain the statistic for all tables? Kudu does statistics maintenance. So for HDFS, we collect statistics outside of HDFS, but for Kudu, it maintains its own statistics. So we get some row counts and stuff like this through Kudu. What about like cognitive activities and other predictors? Like NDB, not done in Kudu right now. And we're thinking of ways of making that better. Yes? In the create statement for this example, you specified the partitions you wanted yourself, but getting it automatically determined? It doesn't do that right now. So obviously for the hash partitioning, I have to declare the number of buckets, but for range partitioning, it doesn't do right here, you're pre-partitioning it, which is actually good for bulk data loads. And, but in the future, and it doesn't do repartitioning right now. So it doesn't automatically adjust the partition boundaries, but that's a goal for the future. Yes? No, we create statistics, but we do that outside of HDFS, meaning you run a command that computes statistics, which we then store in the Hive meta store and then use it at runtime to determine, like, cardinality estimation, try to figure out what you need to use for a join order estimation, or a join order optimization, right? There's nothing, you can't do anything without selectivity estimates. So yeah. But someone would be popping down files in HDFS. That's right. That you don't know of, and then you start scanning and like, oh, all of a sudden there's stuff there. Yes, that's possible. Typically, I mean, we tell our customers to, you know, recompute stats when data has deviated by a certain fraction, right? You don't wanna do that every time you drop new files in, it won't change the plans, right? But if you, let's say, load an additional 50% of data, then yes, it might. Let's leave that aside. How much time do we have, roughly? Yeah, 20 minutes. 20 minutes? Okay, all right. Then I can talk also more about Kudu, pop out of that quickly, and switch to this here. Talking more about tables and tablets in Kudu, I mentioned the partitioning. With Kudu, Kudu does a consensus-based replication protocol, meaning in practice you would use three or five replicas and it uses raft, not pack source. You have basically a table broken up into tablets. Tablets are usually on the order of, let's say, four gig, and those are then spread around across tablet servers. The tablet servers then are, so for each tablet, this will then have three or five replicas, and the replicas that exist on basically form a replica group, and there's a leader and their followers, and the replication protocol will then determine what happens when you're doing an update, which needs to be done at a leader. You can do read-only transactions, which you can do against a follower. You can also have read-only replicas, non-voting replicas, meaning you can scale up your number of replicas to a very large degree without impacting update performance. When you're doing a consensus protocol, obviously the update's efforts is proportional to the number of replicas that receive the updates, so in a practical system, you would have three or five replicas that participate in the updates and you would have a much larger number, you could have a much larger number of read-only replicas, and you could use those to then satisfy your read traffic. One goal here is also of this setup is that you run update workloads, meaning OLTP, quote-unquote, and analytic workloads in the same system, right? And in most systems, you will always have many more reads, many more analytic queries, and analytic workload as you will have updates. So being able to scale up the number of read replicas is very important. The system has a single master, this master is replicated, and the master itself contains both the physical and the logical metadata, in particular the physical metadata, is the tablet directory, so if you wanna look up row 15, whatever, row with key 15 in a particular table, you, your client will then talk to the, here in this example, talk to the master to figure out what tablet exactly that sits on, right? The client also caches this metadata, so you won't have to do that over and over again for every single scan that you're running. I mentioned rafts, raft obviously deals with failures, so here it basically regulates what you do when you have transient failures, what you have when you, what you do when you have permanent failures, leader reelection, et cetera, when you have transient failures, followers can then basically drop off as long as you maintain a quorum, you're in good shape, when the node recovers, it can then basically resume, it might need to be some data transfer, if the node drops off for too long, it basically falls away, the consensus protocol then figures out a new node that can become part, can become part of that replica set, and which is then repopulated with the data from the leader, comma storage, I need to tell you about that, so switching back to this here, going back what's now with this system, what can you actually do in terms of application architectures, so here just to revisit what does a typical analytic application look like in the Dube environment, you typically have something that generates, let's say log data, log data is generated in a row oriented format which is then dumped into HDFS, then you, because for performance reasons you need to convert that into a columnar format, you typically do that through, let's say you could use Impala or Hive or Spark, but the end result is that you then generate a bunch of parquet files that you need to move to the directories that contain the table data, and only then will it become visible, right? This is when you wanna add data, when you wanna update data, you're kind of on your own, it's very complicated, we have some customers that try to do very fine grained partitioning, and then when they have updates in a partition that drop the whole partition, and then they reload the partition, so it's very painful and it's usually a compromise between your update granularity and what you would like for reading, right? Obviously, updates you wanna do at a small granularity, reading you wanna do at a larger granularity, so not really working terribly well. With Kudu in the picture, the goal here is that you do direct updates against the system, you can either do streaming updates through something like Kafka, or you do data loading via SQL, either bulk load or streaming insertions, and then the data is present right away, and you can then query it, and you don't have to go through different systems, and you don't have to worry about data freshness versus update performance. So here, just to summarize, what are you getting? This is obviously targeted at an enterprise, analytic data warehouse type audience, and those people are interested in basically getting SQL functionality, but also getting the ability to insert, update, and delete data. They need the connectivity, all that stuff. So this is something that you get from Impala and Kudu together. You will be able to, because of the scalability of Kudu itself, you will be able to get very fast streaming inserts and general scalability, right? Scalability here by adding in your node. Again, I wanna point out in traditional systems such as Park Cell or Redshift, they often do things like physical hash partitioning. That means you decide the location of a piece of data based on the number of nodes that the data was created on. Let's say you have an eight node system, which is actually not very atypical, relatively small scale, and all your data is physically assigned to these nodes. If you wanna add a node, you will have to reload your data because all of the physical addresses change, right? You hash partitioned it. That means all of the hash addresses change. This is very much a not flexible and not very scalable system because every time you increase something after you reload your data. And that is not the case with Kudu, right? Kudu basically maintains a tablet directory. Kudu is able to relocate tablets based on load patterns in the system. And so this is basically what you need in order to have a scalable system that allows you to scale up total node count during runtime. Right, things that Kudu doesn't do today are multi-row transactions. This is an important piece of functionality for traditional RDBMS applications. And it's something we wanna add in the future as well. And right now batched inserts can be a little slower than you might be used to from your traditional enterprise data warehouse. Why? It's just internal optimizations. No particular reasons, no architectural reasons. So this basically concludes the talk. Again, just to summarize here, you have a decoupled architecture in the Hadoop environment where the individual pieces of functionality are broken out into separate components. You can compose those components, but the components in essence work together to preserve the scalability and flexibility that Hadoop was basically designed for. And that's it. Awesome, thank you. Thank you. Any questions from our staff? Yes. I actually have a non-pattern technique question. So Andy just wrote, just told me yesterday his dog was sick and needs some surgery. And asked me for help. He's what, sorry? My dog said he needs surgery. Oh, okay. Yeah. You're gonna do the surgery? I don't have color, I just get IPO that last week, right? What? You just let him do some money. My dog needs $5,000 in surgery. Okay. I know your good work. Hey, absolutely, yeah. You should just, you know, hand a tin can around. Yes. Any non-fake questions? I put them up to that, sorry. And what does that affect in Hadoop, Hadoop stuff? I mean, it doesn't affect Impala technically. Impala is, so in the, you know, product portfolio of Cloudera, the analytic data warehousing workloads are very important. And Impala is a central component of that, right? So, and obviously our investment in Kudu, Kudu has been going on for quite a while. It is quite an investment, right? Creating a new storage manager is very expensive. And again, we see the added capabilities of insert update, delete of the consistency semantics that you get from Kudu that is beyond, you know, things like eventual consistency that you get in traditional key value stores. We see that as very important and critical to the success and, you know, I would say broader appeal of data management solutions in the Hadoop environment. That if the client has, like, he wants to geo-raplicate his data, then doing strong consistency is not a good choice. I personally, I disagree. I think it is. This is what F1 and Spanner did. They do geo-replication. They do five-way replication across, actually multiple data centers. And the reason here being you want geo-replication for, you know, business critical operations, but you also want consistency because without consistency, you can't really write applications, right? I mean, it's the degree of complexity that are inherent in typical applications is extremely high. And to try to do that while anticipating, you know, eventual consistency weirdness is simply too much for application developers, right? They are experts in their domain. They are not distributed systems experts. So that's why you need to, you need a separation of these things and basically present a simple interface. Like I said, in my mind, the single-node RDBMS is sort of the gold standard in terms of comprehensibility and functionality. Do you foresee maybe possibly end of like, I mean, there's always me like a single-node system for like, you know, people who work as web site. I think it's a caramel machine. But for like, do you see the end of like, these monolithic parallel data warehouses, verticals and systems like that alone? I mean, I don't see the end of anything. You're typically, that's what working in an industry has told me or taught me that existing systems never go away, right? They're simply augmented. But, and I don't foresee, you know, the oracles of this world to go away. But you will certainly see new applications and new applications being done in these different systems that are not monolithic necessarily. Also because of scalability concerns slash capabilities, which is very important. But also if you do a green field application, you know, you can, it's easier to retarget it against a new system. But, yeah. And then also these new systems will handle much larger amounts of data. And you see that even today. Users like eBay's, which have been traditional terror data, heavy terror data users, they have 10 times as much data sitting in a dupe. And, you know, obviously they're quote unquote high value data sits in terror data. But this is sort of the pattern that you see that new applications in particular with larger data requirements are done in, you know, systems like Hadoop. Yep. So when you were talking about eventual consistency not really working out, just like Google actually spanned out, but then they used real time plots and they had to synchronize them which actually cost them a lot of money and they could afford it because they're Google. But like all your other customers, can they really afford this thing? They can't really manage to get data centers with synchronized synchronization of plots, right? Do you use like possible consistency or something a little? No, it uses a software protocol. That, I forgot the name of it. There was some paper that basically did two time in software. So the hybrid clocks, the physiological clocks? Yeah, something like that, I forgot the name. Yeah, yeah, I think so. And I feel like there was always too much of a focus on the, you know, the atomic clocks and spanner. And practice are actually not that expensive as far as I know. And it's just one aspect of the system, right? But it's the same techniques you basically can implement them in software and it is usable. Yep. Do you also replicate metadata? Yeah, like typical and logical metadata, like you have a data? In... So I have a question. Should we speed the question with a more trouble on the video? Oh, sorry. The question is, does Kudu replicates physical and logical metadata? And the answer is yes, to some extent. Kudu caches and the client. So physical metadata is basically the tablet directory that tells it where to find a tablet for a particular key. And that is cached in the client. But what if the primary storage is broken like the machine is now? The primary storage being broken. Which primary storage? What do you mean? You mean the leader or... The leader. Then there's a leader election process and you figure out a new leader. And then... So it's always in other machines? The caching is irrelevant. And I'm not that familiar with Kudu because I don't work in it. But it might actually... Probably doesn't. But so the tablet service involved in the leader election and they can always refetch the metadata from the master. There's a master that has basically the authoritative copy of all of the metadata. So the cache of the metadata is eventually consistent or the master always deconsistence from global factors? I don't want to talk too much about Kudu of aspects. I don't really understand. So I can't really answer that. Yes? So like we talked about non-volatile memory the other day. So how do you view main part of an adapt to that if it comes out like it's actually different? Right. Question was about non-volatile memory and what we would do in order to adapt to systems that have lots of NV RAM. And the answer is you would certainly need to evolve a storage manager like Kudu, right? You can all of a sudden, you know, different access structures become much more palatable, feasible. You would probably also need to evolve in Pala. So for instance with NV RAM, if you have data access that behaves much like memory, you can all of a sudden do join algorithms such as index nested loop join much more broadly than you would do that today, right? So certainly cost models will change and with that you might even want to implement new query operators. So last question I'll ask is if you had a magic wand, you could fix one thing sort of in a one-year span and a five-year span, like what would that be? Like what's something you wouldn't fix right away? What's something more long-term that you should fix? You mean fix as in semantics or? From performance software engineering standpoint, from adding a feature, fixing a bug, what's something you want to take care of in the next year? I see. What's something you want to take care of in a five-year goal? Okay, sure. I think the five-year goal is the, that's on the CULU side, the multi-row transactions and I'm not sure it takes five years but it'll take some time, right? It's not easy, you need to layer a, basically distribute the transactions on top of the consensus protocol which is what Spana does for instance. Greatly beneficial but also somewhat complicated. The immediate thing that I'm actually working on is Impala right now uses a, what I would call non-deterministic multi-threaded execution model and we want to transition that to a deterministic multi-threaded execution model along with reservations for memory and CPU. So what does that mean, non-dermastic threading model? It basically means that when you have a query plan there are certain operators themselves that are multi-threaded. So for instance, we're doing the build side of a join asynchronously. So you can have multiple builds running at the same time. That is non-deterministic in the sense that the degree of parallelism depends on the query shape. You have scans themselves, we have multiple scanner threads that run in parallel and it's also non-deterministic because the number of scanner threads depends on the total availability of threads in the system which depends on other queries that you're running. So this is where non-determinism comes into play. What you would really want for resource management and for deterministic workload management is to have knobs like to specify exactly how much memory a query can use or how many active cores a query can use and then be able to execute exactly within that environment within those constraints. That makes it possible to do effective workload management. That means you can have SLAs with tight bounds which use us like because that is predictability. So that's something that I'm actually actively working on. All right, let's thank myself for coming. Come on.