 So Hadoop compatible file system architecture and so on and some of the recent enhancements that we have made in S3A file system connector and the second half I would like to talk more about the Hive access patterns and the performance benchmarks that we have carried out with Hive test bench which is nothing but a subset of TPCDS queries for doing the benchmark on cloud. So first question is like why do we even need Hadoop in the cloud right? So if you look at it Hadoop does not have any or like cloud does not have any upfront hardware cost associated with that. So what it means is that within a matter of minutes you would be able to bring up a lot of nodes and then set up your Hadoop cluster with no delay or so. So and also especially you do not need to have administrator who is going to look after the Hadoop cluster 24 by 7. So there is no monitoring cost involved as well. Cloud supports elasticity. So what I mean by that is you will be able to add or delete nodes in a lot more easier fashion in the case of cloud and cloud also brings up a lot of interesting deployment models. Just to give an example let's say you have a use case wherein you would want to use the cluster or bring up and use the cluster only for couple of days or for couple of hours. You would be able to do that easily in the case of cloud environment and bring down the cluster pretty quickly. So you are going to pay only for the amount of time that you are going to use the cluster for. So you will be able to use the formal or bring up the formal cluster and bring it down pretty quickly. Or based on use case you will be able to run a very long running cluster as well. And that brings an interesting point the reason is business units now will be able to go in their own speed as well as budget. So this slide talks more about the evolution of various different integration patterns with cloud storage on Hadoop. So there are couple of diagrams over here. The first one is an integration pattern where the entire data set is available in HTFS and the applications read the data from HTFS directly, process them and then store it back onto HTFS directly. At some point in time the management or the administrator might choose to say that hey I have too much of data I don't need the cold storage to be available over here. So cloud storage is going to provide me a lot more cheaper options so in which case they might choose to transfer some of the data from the HTFS onto the cloud storage. So they can do that or they can transfer the data from HTFS onto cloud storage in a lot more easier fashion with tools like Hadoop DCP. So you can have with DCP I guess most of us know that where you can have the source you are all pointing to HTFS and the target you are all pointing to cloud storage in terms of transferring the data and if you have to restore the data back you can use the similar tools in terms of transferring the data. The second integration pattern is moving more towards or a step closer to the cloud storage itself wherein the application directly reads the data from the cloud storage itself processes them but the storage is going to go through two hop approach wherein the first hop is going to be writing onto HTFS and later point in time using a DCP tool it will be transferred from HTFS back to the cloud storage. The reason for doing this is that there are certain cloud vendors it's quite possible that the cloud vendors do not have the similar kind of semantics that HTFS expects. Just to give an example Amazon provides eventual consistency right but HTFS expects a lot more strong consistency so you would want to eliminate those factors if that is the case you will have to go through writing the data onto HTFS and then move towards or copy the data onto the cloud storage. The last diagram talks more about the end goal that we would like to go or head towards wherein the data is directly read from the cloud storage by the applications process them and directly write the data back onto the cloud storage itself right but still we would like to retain HTFS as a storage medium for storing the intermediate data. So just to give an example let's say you have a very large pipeline of ETL job or so which spans across some 10 intermediate stages or so you don't want to store the intermediate data again and again onto the cloud storage environment because the performance characteristics in terms of transferring the data might be a little different. So in those situations you might want to store the data onto HTFS itself and store the final data onto the cloud storage. So what are the kind of problems that we have in terms of moving towards the goal that I mentioned earlier slide right so if you look at it cloud object stores are mainly designed for scale, cost, geographic distribution, availability and so on. However there are certain challenges associated with that for instance how do we even make sure that the Hadoop apps that are being developed are going to seamlessly work in both HTFS as well as the different cloud object stores. So just to give an example of problems take an example of consistency. HTFS offers a lot more strong consistency model however in the case of certain cloud vendors like Amazon it provides eventual consistency and if you look at the older model HTFS and MapReduce were sitting on the same node and the compute was pretty much pushed onto the data where it was located. But in the case of cloud it's totally different right where the storage is completely separated out and the compute is available in EC2 or like different cloud vendors. So the performance characteristics might vary a lot and cloud storage are basically not designed for file system like APIs or like usages and there are certain limitations with the APIs themselves. Just to give an example in the case of HTFS a rename is an atomic operation but in cloud vendors it may not be. Again taking an example of Amazon the rename is a copy and delete operation so it may not be atomic over there. So there are certain semantic differences between HTFS and what the cloud vendors might end up offering. So what is the goal right I mean if you look at it the primary goal is to be able to integrate with the unique functionalities that are provided by different cloud vendors onto Hadoop itself and to be able to optimize each of these storage connectors that we have with different cloud vendors and also not only at the connector level but also you'd like to optimize at a higher levels like you could have optimizations in high, big or spark or like MapReduce or in cascading and all those places right. So we'd like to optimize at a higher level as well so that they'll be able to make better use of the connectors. We'll talk about the performance optimizations at later sections. So this slide talks more about the Hadoop compatible file system architecture. So if you look at the different applications right I mean with MapReduce or like Hive or Spark and so on they do not directly talk to the file system level implementations. So they go by an abstraction layer which are a set of interfaces as well as abstract classes like file system, file context, abstract file system and so on and beneath that you have the strong or the implementation details like HTFS and so on or the concrete classes basically. I'll cover the HedgeBase part a little later but if you look at the underneath layer we had HTFS but now we have a set of different providers like Wasbi which is from Microsoft which is a blob object store from Microsoft, ADL which is Azure Data Lake again from Microsoft for big data analytical workload. You have S3 from Amazon and GCS from Google Cloud Storage. These are some of the cloud storage connectors that are already available in the market. So as mentioned earlier Wasbi is from Microsoft and it's a blob storage connector or blob storage platform where it provides a really strong consistent model and it also offers very good performance that you can even run HedgeBase on top of it and strongly consistent as well. The reason why I mentioned I'll cover HedgeBase earlier, I mean later point in time was that HedgeBase has got some strict semantics that needs to be adhered with the HTFS. So Wasbi offers that. So it's really strongly consistent as well as offers good performance. Similarly ADL is from Microsoft, Azure Data Lake offers strong consistency as well as it's tuned for big data analytical workload. S3A is from Amazon, offers eventual consistency model. Hortonworks has an external zero open for making it an optional consistent model and a lot of performance improvements are in place and it's very actively developed in the Apache world. EMRFS again is from, it's a proprietary connector from Amazon for their EMR cluster and it offers an optional consistency model or strong consistency model at an additional cost. Google offers GCS, provides multiple configurable consistency policies, offers good performance as well as they are planning to open so it's pretty soon in Apache. I'd like to take a case study with S3A and talk about more about its functionality as well as the performance characteristics. So as mentioned earlier, one of the main objective was to be able to integrate with the unique functionalities that are provided by different cloud windows and then bring it onto cloud itself. A classic example in S3A is that it offers multiple authentication models, almost like five or six of them. So it offers basic authentication model wherein you provide the access key as well as the secret key onto Hadoop configuration file itself. And at runtime, Hadoop will be able to pick up those details and then connect to the S3 buckets. Disadvantage of this particular approach is that the moment the configuration files are exposed to anyone, they'll be able to directly read the keys and there is a possibility of accessing the buckets later point in time. Or you could have the EC2 metadata itself where Amazon or AWS directly writes the credential information onto the EC2 instances itself and using APIs, you'll be able to directly read those credentials and then process your data in the S3. It's a lot more secure option. The reason is you're not exposing any of the credentials anywhere in the configuration files or so. It also supports an option of environment variables where you can set the environment variables and then to provide the credential details, a lot more or less secure option, but might be useful for some applications. And it also offers session credentials where you could temporarily create a credential from the token service of Amazon and which will be valid only for some amount of time. So that's a lot more secure option if you have to reduce the amount of credential leaks that are going to happen. Or else if you have public data available, you could even have anonymous login. So different type of authentication models are there in S3 which is supported now. Also it offers encryption that is in the server side. You could transparently encrypt the data which is going to be stored on the S3 with AES-256 cipher. So these options are also supported now. In terms of performance improvements, if you look at CEC is one of the most expensive column, the earlier implementation of S3 file system. So whenever you have a CEC, what used to happen is that it breaks the existing connection with S3A, reestablishes the Https connection, and reopens the file and lands upon a specific place or location. So that's a lot more expensive call because as you know, reestablishment of the Https connection is a lot more expensive. Just to give it in perspective, it's as expensive as reading some 300 to 400 KBF data from S3. So the problem gets more pronounced the moment you have something like a positional read, where the read doesn't change the offset within the file. So what I mean by that is positional read is internally implemented as CEC followed by a read followed by CEC. So you CEC to a specific location, read a set of bytes, and then CEC back to the original location. So there are two CECs involved, and the earlier implementation is going to about two times, the connections two times. And any input formats which are going to make use of this particular API is going to suffer a lot in S3 as compared to the earlier implementations of Https and so on. So in the recent implementations, we have fixed this problem where we made CEC as a no-op call, and we reopened the file only on need basis. That reduced the number of connection about problems by almost 50%. And the second problem that we had was with respect to the backward C. So some file input formats like ORC or beta parquet, initially they read the data which is located at the end of the file. So basically they read the footer information first, and then head back to the beginning of the file in terms of reading the realistic block details. That becomes a lot more expensive because the moment you do a backward C, the connection was getting terminated. So in the recent versions, we have fixed this problem where we introduced an effort-wise variable which can be set to random mode or in sequential mode. The moment you set it to random mode, we are going to exactly request for the amount of bytes that we are going to read from S3, thereby reducing the number of connection about. Let's look at an application like HI, which is built on top of this layer, and try to understand how it impacts the performance. So if you realistically look at the, at a very, very high level, the kind of access patterns that are going to be in HI are the ETL or the admin-related activities where you bring in the data from the external sources, create tables, analyze the tables, or do some transformations so that it can be exposed to end users, or it could be computing the column stats, or if you have lots of partitions, you might end up running some tools like, or commands like MSCK and so on. Basically the admin-related activities. Once the data is exposed to the end user, they're going to run a set of queries in terms of mining the data, and they might end up writing the data onto some specific location. Even the ETL data can write the data onto some location, basically the read and write access patterns. So if you already have the data in S3, you would end up creating an external table in HI in terms of exposing the data to the end users. And if the data set is already partitioned, the partition-related information will not be directly populated onto the metastore automatically. So you'd have to run a tool like MSCK, which is again provided in HI itself, in terms of populating the, or scanning the directory within the table, and it's going to populate the metastore information one at a time. So the reason why this can be expensive is that if you have, let's say, some thousand partitions, it's going to scan each one of the file in a sequential fashion and then update the metastore. That's going to be a lot more expensive call. In the recent versions, we have fixed this problem where we reduced the number of calls to the metastore as well as paralyzed this particular activity, and got almost like three x improvement in terms of the response time. For a Hive query to execute in a lot more efficient way, it's extremely important to have the accurate table as well as the column-level statistics. So you can pretty much gather the table statistics in an automated way by turning on Hive auto stats gather equal to true. But in order to compute the column statistics, you'll have to run an explicit command like Analyze Table, compute column statistics whenever you update the data set. And that turned out to be a lot more expensive call because of the number of calls to the metastore. In the recent versions, we have fixed this problem and saw almost like three x improvement in terms of the response time. The reason why these things are important are that, as I mentioned earlier, if you have to spin up a formal cluster, you would want to bring in the data from the external sources as quickly as possible and be able to expose the data onto the end users and let them complete the query as fast as possible. Then only you'll be able to make the best utilization of the cluster. So that's one of the reasons why we concentrated on all these areas. So in terms of other performance considerations, if you look at the way that the high queries work, the moment you submit the query, the first and foremost thing that's going to happen is the split generation. And if you have lots of data to be processed, the split generation itself might take a lot of time. So certain file formats like ORC and Parquet, they support a set of thread pools, which can be made use of in terms of computing the split generation. So at a higher level, it provides a set of files to be scanned, and these threads can be made use of in terms of communicating with the S3 and then get the information. So you might want to tune or set the appropriate number of ORC, I mean, thread pool based on your use cases. And ORC also has different split strategies. For instance, it supports ETL strategy, it supports VA strategy, it supports hybrid mode and so on. The moment you choose the ETL strategy, it's going to read the footer information first for doing some of the split pruning activity. And that can turn out to be a lot more expensive. The reason is it's going to read the footer information from S3 at the time of split generation itself. The good news is that it'll be able to cache all those details onto the local JVM itself. So when you run the similar query or the same query, again, it's going to be a lot more faster because it's not going to read the data from S3, and it'll be able to look it up from the local cache itself. This will be a lot more useful in scenarios where, let's say, you have one year worth of data to be analyzed and gradually narrow it down to one month. So the first one year worth of data is going to be a little expensive because it has to read the footer information from S3. But the moment it is available in cache, the rest of the queries are going to be basically fast. And once the split generation is computed, the second thing that happens is setting the number of tasks and letting the tasks proceed. So the split information is given onto the task side, and the task reads the files in terms of computing the query or performing the query or so. And at the time of what happens in the case of a task side is that the moment it receives the split load, it reads the footer information of the ORC again because it needs to read the details so that it'll be able to locate the data blocks. So the data is read again in the task side for understanding the footer information. Ideally, just now we read the footer details in the app master side. So what could have been done is that you could have passed on that particular footer details along with the split payload onto the task side itself. So that optimization has been made. And it was available earlier itself, but it was occupying a lot more memory in the AM side. So those problems have been fixed now. So that helps in terms of reducing the number of S3 reads in the task side. And Hive uses Thase as its default execution engine. And the moment split information is computed in Hive side, it gives it on to Thase for doing further optimization. And Thase does a lot of activities based on the number of min max split groups that you have got, the location details, and so on. The interesting observation is that S3 always provides local host assets location information. So if you have a very small table, Thase is going to aggressively optimize and provide only one task. So that's going to be a problem because let's take an example of an item table in TPCDS data set which has got some 52 files. So if it is less than a min threshold, what's going to happen is Thase will aggressively optimize it and provide only one task. And that task is going to open up all the 52 files in a sequential fashion and then decode the data within that. So that's going to be a lot more expensive operation. So in the recent versions, this problem has been fixed where Thase doesn't aggressively optimize if it doesn't know about the location details in advance or enough details about the location details. And we also observed that in AWS, the container launches were slow when we used capacity scheduler. So that problem has been addressed the moment we realize that AWS doesn't support any of the rack locality and so on. So we can turn off the node locality delay to zero, and that helps in terms of speeding up the container launch rates. With all these things in place, we ran high test bench, which is a subset of our TPCDS queries. And we used 200 GB scale in TPCDS, and the data was stored in S3. We used M44X large nodes, which are like general purpose nodes. We had some five nodes in place. And we compared with the previous version of HTTP with the latest one on the cloud. Interesting observation is that lots of queries like query number 15, 17, and so on, they didn't even run in the early versions of HTTP. The reason was the Amazon was, sorry, AWS Connect, timeouts were happening. But with the recent versions, this has been fixed. All the queries run. And also we get an average speedup of 2.5x. If you want to speed up the queries a lot more faster, then you can make use of a high one LLAP, which is in tech preview mode. So it provides a lot more performance where it reduces the amount of data that needs to be read from S3. And we have seen almost like forex improvement in terms of the performance. In terms of best practices, if you have large volume of data, you might want to tune the multipart setting, which helps in terms of checking up the data on the local environment and uploads the data in parallel fashion. And if you're using capacity scheduler, you might want to turn off the node locality day. And if you're using high, you might want to turn off the storage-based handlers and so on, or storage-based authorization provider. And if you have really large data set, you might want to tune the number of ORC threads that are available. So with that, I'm pretty much done with the talk. So open for Q&A. Actually, we've run out of time. I'm afraid for Q&A. So yes, maybe you can take it offline. Sure. Thank you.