 a CNCF webinar on improving data locality for analytics jobs on Kubernetes using Alexio. In caring to community manager, sorry, community program manager at Microsoft and cloud native ambassador, I'll be moderating today's webinar. We'd like to welcome our presenters today, Jean Peng, PMC maintainer at Alexio and Adit Madan, software engineer at Alexio. Just a few housekeeping items before we get started. During the webinar, you will not be able to talk as an attendee. There is a Q&A box at the bottom of your screen. Please feel to drop in your questions there and we'll get through as many as we can at the end. This is an official webinar of the CNCF and is subject to the CNCF code of conduct. Please do not add anything to the chat or questions that would be in violation of that code of conduct. Basically, please be respectful of all your fellow participants and presenters. And with that, I will hand it over to Jean to kick off today's presentation. Hi, everyone, this is Adit. I'm going to co-present the talk with my co-worker and friend Jean. Thanks everyone for joining. So in the talk today, I'll begin with a quick introduction of Alexio for those who are not familiar with what Alexio is. Then I'll move on to describing how you would achieve data locality with Spark and Alexio without Kubernetes. Then I'll give a quick overview, a quick recap of some Kubernetes basics that I'll use in the rest of the talk. And then I'll move on to talking about how we achieve locality with Spark on top of Alexio in Kubernetes, what the challenges are and what the solutions that we have come up with. Towards the end, I'll hand it over to Jean who will talk about some recent innovations for structured data with Alexio. With that outline, let's begin. Before I give you a brief introduction of what Alexio is, I'll start with giving you some context of the evolution of the big data ecosystem. So if you look at the big data ecosystem as it started, there used to be only one compute engine which was Apache Hadoop and only one storage engine which was Apache HDFS. So HDFS and Hadoop were co-located on the same cluster where you have all of your big data and all of the processing and the storage used to be on the same cluster itself. But if you look at the big data ecosystem today, there are a lot more compute frameworks and a lot more storage frameworks as well. So you have things like Presto, Spark, Flink, each compute framework catered to a specific kind of workload. And also you have different kinds of storage systems which are cheaper and more efficient in different scenarios. So these storage systems could be both on-premise such as SES for HDFS or in the cloud, such as Amazon S3, Google Cloud Storage or Azure Storage. So like I was mentioning before, initially we only had a co-located compute and storage, one big cluster which has all of your data and all of your compute. So what we observed during this time was that typically these clusters were compute bound. So you have a large amount of compute jobs which are trying to crunch out data from a co-located HDFS but it's running out of CPU resources. But since the clusters were tied together, you had to scale both compute and storage together. So the model that people moved to after this was that they disaggregated compute and storage. So you would have a segregated set of nodes which would run your compute and occupy the CPU resources and you would have a different set of nodes which would be storage heavy and have your HDFS cluster. But since all of the storage was still on HDFS, HDFS is more expensive than some of the cheaper forms of storage available today such as object storages. Now if you look at the ecosystem and the way it was, because the clusters were typically compute bound, one way of having a cheaper form of processing was to burst the compute out into the cloud. So what you would do is you would move the compute into the cloud but you would still access data which was present on premise. And we moved away from just one compute framework and we started using more compute frameworks such as Presto and Spark. And also like I was mentioning, HDFS stopped being the only data storage. So HDFS was combined or even replaced with storage completely in the cloud or cheaper form of object storage on premise itself. So even though there were all of these compute frameworks that came in, all of them wanted to access data through a familiar API. One popular API was the HDFS compatible API, the HDFS interface to access data. And this is where Alexio comes in. So when you have segregation of your compute and storage clusters, where the data itself and the storage cluster itself is not co-located with the compute resources, Alexio sits as a layer between the compute cluster and the storage cluster. Typically Alexio is co-located with the compute cluster and Alexio enables access of data from your storage clusters to the compute framework itself. Alexio also is responsible for a couple of different things which I have on this slide. So the first thing that Alexio did when Alexio started as a project about four years back was that Alexio provides a global namespace for all of your data. What this means is that if your compute application such as Spark is accessing data from different storage systems such as HDFS and S3, Alexio enables you to access both these storage systems as part of the same namespace, the same file system namespace. So you could have a directory within Alexio pointing to HDFS and another directory within Alexio pointing to an object store. Now by locating Alexio close to the compute cluster, Alexio also acts as a cache for your data which might be remote. So Alexio acts as both a read cache when accessing data from a remote storage for the first time and also as a write cache when new data is written back by the compute framework and it can persist the data to the storage systems in the background asynchronously. Now after the first version of Alexio, the next step in the evolution of Alexio was our certain data management features. So a couple of slides back I talked about how we saw people were migrating from big on-premise clusters where HDFS and Hadoop were co-located to the cloud in which they could have a mix of HDFS and S3 or other object storages as the storage system. So the data management features baked into Alexio allow you to set policies to ease migration of data from HDFS to your object store. So you could set policies which allow you to migrate the data based on access or based on a time interval. Now some of the recent innovations in Alexio include what we call the structure data catalog and the data transformation platform. These are some things which allow Alexio to be more than just a file system and work well with data analytics engine. I won't speak about that too much as Jean is going to talk about this towards the end of the talk. Okay, so now that we have a quick recap of what Alexio is, I'll talk about why we need an access layer like Alexio in Kubernetes. So the picture I have on the right side, we have Spark and Alexio co-located on a set of Kubernetes nodes, which is outlined by the blue Kubernetes cluster. That's the way Alexios typically deployed close to the compute. And in this picture, you can see that Alexio can access data from one or more storage systems. So let's say that in this case, Spark and Alexio wants to access data from an object store like Amazon S3. So some of the key features that Alexio provides is that it brings back data locality to your compute frameworks in a world in which compute and storage is segregated. So object storage typically located on hardware custom design for storage is always segregated with your compute cluster. So in an environment like this, Alexio will fetch data from your object storage onto the compute cluster on the first access and also frequent accesses will have data locality, data locality similar to the way you had data locality in the first version of our big data ecosystem in which Hadoop scheduled compute jobs on the nodes where the data was present. In this case, Spark or any other compute framework is going to schedule compute jobs on the Alexio nodes where the data is cached for subsequent runs. The second thing that Alexio provides is Alexio acts as a layer to share data across different jobs. So if you run one Spark job and you run another Spark job, which is reusing the same data, the data in the middle doesn't have to be persisted back to a slow object storage and it's persisted in Alexio. It's persisted in Alexio. Okay, so let's talk about how data locality is achieved with Spark on top of Alexio without Kubernetes. A typical workflow for Spark launching jobs on Alexio looks like what I have in this picture. So you have a Spark job and the Spark driver, allocates a talks to a resource manager such as Jano or Mesos to launch executors and tasks on those executors. And then the executors itself, they access the data from an object storage or any other form of storage which is remote. So in this picture, you can see that since there is no Alexio in the picture and the storage is remote, there is no locality. And with Alexio in the picture, data locality, the workflow for Spark works like what I have in this picture. So you submit a job, the Spark context talks to the Alexio client, identifies where the location of the blocks are for in this case, the Spark job wants to access block one. Block one is located on host A and with this information, Spark can schedule tasks on host A. Now, once we have a task scheduled on the host which contains the data, the next step is for the Alexio client, which is part of the Spark executor, JVM, to detect that the Alexio worker from which it is accessing the data is local and then to accept that data efficiently without going over the network stack. So to not go over the network stack, Alexio has two mechanisms. One is that the Alexio client and the Alexio worker share a local file system or they go over domain socket which is more efficient than going over the network stack. Just a quick recap, there are two steps for achieving data locality with Spark on Alexio. The first step is for Spark to schedule tasks on executors co-located with the Alexio workers which have the data. And then once the task is scheduled to that note, the Alexio client detects that the worker is local and accesses the Alexio worker using a path without using a mechanism without going through the network stack. And that's what we call short circuit access. A quick overview of Kubernetes, a couple of concepts that we'll use in this talk. As you all know, Kubernetes is a container orchestration system which makes it easy to deploy an orchestrate different applications such as Alexio and Spark. Some of the features it provides is it provides an abstraction for the physical hardware. It provides storage orchestration which Alexio internally uses to manage local storage and Alexio in turn provides orchestration for distributed storage on Kubernetes. So a couple of key terms. A node is the physical host itself. Containers are, let's say, a Docker image, a pod is a schedulable unit within Kubernetes and a controller is something that controls the desired number of replicas or the locations on which a pod should be launched. Demonset is the kind of controller that Alexio will use in this talk in which Alexio workers are deployed on each and every node in the Kubernetes cluster. Okay, with that, I'll move to describing the challenges that we faced for achieving the data locality with Spark on top of Alexio in a Kubernetes environment. So, Spark added Kubernetes support in 2.3. The architecture for Spark on Kubernetes looks like what we have in the picture. The Spark client, when you submit a Spark job, it talks to the API server, launches the Spark driver on one of the nodes in the cluster. The Spark driver in turn talks to the API server to schedule executors. And then once the executors end up on the Kubernetes cluster, tasks are scheduled on executors. Now, when you are deploying Spark and Alexio in Kubernetes, the deployment model kind of looks like this. Spark executors are ephemeral, so they last only for the lifetime of the Spark job, while Alexio workers persist as a demon set in the cluster across your Spark job. So even if there are no Spark jobs running, there would be an Alexio worker running on the Kubernetes cluster. Now, if you remember from a couple of slides back, the first thing that happens when you launch a Spark job on Alexio to achieve locality is that the Spark driver needs to figure out which executor is collocated with the Alexio worker, which has the data. So in a Kubernetes environment, when you're using container networking, the Alexio worker pods would have an IP which is different from the physical host. So the solution to this problem is that Alexio workers use host networks. So when the Alexio workers advertise the location of the data, they would advertise the physical host as the location of the data. And the second part of the solution is that the Spark scheduler for Kubernetes has privileges to map the executor to the physical host. So using the combination of Alexio advertising the physical host address as the location of the data and the Spark driver being able to map the executor to the physical host, Spark is able to schedule executors to the Alexio worker nodes which have the data itself. The second, now, once we have tasks which end up on the Kubernetes nodes which have the Alexio workers containing the data, the job at hand now is for the Alexio client, part of the Spark executor to access the Alexio block efficiently using a domain socket, a domain socket access path which doesn't go over, doesn't go over the networking stack. So the way the Alexio worker detects if, sorry, the way an Alexio client detects if an Alexio worker is local in a non-government environment is by using, by comparing its host name with the Alexio worker host name. But since we're using a virtualized networking for the Spark executors the IP for the Alexio client doesn't match the IP for the Alexio worker. Now, the other challenge is that the Alexio worker the Alexio client does not have a share a directory with the Alexio worker. To, for both of these challenges, the solution that we have is that we mount a host path volume which is shared between the Alexio client which is the Spark executor and the Alexio worker. So there is a directory which is shared between Alexio workers and the Alexio client. Each Alexio worker identifies itself with a unique ID. So the domain and the Alexio client accesses the Alexio worker over the domain socket path. So the client would detect if the worker is local simply by looking at the file system and if it finds the unique ID for the Alexio worker on the file system that's mounted in the Spark executor it knows that the Alexio worker is local and can access the Alexio worker over a domain socket. So this feature was something that was enabled in Spark 2.4 by allowing mounting a host path volumes to Spark executors. So a quick recap. This is what the solution kind of looks like. The first stage was Spark being able to schedule tasks to a lot of two executors which contain the data which was achieved by using host networking. And the second step was accessing data efficiently which was done by using a host path volume on the Spark executors to detect if a worker is local by looking at if the worker's UUID matches what is there on the Spark executor volume and then accessing it using domain sockets. So with that, I'll just talk about a couple of limitations of the solution that we have in place right now which is that host networking and host path may not be available in enterprise environments because of some security vulnerabilities. For that, we might end up, so we have plans for using local persistent volumes instead of host path volumes and also implementing a translation between the host IP and the pod IPs in the Alexio client itself. So with that, we're gonna switch gears from Spark and talk about Presto and also some SQL workloads on top of Alexio. So the first model of deployment which I talked about was Alexio is pre-deployed. Alexio exists on the cluster and then we launch Spark executors which are a free rule for the lifetime of a job. Now the other model of deploying Alexio in Kubernetes is that in the right side picture I have Presto and Alexio. A Presto worker's and Alexio workers would be in the same pod and when you schedule in a Presto worker you would schedule in a Alexio worker co-located with the Presto, with Presto. Okay, with that, I'll hand it over to Gene who's gonna talk about structure data management. Okay, thanks Adit. So yeah, Adit talked a lot about a lot of the work that we have done with Alexio and Spark and Kubernetes and I'll talk a little bit about some of the new innovations that we have in Alexio that really tries to address and optimize for structured data such as Spark and also Presto. So here's a short motivation. When we looked at a lot of the use cases for Alexio as Adit mentioned before, Alexio is a layer that is in between a lot of the applications and different store systems and by doing this Alexio can provide a unified interface and namespace as well as being able to provide the locality and caching for the applications. And when we examined the applications we noticed that a lot of these applications are actually dealing with structured data or like SQL engines such as Presto, Spark and Hive. And so we noticed a lot of the usages of Alexio are with structured data and SQL engines. So we decided to take a closer look at these types of workloads. And so ultimately there are two types of systems involved in this ecosystem. There are the store systems and there are the SQL frameworks or SQL engines. And for storage systems, they are primarily involved with the storage and serving of files and objects, directories and raw bytes. And this is very heavily skewed towards being storage optimized. On the other hand, for the applications and the SQL frameworks, they are primarily concerned with and dealing with data in tables and schemas, rows and columns. And it's slightly different from what the storage systems are providing. So these SQL frameworks are really looking for compute optimized data. So ultimately there is a mismatch between what storage systems typically deal with and provide and what SQL frameworks and applications want, how they want to consume the data. And so with this mismatch, we actually see additional opportunities to further expand the benefits of what Eluxio can provide in these ecosystems. So by taking a look at the two parts of the ecosystem, the storage systems and the SQL frameworks, we see Eluxio as a bridge between these two systems. And so as I did, and we have mentioned earlier, Eluxio already provides the caching benefits and the unified interface and namespace benefits between these two systems that can help optimize for the SQL applications. So in addition to that, we think that having structured data management, we can have schema aware optimizations by understanding how the data is structured and how the data is being computed on. Additionally, we think that Eluxio can provide compute optimized formats in order to present the data in an optimized way for the applications and not necessarily optimized way for storage. And this ultimately leads to a powerful concept which is physical data independence, which means the way that storage systems store the data can be independent from the way that applications consume and compute on the data. So this sort of leads to a high level philosophy of how structured data management would look like in Eluxio. First is to be able to provide structured data APIs. And what this really means is to be able to focus on how these SQL frameworks want to interact with the data. Secondly, we want to cache the logical data access as opposed to the physical data access that storage systems provide. And this is really trying to focus on caching what frameworks and SQL engines want and how they want to compute on data. So here's a very high level overview of the architecture of Eluxio structured data management. We have the storage systems on the left and we have the SQL engines on the right. And here in the middle, we view we have several different components to the Eluxio structured data management. So first we think, first Eluxio has a transformation service which is important to provide the physical data independence. This is for converting and transforming the data that's in the storage optimized format and representation and transforming it into a compute optimized representation. Next we have structured data and metadata. And this is where we are storing the transformed data as well as maintaining the metadata, necessary metadata for that transformed data and also managing the metadata for the tables and schemas information for the clients. And lastly the last main component is the logical data access layer which is essentially the client into all these various other components that we have in Eluxio. So this is the client is what the SQL applications will be using in order to access the data in the Eluxio structured data management system. So in the latest release of Eluxio 2.1, we have a developer preview of these new features and components. And this is the target environment for the developer preview, which is for Presto. Presto is a SQL engine that runs on various different environments. And in this specific environment, the typical use case is Presto talking to storage and the Hive metastore. And it does that via the Hive connector within Presto. With the new Eluxio structured data management, we have a new Eluxio connector in Presto which allows the connector and Presto to communicate with other Eluxio services. One service that we have is the Eluxio caching service. And so this is already part of Eluxio today which we have mentioned from the beginning where Eluxio can cache a lot of the data that is stored in the storage. Eluxio connector also communicates with the Eluxio catalog service which maintains the metadata of various tables and schemas and database information. And this catalog service is what is communicating with the Hive metastore. And we also have the Eluxio transformation service which is primarily responsible for transforming data that's in storage optimized formats and into a compute optimized format. So Eluxio catalog service as I mentioned is primarily managing the metadata for structured data such as tables and schemas. And it has the concept of an under database which is really an abstraction of other database catalogs in order to be able to connect to and understand the schema information. So the main way to do this is essentially attaching an existing database or metastore into the catalog service to be able to get that information. And the benefits that it provides is to be able to have schema where optimizations as well as simplify the deployment. There's also a new Eluxio Presto connector which provides tighter integration with Presto and it is heavily based on the existing Hive connector in Presto and today it's available in the Eluxio 2.1. And right now as we speak, it is we are in the process of trying to merge this connector back into the Presto code base. And lastly there is the Eluxio transformation service. And this transformation service, its primary goal is to transform data to be compute optimized. And this compute optimized format is independent from the storage optimized format. And this ultimately provides the physical data independence. So the two types of transformations that exist today in the developer preview is the coalesce feature which is taking many small files and converting them into or transforming them into fewer files. And this is important because typically when there are too many files in the storage, it can be inefficient to query through Presto. And secondly, there is a format conversion available today in the developer preview. And in the developer preview, we can convert CSV files into Parquet files. And that is because Parquet files is more compute optimized format rather than CSV files since that is just plain text and more difficult to parse. So these are the two types of transformations that are available in order to convert data into be to be more compute optimized. So now I will have a short demo on the Olexio, the new Olexio features that I just talked about today. So actually I have a short setup. So in the demo, I have two clusters on AWS. One has Presto and Hive Minister and S3Data. And the other one is the same setup but it also has Olexio. And in this data we're using just a simple dataset on S3 and some of the tables actually has 10,000 CSV files which is very inefficient. It's very not compute optimized. And so I will later show you how the transformations can help optimize that data to be more compute optimized. So we have two clusters here today. The left on the left, we have the cluster with Presto and Olexio. And on the right we have a cluster with only Presto talking directly to the data. So first thing we're gonna do is we will attach an existing Hive Minister to the Olexio structure data account, the Olexio catalog service. And so here's the attach command right now I'm running. And what this is doing is it's going to just simply communicate with the Hive Minister and extract and get all that information that it needs to from the metadata. And so if you take a look at the tables that stored it returns all the tables that are in the TPC dataset. So next in that same cluster, let's take a look at how Presto is configured. And so here you can see it's a very simple way of connecting Presto to Olexio with the catalog service. And she's at the point it to the appropriate location and it will then communicate with the Olexio catalog service. And so once we start the Presto CLI, we can simply run show tables and that shows that the table that is already in the Olexio catalog service. And we can also run a simple query which reads from the item table just a few rows and that's reading from the Olexio structured data service. And so as I mentioned before, some of these tables are actually very unoptimized for computing on. And store sales is one of those tables. And so what we're gonna do is we're gonna transform the store sales to be more compute optimized. So we're gonna kick off the transformation with this transform table command and this will kick off a transformation. And if you take a look at the output directory, right now it doesn't exist because right now the transformation is just starting. It's just starting so in about 30 seconds or so it'll finish up. So while that is growing, let's take a look at the other cluster which only has Presto and Hive on the right. And here we will start up another Presto CLI and we will show the tables. And this will show, so now this Presto is talking directly to Hive and that's stored to the data. And so here we see how the same set of tables. And so what we're gonna do is we're gonna run a query, one of the TPC DS queries on this data and it does access the store sales table. And so here we will see it accessing directly these CSV files and showing how long it takes. On just a Presto talking to Hive and a store talking to S3 directly. So here it took about 18 seconds to finish here. And so now let's go back to the Eluxio cluster, the cluster that has Eluxio in it. And then we can check out to see. Okay, so now if we take a look at the data we see that there is about one and a half gigs of data that's been transformed. And so if we run that same query on this cluster, in the Eluxio cluster, we will see now what it's running is it's running on, it's reading the data from Eluxio with the transformed data. And here it took faster, it took only seven seconds to read this data. And then if you take a look at it, down here if we take a look at the status of the files in Eluxio, now we see that it's actually in Eluxio 100%. What that means is that for the first read it was not cached yet. So that's why before it was not in Eluxio. So here it was in 0%. But after the first query of the data, it transparently caches that data in Eluxio. And so now it's fully cached in Eluxio. And so when we run the query one more time with now reading the cached data, it's even faster than before, which is now about three seconds long. So we essentially went from about almost 20 seconds to about six or seven seconds for the first read and then three seconds for the cached read. So just to summarize what I did in the demo, I attached an existing Hive database into the Eluxio catalog by the attachdb command. That Eluxio catalog was able to serve all the necessary information and metadata to Presto. And then we transformed the store sale table by coalescing and converting CSV to Parquet to optimize it, to transform the data to be compute optimized. And so with, for the summary, when there was no Eluxio involved, there was like 18 to 20 seconds running directly on the data, on basically the CSV data. After we transformed it into fewer Parquet files, the transformations went down to about seven seconds. And then after it was cached, the resulting query was about three seconds. So this shows the benefit of being able to provide compute optimized data and compute optimized metadata for these SQL applications. So yeah, this is, we have a developer preview in Eluxio 2.1 and we have, we're working with, we definitely want to work with the community to get feedback and be able to collaborate on implementing new features in this work. We have several new projects that, future projects that we are planning such as new UDB implementations and conversion formats, and then also later being able to work with DDL and DML commands and new client APIs. And so finally, it is available today in Eluxio 2.1, which is the latest release. We have a developer preview and we would definitely love for people to try it out and provide feedback. So yeah, thank you very much. That is the end of the entire talk. And so I guess now we can have some time for some questions. Yes, thank you for the presentation. Like you mentioned, we'll do questions right now. If you do have questions, please drop them into the Q&A tab at the bottom of your screen and we'll go through as many as we can. It looks like there's a, there are a few questions. First question, is co-located presto and Eluxio Dockerfile publicly available? So the Dockerfile for Eluxio and Presto is not publicly available right now. It's in the works. So it should be available pretty soon for you to be able to deploy presto and Eluxio in Kubernetes. Perfect. Next question, is ORC format with ZSTD compression available? For ORC, so today in the structured data part of Eluxio, if you don't run with transformations then all the existing formats that Hive makes available is available to presto. Now, if you do want to take advantage of some of the transformations that we have in Eluxio then today ORC is not supported but that's something that we plan to support in the near future. Great. Okay, we'll see. When bursting compute into the cloud, how does Eluxio manage cache data? Okay, thanks for the question, Garin. So when we have a compute cluster, let's say we begin with a compute cluster which is entirely on-premise and then we start to use compute in the cloud using, let's say, federated Kubernetes or any other way in which you might have joined clusters both in the cloud and on-premise. So what Eluxio does is the cache policy in Eluxio allows you to cache data on access. So initially when you have a bunch of jobs running on-premise, Eluxio will cache all of the data on the nodes which are on-premise. And then once you start running jobs in the cloud, Eluxio will start caching data locally. So once the jobs are run on the cloud, Eluxio will start caching data on the nodes in on which the compute job is running itself. Cool, let's see, what else? How are the transformations performed? Yes, for the structured data management, Eluxio utilizes something we call the job service within Eluxio to be able to do the transformations. And what that is is it's a simple distributed computation framework to be able to read data and do some operations on that data and then write out data. So the Eluxio transformations reuses the Eluxio job service in order to perform these specific transformations. Okay, in co-located Presto and Eluxio, what should be the percentage of memory RAM allocated to Presto and Eluxio separately? So the short answer is it depends. It depends on the kind of queries that you're running with Presto. I would say that if your Presto queries are highly compute intensive, then Eluxio can be configured to use what we call tiered storage. So Eluxio can be configured to use SSDs or HDDs instead of RAM so that your Presto queries can benefit from the additional RAM that's available to it. But if your Presto queries are more IO intensive, then it makes sense to allocate more RAM to Eluxio so that your jobs can benefit from IO acceleration. So a good point to start with would be, I would say like one third of your memory, start with one third of your memory for Eluxio. And depending on your jobs, you can tune the amount of memory available for Eluxio up or down. Okay, so I guess kind of a follow up question to that. Since both are memory heavy tools, is it good to have Eluxio and Presto together or is it best for these two to be installed separately? So like I mentioned before, Eluxio can be configured to not use memory heavily and but it does typically make sense to deploy Eluxio and Presto together. Okay, great, let's see what else. Does Presto have access to both Eluxio catalog service and Hive Metastore? I see, so with the new Eluxio structured data services that we have in the newest Eluxio, if you use the Eluxio connector in Presto, then the Eluxio connector will only communicate with the Eluxio catalog service. And it will, the Eluxio catalog service will communicate with the Hive Metastore. So if the Eluxio catalog, if the Eluxio connector is used in Presto, then Presto will only communicate with the Eluxio catalog service and not directly with the Hive Metastore. Okay, there's another question. What about each aid in Eluxio? How do you handle it? So there are different modes in Eluxio for HA. I'll restrict my answer specifically to Kubernetes environment. In Kubernetes for HA, we have something called Embedded Journal, which is our mechanism for high availability. So what you would do is that you would launch multiple Eluxio masters and the Eluxio masters run a consensus algorithm amongst themselves and the Eluxio would elect a master as a primary. So there's no dependency on an external system like ZooKeeper for leader election. The Eluxio masters have the logic baked in the Eluxio master itself and they can choose a leader. So the Eluxio masters choose a leader and in for HA the Eluxio masters use a local persistent volume as the journal so that across restarts of an Eluxio master, the journal is preserved and a secondary master can also become primary and serve the latest state using the journal quickly. Great. Let's see. Okay, if I'm using ORC files, can I use Eluxio structured data service? Yes, so if you use the Eluxio structured data service then it can still present all the data that HAI presents. So if HAI has a bunch of ORC files then the Eluxio structured data service and lecture catalog will still work with those data. Now, if you want to transform that data then the transformations, there are some limitations such as converting from CSV to Parquet and things like that but we are in the future going to support additional formats for transformations itself. So if you don't use transformations then all the formats that work today will continue to work. Great. How does Eluxio manage data security or DLP? Oh, I'm not sure what DLP means but Eluxio has ACL, so it has access control lists. It has simple ways of doing authorization for the data. DLP meaning data loss prevention. Oh, I see. And so yeah, so that's for security for some, for the loss, for data loss prevention I guess one, an important characteristic of Eluxio is that the data persistence is handled by the under file system layer which means whatever the storage system is below Eluxio that is responsible for the persistence of data. So Eluxio essentially is not responsible for the persistence but the underlying data. So you can essentially see, view the storage system as like the source of truth and Eluxio is essentially either caching it or providing some sort of transform, transform version of that data but ultimately the data is persisted in the storage system. So Eluxio, so basically the storage system itself the storage systems themselves have to provide that data loss prevention. Great. How is Cache Lifecycle Managed? Are there supports for external cache providers? The life cycle of the cache, like I mentioned earlier on in the talk, Eluxio does provide certain data management features which allows you to set policies for how to migrate data from one storage system to the other and between the Eluxio cache. So if you're accessing data from let's say HDFS and it's being stored in Eluxio storage, Eluxio managed storage such as memory or SSDs and you eventually want to migrate it to a storage system, an object storage system like Amazon S3, you can set certain policies which would determine the life cycle of the data based on access. Okay, a few more questions. Is there any plan to support AWS Glue? Yes, there are plans, it's not there yet right now but we are, that is in the near future to be able to support Glue as another under database connection for the catalog service. Yes. Okay. In addition to HiveTables, can I import tables from databases like MySQL? So today the only supported UDB is the Hive Metastore but I think it could be pretty reasonable to implement a UDB implementation for MySQL itself so that would mean that Eluxio could provide these same similar services such as caching and transformations for tables in MySQL themselves. So it's not there yet right now but that is something that we may implement in the future. Okay. How long does the cache live? I.e. does the work or do the workers die at some point with no more computations? So typically the way Eluxio is deployed, the Eluxio workers are long running. So the workers do not die when there are no more computation but the storage that is managed by the Eluxio workers, you can set policies on that. So what you can do is you could say that cache data only for one hour or you could say cache data only for a day. Great. Okay. I think those are all the questions. Thank you, Jean and Adi for a great presentation. I think that's all the time we have for today. Thank you everyone for joining us today. The webinar recordings and slides will be online later today and we are looking forward to seeing you at a future CNCF webinar. Have a great day. Thanks everyone. Thank you. Thanks, bye.