 All right, let's get started. So we've got Adit Madan from Lexio. Adit is a software engineer there and has been working on the Lexio integration for DCOS. And Mesosphere and Lexio work together on this partnership to make this happen. That's what Adit is gonna talk about and then demo the integration as well. Thanks, Ravi, for the introduction. Today I'm going to talk about how a Lexio can be used to accelerate Spark workloads on a Mesos cluster. I'll start with a brief overview of a Lexio for those who are not familiar with it. I'll move to two use cases of this solution being used in production. And then I will talk about the architecture of the solution with some details, followed by the deployment of a Lexio on DCOS and a demo with performance numbers. To begin with the overview, if we look at the big data ecosystem of yesterday, there was only one compute framework which was Hadoop MapReduce, and it had only one storage system which was the Hadoop distributed file system. The problem with this was that compute and storage was always co-located and if you had to schedule, if you had to scale out the storage, you would also scale out the compute resources and vice versa. However, if you look at the ecosystem today, there are a lot more compute frameworks for both streaming and batch workloads, and there are an equal number or even more storage systems each with its own pros and cons. And in most cases, these compute frameworks and storage systems are not co-located. The part about being not co-located gives you the flexibility of expanding storage and compute independently. The issue is that each application manages connections to the storage systems individually and optimizing for storage access requires application level changes and there is no sharing between different applications. So if two applications were sharing the same data, they would cache it individually, duplicating the data in memory and making sure that... And there would be duplication of data. This is where Aluxio comes in. Aluxio provides the storage abstraction for all of the different storage systems that you have, be it on-premise or in the cloud, across different file systems as well as object storage systems, and you can access the data typically using a file system API. And this can be done without any code changes at the application level and you can continue to use your Spark applications, Presto, Flink applications, name it, and without any application changes. The highest performance is guaranteed by managing all of the data in memory and providing shared access across the different applications that you have. To summarize, Aluxio has been used for a variety of applications ranging from big data to deep learning. As I'll talk about in the next few slides. To summarize, Aluxio unifies all of the data in your cluster. It provides high performance by in-memory data management. It provides cost-saving and you have no vendor lock-in as from an end-user perspective, migrating data across storage systems is transparent. I would also like to mention that Aluxio is one of the fastest-growing big data open-source projects. The graph that we're looking at is the number of contributors for different popular frameworks in their early months. The left, the y-axis is the number of contributors and as you can see, Aluxio is doing pretty well. Moving on to use cases of the given solution in production by existing today. Now, the first one I'm going to talk about is Chunar. Chunar is China's biggest travel search portal. They were using Spark and Flink in combination to tackle an incoming stream of click data from their website. The storage systems that they used were HDFS and Cef. They used Aluxio as a storage abstraction layer and also as a mechanism to share data from a pipeline of data processing flowing between Spark and Flink. On average, for their workloads, they saw about 15x performance improvements and on a peak workload, they see up to 300x performance improvements. The link that I have on screen gives you more information about the solution. The second use case that I would like to talk about is Garden Health. Garden Health is genomics data processing. It does some data analysis on genomics data for cancer patients. They used Spark to process data across different storage systems that they have, both on-premise and in the cloud, and they scale up to exabytes of data. They moved to a solution using Aluxio and Mesos and an object store called Minio. Minio was used for cold data with backups to Amazon S3. They also saw audits of magnitude of performance gains when they used Aluxio in combination with Mesos. The link that I have on the screen has more information. In the words of our friends at Garden Health, the benefits that they saw from Aluxio with performance could literally be a lifesaver for their patients. Okay, so let's talk about the architecture of the solution and some of the details. Now, let's say we have two Spark applications running on a Mesos cluster. Both Spark applications would have their own context with their own caching of any data that they access. In this picture, we have these two applications accessing data from HDFS and Amazon S3, or it could be any other storage system then. So whenever an application accesses the data, they would maintain its own copy. So in this diagram, we see that the two applications have their own copy of blocks one and three, and there is no sharing of the data between these two applications. The other thing that we see in this picture is that the lifetime of the cached data is tied to the lifetime of the application itself. Now, if you look at the same picture with Aluxio, both applications can talk to Aluxio without any code changes. Any data access from slow remote storage, such as HDFS and Amazon S3, would be cached in Aluxio. Typically, Aluxio is close to the compute cluster, which is the Spark cluster in this case, and the HDFS or storage system could be remote, it could be co-located. Aluxio gives you the flexibility of maintaining performance, even regardless of the location of the storage that you're accessing. So in this case, if the same two applications access blocks one and three, you'll see that there's only one copy of these data in Aluxio. This means that the memory, which is expensive in your compute cluster, is being well-utilized. There's no duplication of data, and this will, in turn, lead to performance gains when the memory is overutilized. Like I mentioned before, the storage and the compute, in case of Vanilla Spark, are tightly associated. So in case, there is, and they lie in the same JVM. So let's look at what would happen when the application goes away, and particularly when the Spark context that is running the application is no longer accessible, which could be when you have a crash or you just decide to close it. Now, once the application goes away, the storage that is associated with that application is also inaccessible, which means that when the same blocks are accessed again, they will have to be fetched from the remote storage cluster that you have. And this access is typically bound by network or slow disk IO. The same picture, Spark accesses data from Aluxio. In this case, if you have a crash, the data is still accessible in Aluxio. And Aluxio would manage this data in memory or across different tiers of storage on the compute cluster, which means that once the same data is accessed by the application that died or some other application, you still have that data in memory. To summarize, the lifetime of the data is disassociated from the lifetime of the end application which is using it. This provides both, this provides performance when the data is being shared across different applications and also during scenarios in which we have failures. Now, Aluxio has been integrated with DCOS. Aluxio is available as one of the packages in the DCOS universe. You could use Aluxio, you could use DCOS to deploy and manage Aluxio in a DCOS cluster. Now, on a DCOS cluster, Aluxio brings a unified view of all of the data that you have, be it on a DCOS cluster or from outside. It brings high performance and predictable SLA for your workloads. DCOS makes provisioning of Aluxio easy. It manages elasticity, both scaling up and scaling out. And together, the solution enables faster analytics with Spark and other frameworks which are running on the DCOS cluster. It also enables you to access data from disparate storage systems, both, for example, HDFS and S3. And in the demo that I'm going to show you next, we'll see HDFS being mounted as the root understorage system in Aluxio. What I mean by understorage system is just one of the storage systems that is backing Aluxio. And you'll be able to access Amazon S3 in the same namespace, which is a file system like namespace. The demo is based on an Amazon EC2 cluster. We'll have Spark and Aluxio running on Mesos on DCOS. We will access data from Amazon S3 by running a simple Spark count application. And we'll see some of the performance numbers of the given solution. The versions used in the demo we have used the last release of Aluxio, Aluxio 1.5. We used the previous release of DCOS 194 and with Spark 202 running on Amazon M3.extra large instances. For the demo, we have a pre-deployed DCOS cluster. It will access data from HDFS. We see that in HDFS, we have a single file called license.txt, which will appear in the Aluxio namespace once we mount this HDFS location into Aluxio. We will also access data from an S3 cluster. In S3, we have two files, readme and a sample 1 gigabyte file. The 1 gigabyte file is what will be used for the performance numbers. For Aluxio on DCOS, we have a Docker registry setup. The Docker registry will be used to hold the Aluxio client Docker image, which is built as part of the deployment process. And the client Docker image has the Aluxio CLI in it, which will be used to access the Aluxio file system. In addition, during the deployment process, we build a Docker image for running Spark on top of Aluxio. The Docker image will have the Spark shell and is also the Docker image used for Spark executors. To install Aluxio on DCOS, locate the package in the universe. The first thing that you need to do is obtain a license from Aluxio. And base64, encode the license. Don't try to use this license. This license no longer works. So the next thing that we do is have HDFS as the root under storage for Aluxio. Like I mentioned before, any writes that you do to Aluxio or any reads that you do from HDFS will be available at the root of the Aluxio file system namespace. Now, we just specified the under storage system for Aluxio and the license. And that's all that is needed to deploy Aluxio with the default configuration. Now, you can monitor the progress of the installation in the services tab in DCOS. After waiting for a few minutes, you'll see that all of the processes for Aluxio come up, which includes the Aluxio master process and Aluxio workers for the Aluxio distributed file system, as well as other auxiliary processes. Once the installation finishes, you will have access to the Aluxio client image that I mentioned before. So we just logged into the master node in DCOS to access the CLI. And once we're in the master node, we can pull the client image that I mentioned earlier. Now, the client image that was built as part of the deployment has all of the configuration that is required to connect a client to Aluxio setup already. So once you have access to the image, you can run the Aluxio CLI, which is a little hidden on top, which has been Aluxio FSLS. So the file that we see over here, license.txt, is being fetched from HDFS, which we configured as the storage system backing Aluxio. The not-in-memory annotation over there specifies that only the metadata from HDFS has been fetched into Aluxio. The data itself will be fetched into Aluxio once we access the file. So the next thing that we do is we mount an S3 bucket into Aluxio. We specify the credentials that are required to access the S3 bucket. In this case, we mount S3 at the location slash S3A into the Aluxio file system. And DCOS demo is the name of the bucket in S3. So you will be able to access both S3 and HDFS in a unified namespace. And you can easily migrate your data between the two storage systems. The end application only talks to Aluxio, and it is unaware of where the data is being accessed from. So you just specify an Aluxio path, and that is needed for your application to talk to one or more under-storage systems. As you can see, S3 was mounted at that location. And once we list the contents at the location S3A in Aluxio, we'll see the same two files that we had in our Amazon bucket. Readme and sample 1 gigabyte. Sample 1 gigabyte, like I mentioned before, is the file that we'll use for the performance numbers in the count spark job that I'll run. Like I mentioned before, the not-in-memory annotation only means that the metadata has been fetched from the storage systems back in Aluxio and not the data yet. So in the terminal that I have open below, I'll run a spark application, which is the count job. I'll log into the DCOS master node and pull the spark Docker image that was built as part of the deployment. Now that we have pulled the Docker image, we'll start a spark shell running the spark job on Mesos on the DCOS cluster. Once all of the executors for the spark job have come up, the first thing we'll do is set the log level in spark to info to monitor the timing information. As you can see, the spark application talking to Aluxio, you specify the scheme Aluxio followed by the master node of Aluxio. And then you specify the path of the file that you want to access. Like I mentioned before, the sample 1G file is not in Aluxio memory yet, but when we run the spark application and the data is being accessed from Aluxio, that's the time when you pull the data from S3 into Aluxio. And all repeated accesses of that data will benefit from the acceleration that is provided by managing data in memory. So we run the count job, which counts as a number of lines in the 1G file. We notice that the locality level was process local for this job. What this means is that Aluxio does not have the information in memory and there is no locality from spark's point of view. As you can see, this job took about 31 seconds to finish and this is when the data was actually pulled into Aluxio. Now, if you exit the spark shell and repeat the same job, let's see what happens. In the top shell, I just showed you that the annotation for the 1G file, it changed from not in memory to in memory, which means that now Aluxio has the blocks for this file. And in this case, since the default block size was 500 megabytes, we have two blocks for the file in Aluxio. So we restart the spark shell and redo the same count job, set the log level to infer again, re-evaluate the file and issue the count. So this time we can see that the locality level was node local, which means that the executors, the tasks had complete locality and that's why we saw the performance improve from 31 seconds to 3.6 seconds. So the results for the performance experiment that we saw in the demo, the lighter blue bar is the initial count that we did and the darker blue bar is when you exited the spark shell and re-read it the same experiment again. So if in case of Aluxio, you notice that we got an 8x improvement for repeated accesses independent of the spark shell, the spark context that you're using for the data access. However, if you access the data directly from S3, the initial count and any repeated counts after the spark context has been restarted, they have the same performance. So the gist of this is that when you have repeated accesses or when you have data being shared across different applications, you will see tremendous performance gains from Aluxio. To conclude, Aluxio is easy to use in a Mesos environment. DCOS brings you ease of deployment and management of the application. With Aluxio, you get predictable performance with a controlled caching system and you get improved performance as well. Aluxio easily connects to the different storage systems that you may have, be it HDFS, S3, CIF, you name it. There are a number of connections that you can do from Aluxio to the storage system. So go use Aluxio. It's awesome. So that's it from my side. Thank you for listening to the talk. You can reach me at aditataluxio.com for the questions that you may have after the conference and I'll be happy to take any questions now. So the question was that how long is the caching available? So you do have control over how long the data stays in Aluxio. So by default, there will be no eviction until the memory is full. So the question was how do you handle write workloads? For writes, we have different policies for writes. So the default policy, we have a policy called cached through which means that writes will be cached in Aluxio memory as well as propagated to the storage system which is backing the path. You can have different options such as something, a feature called fast durable writes which means that we will make sure that the writes go into Aluxio. We replicate the data in Aluxio and we asynchronously propagate the update down to the storage systems which means that when you do a write to Aluxio you're guaranteed that your data will not be lost and we will eventually persist the data to the backing storage system. Did that answer your question? Believe we only, so unfortunately, I'm not in a good position to answer that question since I'm not sure I completely understand it but we can talk afterwards to make for a longer conversation. Any more questions? Thank you.