 I think we can get started so because it's wonderful thing. So the next talk is by, is it a meme? I hope I'm pronouncing your name correctly. I'm actually using the middle name, UR, but it's fine. Okay. So we'll have UR talk about hybrid cloud storage. Okay. Thank you very much. Hi everyone. My name is UR. I am a Ph.D. student at Boston University and also doing an internship at Red Hat SEP team at the CTO's office. So today I will talk about the hybrid cloud storage caching projects. So we designed an implemented cache architecture to improve the performance of big data analytical workloads. So in our own data center, we don't have a full bisection bandwidth, which is the case for many other data centers out there today. And we have a tremendous amount of data reuse. So we built a cache called D3N for a single data center. And the main idea behind the cache that we built is that it caches the data at the access side of network bottlenecks. It is a multi-layer cache, as you see in the figure, where we store the data on local racks and we forward to request to the upper layers using concepts and hashing. So unlike other solutions out there today, like Alexio, our solution is not a single cluster cache. It's actually as an extension to the existing data lake and it's designed in a way that it shared among multiple cluster. So we put everything in the RGW code and the code is upstream now by the Red Hat engineers. So we run a bunch of experiments and we have a paper about this work. But overall, the implementation imposes minimum overhead and it's significantly improved the performance of big data on analytical workloads. We run experiments on one of the Facebook's MapReduce cluster traces and we see a reduction on the back-end traffic and improvements on the performance. So everything so far, what we've done is for a single data center. But now we want to take this work and extend it for a hybrid cloud use case. So the value offered by public cloud services are clear for many workloads. However, today many organizations want to keep their data sets in their private data center for different reasons, could be calls or security. For instance, Two Sigma financial hedge fund and one of our collaborators, they use spot instances to run their computation and they create compute cluster in multiple regions. And however, due to the security issues, they want to keep their data sets in their private data center. And because of the current asymmetric network prices, companies like Two Sigma has a strong incentive for on-premise storage and cache data locally in different cloud regions. So here is our new proposed architecture for a hybrid cloud scenario and I'm going to go over all of the components. So what we did is instead of like in our previous design it was only for read-only and intermediate data sets. However, now if we want to cache the data on the other side of the wide area networks, write cache is important. So therefore, we also deploy a write cache. However, rather than implementing a replicated durable write cache, we use the existing self-code and stand up a local OSD cluster as a cache layer in each public cloud region. And also, as you see, we have read caches collocated with that. So read cache in our design stores data in like block granularity. However, in the write cache, we store data in object granularity and any data coming from the client is first written into the write cache and once the data is age, after some time, Rado's gateway flushed the data to the back end. So we have also like inclusive cache model here where the same objects can be present in both read and write cache. And in the write cache, we also use erasure coding. We are still exploring its performance and the right redundancy in the write cache. The second thing we've done is in our previous design, we were using consistent hashing to locate the objects in the caches because it was simple. It doesn't require any protocol changes and it was like easily upstream. However, now we have a write cache, we need a directory to know where the data is stored and to prevent any data loss, a directory must be durable and reliable. So in the implementation, we are using Redis as a directory and a Rado's gateways contact like talk to the Redis and get the location of each objects and forward their request to retrieve that objects. So the directory stores information for each blocks, objects, and this is just an example. It acts as a database. And also other information in the system like cache service itself, what is their capacity, what is the hit count, what is the bandwidth. And the other thing is like RGWs doesn't aware of how the data is indexed in directory. So directory really act like a database and it just serves the queries. And in the feature, not just like our cache services, but we also want other components in like analytic clusters to talk to the directory. For example, cluster schedulers like yarn or to allocate the jobs based on where the data is like cached or your Kubernetes or your DNS server when they forward to the request based on, I don't know, like cache servers load, for example, or certain bandwidth information. So that directory like really in the feature will provide us a lot of information, which not just the cache, but other components of this entire ecosystem can benefit. And finally, when you're in a single data center, you usually have a fixed, not usually, but all the time you have a fixed network topology. And in our previous design, we were using any cache based lookup service to locate the cache services. However, when we are in the public clouds, we cannot take advantage of the topology in the same way because there is no topology. So therefore, we will use the Kubernetes DNS servers to forward the client's request to the nearest cache in the region. And finally, most of the object stores today use S3 protocol. And we wanna generalize our cache implementation. We just wanna support multiple data lakes, not just SEP. Therefore, we are using S3 for this purpose. And this way, we can deploy these cache caches not only in front of the SEP, but any other data lake like S3 or Minio or nothing comes to my mind right now about whatever support S3. And so what we have in here is there has been over half a century of research into the caching techniques in multiple processors, file system, web caches, right? Now, however, we have this like huge opportunity with the global or with the shared this directory, right? We are right now currently looking at how can we use this directory to do better cache management, right? Because as I mentioned in earlier slide, as in the directory, we store a huge amount of information about the data. We store who is accessing data? What is the access size? How frequently they are accessing? What applications they are accessing? And basically we have like the information about like the entire system, what's going on in our system. The second thing is when we are dealing with object stores, it's different than, for example, like CPUs, right? In here we are dealing with like large granularity object accesses. For instance, in SEP, I can say it's like formic. Each object is formic. Every time you're reading or writing formic chunks, right? And these are also immutable. So these unique features provide us the opportunity to explore different caching techniques, different than prior works like on CPUs or other web caches, right? So we can now run simple realistic or even like machine learning techniques to find common access patterns. For example, we know that like there are objects which are written many times, never read or hot objects or cold objects and can we detect these patterns? And so our key idea right now that we are working on this cache management scheme is that we wanna make sure objects spend enough time in the cache and then we learn about each objects and each accesses. And then we can find the right candidates for eviction. We also wanna use this information store in directory and use these like historical based approach to predict feature accesses. So where are we going now? So this design and implementation allow us to do all interesting research. As I mentioned, directory provide us now a global cache view and we wanna build a platform for other researchers to use the directory and explore different cache management algorithms, different policies. We wanna build a cache not for a single data lake but for multiple data lakes, geo-distributed data lakes. For instance, the open storage network deploy one petabyte data lakes all around the country and one of the problem they are facing today is there each user has to define which where to store their data, which data lake they have to store data. So can we use the caching and do this more automatically and can we place the data on behalf of the user without user telling us? Like can we place, I don't know, user A's data in data lake A, user B data in data lake and for instance, right? Also we wanna explore like how erasure coding works in the cache layer and this hasn't done much in the literature as well and not redundantly store the data in cache. We are also interesting in looking where do we run the computation of the data? For example, one example is that if you have the data cache in region one, then can we spin up the cluster in region one or the other way around, right? You spin up your clusters, both instances in let's say region N then can we prefetch the data before computation starts? So we are looking at these techniques as well and finally to realistically design all the things that we mentioned, we need real system traces, right? And because of these traces, these logs will help us to understand better each application, each user, each jobs and this way we can provide, we can develop better algorithms and improve the cache management efficiency. So if anyone who is listening to this talk has such a traces or would like to talk about more about that, I will be happy to chat. Also, as I mentioned, the first version of the cache, D2N is upstream and you can find the details from our link for the hybrid cloud cache implementation is also on GitHub. It's like you can download and play with it. We are still updating the GitHub repo whenever we have new functionalities. And also we are right now working on a simulation for this cache management algorithm. That code is not available right now but hopefully it will be available soon on our GitHub repo as well. And if you wanna learn more about our projects, please visit our web pages. So I believe that's all I'm happy to answer any questions you have. I am just going through the shots and I don't see any questions. So maybe we can give folks a few minutes. Okay, sure. And see if any questions come through we'll just move over to the break room. Break up. Oh, okay. Somebody shared the, yeah. Oh, so I have to verify. Should I stop sharing? Yeah, you can stop sharing now. Okay. Great. Yeah. All right, people. I guess we don't have questions. Okay, that's why. We're very clear or that's good. We appreciate your time so much. Thank you, no problem. So I have to move to the breakout room right now, right? Yes, yes. Okay, all right. Great, thanks for listening. Okay, thank you. Bye.