 So hello everyone. Thank you very much for coming to the last session on last day Today I'm gonna give you a talk about the D3N, the project that I have been working for for a long time This is a joint project between two universities Boston University and Northeastern It's actually start as a research project, but later on thanks to the RETAT they really interested it with the idea and now we are collaborating and our code become part of the upstream so D3N is a cash infrastructure for data centers and its goal is to speed up the performance of big data analytics So here we are seeing a typical private data center an enterprise data center The data is very important today for many organizations because They want to collect the data, analyze the data, data basically at the value to their business and a lot of organization because of that form storage repositories called data lake to store the huge amount of data and data lakes are typically deployed as Object storage, immutable object storage. They're running on cheap commodity hardware And at the same time these organizations create a lot of like compute cluster or analytic cluster to analyze the data sets and It could be a one big cluster or it could be multiple cluster Used by different organizations, different groups within the organization like your performance team has a cluster or your marketing team as a cluster And what I mean by analytic clusters I'm talking about big data analytics frameworks like Hadoop, Spark which run in a distributed manner and the data sets for these Analytic clusters are stored in the data lake. They're input data sets output data set sometimes even the intermediate data sets and As you see here in able to access these data sets compute cluster has to go over the network So today even we talk about full wide section bandwidths Maybe flat data centers But in reality most of the data center has a high over subscription and organic growth And what I mean by organic growth is you upgrade some part of your switches or part of the data center But not not the other part. So as a summary basically what I'm trying to say is because of the over subscription and Organic growth we see network imbalances. We observe a lot of network congestion and as a result accessing the data sets store in the data lake So a lot of like high latency So that's assumption we have in here for this project To Tackle down it first we look at the how these applications are running What are the typical characteristic of these running application? So there are a lot of literatures and also we have some our own data sets from our industrial partner like to sigma We have publicly available traces from Facebook for their like Hadoop Clusters and among these data sets what we observe is a high data input reuse those Data's are repeatedly accesses The other thing we observe is there's uneven data popularity. So certain data sets are absolutely much Frequently accessed than others and we also observe file popularity is changes over time So a file is popular today. Maybe in a week, but after that it's not become hot anymore. So for these three Observation caching is seems to us obvious solution the other thing we observe is usually the data sets access sequentially and with their Entirely so user usually process entire data sets and prefetching definitely Help us to improve the performance for those type of accesses So based on this observation we come up with the main idea for DTN And the main idea is here is we are catching the data on the access side of the bottlenecks So here I'm gonna Overview, what's the DTN architecture and how does it work? So in the figure you see a self cluster which Usually run on hard disks and we have a hierarchical network topology like a factory, right? We have a bunch of switches and then we have racks With top of Rack switch and in each rack And in each rack you see Compute nodes this could be bare metal VM containers the bottom are gray I don't know you can see but So in our design the initial thing we did is we place a strategic with rack local cast servers And these cast servers are equipped with high speed SSDs and we place them per rack And then we run cast services on top of each each Cast server and these cast services we call layer one L1 and they act as a local cache to the entire rack So any request going to the back end self cluster from any nodes Has to go through the layer one cache So you have to first check the data whether it exists in layer one or not This way we trying to prevent any traffic going out of the top of racks, which basically So if you have more sharing happening among your clusters Then what we did is we create more cast services and call them layer two and basically what we did is we Group those cast services Inside a pool and create layers now as you see a layer to become a one big distributed cache for this one cluster So any miss from like your local layer one cache is now going to go and check the layer to cash So this way we are trying to prevent Traffic going outside to your cluster basically so in your Organization data center if you have more sharing happening among your clusters You can you can even create a higher level Which is layer three in this case, which is going to be shared among multiple cluster And it's gonna again prevent any accesses going to the back end so the main design goal of the tree and in here is the minimize the access to the back end and Maximize the storage throughput So later on we implement this in the set In here, I'm showing you how the seth object store works So we have a seth cluster running and the seth provides a gateway called rados gateway which provides user to access their object store and typically clients in here clients are like map-reduced jobs your high jive spark jobs For they request a load balancer and then load balancer distribute the request across rados gateways and Rados gateway is also provide us Swift NS3 APIs and today most of the applications are also support these API's So what we did is we implement the detail logic inside the rados gateway and Rados gateway has actually three layer the front-end which provides the web serve RESTful API things and then we have the rados gateway layer and under neat we have the lib rados level which provides the Rados protocol for step cluster and we did our modification to this like Sorry middle layer the rados gateway So now I'm going to show you how this architecture change So instead of like clients going to the load balancer then the load balancer distribute requests to the rados gateway In our design clients go to the nearest Rados gateway we call the Trian rados gateway and how do they do that? They use any cat DNS any cat solution. So with any case they go to the near nearby DNS server that we provide and the DNS basically Point to the client to the closest rados gateway And if you want to look at a little bit more details now clients basically gonna forward their request to the nearby Rados gateway which run the local layer one cache and if the data is not there then Layer one cache is going to run a consistent hashing algorithm to locate the Object in the second layer and request it from there And if the object is not in the second layer then the second layer is going to go and bring the object back to the client So we did our modification to the self rados gateway and We also upstream the code Since S3 is since rados gateway provides S3 swift compatible object interface automatically the Trian also Provides them so any application who can able to who talk S3 or swift can use the cache and with our current implementation We only implement the first two layer layer one and layer two that because that was what we need in in our initial Implementation and as I mentioned ago Previously we are using consistent hashing algorithm and we are caching the data across the NVM ESS these We are using NVM me because we are in here talking about the big data sets So NVM means our performance and at the same time cost efficient comparing to other like faster solutions like RAM and The caches layer one and layer two Logically separated, but they're sharing the same physical hardware. So so they're running on the same a Cast server. They're sharing the same SSD space basically Next what we look at is okay if the layer one and layer two are Running on the same machine and if they're using the same cast space, how should we Split this cast space, right? One easy way to do and this is what we have implemented. We basically split the cache 50 50 and However, later on we realize this might be not the best solution. So then we come up with the algorithm which basically Observed access pattern and network congestion and decide the allocated space per layer This is important because in your cluster, you may have a very high cluster locality or you may have a Very congestion congested network to the back end storage, right? Then in this case, you want to store more data on the layer, too it could be the opposite you may have a lot of Clients running on the same rack accessing the same data or maybe you have a high congestion Within the cluster network among your racks then you want to store more data on layer one So we propose an algorithm Which is adaptive cast size management algorithm and our algorithm measure the reuse distance histogram and the mean miss latency And then by using these two metrics we find the optimal cast size So the algorithms are on the paper. I'm not gonna go in details, but I'm happy to discuss offline later on So then we run a lot of evaluation to look at the performance of the cache and We run micro benchmarks and we also Create a simple numerical model to show the value of multilevel cache against only local layer one cache or only one layer to distributed layer to a cache and So micro benchmarks what we saw with the results the tree implementation is super fast For the reads hits we can able to saturate the maximum speed of SSDs and at the same time the nick So read throughput is increased by five times For the right back cache We can able to saturate the maximum right bandwidth and the performance improves again and the right through policy has a small overheads up to 10% And also, we are we look at the value of multilevel cache by Using our model and we run our model Against the publicly available Facebook trace and the two sigma trace that we have and our results show that multi-layer has Provides you more throughput than any single layer solution Also look at the algorithm these experiments the previous experiments at least the micro benchmarks we were running on the real environment but the Simulation we didn't actually for the dynamic cache management algorithm We didn't actually implement that algorithm yet, but we run it on the simulation and in this experiment We are showing the algorithm adapts the changes in the access pattern. So on the Right graph you see the on the x-axis of the algorithm runtime basically the time and The y-axis showing the local layer one capacity So what we did is from one rack and in this experiments we have ten racks actually I'm just showing the two results of two rack because otherwise it becomes super complicated So one rack which is rack one makes a lot of aggressive request And it's keep requesting different files and then what we see is until the minute thirty six it's increase its local one layer cache capacity and after Time thirty six thirty six minutes mark We the rack won't stop making requests and at that time rack four start making a request and then after 36 minute mark We observe that right rack four increases local layer one capacity so this graph we are trying to show the algorithm adapts when you are access pattern change and We also Show the total runtime completion. So we compared the dynamic algorithm with static allocation Which has like 50 percent L1 and 50 percent L2 and we also compared results with only layer one which is a distributed cache and We show that our dynamic results of dynamic allocation Reduce the runtime We also look at the Adaptability to the network law changes. So in this experiment we put a congestion to the network on one of the racks which kind of tied other Which kind of the other request tied the other request and because that rack has a limited or congested network the other racks also get impacted So during that congestion window Because accessing to layer two becomes expensive for at least one rack We observe that all the racks start increasing their layer one capacity. So that was the goal of this experiment We also run We also use the Facebook trace with different locality In with different locality level so the Facebook trace actually doesn't have any rack locality But we synthetically generate it so a hundred percent means in here all the requests are rack local Versus zero percent means requests are randomly distributed among different tracks and for all the cases we observed at dynamic allocation I'll perform the static allocation This is one of the cool experiment we have this actually running in our production In our real environment So we again played the Facebook trace in here. We are just running to read Running the Facebook trace which has 75 for use around 850 jobs and we are comparing the detail implementation with Original vanilla rados gateway which has no caching and we are running it under two network scenario in one of them We have a high connectivity between our cast service to the back end which is around 40 gigabits per second And we have two racks here. So the aggregated bandwidth was 80 gigabits the other scenario which has 12 gigabit is more realistic and for both of the experiments We show that D3 and improve the performance a lot and it's also on the right left side graph you see the Traffic going to the back end and as you see in here, we decrease the total amount of network going to the back end So D3 ends to summarize it D3 and is right now supports the read cache and We also propose a dynamic cash partitioning algorithm But we didn't actually implement that algorithm yet and we recently also implement a prefetching mechanism This has been done actually by a group of students So D3 and implementation right now support read read a hat prefetching or user can define basically Commands and say I want to pre-fetch that data and the data will be available to in the cache So D3 and also supports write cache, but we don't have any redundancy mechanism today We support write back and write through So the status of the project we upstream the code the read cache code is upstream Next step is upstream or at the prefetching mechanism there and the feature works We are thinking some of them in in progress. We are trying to make sure the write back cache has some redundancy because we want to Tolerate the failures kind of scenario and we want to look at the cache management policies right now The caches are using basically LRU. So we want to use some like do some smart caching using some machine learning techniques You want to predict the feature accesses? We want to predict access patterns to pre-fetch the data before client even access We also want to support multiple back ends right now current design supports only single back ends And we also want to use the cache level in Events to hint the application what I mean by that the applications are not aware of what is in the cache so if we We if we provide some hints application then application can rearrange its IO or jobs There are also some Limitation that I want to talk about with the current design So one of the thing is right now all Decisions are making locally. So when a cache server want to evict some things it has no I know global view Or when it's want to make a request, it's just running the consistent hashing and forward this request So we are coupling the policy and mechanism here and we are caching everything in block granularity right now So we don't have any idea of or we don't have any mapping between the objects and the blocks And Then we also have a lot of redundancy because multiple cache servers on layer one can store the same object Higher-level layer two can store the same object. So is that redundancy really necessary? require necessity so we are we want to try to eliminate that redundancy and Also, if you want to do some more smart caching kind of stuff the current implementation is Little bit like limiting us on that sense and also we are not able to do why they're in it or caching So Because of that we are changing with a bit the current architecture. That's what I have been working over the summer So we are trying to make sure instead of like using consistent hashing can we use more flexible solution? Consistent hashing is great. It's easy. It's easy to implement We upstream we don't have to change anything, but we want to do a little bit more complex cache management Because of that we are moving forward to directly best base cast solution And what I mean by that so each cash is gonna store a directory for each object about their location Maybe more information about like the file information user information who is accessing and we also want to make sure the Management policies can be tuned if you are using wider a network if you are using hybrid class If you're or if you're just in a single data center So that's one of the direction that we are going forward right now The second one is we are also working on the right back caching with redundancy So the goal is in here implementing a Persistent layer in the rados gateway that we can use and we want to ensure The redundancy happens before the acknowledgement the rights. We want to make sure If there's a failure happens we recover it and we want to make sure we are implementing the right code for the story service so we had some meetings with the set folks and We come up with this idea is so I'm gonna share with you today and So instead of going and you know reinventing the veil We want to use the existing redundant story system for right back cash, which is already exists in set So the basic idea in here is we basically going to run another Cluster as a right back cash So as you see in this figure if you have a read request, they're gonna be served by the RGW cash But if you have any right request if they're right through the RGW cash gonna flash going to write Right through to the backhand OSD's However, if you if your clients would like to use right back mechanism Then the RGW is gonna write the data back to the OSD cluster. So We are gonna planning to input like Implement this or run this architecture. Of course the current Seth has a lot of maybe unnecessary Stuff that we may not need in this cash design So we are planning to tune that cash OSD cluster and make sure it's performant You know, we don't know we are not gonna sure it's gonna work or not right now But we are going to see that So again to we will have two OSD clusters one for caching the data for right back and one for for the backhand which is going to store the persistent data for long-term storage and and then RGW cash in here is gonna provide us to mechanism And then once we have this right back mechanism right back caching I believe we can build on top of it are This is going to be like the core and then we're gonna build our own research on top of this So that's that's all so we have the web page for the Projects you can find them either at the mass open cloud or at the Collaboratory page and we have the github repo for the coach and the simulator that we have for Dynamic cache management, and if you have any questions feel free to shoot me an email and So I wanted to ask about the application hinting So you're actually mentioning that the cache could send a hint to the application Have you thought about it the other way around where the application could provide hints about the data? That's the caching more efficient. Yeah, we think both you're right. So the question is you think about Cash hint application. Did you think the other way around? Yes, we did. That's why we implemented prefetching mechanisms So in the prefetching we specifically we are providing two way one of them is The client can say, okay, I want to prefetch that data because I'm gonna access it that way we can bring the data before hand Or right now the prefetching mechanism is work like that for the read ahead at least you access one block and Then next and block is going to be prefetch But what we want to do is For example, if we realize if the applications tell us, okay, I'm gonna read that one in the near future We can also prefetch those as well. So we think about it. I think there will be a research direction gonna go to that way as well But that's very critical for applications as well, you're right So I'm pretty new to like this caching and prefetching like terms like I sometimes use is like interchangeably like can you like Explain the difference. So Prefetching means so caching is improved the performance not on the first time So when you request the file at the first time, it won't exist in your memory or the cache It's going to be the miss so you have to go Read it from the original place back and and bring it back storing the cache However, the second access is gonna be in the heat in the cache, which means you you're not you don't have to go to the back end However, the first is going to be the miss What's the prefetching do before you even access the data before you issue the IO? The cache is gonna go to the back end and bring that data and store it So your first request is gonna be also served by the cache That's the difference. I think you mentioned that you were using LRU cache algorithm Is there can you talk to the eviction policy? Is there like a limit on the cache? right now L1 and L2 they're sharing one big LRU and We didn't prioritize L1 in the implementation L1 request over L2 request That's something we are want to look at it and we are using LRU Because it was simple to implement first. We want to make sure the system work working Um, but what we want to do is actually like we want to for example use some machine learning techniques instead of like LRU or I don't fight for or LFU whatever and Is this gonna give us any benefit? Is it I don't know if it's gonna be performant enough I don't know is it gonna be give us a benefit, but We want to look at that But as I said right now each cache Give their own independent decision So one cast servers has one LRU second cast server has their own LRU and then we evict something We don't know anything about other cast servers So we have one LRU queue and whatever the least access we just evicted and again as I said, there's no prioritization I think this is also important Does he do we have to treat each object same or should we prioritize some objects and Also for read and writes request. We don't have any prioritization right now. So should we have to Maybe not evict from write cache or we shouldn't evict from with cash Like those are the traces. We have to look at it and this is where the research. I think be part of this project