 Hello, everyone. Welcome to our Qoocon talk. Today, we'll be talking about how we've been working at CERN for speeding up analysis pipelines using remote container images. My name is Ricardo Rocha. I work at CERN in the CERN Cloud team. I'm a computing engineer. I work in the CERN Cloud team as well as a computing engineer focusing on containers. Today, we'll be covering this topic. I'll start by giving a very brief introduction to CERN and what we do here. CERN is the European Organization for Nuclear Research. We focus on doing fundamental research and trying to answer big questions about the origins of the universe and how matter is constituted. To answer all these questions, we basically built very large machines. The largest machine that we have today is the Large Halphone Collider, which is here in the map. It's a particle accelerator that is 27 kilometers in circumference where we inject proton beams one clockwise the other one counterclockwise and we accelerate them to very close to the speedo lights and to very high energies. We try then to make them collide at very specific points where we built this massive detectors that allows us to get the sneak peek into the collisions. To have an idea of the size, you have the Geneva Airport on the bottom right. CERN is actually split or the accelerator is split between the Swiss and French border close to the European Alps. If we look a bit down into what the actual accelerator looks like, it's 100 meters on the ground in a tunnel. We see here the magnets and in the middle we have the beams circulating and to achieve these high energies we actually have to cool them down to very close to absolute zero. Then we have these big experiments that I mentioned. These are in large caverns also underground. This is the compact Muon Solenoid CMS. The cavern is 40 meters by 40 meters. It's fully filled with detector. The detector weighs 14 tons and you can see the size of the person for an idea of the scale. It acts like a gigantic camera taking 40 million pictures a second and then the result of this collection is just a very large amount of data that we have to store and eventually analyze. This analysis has many steps but the final step will be the end user analysis where we generate plots like the one we see here which showed the peak that gave us the Higgs discovery back in 2012 that led to the Nobel Prize in 2013. Here to process all this data we need a large amount of computing resources. We have our own data center on premises that gives us something like 300,000 cores but we actually need more capacity. Over the last 20 years or so we built this very large grid computing infrastructure where we connected more than 200 sites around the world and this acts like a giant supercomputer for our physicists. At any moment you will see something like 400,000 jobs running in this infrastructure and we have more than double the capacity we have on premises to close very close to one million cores these days. This is one crucial part of our system to analyze the data and this shows also how important it is to optimize the software distribution which is why we are doing this talk today and how to describe to you how we used to do it and how we are now doing it using containerized infrastructures. If we look a bit at what we used to do and we still do actually this is the main way of distributing software in the grid we use this system called CERN VMFS. It's a very scalable system to distribute software in the grid and across all the sites it acts like a hierarchical read-only file system where CERN is what we call the stratum zero or the top of the hierarchy where we push all the software the experiments will do their releases and push software here and then at each site we run these caches that are exposed to the users as read-only POSIX file systems in user space and we do very aggressive caching at these sites to optimize both the network usage towards these sites but also to speed up the start of the jobs making sure that only the data that is actually needed is pushed to the different sites on request. This has been a very successful system and we use it intensively. When we start looking at containers it's very important to make sure that we achieve the same efficiency. This was the big question when people started containerizing their workloads and thinking about using containers we started thinking how can we rely on something similar to do the same for containers. There's a couple more questions that come more detailed questions that come with this. If we think that software packaging in container images it's pretty important to speed up container creation if you think that to start a container you need to pull the image and you consider that some of our users have images that are several gigabytes or even tens of gigabytes this can take quite a while especially if you have large clusters where you might need to pull these images in many different instances at the same time. This not only slows down the job start but also puts a lot of load in our system. How can we reduce an optimized network usage? We knew how to do this already with something like caching with CDMFS and then if you think of cluster autoscaling which is something that we try to explore as much as possible we have to think if we are handling huge images and we are constantly dropping and creating new nodes then we'll be on this cycle of having to pull new images constantly because the nodes are just fresh. All of this is quite important and this is what triggered all the work that we're describing today. There's some history before what we will present here. Back in 2016 at FAST a system called Slacker was presented that allowed a fast distribution of Docker containers and introduced this idea of laser loading of Docker container images. The CDMFS team took this idea and implemented the Docker CDMFS graph driver. This worked very well while we were using Docker but when the components started being split then we couldn't use the graph driver any longer and we had to look at the run times that were appearing. With this I will pass to Spiros that will focus on describing how lazy pulling is working. Thank you Ricardo. I think I managed to share my screen. I will be talking now about lazy pulling building on the history that Ricardo already mentioned there is some ongoing work already the CDMFS developers already started implementing a container D remote sensor based on CDMFS and this is still work in progress but in this presentation we will focus on another implementation that started by the container D authors based on Star GZ and we will also do a demo with distributed the hierarchy of container registers. So what is a remote snapshoter and what is Star GZ? Back in 2019 every group of the container D authors had some brainstorm sessions to implement remote snapshoting and they came up with an API based on GRPC API for a pluggable remote snapshoters and the first implementation came from Entity and Kohi Tokunaga and Hiro Suda that was based on Star GZ Star GZ stands for Seekable Star GZ and it's extending the properties of tarbles that make up container images. So a very interesting property that tarbles have is that if you can alternate or append many tarbles one after the other, they're still a valid tarble. So based on this idea some developers at Google to improve the performance of a Golan build system proposed the CRFS and implemented a proposed Star GZ protocol. So what this snapshoter does it indexes all the files in all the layers and it creates something similar to the manifest for container image to where all these files exist in every layer and then it mounts, it does a huge mount for every layer from the container registry to the host leveraging range queries and as I mentioned, this is a JLBC plugin configured in every node in the container it communicates via socket. So this is more like a visual representation of it and so on the left side we have the container registry on the right side a node that pulls images. So on the left we can see that we have the hierarchy that under V2 in the registry path we have every layer as a blob and on the right side we can see that we have container D communicating with a snapshoter and the snapshoter creates a huge mount for every layer and then it creates an overlay file system where the rootFS where the container starts. To demonstrate how this works we run some experiments with a very big image produced by the Atlas experiment called Athena and Athena is this image Athena is a full release that is made up of 17.2 gigabytes uncompressed and 5.4 gigabytes uncompressed and below you can see an optimized image for star gz that turns into a file turns every file into another table. So as you can see the size increases a bit because we have the additional size of the header of every table So for this experiment we run a simple workload to analyze an event an LH event. So here in the first row we have the native execution with just the native container D overlayFS implementation and we can see the pooling time is an optimistic 3.5 minutes time star gz is a flat 15 to 17 seconds and then you can also see that most more importantly for network ingress in the case of the native implementation of container D we pull almost 6 gigabytes of data while in star gz we pull exactly the files that we need and we have less than 1 gigabytes. The benefit of that is that of the implementation of the snapshot is that while we lazy pull all the files on the time that we need we don't lose a lot of performance so the execution time of the workload with star gz is just 1 minute slower than the native one. So if we add up the pooling time executing with star gz it's faster. So to summarize we have like very fast start-up time low network traffic, the memory consumption for this snapshot is a little concerning but something we can investigate and the drawback is that to build this optimised image you need a lot of time in this case it was 45 minutes. To demonstrate this like more in practice I will do a quick demo. Here I have one container that runs the container D and there's another and then auxiliary sentosite container. So we will see an example image that I have so this is a Docker file that basis on Python 3.9 and adds a simple HelloPy file that just prints Hello and go and build it but I have it already built and then I can assume that I can push to registry and what I will show you in the other screen is that I can just optimise it and then try to lazy pull it. So in this case I have already optimised it so now I will just try to pull the image but not the full image and just do the few months that I described. Here you can see that it downloads the manifest of the image and the index of all the files in a flat time of 6 seconds sometimes it's 5 but it's much faster than a normal image and then I will also go ahead and download the massive image that I described and you will see that pulling time is just a little more than the standard, than a simple Python image. While downloading I will also show you that I have the cursor running to monitor the traffic of container D so here you can see that the IEO of the container D demo container was only 30 megabytes and here you can see that in 16 seconds it manages to download the image. Finally I would like to show you that this is the original image built from Atlas and it has 14 layers and this is the optimised image that I created again comprised by 14 layers but now you can see that the signatures of the layers are different. Now back to Ricardo to talk to you about the demo that I described about hierarchical registers. Okay, thanks Spiros. I'll just hide here my bar. So Spiros explained all the points, can you hear me? Spiros explained all the points of why we are doing this lazy pulling with a couple of images. What we'll try to explain now is how we are deploying this in our infrastructure. So coming back to the initial slide that I showed which was how we are doing today the software distribution we will rely on certain VMFS. So this is a system that we are very happy with. It works very well. So one option is to rely on the implementation of a remote snapshot or relying on this system. And this is something that is happening as we mentioned earlier and something that can work very well. What we will try also to demo today is something that could be more generic with distributed registries. So instead of just relying on the file system and the HTTP caches, really relying on the implementation of the container registries. The implementation is up to you which one to choose in the demo today I will be using Harbor which is also something we deploy here at CERN. And the way it works is very similar to what we saw in the model with CERN being at the top of the hierarchy where we push the images and then at each site we can run another registry that can be configured as a proxy cache so that if people pull the image it will first cache it and the second pull will be much faster or just configured with replication by using some pattern to decide which images and tags should be pushed along It's very important that it has proper sRGZ support and sRGZ as we showed for performance and also one benefit of using this is that any OCI artifact if you have an OCI registry can be pushed it's not only Docker images you can use to push home charts or ML artifacts containing model data or weights. So in the picture on the right you see what we'll try to do in the demo which is we have CERN with a registry and then we have two regions deployed in this case in the Google cloud we have a cluster running on US Central C and a cluster running in the Netherlands so we have two clusters in different continents and then each one has its own registry so this is the Harbor registry running on US Central this is the Harbor registry on the Netherlands and each one has a cluster running on different nodes that will try to run some user analysis and they will use their local registries that then can point to the top so I will now jump to the demo and here is the deployment of my two Harbor instances the one on the left is the one we have in the Netherlands this one and the one on the right is the one we have in the US Central region so I'll just browse through you can see here they are exactly the same configuration and this is the way we will deploy so that all the sites feel kind of the same and then you can have multiple projects in this case I have a CERN cache which is a proxy cache and it's linked to the CERN registry so whenever you pull from this prefix it will just pull the image from the CERN cache and then you have Docker IOR just for convenience as another cache that is configured as a replicated instance so we have configured that it should replicate everything that is in this prefix at CERN including the AT&T image so the way this works for the proxy caches is that you define two registries and it controls the health and then for the replicated one we just define replication rules and you can see here that we just we had here a successful replication a previous one was not successful you can decide to trigger this manually or you can decide to make it scheduled like every hour or every couple of minutes this is the way we structure so after this overview I will try to submit a workload so the workload is very similar to what our users do just in this case I will be using Argo Workflows which is kind of a new tool that most users are not using but what this will do is it will submit the same workload to two different clusters to the two different clusters but on the left one in the Netherlands I'm using ESRGZ deployments so you can see here the workload is already starting I'm doing the same in US Central but using a normal ESRGZ image and you can see that on the left we already started the workflow while on the right is still on the first step of preparing it because it has to download every step downloads the same image the same Athena image and it takes quite a bit longer so we can see that in the Netherlands our workflow is already going pretty fast and executing some steps each what we do here is we parallelize the job into 20 different parallel jobs and each job has three steps one for staging in, one for processing and eventually we'll have staging out when the processing stops while we see this job with ESRGZ going really fast and some of the jobs even already pushing data out the one in US Central which is using the normal images is just starting to launch its job so it's way behind and we can see here that in just a couple of seconds or so we'll be using we'll be having this workflow completed so this is kind of critical for us you can see the benefit when you start parallelizing the job and especially as I mentioned we have enabled cluster auto scaling and some nodes might come and go it's really really important that you don't have to pull the full image every time if your job is just using a small fraction of the image which is the case here the image is 18 gigabytes but each job is just pulling a very small fraction so we can see here that our US Central the non ESRGZ is still behind but moving this one is almost finished so if we do it leave it a couple more seconds we might even be able to see it finishing so I'll just give it a couple seconds one thing that I will also want to show you is that in this case for every node it's pulling the image once so it's putting a lot of stress on the storage that are being put so in our case we have our hardware instance that is backed by a GCS bucket so we are actually putting load on the GCS bucket so our optimized ESRGZ based workflow is already finished this one is almost like halfway so it's a significant advantage if you start scaling out what I want to show you is this this traffic that I mentioned so again on the left you have the the Netherlands West 4 European region I'll just refresh here the data but I'll try to show you hopefully we'll see some data about the traffic being put or the load being put on the GCS bucket there we go so in this case we see here the bucket is US central it's loading here and we can see the network traffic sent because it's serving the data and we see here the equivalent on the Netherlands region so you can see here that we picked at several tens of megabytes per second something like 80 megabytes this hardly has a peak because we basically downloaded very little data in this case for the send user analysis it was a very very small fraction of the image we used so that's it for the demo and I will pass back to Spiros for the rest of the talk thanks Ricardo so the status, the current status of the that although it's in its first stages is very much functional and we can achieve super fast container startup times we can dramatically reduce the network usage as Ricardo showed and also the cost usage in case that we use the public cloud and as we saw in the example that we did with Athena the CPU overhead is very little and overall there is a big gain CPU wise and also network wise and just limitation that we found in our own Github registry that we used before starting leveraging hardware is that there is a strong requirement for Stardis Z to support the CPU range of queries and in our case it's not supported improvements that we would like to see is the speed up of image optimization and in case of Athena I took 45 minutes in other images it might take even longer additionally what would be very important for us is to be able to create optimized images that are based on already optimized images so just optimizing in Stardis Z the additional layers and finally what we would like to see also is that being able to optimize images with some existing data so instead of cramming data inside the image maybe we could just mount them and if we have workloads and we have a sample workload we can optimize the image with a sample workload and then have an image which is very much well prepared for the actual workloads some small issues that we found is that container D doesn't gracefully fall back to the standard snapshotting if the remote snapshotter is down whichever snapshotter that is and we would also like to do some further investigations in our hardware configuration because we hit some limitations with large layers and this snapshotter was not behaving in the way we would like also we would like to thank the team from NTT, Hero and Kohi CEMU MFS team and all the participants that made this possible Ricardo do you want to say some closing remarks as well? No, again, thanks everyone also for watching and thanks for everyone that has been working on this, this is one of the key points that will help us making the best use of containers also on the grid not just at the local sites I would also highlight that this is the work of a lot of people we had a workshop in May last year that kind of triggered a lot of this here at CERN with people from different companies around the world so we look for it to continue improving the system thank you very much