 Thank you for coming to my presentation. We're gonna talk about distributed compilation tools and trying to implement them with Kubernetes. My name is Diogo Guerra. I'm a former computing engineer at CERN with the Kubernetes team where I'm mainly focused on integrating tools, ranging from monitoring, logging, and driver configuration so that the users can launch a cluster with all these features and to be available with our central monitoring on demand. So once upon a time, around four years ago, I was a young computing engineer at CERN and I was just discovering how awesome was Kubernetes and how we could deploy and scale these applications that were load-balanced by a service. And I used to live with a friend which was a software engineer. And my friend who at a Saturday was trying to compile his tool. I was complaining that the application compilation was taking too much time and I was telling him, man, don't you have like, can you use like a distributed compilation tool? For sure, like there's something available, but he didn't know about anything. And actually I had this idea like, would it, because I knew about these tools and I was fascinated with Kubernetes, so would it be nice to like use Kubernetes to scale up and down these worker workloads and like, could we do a service with this? And so we come to our motivation. So at CERN, CERN actually has lots of many applications that are managed and contributed to by lots of many themes, small and big. And these solutions sometimes are not small and they are really, they really take time to compile. And here you have the two most common tools which is root, which is a tool that scientists use to analyze and visualize the large amounts of data produced by the LHC. And also we have our own read-only file system for software distribution. So this has some problems, is that it's problematic for a team to actually manage this. So it's expensive to set up and maintain in terms of manpower. So if you have different teams doing the same thing, you're actually losing resources. And it's also expensive to run because if the resources just stay idle, you're losing money. So also it's my experience, and I hope I think yours too, that users when they use something, they opt for the easiest solution. And sometimes it's not easy enough and users don't do exactly what they should. So from the challenges side, small teams have lower budget. So usually maybe you could say that, okay, let's just buy a big machine and we solve your problem, but maybe this is not possible and maybe you don't want to buy big machines for all your developers. So maybe it would be best just to have some distributed system where you can just compile and share with everybody and everybody would take advantage of these resources. On the organization side, this needs to be something that is easily integratable with what exists because people will not just change in a whim. So it needs to be something that is easily integratable and it does the same job to be accepted. And there's not really a standard way that is offered to do this. So clearly we have a problem here. And also one of the other things is that the user which wants to compile their workload will just use a public machine that happens to have more resources than what they have, but this public machine happens to be utilized by other users that are doing different things. And when you try to run a compilation job, you just starve everybody from CPU time and this is not good for the community because you will have angry users. So how can we help solve this? So we have some criteria that we want to meet. So we want this to be compatible with what already exists with minimal effort. We want the user to be able to use this in an easy way so that it doesn't have an excuse of like, because the user defaults to use what is easiest. So if you present them an easier solution, if it works just as well, they will actually take this into account. And we really want to make this a Kubernetes service because why not? So this is basically my experience and failures trying to implement these tools to run in Kubernetes. And so the first one we tried was DCC, which is probably the most known one to compile C and C++. So the idea is that we have the worker pods that actually do the compilation service, the compilation jobs and we just scale them on demand by the amount of work that we are actually doing. And basically this would be backed up by a ingress and service and we would be happy with it. But the problem is like this will not work. So the reason is why this DCC is static. So before you actually do a compilation job, you need to define all target nodes that you're going to utilize. And if you're scaling up and down, this will not have a good time because Kubernetes service will assign jobs to random workers and then the controller which is the user computer will not know which worker has the job and it will be MS. So also one of the big cons about this is that the client controls actually all the compile jobs. So you'll have back and forth between you and the cluster. But this is common for the distributed tools. So next one, we actually try the spin-off from this DCC which is called Ice Cream. And in this case we have the same worker scaling up and down deployment by a service but it actually has a scheduler which does some auto-discovery for the containers that run the pods that run the compilation jobs. So this works but there's some caveats. So we found out that when if we scale down very fast, we might have problems because the way the scheduler actually discovers the worker pods, they will be killed and the scheduler will not be notified. And then when it tries to schedule stuff on this worker pod, it will fail. And if you have enough failures, the job compilation will fall to the scheduler and you'll basically be limited by the capacity of your scheduler. So also in addition to that, we have a very big problem which is you actually need to compile the container where you're running the compilation job with the dependencies that your job is going to utilize. And this is not, this goes against one of our criteria which is it needs to be easy for the users to actually utilize it. And we don't really want to manage these dependencies. So is there a better way? So at this point, there was no real stable solution and like our idea was not really working so we decided to drop it until a couple years later there was a new user that actually said, look, do you have something? And we said, well, we tried but we didn't really find anything suitable. And he said, well, let's try with escush and with escush we went and here the service works more or less the same way. So the difference is that our worker pods will actually register with the scheduler and there's actually a heartbeat going on when the worker pods die, the scheduler knows a bar is and will replace the failed jobs to other worker pods. And here we can see that we have a compilation going and somewhere in time there was a pod that died by mistake and the two of them were created in place of that and the compilation job finishes. So there was a real advantage of using this we found out that there's no need for dependency managing so this actually removes a big hurdle for us because we don't need to maintain like the container can just be a thin shell of the distributed compiler worker and that's it like you don't really need to add dependencies. But there's also other some big cons. So the client needs access to the worker node because the scheduler will say you can use this worker node but the client will always submit the job to the worker node directly and it will upload all the dependencies to the worker node if they don't exist. And also SKash uses bubble wrap. So bubble wrap is a thin wrapper around the job which basically isolates the kernel from the job and this requires us to use namespaces and thus we need a privilege pod. And also the job limit is not supported in the worker nodes and we found out if we submitted lots of load into the worker nodes what would happen is that all these compilations would be running and we would be run out of RAM and the job was just killed and that's not good. So we presented this to the user look at this works DM and the user was happy but I actually was not happy so I questioned myself like can we go cloud native? And I remember I had this talk with my friend long time ago that he was explaining me that many compilation tools actually create what is called a compilation database which is basically like a registry of all the compilation commands that are run by the compiler. And at this point I was explaining to my team during lunch all these ideas look this works. This doesn't really work because there are these problems and I explained them what this distributed that database compilation database was and it's like wouldn't it be nice if we could just do like serverless and we would just compile everything on demand to which I was said that this was vaporware and I was really triggered because of this and during that night I went home and actually I was like there should be something that does what I want and actually do somewhere in a blog post in a comment I saw someone pointing to this project which is called Chichi. What we have here is we it's an idea to implement this on Kubernetes but with a deployment that has the worker pods and the scheduler where the worker pods register to the scheduler in the same way that works with sgash. So the way the ggworks it's through your compilation files it will actually create a future with the ins and outs of the compilation step and it will bundle them and send them to the storage which we use cloud storage like S3. So I didn't really do this with the Kubernetes instead I wanted to use Knative and because of a lack of time I decided to use something similar like I don't know AWS Lambda so basically what happens is that the client is compiling is inferring all the intermediate reference this is what they call and then they bundle all these objects and they send them upstream and then they ask the scheduler like look give me this binary and the scheduler will create the dependency tree for this object and it will resolve everything calling functions on the Lambda service. And basically in the end what we receive is just a binary so there's actually a good thing here is that we don't actually need to interact with the compilation workers. We just send the objects and we get the result. So this is actually it fits very well in the cloud native ideology it's fastest or equal speed to what the other services like ICC and Ice Cream and VCC offer and it's really flexible and scalable because we're just limited by the amount of pods slash functions that we can deploy. Also in the same way that SKash works it actually determines what are the dependencies of our compilation job and we actually don't need to do any dependency management so this is actually really nice. The problem though is that this is a recent project and it actually has zero or no activity so there's no users or development going on and on top of that there's some lack of supporting documentation so to use you can do their examples but to do something like I don't know backport these to K native you need to actually feel around on how you do this. Also on top of that because for example in this case root which was the original project that we wanted to accelerate and they help our friend Capybara with it actually does some validation during the compilation job so because Gigi actually gives us a future and not the result that we expect we actually need to take this into account and add it to the project workflow so in some projects it might be required some more integration efforts. So I think there's no need to say but bigger nodes will always be preferable because that's the way the world works I guess and this way we can look at distributed compilations to be useful when our friend is compiling application in his four core laptop or some user is using shared infrastructure and it should not be, it should use like some proper distributed infrastructure to compile the job. So we see that legacy distributed compilation tools do not really work with the Kubernetes because they offer some things but they lack on some other place so there's not really a configuration that we could use and there's a need for there's a paradigm where we can take advantage of using distributed tools in the cloud native dynamic so like we had the last example which takes advantage of these very well and if we have a Knative cluster with some spare resources and we can just, we can use this and create a service for our users. So if you wanna have a look I have a link here for the very simple Helm Chat that I did with SKash and if I encourage you to check out the source for the GG project and the paper and if you're interested to actually contribute and hopefully we can do something together about this and last and not least I want to dedicate special thanks to all the Capybaras that's a first judgment for your entertainment. Sorry. And to Diana who helped to review this presentation and her tenacity to help me solve some last minute issues with the presentation. That's all I have. Thank you for your attention. So I'm not a specialist on compiler stuff but we still have lots of time for questions if you have some and I will try to answer them the best I can. Can you do the mic after? Thank you for the talk. It is actually very interesting talk and my question is do you think there are any cases where users want to do distributed compilation from Docker file? For example when you create a container that contains applications that requires like a large compilation and if so have you tried like build a kit, distributed build feature? So that I understand you're saying that you have Docker file and you actually do this compilation job inside the Docker file and then yeah so yeah this would work you just need to have access to the internet or whatever we will have your service and if you just source the environment variables which is pretty much how it works like you just point out to the source remote with the source, with the S3 credentials and it should work yeah. Oh thank you. Yeah no problem thank you. Hey very nice talk and very interesting. I had one question around incremental builds and that is how were you able to compare between someone building locally on their say like weaker machine and then when they built the second time so it's incremental and they should ideally have less to compile how that compared versus sending it to build in the cloud again like were you able to have it so yeah. You know what I'm trying to ask you? Yeah I think so. So I didn't really test like performance wise because that's a very lengthy process it's more like an architecture overview and if it works or not. Related to caching so if you're talking about the GG thing so all your objects will be transpiled in an intermediate representation in the same way so if they are still in S3 it will actually reuse them so you get that covered yeah. And you also have local cache so there's cache in multiple places so there's cache in client there's cache. There's the S3 repository and then if the client doesn't die like in the case of the AWS function it's stateless so it always dies but if it doesn't it has cache also locally in the build server. Okay that makes a lot of sense just one small question on top of that is a second build where you make one change do one like a second build right after is it faster in the cloud than locally on slow machines? I don't have information to answer that. Hey that's no problem very cool and thank you very much. Thank you. So I guess there's no more questions feel free to contact me and have a good day.