 Those we have over one one thousand eight hundred executors slaves Might run ten to fifty Executors we have on those over thirty instances. We have Over four four and a half thousand jobs that are run then some more frequent some are our rather dormant But then kind of to take another Angle at how much we are doing stuff we spin up for that six hundred people we spin up between 15 Thousand to 18,000 virtual machine instances every day and these are kind of test targets or or build Instances so that we create an instance we compile stuff and then we trash that instance With those numbers we are facing more and more challenges with Jenkins While we kind of love it while we utilize it so much. We are seeing some issues with the Stability under the number of concurrent jobs and builds that we execute Another part of this is that With 600 developers everyone has their favorite plugins that they want to put running on the instance And and some of these plugins are not on the same level of quality as as the rest of the Jenkins And this causes Unfortunately Stability issues with the Jenkins we get downtime so we need to take them into maintenance Reboot them and this equals then for for us kind of basing stuff on continuous integration trains kind of Waiting and and loss of effort when our continuous integration is not up and running no work can go into masters and trunks Because it cannot be verified work Another aspect is about this is that with Jenkins everything is tracked by by the master. So When we have to do maintenance On the on the continuous integration instances, we might lose state and progress for jobs that Run for a longer time. So we have for example some 12-hour Performance testing jobs and then it's always when you need to go into maintenance Do I want to do it now do I want to lose the progress of that and and so forth? So these are the these are the kind of Or this is the context where I started to look at look at continuous integration And this is how kind of this GI became to be The goals that I set up for for this GI when I started was that First of all, it needs to be always available So no matter what happens in the in the system whether we lose nodes whether we take notes for net maintenance The system should be always available and always being able to accept new work and create new progress This is important so that our continuous integration efforts net would never stop Second important Goal is for tolerance and kind of isolation of failures. So no matter what happens with the system The impact should be minimized To as small as possible. So in these terms if we look at for example slave dice, we should only lose The effort that is done on that slave if one master dice Those slaves should still have the state and they should be able to kind of contribute back to the build The next goes fault recovery and this is when something happens Meaning that we lose it. We lose a slave that is executing a job I would really really love being able to automatically recover from it by restarting the entire task So then from a developer's perspective, I could always kind of rely on the guarantees that when I start a build it will eventually run to the end as successful or or With a well the whole build should just run and return the results Last but not least for me, I set the goal that it should be really scalable I said that we have about 30 instances of Jenkins. It's a lot of stuff to maintain I would love to bring this down to just a handful of items. So for that that means we need to support thousands of configured jobs and and perhaps Hundreds or or even thousands of concurrent builds running on the system There's two key concepts with this GI that make make it work towards those goals first one First one is the reversal of control for flow compared to say Jenkins I wanted to eliminate all active masters from the system and And Just just so that we don't have a master that we could lose Yes, so rather rather than that model in this CI all the Progress and state transitions are made by by either calling clients or then workers when they perform tasks all Work is done in a pool fashion rather than push fashion and kind of well, there is no Active process that would track the state but rather the responsible for responsibility for updating the state Is left on on the workers on the slaves that perform the actual tasks and in this CI since there is no master scheduler I Distribute the work by posting tickets and tickets isn't it item of work that remain posted until it's successfully completed by by One of the workers purpose on this one my master is kind of If you want would like to put it into that terms my master provides views and hooks then to manipulate the global state of the of the ripostar Another key concept. I looked into what are the issues with our Continuous integration at f secure was that the kind of the plug-in model were really really important It is really difficult to get right So I looked at builds and I kind of figured out that it can be junked in to a set of specific tasks And then I realized that that's that's the model that I want run with this CI so rather than having pluggable on our architecture in in terms of Libraries and implementations I Define workers that do just one task Do it well and then to gain that rich functionality of Different types of tasks. I can then chain these together And the benefit of this one is that the workers become really independent if there are any faults in the Implementations of the workers as well. It's kind of the impact is limited to that worker alone It's it's fairly easy to introduce new functionality by just introducing new workers and then adding it to the chain of of each build and One good benefit of this model of of or going back to that no active Scheduling is that we don't need to pre-register these workers So we can bring them up and and down based on real load on the system based on say for example length of the queues or CPU burn Just to put an example what this worker concept means a very typical Build sequences that I want to check something out out from git purple story That's one worker. I want to run a makescript or build script that actually builds a software That's another worker in this context and then this last worker for digging out Artifacts that we want to really publish and store for for later use Then moving on to the implementation My implementation of this see I actually was made really really easy by by two open source components. I use safe Or storage system for maintaining kind of the persistent data That's that's build artifacts that's job configuration That's a build state between the different workers and and safe comes out of the box with This support for availability for torrent for torrents. So it's distributed in that sense So I don't have to reinvent that in this CI I just need to make sure that through my front end the data is stored successfully on on safe another technology that I use with this see I sue keeper and this is Distributed Distributed coordination service from a patch project and again Already built in availability built in Distribution and I use sue keeper for providing distributed locks amongst system of individual nodes When whenever there's a possibility for resource contention, so this could be Changes to job configuration Assignment of build numbers Trigger build and get a build number. So things like that where where I need to synchronize Globally within the system. That's about sue keeper. If you're not familiar with seven sue keeper. I really recommend Checking them out. They are absolutely fantastic projects Then lastly to bring all these Together I have a Python based just a relatively simple front end that provides rest Jason API hooks for accessing the repository over seven SF cluster and zookeeper data and Then also to manipulate and synchronize where it's possible or where it's needed These front ends do not hold any state and they don't have any active processing. So whatever Manipulation is done to the repository and the global state Happens during a one web request call This makes it kind of then easy to make or distribute and make highly available and Fold of one node does not have any impact on on the others then there's Workers and I mentioned workers are extremely simple. They do only one test task at the time They are just smart enough to call the front end see if there's tasks that of certain type Start executing it retry if there is Failures with the front end. So if front end goes down or if one connection to any of the storage is then we just don't Progress to the next state we keep retrying on until that's successful I Wanted to mention here. This simplified workers is this is something that I'm looking into in very much in the future We could Spawn these workers to have lifetime of one request. So for build tasks, especially this will be really good to kind of launch it once in a fresh VM instance or linux container to allow whoever's Building anything to kind of set up full control over dependencies and yet maintain that there's no side effects to Next build Okay Here's Then simple by fixture or picture how to all put always put together So we have the front ends that access a gateway basically to the set storage and and the zookeeper For synchronization and clients and workers that utilize exactly the same API for accessing The state of the repository Lastly, I will talk a little bit about the results I have a test setup or I've used a test set but set up in Amazon web services to kind of prove my my goals and this was kind of The setup spread over three availability zones so in each availability zone I have parts of the system and with this one I can then go on and check check the How this see I works against the goals that I've set I'm pretty happy where it is as of today availability so I can Shut down one Full availability zone and still the work is progressing I can submit new new tasks and they get processed just as if nothing happened Full tolerance is at the same level as well So when I shut down one availability zone only the progress on those nodes that they are On the availability zone in question is lost everything else works as if nothing happened fault recoveries is only on Say roadmap I didn't start it yet, but the model of kind of tracking the state of the workers in in zookeeper I'm quite Hopeful that I could do full restarts with ease So if I detect a stale node or stale task I could just shut it down reassign it to another worker and complete the build job even after failure On scalability front I this is Very simple very simple testing so Still need to try this out with real work real compilation work But I can already see that with this see I can host over 10,000 project without degradation on the performance and I can witness 500 corn current Build execution at the same time So I'm quite happy on on the scalability metric as well One last slide on the future The concept looks quite sound so I'm quite happy where I am at the moment It's still very early for for the project. I have to say I couldn't recommend it for general users of yet It works, but it requires quite a bit of effort to maintain it and keep it operational So the next focus certainly for me will be on the documentation side and really make the tools so that it can be taken into More common use In the next three months or so I I'm hopeful that I could launch a public service This see I kind of similar to Travis see I but on on this see I so offer that as a service for For those projects that host their stuff on on say public github. So more more to come on that front And while that is said that it's still very early. I'm looking also for brave early Adapters and contributors. So if you're interested checking to check it out come to talk to me And I can I can show you show more That's my slides Maybe we could take questions questions now if there isn't then I I have a couple of Demo commands here are prepared so I can show how it looks in practice, but let's if you have any questions I'll be happy to take them now You mentioned the that you have very small workers like a worker just to check something out Yes, so can you talk about how the state transitions between that for instance I get check out Obviously is going to write files to a file system and build needs to use those so Is it is it using Seth to sort of transfer the data between those jobs? How does that work? No, that's a very good question the workers have their own Workspace that is on local disks, but between workers Yes, I'm transferring the whole workspace back and forth to Seth from the workers that is That is it's quite heavy on the network at the moment I'm looking at if I could somehow make it Optimized so that if you have worker that is running on the same host we could just copy it over But yes, that's I'm at the moment. I'm pushing it back and forth So what do you use as a like a kind of a presentation front and to look at the results and notify developers about regressions like do you still use the jack like do you use jackets for that or do you have a custom system for that or Is that somehow built into zookeeper? The front then provides a HTTP REST API where you can get the state of state of the builds This is a little bit early into the project. I was actually hoping to be quite a bit far further By the time of this presentation Within f secure We are having just one or two projects that are using this CI so far for building and they are quite far without the Automation so they use the rest API's directly But You I is certainly something on my list at some point Utilizing the same API's provided providing the views in a little bit more user-friendly Could you say something more about sort of configuration changes say I'm making a change to a build command or to what the workflow is and Do I need to do any restarts or do I just like posting you Okay, um, let me check I could show I could show all the job configuration is is Again Jason We see I have a command line client here Let's see what we can see from here This is basically a job configuration for this CI itself how I build build and package this CI We have set of tasks that get assigned to workers Get check out execute shell, which is kind of running the make make script and and so forth and then finally I have one task for actually build for making the build artifacts and this is when I have all these configurations and then just my for Setting up a new configuration would be like this Yeah But it's a good question the job configuration gets embedded in the build while the build is triggered So the order of where when and how you change up the job configuration is sorry Did that answer your question? So how do we manage? Yes, yes, yes, so at the time of the trigger triggering a job we take the job configuration Yes, basically it has a bit of a hierarchy of Are you sorry? So he has a bit of a hierarchy within as if on how do you store jobs and and then builds underneath the jobs and so It's starting That Quite possibly I'm Although I have to say I'm kind of tempted to move Even further away. Now I'm using Zephyr the Zephyr FS Cluster file system. I would rather use just a raw object storage where I don't then have the permissions It's a tough question for for us in those numbers We are really telling our developers that Jenkins for example is just not the storage place for artifacts That they whatever is on the Jenkins instances is bound to be temporary One aspect of Also that we instruct our Jenkins users that don't store any job configuration in the Jenkins itself Just make easy hooks call make and have then the logic in the repository itself So in that sense, I I haven't Faced too many issues with the kind of user perfect or access rights management in our use, but certainly that will Perhaps we can talk after this session and I'll See if I could embed your require requirements into this CI Like Android where the source code is 20 gigabytes or so and then how does that work on like you call me on an IO bound You know software project like that where it's going to open a whole bunch of files We notice any performance degradation I have not tested that big of any I alone as as as Android on the other hand kind of the worker configuration depends on your environment where you run the execute script that does actually the building if you can Provision enough resources on that platform then that resource uses actually local and only when it comes then back to transferring the workspace back And forth and we touching kind of the front end and the storage So how much of the remaining Jenkins function functionality are you actually still using because Jenkins is a particular piece which we actually looking out for Getting rid of because it's so utterly complex and just falls over several people touch it so it seems like you actually Replaced most of that with The station configuration for example one thing that I didn't see was polling for new commits But I guess this is also rather easy to do with a simple worker Yes, so what functionality This this CI is still fairly simple, but The workers for example here are pretty basics or pretty basic But if we would want to kind of analyze say no steps results or or Plot the graphs on on some metrics We could just create a worker that can dig out that information from the workspace and perhaps create artifacts of the images so This CI itself tries to be as minimal as its core as possible But then offer that opportunity to extend the functionality through the easy easy creation of new worker types Well, yes the way that you've got this I deployed in your use case right now, I got two quick questions one is the Job configurations that in the case that you're just looking at you store those in a version control separately to this or you just not and then the other question is the underlying distributed You Okay The first one I do not version the job configuration here That's that's kind of something that I would encourage any developer to do that whenever you Publish new job configuration make sure that it comes from the same repository as rest of your code zookeeper Project does not itself Recommend running over long or high latency links. I have not tried how that works That's it but the it is a kind of good point that it would be really nice capability to also be able to Have that system over regions over long distance, but at the moment no so it seems like the the architecture so far is Designed around internal use at f-secure, but you mentioned Your desire to make it a public service Have you sorry, what are your thoughts about how to do that securely, you know And presumably these workers are going to be running untrusted code from random repositories on on github Would they still have access to set for any of the other shared things would they be able to to poison the Where artifact the DCI configuration are things like that That's a good point The workers don't actually access the CF cluster and zookeeper directly They all go through the through the front end components that it used by by clients and workers are like So there we could do Authorization on who gets the access now as far of as far as running untrusted code I mentioned there in the slides that we could have an opportunity to launch spin up a new VM or linux container for For the execution task. So in in that essence that What I now have there the make file that worker could be Modified to create a new container for that make command and once that's finished Then then we can collect the artifacts, but that way we could limit on the container level the access to Internet and the access to the Services that is that are required written in Python For the jobs is Currently git but the kind of git checkout is one of worker type of its own So it should be fairly easy to create workers for SVN or CVS similarly There's one good point about this worker model is that now we can also split Split the git checkout to happen on a different node and in different authorization context as Running the actual makescript So for example now If we set them up running on on two different contexts There's no way that the build steps actually might have access to git credentials for example That's what added benefit of splitting the workers in in isolated tasks I'm gonna thank you if anyone wants to leave I'm gonna show just couple of Commands on how how this system works in practice just to get a little bit of a hunch on on this one But if you have any more questions, just keep them coming So I have a very Simple command line client Just to show the configured Jobs now this is running on my local host Two jobs configured there So just a job configuration that is is Jason one. There's list of tasks This particle or does nothing but sleeps But Two different tasks if I launch it to get the state of the build Sleep number three. Oh here. We are kind of hopefully see this is now stored in Oh, no, this is the build state. This is still stored stored in the safe configuration But here we can see the different or here's configuration actually tasks. No tasks has been spawned yet for this one We see the tasks. There's the first task is pending looking for workers that can Fulfill these promises. So type execute shell worker and then I have a node label Just arbitrary name The db and six so can have different workers that are capable of building with different environments or different Operating systems. I'll show you them. So a couple of tasks. This is like I'll show how these looks like This is now in in zookeeper There's one worker that I didn't mention but build control worker which kind of manages Spawning different tasks altogether, but this is how they look and here the assignee part is Indicating the worker. So in the future when I'm looking at restartable Tasks, I hope to see that when this one is stale Then I could restart the entire task to make sure that all the trigger builds will ever show you finish Here's another task now. This is complete. So already Accomplished just collecting artifacts from from the sleep pattern happens to be date capabilities again indicate what kind of type of worker it is and then if we finally go again Beginning status is complete. We've run through all the tasks in order Yeah, I should have somewhere result success and we will have the list of List of the build artifacts that are created that can be then referenced by by other scripts Yeah, that's a good question that Unfortunate there. I don't have a command to show it but yes, the each of the workers can and actually will Submit a console log where they can correct both standard output and standard error. All right. Thank you If there's any more questions, just pull my sleeve at any time in the context