 Okay, we're last. Okay. Good morning, everyone. Welcome to another edition of Wikimedia Tech Talks. Today, we are having Alexandros Kosiaris, who is a senior operations engineer on the SRE team at the Wikimedia Foundation. He's going to be telling us about the new deployment pipeline, which I believe is already being used by a few services. He's going to talk, tell us a little bit more about that. And if you have any questions, please direct them to me on the IRC channel Wikimedia office. With that, Alex, it's, can you help me stop this? Hello, everyone. As Suburudu said, I'm Alex. And I'll be presenting you, yes, doing it. So yes, doing it right now. We'll be presenting today the changes to put it in my other window workspace and present and false queues. Let me get my presenter notes. The topic of this talk is the deployment pipeline, 3,000 meters and 48 meters of overview, which is a pun on the 10K feet usual stuff that you hear about. As Suburudu mentioned, there are questions to him. Know that the slides are numbered. So if you want to refer to a very specific slide, feel free to. It will make it easier at the end of the talk to just skip down to the specific slide in case you have something to ask about that one slide. Let me start by giving you a problem statement. And some people might have seen part some of these slides as the specific problem statement has been reiterated in the past. So let's start very, very generically on this one without having anything to do with a pipeline. And the question that we can ask ourselves is how does a user deploy around code? And in the very general case, it just needs an execution environment. And here are, for example, the execution environment. So to give you quickly what those are, the top left is Aeneac. If any of you recognize the punch card, congrats. The others are MS-DOS. There is a POSIX slash UNIX variant, which is a San Luis Solaris 10 on the top right hand side. And there is, of course, Windows. From Windows 95 to Windows 98 on the bottom right side. So what is all of these environments have in common? They're all execution environments. And the end user decides what they want to execute. Whether that is double-clicking or there is just typing a command at the end or just repeat punching holes in a key card. At the end of the day, it's just that they just define what they want to execute. And while that's something common tool for those execution environments, there is something that they don't have in common, and that's aside from all the other stuff like easiness of running or deploying or whatever. And it is the fact that if we split them to the left-hand side and the right-hand side, and know that's not a political statement, by the way, they're just arranged like that, the left-hand side, it's actually without a scheduler. On Aeneac, there would be no scheduler. It's the moment that you put the card in there. It would run your program. So what actually implemented the scheduler was a person. At the beginning, stories that have not been published tell us that people that have connections to the specific scheduler person would pre-empt others and would cheat their spots on VQ. There was always a VQ for the Aeneac machines. And of course, spawn an entire process and people and having to care about it and a lot of strict measures so that nobody would pre-empt others and all. And quite a big mess then. Microcomputers came along and it's also without a scheduler. That means there is only one single process ever running. There are tricks to simulate multiprocess. That's what viruses did back then. But the right-hand side actually has a scheduler. That means multiprocess, multi-user, multi-everything. And in all of these environments, whether they are left-hand side or right-hand side, we can ask a question. And while I ask that you keep the scheduler idea in mind because we're going to be talking about it later on, on all of those environments, how many things does the user have to care about? And the answer is they only have to care about this in the general case. How to execute the software, whether it was double-clicking or feeding any a card. That's about it. Have they did not have to care about CPU placement, memory placement, network capacity, disk throughput, disk space? None of all of these. And yes, there are cases where you have to go around and care that you have an SSD and not a disk, a hard disk, not feeding rust. There are cases where you might go around and care about painting a specific software to a very specific CPU. But this is not the generic case that we are actually talking about here. And now that we painted this rosy picture, let's contrast this to how we currently deploy and execute code in WMF, at least currently in the majority of cases. Because as you will see at the end, there are some cases that are not so bad anymore. So not by the way that in the question, we no longer say developer. We no longer say user. We say developer. So we're talking about a person that is more technically inclined. And this is a very, very ugly slide. It's ugly on purpose because the situation is ugly. So the developer will go around and write the code. Then they will go and test it locally. If they test it locally, they will push to get it. Then they will wait for CI. And they will also wait for code review by other people if they don't go around and try to merge themselves. At some point, they will get the code merged. And all up to now, it's more or less fine. And here starts the ugly part. I'll have to go then and build a deploy repo manually. A deploy repo is something that generally gets created by going around and installing the various dependencies. So you're going around and running composer install, pick install, npm install, whatever of these things is actually required. And after that, you try somehow to deploy that free-fit production. And at the best, I'm assuming that you actually try then you know what you're doing. But tooling does support it. But you have to go around and define everything yourself. You're going to be doing it with a blue-green method. That is blue-green central method where you go around and deploy parts of the infrastructure slowly one by one, testing them, and then killing the previous one and putting the new instance of your code into production. And after you've done that, it might error out. And you'll probably have to figure out why, I don't know, machine does not work and why that might be true. CPU contention, because in the current case, most services are co-located. So something else might be in your CPU or you might end up deploying something that CPU and starts others. Well, same goes for memory. It might very well have been that the machine has run out of this space in some of the partitions that they have. I don't know if you even care when you deploy code what the partitions are and what disk space does the machine have and all that anyway. And well, the machine might actually be dead. That's something that has happened. And it's still happening at times with Mediaweekly. One of the Mediaweekly servers, servers dies and then the train coincides with us already trying to pull it out of rotation and then SCAP kind of stalls and all. And it even puts up a very interesting question when the machine is back on. What version of the code is it running? Because if it was out and SCAP skipped it, what's going on now? We do have pulling in efforts to combat those problems, but still they do creep their ugly head every now and then. And, of course, a perennial question. What I deployed is actually what I was testing, is the thing that I know that I tested the same thing that I deployed because there is the possibility that other changes, you know, piled on top of it and no longer what I thought I had. So all of those things that I've painted in red, all things that we don't really want exposed people to it. And while we have all of those red things, we also don't really have elasticity. That is, we cannot react to increased demand, which is something that has happened quite a few in the past, especially when some GLAM user, in our specific example, uploads tons of videos to comments and then all very recoding and re-encoding takes place. And, yeah, video scalers tend to be fighting with all of that load and we do not have the capacity to add more workers to help with that temporarily, at least to help with all of that load. And at the same point in time that we cannot run multiple versions of code simultaneously in order to test something, we cannot really do canaries nicely. We have collocated services where they share dependencies and that's, for example, all of the Node.js services that are residing, for example, on RACB clusters. So there we have the usual case where some service wants to upgrade to, I don't know, Node.js 10 and take advantage of all the cool new stuff, but some other straggler fights to get upgraded to Node 10. And while it does have the nice consequence that the entire group moves along as I heard, it does, overall, slow down the speed of some people in order to increase the speed of some other people. It has its pros and cons, overall, it does cause friction. There is also more hardware fault tolerance, box dive, relapse of capacity. The Deploy repo, nobody really uses it for the environment, it's just that the Deploy repo. But we could do that. We could go around and create, take that Deploy repo that artifact that we talked about and use it to develop stuff locally. And let's not even talk about the first deployment, you know, when you put the service in production for the very, very first time, and you have to find your hardware, find out how to create the Deploy repo, find out how to configure your deployment tool, which is capped in our case for most of the tools. There is, of course, a data cluster as well that you might have to go through, although there is a coach who should request for it. He should be sure about the future of the data cluster currently. And all of this means you have to talk to multiple teams. You have to talk to the services team, probably. You have to talk to the Relics Engineering team for sure and the SRE team for sure. That is going to take time. So how about we go around and abstract some of these problems away? And here's some ideas of how we went around about abstracting the problems away. As you remember, just a while ago I talked about the scheduler. We could have a scheduler, an animated program, and not a human being, because that's what we're currently at. Usually, I am your ANIAC scheduler. Somebody comes to Alex or Giuseppe or somebody else in the SRE and would say, hey, you know what? ServiceX is going to go to Boxford Y and Z. But we could have an animated tool that does that. No reason to have humans be a deployment there. And we could also decouple the service from the workloads, allowing us to run an arbitrary number of workloads and still only have only a few of them addressed by the service. That is, only a few of them receive actual traffic. We already have it via PyBall right now via LBS. And we can do very nice things with LBS and PyBall. Things like saying, hey, you know what? WorkloadX, instanceX of your app can receive just a small percentage of the traffic that all the others are receiving. Or we can go around and fully decouple one. But all of these tools are SRE only. They were created and they're there with the idea of helping SREs operate everything, not with the idea of helping the developer. And so developers don't really have access to those. They cannot really play with those. And another thing that will help us transform them in ways, building the deployment automatically. And yes, we will be calling it a Docker image from now on because that's what we ended up with implementing and use the Docker image as well for deployment because we can do it locally. We can use, and we've actually demonstrated it, we can use the Docker image locally to develop stuff. So let's take one of those bullet points. Let's take the buildreploy that we put thematically. And let's talk about this because this is the major work of a pipeline currently. So the pipeline is powered by Jenkins and Blobber. I'll be referring a lot of Blobber in the coming slides in case you haven't heard about it before. But the entire idea of the deployment of the pipeline is that you upload a change and, you know, tests are run, you get your reviews, you merge the change, you obtain the deployment, the deploy report, the artifact, if you want to call it other names. You usually run integration tests. The pipeline can do that for you. And then you just deploy. And we can actually see that in practice how it happens from day to day. So that's camera action. I always wanted to say that badly. This is a slide that contains a snapshot that it took today from Garrett. And we have a software there that is called Turnbox. It's by the WMD team. Then you can see that user Thomas Arrow, uploaded a change on July 9 on the Turnbox repo. And then the pipeline goes around and gives some input. It's actually a success right after that. There is even Jacob from WMD that also shows up. And there is the usual stuff that you all know about. But at some point, Jacob says plus two and then the pipeline goes around, merges. And the line that I, the entry that they expanded there and there is, it's marked in red. It's the new stuff. And you can see that there is, which media pipeline project there has said, hey, you know what? We have that Docker registry slash wikimedia, slash wikimedia Turnbox image. That's a Docker image that you can go around and pull. And here are the current tags for it. And you can see two tags. One is a snapshot base, 2019, July 9, and the time. And plus the production standard that we added there, it means that it passed tests. And there is also a git hash there, that 799 thing, which if you look at it carefully, you realize that it is the exact thing, string that is a git shot or to commit that Thomas uploaded. That is viewable on the middle of the screen where, you know, right under the author, it says commit and you can see the same exact string. So what the pipeline did there is that it built the image and then it gave the image two different tags. They are, it's the exact same image, the exact same version of the image, but you can refer to it by two different names. One is timestamping and the other one is the git shot one thing. That allows you to implement two different deployment strategies with that or release engineering strategies if you want with that. One is a fully timestamped base. You just know every point in time that you've deployed a specific timestamp base for release to production. The other one allows you to do just git deployments in the way that, you know, you always know what kind of git commit you have deployed because there is a one to one correspondence between a git commit and an image. So when you test a specific image you are pretty, pretty sure that it's the exact same thing that it's going to end the production into the exact same thing that you will luckily test it against and of course the CI test against and all that. There is a third one that's not viewable there. A third approach it's not viewable there but it's used by analytics team and support for the git and if you want to do semantic versioning it's pretty cool. If you push a change a git change that has a tag a git tag that tag will be used as is as well to be added to the resulting image. So if you go through that you also have the option of doing semantic versioning via your git tag release. Andrea from the analytics team found it very useful. Pretty happy that he did and I'm pretty happy with the functionality. It's just something that we need to document a little bit better to be honest. Now why do we actually call this a pipeline? If you go around and click on the that red relic this is on top of the image build success button. If you actually click on it you could go to something like this where you can see why it's a pipeline because steps more or less are the other that are being executed check out patch step and a build image step and then there is a run image test and so on. I've chosen there to just expand the run test image step like I would have expanded any other one and you can just see that it goes around and does a an NPM test at the end of the day running the test for this specific thing which is in TypeScript and which is a JavaScript variant so those steps in the coming versions not right now are going to be configurable that's something that is happening slowly in the pipeline and the entire idea is that you will be able to configure the pipeline to your needs which might be different per software. A lot of companies are going around and doing various pipeline at the end of the day everybody is inventing a pipeline today why did we go around and do that one thing is when we started discussing this a couple of years ago there was really nobody doing it seriously aside from the very big players and the very big clouds to which we cannot go for reasons not to be discussed here the other reason we started doing it on our own and one of the things that we met was that superfiles are actually really hard those digital lists of two of the mistakes that anybody can make when they create docker files it's really easy to mess up you can go around and do them the usual is simple mistakes which is using dependencies that are not pinned or frozen where knowledge goes and that can happen both for your internal dependencies things like not freezing a specific Python package that you want to use or a very specific module but you can also not go around and correct the pinned things to specific versions for example of Node.js we do have that currently we have images pinning Node.js versions 10 or 6 and so on the other mistake that you can do is that you can have changes to your source code in validating build caches or in validating the various docker optimizations making everything take way longer and consume way more space the usual mistake that you can go around and see and most of the docker files you can find in Stack Overflow is that if you end up pasting them the software will end up running as root and running as root docker images docker containers or any kind of containers actually is not a safe for some people who would want you to be there was in fact a couple of weeks a couple of months ago a vulnerability that was present in most container runtime engines that if the image was running as root it was trivial to escape the image and arrive at the host so that's one thing that could be avoided and it's easy to forget it we forgot it, sorry we forgot it in one of the images that they are internally maintaining so anybody can make that mistake there is also issues like ending up in really bloated images they're using external services to do stuff and pass some non-standard eyes which makes it a beautiful thing when you want to debug and you end up finding code in places where you don't expect them so which is an obstruction of docker files were created it's a declarative tool written in GoLang docker file by the way is an imperative has an imperative nature you're at a relatively declarative stanza in YAML and all it does is generate a docker file for you and it supports slipping down images using multi-stage docker files that means all of the devs dependencies that you might have installed during this process no longer is in the resulting images you're going to publish that means that the resulting image is going to be way smaller which means faster deployment it also has policies for configuring a variety of things like for example not allowing to use the any kind of docker registry that you do not want to, that you do not trust it makes sure that source code files or directories that means our source code and images are not owned by root or known by a user and the code is not executed by root as well and I have a link there if you want to see more about Glober, but let me show you it's in wikitech but let me show you actually Glober Hello World and it's as simple as that you can see just version what the image is an entry point that says Hello World but this means five lines when you pass them to Glober which is quite a bit more and what it actually does, it goes around and creates a Unix user somebody that is going to be and it's actually named somebody by the way then it's going to be owning all of the code that it's going to be installed, that is your own code of the deployer's code and then there is a user that's called a run user that is going to be running the app because one of the key points about securing services is that you do not run the code in the same way under the same user that it was deployed because if the code, if the application gets compromised by an attacker if that held true then they would be able to execute arbitrary code you just added one more layer of protection around by separating the two users and that's kind of what's understood by most of the applications that are being deployed there are some exceptions like Joomla WordPress stuff where they want to be able to install plugins and all it's a security nightmare at times so we took the decision to create this tool that generates those Glober file but we also took other decisions during the pipeline process one of these was that we wanted everyone to be able to use our registry and our Docker images so all of the images are available for anyone to use and they are under Docker registry to do it with media. Don't go typing the URL in your browser it's not going to give you back something because we don't yet have a nice web interface we're working at it but if you go around and use the various APIs of the Docker registry you can and you will be able to install images and test them and use them but that's read only nothing but the pipeline and CI can actually push images to that registry and that's a conscious decision because we don't want anyone else to be able to push images there so that they are not there is no abuse of those and those images are always going to be based on base images and provided internally that is base images of the SRE team will be taken care of leaving them and making sure that they don't have any known vulnerability base means they will also be based on just the stretch, cluster and other Debian versions those time goes on as more of a thing released and that also means that yes no Docker have any images or any other registry images out there and I'll expand more into that later on and we are in the process of discussing about creating tooling to ensure as simple as possible upgrades for all of those images because if you want to visualize a tree starting from those base images to all of the production images will be running as you can understand it can be a difficult thing to know which exact component is vulnerable to a security incident and be able to rebuild all of those images all of the dependent images in order to protect against it one of the most interesting decisions that we are taking up to now is still discussing it every now and then after that we don't tag Docker images as latest as anyone that knows Docker around here will tell you that it's kind of standard to have the latest Docker image for any image but truth be told latest is kind of a mess, latest doesn't actually mean latest, all it means is that somebody pushed an image without specifying a version and just then it's just as latest that I was added by default that means that if you push a very specific version of an image you tag it correctly with a version latest is not going to be pointing at that so it's badly named and somebody needs to tag and push latest on purpose in order to update it or forget to push correctly in order to update it so that's kind of ambiguous and weird and how do you do that in a pipeline in case you can have rest conditions how do you go around and you know that it's version X and not version Y that is the one that warrants being called latest for these and other reasons even communities go around and very clearly spell it out and say avoid using it it's very ambiguous, very difficult to know what it refers to very difficult to know that you actually have graded because you no longer have something to tell you that hey you know what you're running version X and not version Y as I already said we will not be using docker hub or gcrio or any other registry images the actual reason will be I'll be talking about it later but I just want to give you an idea so there was this project that was released a couple of weeks ago vulnerablecontainers.org that's the official images in docker hub and you can see the numbers there and I can tell you that when I listed it today to update the slides the number was 17 about a couple of weeks ago it was 720 different vulnerabilities available through the no jf image you can see that rails jungle and java have not been updated yet so the entire thing is that you know those images are maybe fine for local development and all but they don't look like things you want to run in production and we want to avoid having those kinds of issues I've got to the fact that back in 2018-17 docker hub images were found with crypto miners in them and this is starting to paint a rather nutruly picture by the contrary but the main reason we do not want to have other registries around is that we want to be able to upgrade very easily and quickly and without having any kind of blockers whenever the next cart lead shell shock gasp whatever you name it it's going to have some issues that somehow ends up meaning that any kind of you know a small system can be compromised things like a live sea of vulnerability means that everything needs to be rebuilt and repackaged in a common Linux system and if we have any kind of third party in that process in that supply chain we'll upload them and by the way most of the images in docker hub are maintained in reality by volunteers while we're kind of done for we're blocked waiting for a very important security vulnerability to be addressed so the only way we can maintain trust in our pipeline is only that we can meet a community controls the entire supply chain of the images so we're talking about image creation decisions that we took but after you've got the images the ones that were created by the pipeline how in earth do you deploy these things how do you orchestrate them and as you tell your secret docker on its own might be okay for local development but for orchestrating workloads it can be painful now there were even some talks in previous kubecons about how painful it can be if you don't have orchestrator around docker because it's an excellent workload executioner but it's not really a good orchestrator networking is quite painful in docker it makes assumptions that you might not like especially when you want to run many workloads in production understanding the global state of a workload like for example how many instances do I have and what kind of version is around me it can be difficult to discern but docker swarm made recently some big improvements in that but still it's kind of clunky metric planning needs to be implemented by you and I mean really everything needs to be implemented there is nothing there that aside from you know tailing standard output nothing to help you have to go around and implement everything so there were voices years ago that we're talking about docker but back then technical operations which was the name of a saree back then said no this is not happening because it would be a nightmare having this thing in production and then Kubernetes came along it was released in 2014 a couple of years of dock clouds released docker and it's currently the de facto standard for orchestrating containers container images and container executions WMS back then thanks to the efforts of the WMCS team became a member of CNTF of which now currently is a graduated project and the other team was lacking enough to obtain some knowledge around it and after evaluating briefly some other tools like messes and now that we decided to go with what seemed like the standard like in order to be able to take advantage of the community that will be be around it and we decided to go with that so what did it give us I've already talked about the scheduler yes it has a scheduler and you go around and you size your workload you might hear me in my RC channel saying size up your pod that is go out and find out how much CPU and memory does it require under stress and under normal creating circumstances decide how many instances of that you want in order to be able to serve the expected traffic for a specific service you might be able to go around and give it hints like hey you know what this specific workload needs GPUs needs to be spread across multiple nodes because we need high availability in case one of the nodes dies or other things like that or it needs to be undedicated with all of these are things that are supported and you give by declarative inputs all of that to the scheduler and it will do the best that it can to spread nicely the workload across multiple nodes and it does a pretty good job at that at least as far as we've witnessed up to now there is rumors around and stories about cases cases where things can run not so nice it can run a mock and pause issues but we haven't met any of those so we're about ok up to now and one of the best things when we actually run across this thanks to yours truly is that it will reschedule workloads if a node fails so if a node for whatever reason dies the scheduler will realize this within a configurable time limit currently 5 minutes it will remember correctly and it will reschedule the workloads in other nodes that means that no lost capacity for a service it means that if something dies within 5 minutes we can know and that of course assuming that the cluster is not under very heavy contention the capacity of the service will be brought back to normal operating level what else though it also provides us with the power to decouple the service from the workloads and this is a very flexible tagging scheme you can go around and provide a workload a pod as it's called in Kubernetes which a set of arbitrary tags and then you create a service and you give the same arbitrary tags and they are bound together and then you can create other workloads for example you can spawn X workloads and only have a number of them being served in production requests because of a tagging scheme and that allows implementing canaries you can go around and have a number of workloads having version X and then another number of workloads having the next version version Y and then by a careful calculation of the percentages between the two you can go around and say hey you know what the small percentage of traffic for the service is actually going through the new version we'll leave it like that for I don't know three hours, three days and after it's proven we just upgrade everything to the new version which is nice the other thing that it does is that it monitors workload and points the pull is failing once that's something that Pybal does as well but the difference is that this is actually viewable by the developer on the API where the area is for Pybal the live monitoring and live deploying is not viewable by anyone who doesn't have access to the logs namely Siri and that means that if a specific workload crumbles under the load of too many people are requesting requesting access to it it can be deposed until it's functional again it doesn't receive production traffic and then it can be reposed and of course as you can realize create some very nice domino attacks of cases we haven't yet met many of those but it is something that can happen and things that we can react to we do have plans to be able to react more by adding more capacity for example to the service and it's implemented by default let me point that out blue green deployment that means that you might have access to it you will continue having workloads to start at the end of the deployment but you upgrade them in batches say 25% which by the way is the default that means that if you have 10 workloads well not 10 because it won't make my numbers easy 20 workloads 5 will be upgraded in each batch so 5 new ones will be spawned there will be health check traffic will start flowing into them the 5 old ones will die the next batch of 5 and if problems arise it stops automatically and it does allow you to roll back although rolling back is manual it's not automatic stopping the upgrade is however automatic and other things that it does is that it's actually declarative and you might have heard that with puppet before the tool that the service uses to manage configuration across the fleet but puppet in reality you go around in the queries and you define the version of the image that you want your number of instances, all of these configurations and it will make it happen and so you haven't messed up what you asked for which is easy at times to mess it up because it's YAML it's not the most beautiful format in the world but it will make it happen at the end of the day it will make it best that means you no longer have to go around and care about state transitions what was my previous state what was my next one I'm going to do that tooling and all as long as you have written your code well and you go around and you declare your new version it's just going to happen now someone might say rightly perhaps that hey, you know what you've never looked in yourself Kubernetes is a platform that it's been looking at in Google and all but the truth is that a lot of the parts of the platform are pluggable so for example sometimes are pluggable that means the container the things that run containers are pluggable there is no standard code with the open container initiative it exists for a couple of years and we're not locked into Docker any OCI compliant container and time engine will do we do have Docker that's why it's bolded there but CRIO and container do exist and they can be used as well and the Kubernetes networking paradigm is also pluggable there is the standard code C&I container network interface that anyone can implement I've listed a couple of there that implement that there are many many more in reality it's a very open space for anyone that wants to create a public or private cloud and it allows you to have a selling point here and that's where we actually have to do an evaluation because with the container and time engine that I talked about earlier truth is Docker when we took the decision was the only viable solution now CRIO and container do available solutions as well and we need to evaluate them but as far as working though we did have a lot of possible possibilities back then so we evaluated a lot of them and we went with Calico Calico project Calico is one of the C&I plugins it's also used in open space but we choose it because it was very compatible with our current working setup which means we have to do zero changes it avoided the complexity of overlay networks which other things like Flannel which is used in Toolforged and it's a pretty network policies which was a feature that has been asked by our servers who are our developers in the past that is they want to be able to know that hey you know what a compromise service is not able to reach to all of production that is outgoing connections are going to be blocked right on that level and they would not be allowed to be you know running I don't know Nmap and scanning the entire production and that's actually great and we do have it working right now we have to widely applications from talking to specific points in production and with all of those nice things and decisions there also comes caveats one of them is that when you abstract away problems you have to create new ones and you have to create new concepts and there are many many many of those I've listed some of them there but the list goes on and there are realities that you have 9 new concepts for workloads 4 new concepts for services 6 more for configurations storage and all and it's not really all that great having to learn all the Kubernetes API and Node.js application that is wikitex manipulation and all of them are defined in YAML many many lines long reach and it can be kind of messy to work with them so we scarred the environment the ecosystem back then and there was a tool but it looked like it's going to become the de facto solution it was called Helm it's still called Helm and it's the equivalent of APT get or DNF in the Kubernetes world it's a package management tool that allows you to deploy applications and it abstracts all that complexity that I talked about all those new concepts away it provides you with a simple CLR if you can go around and say hey you know what I want to install mid-air wiki and I don't have to care about services and pods and deployment and replicasets and all those things that I talked about earlier and the de facto standard there are others but it looks like it's going to be the one that will have the majority part of the market for now and the idea is that you go around you create all those concepts all those Kubernetes resources and you just group them make them a little more configurable and deploy it and it's a concept of a Helm chart which is called chart because the Kubernetes ecosystem is obsessed with failing may I say so the chart there does not refer to graphs or diagrams it refers to actual charts that you know Helmsman on ships actually you know all and they are maps of where the various harvours and shallow and deep waters are and all and you know use them to group and deploy the various Kubernetes resources but the depleting part allows you to set various values the most common one being the version of the app that's the thing that actually gets changed a lot and you can share this with the rest of the world with the entirety of the community and allow them to easily deploy their own versions their own instances of your and you can of course use it again for local development as well and we have our own and we have two links there the one is under releases with the we can go around and see our charts all of these are currently used and we also have various people where you can go around and submit new charts or submit changes to charts but you see that the repo is under operation the rafiality is just historical we're not the gatekeepers there although you can always as a as a reviewer so I can provide input and feedback on the various changes that one is proposing as well as the rest of the SRE team we're not gatekeeping there you can merge it if you want at the end of the day the chart is up to you or whoever you are the developer but again having to create a chart for the first person can be a little bit daunting so we created some scaffolding for that and the use of the scaffolding is you know just hit clone that repo and run that script create new service.sh and it's going to ask you a couple of questions what's the name of your service how the image is going to be named and that's about it and then you have your chart you can optionally edit it change stuff I urge you to do it because some of the stuff cannot be that simple things there are readme's in there and notes that are going to be displayed in the face of the it's nice to provide them something more than the generic stuff that we are able to provide in the scaffolding but you can submit it for review and can be merged and a chart created and published and that way you can go around and deploy in our own deployment pipeline and this is in reality used by the pipeline as I mentioned aside from the various unit tests that are being run by the pipeline the npm testing that we saw when I was showing the turn box and can test earlier there is also integration test that you can run and that's how integration tests are run by the pipeline those are the helm chart tests they are part of the helm chart you can go around in many of the charts but we have all of them on the test feel free to look at them the thing that we have currently standardized on is that we use this tool that is called Service Checker that the user team has written internally and it defines probes it relies on open API swagger endpoint that defines the various endpoints and probes that you can run against those endpoints in order to know that the software is operating nominally and the exact same thing is also running in production so at the end of the day our integration tests and our actual live production tests are converging and they are going to be more or less not really distinguishable implementation wise you can easily add my tests customize existing ones and use them so with all of that let me give you a very quick run about what is the current status and what is the roadmap we currently have 9 services in the pipeline I named them there many thanks to all of the employers that have gone through the process of the pipeline in order to have their services deployed by that thank you very much and I'm hoping that it is proving to be a better solution than the previous one some of them want to be added soon things like REST Router which is a sleet of part from REST Base changed from CBJobQ which is a sleet of part from the mobile content service are to be added in the next quarter and in our roadmap which is kind of not sure when it's going to happen because the pipeline was not chosen as in the annual plan we want to document it better we want to help create a local development environment that already kind of exists based on Minicube where there is still some work to be done on that encouraging a household of various other service owners to migrate services to the pipeline that includes media wiki although we are not sure when we will be able to start that part of the work but at the end of the next fiscal year it should partially have happened at least I'm hoping to we need to have tooling to automatically respond to increased and decreased demand that is be able to automatically respond to more instances of whatever app is running in order to increase capacity and do the inverse when increased demand recedes so to get back the capacity from the service be able to create high available endpoints for the apps for the various developers in order to be able to route their applications more easily and with less SRE intervention that's something that we want to have and we also want to add in the various services TLS that is not just termination which we already have by engine X we also want to have services reaching out to the world by TLS without having to implement internally in any kind of way TLS we want to abstract TLS away from the various services and service owners because cryptography and TLS are very well corrected and we want to add the force to imagery in order to have things like latency, better latency metric something like that with that thank you very much and I'm hoping that you enjoy the presentation and please we have time for questions thanks go ahead Andrew sorry I was going to ask is media wiki on the roadmap as I said the annual plan did not choose pipeline to be added in it so it's not entirely clear it's the last of the services that will be migrated at the end of the day it is on the roadmap but I cannot tell you when it's going to happen Gina that's a local dev version of media wiki in Q3 the TLS demarcation you're talking about I'm not sure I fully understood that that means that when apps have to make remote connections to other services let me give you an example let me give you an example so let's assume that a complication that runs in the cluster let's say change prompt needs to for some reason talk to event gate whatever the idea would be that change prompt knows nothing about TLS it talks locally to a proxy to an HTTP proxy that knows to initiate a TLS connection to a corresponding proxy that resides in front of event gate and terminate the TLS the idea is that everything will be encrypted as long as it's on the wire and the decryption will be high opinion right next to the application and not just the encryption sorry the decryption and decryption as well so all of this will mean that nothing TLS-wise will be visible to the application they won't have to import libssl or tools that get them to expose ciphers to the code and things like that that's a sanctuary question so it just seems like usually that stuff from to initiate connections is relatively easy because it's built in to say HTTP library but you're just saying that the differences between all the different TLS libraries out there might have security implications and it's better to abstract that stuff away so you don't have to worry about it exactly and I can give you a very nice example about that so back in 2000 I think it was 15 Node.js decided to to public to embed to embed certificates in their own code repo and it's part of a TLS let's call it it's a digital implementation where actually all those TLS certificates go from how do you update them and all and at some point it turns out that they were having certificates that were expired and they were no longer in use the actual clients that they wanted to talk to had switched to using different CA certificates that were not passed in Node.js so that was implementation detail that gave a lot of pain to the language engineering team back then because they were trying to figure out what on Earth was going on what are those very cryptic encryption errors that we're getting what happened here and all of that was just because Node.js decided to do things very nicely on the TLS level and that goes on for all libraries they take decisions of various levels about TLS and SSL and all it's difficult to keep up with TLS these days it advances with how can they be used to with that for internal services talking to services if they don't have it right now I guess like if we wanted to post an htps something to vincate we would go through the engine x termination because the app itself doesn't have any of that in it but this would just sort of abstract that so that all Kubernetes bots would just expose something that should be at some point also, is that right? Yeah, that's more or less if you want to imagine the implementation of it it's more like an engine x of the termination and an engine x of that is an initiation, that's one way to implement that but each engine x server that's doing the termination is deployed alongside with each app it's called the sidecar the terminology is a container sidecar so Scott had a question in the chat how is that let me actually read it for everyone for the people as well that might be looking at the stream this may be a naive question but how is that as opposed to code handling this process? for example, if we wanted to spin up a local dev instance with a representative subreddit of articles putting the sense that it is in for testing the pipeline has no no provision for that currently and while some parts of it have been discussed in a staging cluster there are the data that you need for testing or do you need something really recent I guess but it would be up to the developer wouldn't it okay we have two minutes left any last questions Scott are you talking you're muted Scott we can't hear you if you're talking I guess not there are no other questions in IRC there is one from other matters like if a pod keeps failing the respawn cadence will stop trying is there a respawn limit in order to prevent the domino effect yes there is if a pod keeps on failing eventually the Kubernetes kiblet will put it in a what is called crash loop back off state the back off stands out for backing off exponentially it will not try to restart it for some time and then from even more time and not every single failure of that is going to provide even more time before between the various retries so yes no domino effect in case something keeps on failing very very quickly okay we have one minute left I might try to answer Scott's continuation of that question there you might for development environment specifically you might be able to do something with your blubber files to build the development environment images separately differently than you would the production ones such that they would actually pull down whatever test data you might need to run your development environment and those would be actually different images than the ones that would be used in production also as a as a response to that this is something that falls under the local dev project which is still in early days so this is something that's sort of being kicked around but we don't have any real plans for it it's not officially on the roadmap but it is a nice to have thing some repos to check out are the local charts repo that has the current very alpha local development environment so yeah I'm sure we'd appreciate folks kicking that around a bit and giving suggestions for feedback alright thanks everyone thanks Alex for the presentation that's a lot of information I'm still trying to digest it it's looking forward to using it in our new system and thanks to everyone for the questions and thanks Brandon for setting this up thank you thanks everyone as well