 Thank you. Good morning. I hope you're all enjoying conference so far My name is Michiel Rook. I'm from the Netherlands and today I'll be Telling a little story for the next 50 minutes roughly 45 minutes. I'll be talking about a project that took almost a year Which means I'll compress a little bit What I'll be talking about may or may not apply to Whatever you're doing daily If it does then great if it doesn't then maybe we can think of ways to get it there But this worked on a particular project. It may not work on yours. So that's the disclaimer that I want to start on right a little bit about me, I'm a Java PHP and Scala contractor Based in the Netherlands. I do some international consulting as well, but mostly in the Netherlands I train and coach teams and I do some speaking such as this conference I am part of a little company called make.io where we coach consult and Deliver technical products around continuous delivery and continuous deployment And I'm also part of a group of Dutch and Belgian web-free Lancers called the Dutch Web Alliance if you after this talk Want to shout complaints? Or general nastiness to me then that's my Twitter handle Compliments are also welcome of course Yeah, let's get going I'll talk a little bit about the background of this story how it came to be Where it was where it all took place? The approach we took then a little bit about process and standards that we were setting With the team and as a team a little bit about built pipelines and to close off a lessons learned Which there were a few? right now this all took place in 2014 and 2015 in Amsterdam in The Netherlands specifically the north of Amsterdam, which is as you can see has a beautiful view over the river that runs right through Amsterdam the I-River and This company part of the Perskoop, which is a Dutch Belgian conglomerate of newspapers and news sites. They also operate a few job boards and One of the biggest job boards picture there the national job site for for Dutch jobs That's one of the most visited job sites in the Netherlands Now as I said this all took place in 2014 2015. This was a traditional organization lots of well water folly ask development processes and That had to turn around They were dealing with a system called San Diego The relation to the actual city is completely coincidental by the way. They just needed a name apparently. It's called San Diego Also called the big ball of mud the big ball of yarn The big ball of nastiness a large legacy monolithic application Which happened to generate significant money for that? Company read millions of euros It was also very slow very complex Hard to maintain lots of technical debt This came to be from a history of takeovers and Mergers within the company that led to other code being ingested into the system or quick fixes hacks You probably know what I'm what I'm What I'm what I mean with that? So it was a big relatively messy code base that took lots of work for simple features to get delivered Also the team that was dealing with this application had relatively limited confidence in the application So they became very careful when changing it Now San Diego looks a little like this in the top. We have three job sites that the this company offers And they all go through the same set of load balances furnishes, etc From those load balances you end up in a front-end server Which connects to a back-end server or multiple back-end servers? The problem with that is that they run the exact same code and the only way to distinguish between the two is a flag You are front-end you are back-end, but they run the exact same repository Sometimes they can talk crosswise to each other or depth-wise And it part of it is through an API part of it is through direct database calls So it was a little bit complicated a little bit messy and then in the bottom There's a bunch of external services like databases and solar and other stuff that this system would connect to now the release process for this application was long and Consistent of large amounts of downtime so releases were scheduled over the weekend now the problem with it Job site is that in general people that are looking for a job don't look for the job during the week They look for jobs at night and in the weekend and the people that actually put the jobs on the site Recruiters companies they only do that from nine to five daily on working days So taking the site offline During the weekend means that the paying customer aka the recruiter doesn't notice But the people that actually use the site To improve their life find a new job find a better job Well sites offline they can do that and this happened regularly every four five weeks They went down for well a few hours at best and one or two days at worst And then the days after would be firefighting right firefighting the release that was just deployed Now as I mentioned infrequent and manual releases The tests that this project had were if they were there and they were extremely fragile So they would fall over for no apparent reason and slow So not really trustworthy velocity development velocity of the team was low Apart from firefighting new releases there were frequent outages and frequent issues performance problems all across the board a Team that was becoming more and more frustrated with this system and Like I said, they had low confidence modifying that existing code now all these problems put together Management and developers of course everybody started thinking how do we improve and management set a few goals One of the most important goals of course reduce the number of issues the number of outages the number of the time spent firefighting by the team Also very important reducing lead and cycle times now the lead time and the cycle time are Almost interchangeable not quite but basically they mean The time it takes for an idea to be put into production or to actually be usable by the end user by the customer Now in San Diego's World that could take three or four months a long development process a Team that had low confidence in modifying the code and then a release schedule, which is Every four to five weeks. So all these things combined you have cycle time or lead time I mean, which is easily two to three months for simple features already Which in the current climate at least for job sites is unacceptable, right? Other companies are moving so much faster. So you have to keep up with them purely from a competitive standpoint now Increasing productivity was also very important Getting the team at a higher velocity and making getting them in control again But also increasing the motivation for those developers and not only the developers, but everybody involved with that system, right? Now if we have these four goals, then the question becomes what do we do? We can either refactor the current system or we can rebuild and by refactoring. I mean piece-wise Put it pulling pieces from the code making them better and improving that way They started doing that And after three months of very enthusiastic refactoring they had a test suite which ended up with 2.5% code coverage Now this was in terms of percentages this was a Three or four fold increase of what they had so this was pretty spectacular, right? But in absolute terms, it's not really useful 2.5% and it also means that if you Extrapolate that line it will take years to get to some sensible number like 70 or 80 percent So that was well, obviously not going to work Something like a commercial off-the-shelf product was also considered the problem with that is you can Buy a job site somewhere and you can probably modify a logo or a few colors here and there But this company is built around the job site So it needs to innovate on the job site when you buy something off the shelf The ability to innovate is greatly lessened or maybe even zero So that's out of the question to and then a cut over rewrite or rebuild was considered Where we essentially build the new system? Like the old system, but with a better design and then at some point we flip the switch The problem with that is you rebuild all the bugs and all the problems that you have in the old system because they have become features over time right people expected to work that way because it always worked that way even if it was wrong and A lot of the why the reasoning behind certain features certain code has become has become lost It's stuck in a drawer somewhere or the people that thought of it have long left the company So you don't really know why things are that way, right? So that was of the out of the question as well So that's very positive. Well the approach we took is Is something called strangulation and now I don't mean the physical act of strangulating but it does really There is a resemblance and I'll show you in a little bit what I mean with that API first so instead of doing all sorts of weird internal procedure calls We consider everything in API and consume your own dog food. So if we if we and or if we Open up API's to other people to other parties to other companies to inject jobs, for example We use the same API's internally as well Otherwise, we will never get the API to the level of where they're actually useful, right? So consume your own API's We have services per domain object in this case and the domain objects would be a job and a job seeker and a few other things So the services would not necessarily be micro services a little bit bigger, but small enough for our case And we started migrating individual pages Individual pages on the site and what do I mean with that? The strangulation pattern on the left we start with the monolith which connects to a database and it is in It is in contact with the internet so it can be accessed from the internet Step 2 we add a proxy between the internet and the monolith And initially that proxy doesn't do anything. It just passes to the traffic one to one to the monolith, right? but then we start adding a new service and That could be a front-end as well and it and It has some functionality We developed some feature in that new service and at some point it's ready And we can put a switch in the proxy that if you access that page Whatever that service is implementing we route the traffic to the new service and The traffic doesn't end up on the original monolith anymore And then we build services and we build services and etc etc until at some point the monolith is not actually doing anything anymore It's not receiving any traffic In effect, it has been strangled The strangulation pattern comes from a tree which grows on another tree and basically Constricts or strangles the original tree its host until the host no longer can survive and withers and dies All very positive again, but in this case it would be our monolith Right, so at some point we have enough functionality for a monolith to be obsolete and we can throw it in a bin and The proxy could look something like this. This is an Apache and what this little Snippet says is that for people on our enter our own internal network, let's say that IP range We rewrite everything feature slash To the new service and the rest goes to the monolith and We can then at some point if we're happy with how that is performing internally We can remove that condition and let everybody see it and We iterate and iterate like that Now we also said that every service that we have needs to be scalable So it needs to be behind the load balancer so we can easily scale up and down the number of replicas per service Unfortunately, we need to be able to access some legacy databases and Convert that data We need to do continuous deployment from day one what that means. I'll explain in a little bit Everything as a Docker container. So Docker containers all the way Every service as a container and I mentioned this before a little bit the front ends our services as well They expose an API, right? They're accessible through HTTP And in turn they use APIs to access other services So the front ends are at the exact same level as are the other services and we consider them as such That leads to this architectural overview where we still have the three sites But now instead of a group of front-end services, they are served by their own front-end services For the three sites They in turn connect to internal services using APIs Which have access to their own data storage Search engines, what have you and then San Diego is off on the side It's actually in a different data center. All the new stuff is on amazon in this case And we have a tunnel between the two networks so that if we need to access legacy data or legacy code We can access that through a tunnel So now that we have an architecture in place It's very important to set a process and the process in this particular project was Kickstarted by a few external consultants of which I was one But the team was involved from the very Very beginning because team acceptance is key for something as dramatic as this Right, so they were involved from the very beginning One of the process change that we said is that everything is going to happen continuously And by everything I literally mean everything we Go from a project type development life cycle to a product type development cycle so And a product type cycle is it only has a start it has an end The only end you have is when you say, okay, my product is now obsolete and I don't offer my product anymore Which in this case would mean that the company doesn't exist anymore. So there isn't an end like you have in a project We have a six month project and then we have another six month project. No, this is a product So it keeps evolving continuously. Everything we do is continuously The talk or the name of this talk is continue the road to continuous deployment Now that particular phrase Continuous deployment. There's some confusion on what it actually means And I'll give my view today There are some differing views. That's all Fine, but I'm going to give my view CD Starts with continuous integration This is a pattern that's older It basically says we have a developer somewhere Checks in some code to get up get lab whatever And some process starts building that code now in terms of php building may Or doesn't mean compiling like it does in java or scala or other languages But building could still be like minifying your j s or your css or combining other stuff Linting syntax checks all those things Those all happen in the build step and of course running tests Happens there as well Now ci you can have Jenkins or travis or circle or any other tool That basically all do the same at the end you have an artifact It may be a far archive or a zip or tar or something else that is deployable But it doesn't actually do anything with the artifact other than generated Now continuous delivery then takes the artifact and automatically deploys it to some acceptance environment And on the acceptance environment we can then do manual checks Click through it verify the product and then at some point we can deploy to production Now deploying to production should be automatically As in we need to have scripts Which can automatically deploy to production But as you see the red arrow there the actual invoking of the script or the triggering is a manual action So they're still human involved Right, we deploy to acceptance automatically as soon as the artifact is completed in the build stage And then at some point somebody says okay, this is good enough Stumps on the button somewhere and the deploy to production happens. That's continuous delivery What continues delivery also states is that code is should always be deployable And by that it means of course that you cannot break your master branch or your trunk It should always be deployable in a deployable state Now continuous deployment then is What some consider the holy grail all the arrows are green. There's no human involvement anymore Basically, this means that we from the build stage we deploy to acceptance Which at this point loses its meaning the name at least because there's nothing to accept Other than automated processes which do acceptance So depending on your automated checks it may be called staging or pre-production or Functionally equivalent to production whatever If that all goes well everything comes up in staging or acceptance correctly and works We then automatically deploy to production And there's no human involvement anymore Now the idea is that if you automate this from start to finish and my personal Wish or my personal metric is that by the end of production it should take no more than 20 minutes From the actual first push to get up So that's a very short process if you do that Like that it takes the excitement out of releases out of deploying if it happens so often It's it's become standard, right Now why would we do this One of the things in continuous delivery and deployment is that it allows us to do small steps Right if we could release every single commit and the commit is very small We don't commit entire pages of code Right. So every step that we deploy is very small It also allows us to get early feedback On whatever we're working on remember that I was talking about product lifecycle. We have an idea Let's put this feature in for 50 percent of our users And we don't want to wait for that release cycle to get us there In the next four to eight to 16 weeks whatever Because that would then mean that we get Feedback, which is a not early, but b influenced by all the other things that are part of that release, right? So it becomes impossible to track those variables reducing time to recover If you have a pipeline which deploys to production in 20 minutes and you have a serious issue Which is a bug in the code And if for example, it was the last commit that broke it Revert that last commit and 20 minutes later. Everything's fine again And last but not least experimentation product lifecycle product IDs Combined with early feedback lead to experimentation A product manager or product owner thinking like if we try this Move this button there or change this on the page or whatever you can think of How do how do our consumers react? This is something that Netflix and Amazon do constantly Right if we change the background of the movie title to another color do people click more or less on it Uh, if we do this do people react in a certain way or a certain another certain way so That's very important in continuous deployment that it allows you to experiment with product IDs And teams that do continuous deployment In general suffer from these incredible statistics That they deploy 200 times more frequently than Teams that don't they recover from failures 24 times faster Uh change failure rate, which is three times lower, but the most impressive one Two and a half thousand Times shorter lead times So that means that instead of three months we end up with a week tops of lead time Now One of the things we said early on in this team Is that we need to do tdd as a As a rule who here regularly performs tdd Is roughly what I expect it depends a little bit on the audience how many how much How many hands I see? Anywhere from 25 to 50 percent I rarely get above 50 percent. So that's That's par for the course bdd then Yeah, that's also Usually you see 40 to 50 percent of the audience say I regularly do tdd and then the bdd is 30 to 40 percent of that I'll explain in a little bit why this is extremely necessary Now I talked about pushing commits and deploying commits One of the rules you say is that every commit goes to production And by every commit, I mean it could be two or three commits depending on where when our Jenkins in this case starts there could be a small window small delay So two or three commits may be grouped, but it's not going to be more than that And assuming all the tests are green etc etc that can those commits end up in production within 10 to 20 minutes That requires a few Things within the team and what they discovered quickly was this Only commit to master and by only committing to master or trunk if you will I mean we don't use branches ever bringing the pitchforks this is the When I when I say don't use branches ever I always get reactions from sure to downright hostile Which is why included the pitchfork image But I'll explain why this is well one nice image of what branches do for you But what branches do For you in real life is they delay integration And by delaying integration, I mean depending on the lifetime of your branch It takes a while for it to merge back to master At which point you'll run tests or you run deploy Um And we try to do small steps Which is impossible if you have long branches because they end up being one single commit a merge commit Which could be pages of code, right? um what I also think in terms of branches is And this is going to sound a little bit harsh, but it's an abuse of version control For functional separation And by that I mean Feature branches are used to separate to functionally separate or to separate functionality Right product owner says, okay, this feature is now ready. It can be merged to master which is abusing version control for The act of separating different functionalities or releasing functionalities in a different in a certain sequence And I'll I'll give an an alternative in a little bit Uh, but no branches also means no pull requests because they are branches no matter how long the pull request exists Hopefully it exists for only a few hours Depending on if somebody is ready to review the pull request But there are still branches There is one exception and one exception only and that's work in progress or prototyping And prototypes of course are thrown away after we're done with them Right But they can be put on a branch if other people on the team need to see it Now I hear you think okay no pull requests. How the hell do we do code reviews then? right Pear programming I would not advise advise doing it like this. It's not particularly effective But still pair programming. We don't enforce this but we encourage this The team learns very quickly to do pair program Excuse me pair programming because it's a continuous code review right You put two people together and hopefully it's One person that's a little bit more experienced combined with a person that's a little bit less experienced so you get knowledge transfer as well And so you mix experience levels But you get a continuous code review Right instead of one developer hacking on his or her laptop and then checking it in pull request another developer checking that pull request You now bring those two people together And in line Make it better And this is a lot faster leads to emergent design. That's a lot more Lean than it would with branches or pull requests It requires discipline though And not everybody is comfortable continuously sitting next to another person But that's something that You can get used to at least that's my experience It also means that we pair on everything and with everything. I mean scripts for automation server setups all those things because it becomes a team responsibility And I'll discuss that in a little bit as well focusing on value I talked about the strangulation pattern and that we build up a new service to Strangle our monolith at some point Now we want to deliver value for the business again Right increase their confidence in us again Restore cooperation So that means that we focus on value creation and not necessarily on just moving parts out of the old system And copying them verbatim, but we focus on new features and new features are only developed in the new system So we don't touch the monolith unless we absolutely positively have to because it's still generating money But in principle new code only in the new system and we focus on value feature toggles Are an incredibly powerful tool To not only separate functionality But also do ab testing and separating functionality. I talked about feature branches and people using feature branches to sequence to make a sequence of functionality now feature toggles allow you to do the same But without interfering in the purely technical act of deployment Right deploying is a purely technical exercise it has No relation to business because it's just releasing software And if you put new stuff that shouldn't be seen by a customer yet Behind a feature toggle and a feature toggle is essentially an if statement They will not see it, but the code will actually be integrated Will be tested will be deployed And at some point when the product owner says okay now it's time to go We flip the product or the feature toggle and customers start seeing that new feature At which point we remove the feature toggle, of course We only try to use that on specific things Systems or functionality that a customer would not yet see anyway are Don't have to be behind a feature toggle and in some cases the New functionality is not a problem to see to let's see But it also means that we can do ab testing Like I have a new version of our search page on the right hand side And I want to check whether the metrics for that page go up or down Like the number of users on the page the time they spend on the page the number of clicks all those metrics And initially I want to say 10 of my visitors get the feature toggle enabled And then we see metrics happening and if the metrics are not worse than currently then We increase that feature toggle the percentage And we let the people 50 percent of our traffic see the new feature And then 100 percent and then we remove the feature toggle and at that point this whole stuff In the monolith for example is obsolete and can be thrown away cleaned up right Another process rule the Boy Scout rule Who does not know the Boy Scout rule here? The Boy Scout rule basically says leave the campsite in a better state than you found it Which in translated to code means that if you see something That can be refactored in a reasonable amount of time Then you should do that at that point because if you don't it leads to something called the broken window syndrome the broken window syndrome is you have one broken window in a house in an apartment complex And okay, it's only one broken window We can we can handle that but if it doesn't get handled a second broken window appears and people This complex is starting to look shabby And then a third and a fourth and at that point people don't care anymore right and the the outside look of the complex is now Shabby and people consider that complex to be of that quality as well So if you want to avoid the broken window syndrome It means that every single time a window breaks you fix it quickly quality Is a precondition for speed If you don't have quality you can't go fast. Yes, you can go fast for a limited amount of time But you will hit a brick wall pretty quickly And if you are careful about quality you can Increase the amount of time spent linearly with the not with the complexity of the or the features that you have And php one of the quality gates we would have is a simple syntax check Right simple linting check Because we don't want our customer to see a fatal error somewhere or you remove that method Or you forgot a brace or a semicolon somewhere, right? But it also means that we have tests in all shapes and sizes It means that we have code coverage, which is at a level that we can trust And in this project and this is actually something the team came up with themselves We had a Very hard definition of code coverage 100% Now note the asterisk Because this was a php project If it were java or scala I would probably we would probably get away with 80% something like that. That's a sensible number Reason why is there you have a compiler which helps you Right if I remove a method in php, I and a test doesn't catch it. I will know about that runtime and that's too late In a compiled language my compiler will hopefully Hint me about that But in this case we set a hard gate on 100% Which means that if the coverage drops to 99.9 percent The build fails This means that we have a safety net Now there are exceptions, of course Very very small exceptions which can be annotated by code coverage ignore but Every single instance where that might be needed would be a conscious decision Right whereas if you have a 80 percent number as long as the 80 percent you're still along the 80 percent Then you don't really know whether the code you're not covering is essential Or a stub or something trivial right with 100 percent if you drop to 99.9 And the point one would be trivial code then you can make a conscious decision. Okay ignored that particular piece of code Right, but it becomes a conscious decision We also envision DevOps DevOps is a very popular tool these or a term these days. Everybody wants DevOps What does DevOps actually mean and at this point it's um It's starting to become a container phrase Because people started adding security on it and then qa And then biz and then net and then system What I mean with DevOps is dev star ops Dev star ops basically solves this problem It means that we don't have walls between us anymore, right? Um instead of us generating an artifact and throwing that over the wall And then the ops team taking that deploying that somewhere in production a production system That we as developers are not allowed to touch And then it breaks And okay, why did it break? Yeah, yeah, yeah because x y z ops may Doesn't even know about it or doesn't even have the insight into the application So they only have logs or we didn't actually put that in the logs or can we get the logs? Can we access the server? No, you can't because we are only allowed to access the service and this round robins a few time pingpongs a few times And the end result is that we're slower than we could be right, so if we integrate that Those people into the same team We foster responsibility Responsibility for a product which resides at one team not multiple teams one team And if it breaks somebody on that team will get a page It doesn't matter what it's ops or dev or qa or somebody else Somebody on the team will get a page And the funny thing about that is developers don't like being paged at night Go figure Most people don't like being paged at night But usually ops people Have that in their contract right they do night shifts and standby and all those things They expect stuff to break sometimes they're even Sad when things don't break Go figure So what we do is by integrating all those people and suddenly developer at night gets paged And he's like why the hell is this page and the ops will be like yeah, this happened for years every day And it's just Why well it wakes me up So immediately we start getting rid of that That's the effect that this will have that DevOps will have Well, we'll get rid of this problem because it hinders us right It will also require a culture change It's a culture where people actually talk to each other and To get or solve problems rather than pointing fingers everywhere Enough about DevOps I talked about continuous everything one of the everything's is monitoring And by monitoring if you can put it on a dashboard then put it on a dashboard Stuff like how's my System performing and these are technical metrics cpu ram load etc etc Log files centralized logging Alkstack elastic search Hopefully with some sort of idea so you can Correlate different requests together within services and you have a life cycle of where Things start to go wrong stack traces everything that your app produces Let it be searchable and centralized so we don't have to log into an individual service and look at log files Right all these things lead to a built pipeline And a built pipeline should be all about automation Automating what is repeatable. Why do we automate? Because if we Let humans do the work mistakes start to slip in Start to happen if you have 10 tasks for 10 humans they will do it in some sort of way And give the same 10 humans the same 10 tasks the next day and you will already see variations Because of lack of sleep problems at home Uh had a little bit too much to drink etc etc etc all those reasons which an automated script or process Doesn't really have a problem with right it will do the same thing every single time And we automate building testing deploying Orchestrating so getting services together and initializing services or servers and configuration All the things we used to do manually like uh boot up some vm and do a bunch of app get install stuff We automate all those things so that we can throw stuff away and make it happen again Continuous everything also means continuous testing Continuous testing in depth And in depth I mean that we start with unit tests. So that's where tdd comes back We then go to integration tests acceptance tests and ui tests unit tests could look something like this. This is relatively recent php unit code And what this code does is test a single unit a single Object single class and all the dependencies that that object has are marked away So that we can control The world around that object integration tests then Are where we test components together. We actually test life cycles of objects are objects talking to each other We may connect to an actual database with test data fixtures data fixtures Or we test using you know sql light instead of my sql with some Fixed known test data that we can then assert on acceptance tests Are that's where bdd comes in and bdd says we Generate or we create scenarios using a syntax like that, which is called the gherkin dsl And the gherkin dsl always runs in the same flow given when then Given the world is in a certain state a predefined state When something happens then a verifiable outcome should be able to be detected And these stories or these scenarios rather Are the result of stories of acceptance criteria. They are the result of examples Of edge cases for example. These are actual examples of functionality And they can be implemented by code In the background But this is in our domain language or ubiquitous language. This is what we write down. We don't actually write code anymore In this case, we do need to implement that of course but it's An acceptance criteria. It's business language This can be done using behed or php spec for example And then last but not least we do ui tests ui tests using something like selenium or Phantom or protractor, what have you. You fake a browser. You can detect javascript issues The problem with that or the Potential issue with that is that speed and stability Are a problem. It could be a problem selenium is not very fast Phantom js and protractor have some stability issues here and there So we don't want to overuse that Now our manual tester or our qa person in this In the previous process would then be if he sees something like this like what it what is going to be my role right I used to click through an acceptance environment on my own and verify whether things were ready for release Funny thing is the tester is actually going to be more important than ever don't tell them that Because the tester becomes a part of the three amigos And the three amigos is basically a business representative a development representative and a testing representative Because these three people combined should have enough knowledge of the business the system The edge cases all the things combined to come up with a reasonably correct formulation of a story Right, they won't they will think about possible ways of implementing it not the actual implementation because that would be too soon But they could think of possible problems Possible issues edge cases all the things We individually could not come up with but as three amigos. Hey, we can do everything Right, so the tester becomes extremely important All these things together lead to what we call the testing pyramid and recently there's been some literature on that the testing pyramid is wrong At some point it becomes wrong for now. It's good enough The testing pyramid basically says that at the bottom we start with the cheapest and fastest tests Which are unit tests by definition unless you do very slow things there But in general unit tests are the fastest and the cheapest So we have the most of them Then we could have a few integration tests which are slower And more expensive acceptance tests yet slower UI tests almost the slowest And smoke tests are actually tests that check whether your application was deployed successfully So they are the slowest because they require an actual running system an actual deploy And the tester comes there and exploratory testing, you know clicking the critical paths of or verifying the critical paths of the system from time to time And you're monitoring Because testing nothing is ever watertight or 100% bulletproof So you need something to alert you if after deploy things start going wrong Right, we have a performance issue, which we did not detect We have a load problem somewhere. We have an increased error rate after an hour or something of running That's hard if not impossible to detect with testing So you need something to verify after the deployment A pipeline could be Written like this pipeline as code This is Jenkins, by the way, this is Jenkins code And the nice thing about a pipeline as code is that you can co-locate it with your actual system with your product in your repository And we can throw away Jenkins and re-initialize it using that code So we don't have to click things together anymore. It's all authored this way This code runs in a few stages We run the tests initially. We then build a docker image And we push that to a repository somewhere and then we deploy to staging and lastly to production If one of the stages fails, the entire build stops And it's a sequential system in this case Now the docker file would or could look something like this very simple I'm not going to go into docker. That's for another time And then we start deploying This is only one way, one strategy to deploy. There are many many more canary releasing, blue, green Look them up when you have time. This is what we call the rolling update Rolling update starts with pulling the image, our docker image from a repository somewhere And then we start a new container based on that new image We wait for it to come up on a port, which well generally all services do or most of them anyway And then we do our smoke tests, aka health checks Did the deploy did the service come up correctly? And we can test that using some endpoint where we know that we put this input in and then we expect that output If it all goes well, we add that container into the load balancer And the load balancer would be haproxy in this case And it starts receiving traffic immediately We then remove one of the old containers from the previous version from the load balancer And we stop and remove that And we repeat this until all the replicas of our old version. So our previous build are gone And we only have replicas of the new service Looks a little something like this in the Jenkins We have the four stages if one of the stages were to break It would call a red and the rest of the stages for that build wouldn't be executed Right if it does break We want feedback the siren of shame The siren of shame is well literally that it's an led lamp with with with a siren on it And and there's some audio effects We use the fog horn and the train horn and and all sorts of other things to attract attention That attention is to encourage developers to immediately fix that pipeline Because that pipeline is the only thing we have to put stuff into production We don't do things manually anymore. So if the pipeline breaks our development process breaks. So that's an immediate p1 p0 Right fix it Initially, it's the pair that broke the build We don't do naming and shaming, but Initially they start working on it if they can't figure out they pull in the rest of the team To get that pipeline fixed quickly Right closing off with some results this whole Project led to a total build time per service under 10 minutes And the total time is from the very push To get lap or get up it was get up in this case to the last production replica being replaced Under 10 minutes from start to finish 50 plus deployments per day over all the services combined Significantly reduced the number of issues and outages and the page load times went from five to six seconds on average to a point five And sometimes even lower than that We improved audience statistics all across the board More time on page more pages per session more people on the site better co-ranking all those things We got to experiment together with the team with new tech They dabbled a little bit in angular java event sourcing things like that But most importantly they had a lot more confidence lot more motivation and fun Flossities than a bonus Of course, there were some lessons to be learned The team acceptance initially was not very great Um, remember it's a bunch of external people that say okay. What you've been doing is essentially wrong Uh, and they knew this but they there's always a little bit of pushback, which is fine. Um, but in Pretty quickly Everybody got to see that this was a far better way and got them out of that hole that they've been in for years Change is hard in general and humans are Change averse Usually they like their own patterns But if you persist and keep at it Then you then it will happen new technology not everybody was as experienced with the new technology as We would like but with para programming that pretty quickly turned around Uh, and this was in 2014. So then docker was still very much unstable point six point seven was the release I think so we had some issues with that but in general it was fine Um stability of the built pipelines JavaScript testing mostly that's where we had to drop the 100 code coverage rule actually Because uh, javascript testing usually broke for unclear reasons Business alignment if you as a team start moving this fast and the business is still doing it The traditional way then cracks to start to appear again in other Uh in other parts of the company Feature toggles, uh, I said that as soon as something is live you need to remove the feature toggle Well, we had a few feature toggles that were never enabled actually so stuff that wasn't put live as well and if you Have too much feature toggles you have a combinatorial explosion Which you don't want as well and at the end Uh, not enough focus on replacing the legacy app. Unfortunately, san diego is still in production if a small part of it But it's still in production um And it could have been out of production by now unfortunately Right some literature which you can read Hopefully in the next few weeks or months continues delivery the bible by jazz humble and day farley building microservices by sam newman And a book by matt skelton and steve smith built quality in which is basically a report of about 20 Projects where they implemented continuous delivery and how that worked And below if you want to read more about why branches are evil then drunk based development.com is your way to go Little bit of a Sales slide here make dot i o we help you get faster We help you implement continuous delivery and continuous deployment if you're interested come visit me Or send me a mail or a twitter go to my blog I Write about these things, but also about event sourcing cqs And i would love it if you could leave some feedback on joining on that particular link Have you any questions? Yes, sir Check in the Like the amount of feedback you're getting for a new web page or anything else Have you actually got that automated as well The metrics you mean or the yeah, so that it Whereby say you you've released it to 10% of your audience And then you might up it to 50 before you actually release it entirely. Is that all automated? Unfortunately not so the question was it does the future toggle automatically increase if your metrics are at the right level We would love to have gotten there what we didn't And in the time that then you end up with something called hypothesis driven development Which is we think that we that? This change will lead to that and that's verified or verifiable, but these metrics and if Then you can automate that unfortunately. We didn't get there other questions Um Well, you would write down which metrics to to check And for example, you could say okay that Um, we consider the feature successful if the metrics go Increased by this percentage or go above that threshold and if they do we increase the feature toggle by x percent Right that could be a way So you have some sort of feedback loop from the the metrics from the monitoring back into your feature toggle setup Yeah, I would I would definitely consider that something like prometheus or other things they have hooks Yeah, yeah could be yeah even though nagas is more of the system monitoring I was just going to ask about the Branching model where you're always committing to master. Does that mean that you're effectively only working on one feature at a time? Or is everyone working on features at the same time and committing All together. So the question is if we Only have master. Are we all working on one feature? The answer is no uh in principle If features are large enough and Completed complicated enough they couldn't be put behind a feature toggle And otherwise it would just simply be small commits towards a new feature And you the only thing you would sometimes have is if you are working on the exact same piece of code You would have a small merge conflict For the individual developer But in principle everybody's working on the entire product not just on one single feature Do you ever have failing tests in master? Or do you make sure that they're passing? Hopefully not because that means somebody forgot to run the tests on their own machine We didn't actually have post commit hooks or or uh That you Can only push if you run the test locally So that's more of a discipline. We trust people to do that And you quickly learn to do that if the siren starts running again Right I presume that when you have multiple services breaking down the manual application You're working with different repositories for each service or using the same This case different repositories for different services. So one service had one repository What was the secret then of managing the different dependencies between the services? So if service a depends on the feature that's going to be deployed on service b at some point in time How do you synchronize that? Yeah, that's a good point if you but vertical development would solve most of that so you start at You with bdd you start outside in with these things you would start inside out And you would start with the lowest service and add a new endpoint there Which at that at that moment in time is still unused Right, and then you build the code in the service which uses that endpoint etc, etc Or you could put it behind a feature toggle something like that and also api version Very important. All right. Thanks. Okay, your question. So you talked about continuous integration and Well testing and everything But you still have a monolith there with very complex algorithms, I suppose And your new services will have At some point to communicate with the old features. So how would you manage That type of situation By not doing it And that sounds very easy, but We try to avoid communicating with the old system as much as possible. It's the only way it's We actually communicate this through the legacy database And the legacy database is red and the data is then transformed to a model Which is better a new model. So that's the only point of communication actually We don't call apis in the old system. We only use data communication So and How would you afterwards because it's a also a legacy database? How would you migrate that to a new database or a better schema? Well, one of the things they they started doing after I left is one part of the The the job seeker part was started to become implemented as events as event sourcing and Basically everything that happens in the old system is then written down as an event in the new system and it's also written down as a In a new data model And at some point if you flip the switch The new system will start to generate those events rather than the old system Right, and then the old system is obsolete. Anyway, it doesn't respond to those and things anyway And you have those events and you have the new Data models essentially and you can keep them up to date like that Does that answer your question? Okay I was curious as to how the scales for Engineering teams that are 200 to 500 engineers. Do you batch commits when it's continuously deployed at all? There are different views on that I know that spotify for example Has something called I think they call it release trains And they deploy Or was it Shopify it doesn't matter they deploy eight times per day something like that and you can Attach your commit to a train And then it gets deployed And they have some sort of a feedback like these these commits were part of that train But other companies of of significant size google for example has a monolithic repository So one repository well all the code is and they continuous deploy So I don't consider continuous deployment to be unscalable in the large or large teams. No, definitely not One question here, and then I think we're slowly running out of time. Yeah So you said you don't always use pair programming. Did you have rules of when it was safe to not pair programs? I'll see the well the The rule with there there's one I hear a comment about pair programming like trivial features We don't have to pair on right, but is it always obvious from the start whether something is trivial, right a bug fix which looks trivial could be very complicated And for development regular development, it's it's equally so so I would say pair programming We didn't enforce it, but we encouraged it but also on the simple things And sure there were moments where people needed to take a break like I need to be solo for a little bit but we tried to limit that and You find out quickly that that this helps a lot that this improves the quality And your own way of thinking as well It guards against dropping in the rabbit hole it doesn't It's still possible But it's less likely because you're a pair programming So people knew that even for trivial things it was interesting as well Okay, it's time to wrap this up if you have any further questions then Catch me at lunch or on twitter. Thank you so much for all your attention and have a great day