 Hi, my name is Shlomo Shapiro, and I'm working at Immobilien Scout 24, which is Germany's leading real estate listing portal So if you live in Germany, you probably already found a new home through our website. If not, come and check it out We have lots to offer but here I'm more to talk about DevOps and especially what happens if you already do DevOps for quite some time like we do and how to deal with the risks that Probably are now different than in the times before you were doing DevOps Let's start from a question who is doing DevOps here, okay Very interesting not everybody So I hope that this talk will help those who don't do DevOps to maybe get yet another argument for their bosses Why it would be interesting to check out how to do things the DevOps way Well, this is probably common wisdom if you take software from planning through Development testing and into production then of course errors happen and need to be fixed and The cost of fixing these errors of course changes Fixing an error here is much cheaper than actually fixing it there That's why it pays off to try to fix all errors early on in design and those of you who run old software old meaning older than 12 month probably already thought about a redesign or were upset about the initial design and Older company like ours a mobile scout is now more than 15 years old is running code That is partially also 15 years old So a lot of the design decisions which we made early on are not valid anymore So we suffer from that one of the learnings we have is that we try to fix errors as early and not as late as possible and DevOps doesn't make this easier because if you look at the Development cycle of software in software development. That's how it looks at least in our company you have a rather long time of planning and designing and user experience and wire frames and what else after that you have a shorter development time and after that you have test and in evil shorter production time this works quite well and I think that this Helps the developers actually to reduce design errors In operations, we also do software development and everybody who's doing operations is actually doing software development Even if you don't call it that The difference is that usually we have an idea over coffee and then we start hacking right and Then we put it in production and call it testing and then we run it and Then we're often afraid to touch it because we know that if you touch it is probably gonna break and I have a long history in operations. So I know what I'm talking about If you look at these two things in comparison The first thing I would notice is that actually operations seem to be more risky than development because in operations we spend less time on planning and designing and we spend less time on the cheap fixing area so to say and in production we much much faster go into production and then Fixing errors, especially fixing design errors is really costly Why does that matter? Well as a company you obviously don't earn money on broken stuff so if you look at Typical outages at least typical in our kind of business, which is running a website You can ask yourself. Okay. Who's guilty for those outages? Who did the initial error who should have done something different and We all know that the blame game doesn't help but in the end it helps to understand what to do different the next time you go there and If you look at these I'm not gonna go into detail I'm sure everybody can find themselves somewhere here Me too included I did almost all of them already. So I'm not ashamed to show them So what about DevOps? DevOps in the nutshell is respect and learning in my opinion and It goes both ways both sides developers and admins have to respect each other and have to learn from each other and The devs the developers can learn from ops a lot about Operability how to optimize software so that it's not only nice to develop but also nice to run it and If you look at software development cycles in some cases you develop it for let's say half a year and then you run it for 10 years So why not optimize a little bit better for how to run it for 10 years and not only for how to program it The admins of course also can learn a lot from the developers for example incremental improvement Start small do a minimum viable product and see how it develops improve it further on or Coding instead of hacking and let's not go into that Test driven is a big thing from development Which is already really established in development and in operations. We're slowly learning how to do test driven and Actually, this talk is also about how to do test driven in infrastructure development. A very nice thing is code quality Who is a developer and cares about code quality hands up? Okay Who is an admin and care cares about code quality? Okay, why the difference? Why is code quality in operations different from code quality in development? It's all about craftsmanship about Writing code to be read later right for reading not right for it works Don't do comments do readable code. All this stuff is code quality So that stuff which really works well in development and it works even better in operations Actually, I believe that the reason is that the stuff which we develop in Operations is more complex which the than the stuff being developed in development Because we instrument systems and landscapes of systems and very complex things that need to play together and That are often very difficult to test in a sandbox That's why I think that the challenge of development in infrastructure is actually at least As high as the challenge of development in pure software development and my favorite of course test automation and Yes, test automation is the only way how to solve this problem Because this is a big truth untested means broken And another big truth is No tests means legacy Because if there are no tests, you don't know how to touch this code You have to be afraid of touching this code and the only way how to fight this fear is by having tests and test automation and In our world, this is true untested means broken and there's a very nice example for which I can tell you we recently did a complete rewrite of our System authentication layer like how the Linux systems authenticate users when they log in and of course we did that test driven and Of course, we forgot one use case and when the original servers LDAP were switched off and Only the new servers active directory state available, of course nothing worked anymore Because we forgot about this little use case which was necessary there And we looked into the code and said well, where's the test? There's no test. So obviously it won't work Right, it's simple. It's something a developer would always do but in operations. We also do that no test no work So then we wrote a test we fixed the code and it worked again. The problem was of course that our Pam patching code didn't expect that the Pam LDAP module was missing Which from a which just happens if you set up a server without Pam LDAP and then our hook for patching The file was missing. So no patching happens. So no login was possible Actually simple but again no test no work. So what is this thing about tests? there are a lot of books about tests and what you can do with tests and how to write tests and I Think for us guys in operations The simple version is enough the simple version is that there are two types of tests number one is unit tests and a unit test tests the smallest possible component in an artificial environment So try to think how to cut down everything that is not needed to test a single feature a single aspect a single function in Development this is much more complex and you do unit testing on all kind of levels But in operations that is okay try to think how to cut it down how to strip it down And if stripping down means setting up a server and running something that can be a unit test In development you say unit test doesn't have any external dependencies on so on but in operations You have to see what fits the problem The other test you need is the opposite. It's the system test The system test has the job of testing the entire application in a realistic environment and Also testing it with other applications together because you need to test the Corporation the inter operation between different applications before you roll a change into production That's all you need to get into test-driven development in infrastructure So a little overview typical things for unit tests are they built a part of the build process early on and They have quick feedback cycles a Unit test should give you an answer with mere seconds So you can run it after every save of a file after after every code change a single line changed You run the test it tells you yes or no. That's a unit test Also very important syntax checks Sounds stupid sounds silly, but hey How many failures that you have due to a missing semicolon or other stuff happens to everyone so easy to fix by the test and The other side system tests usually start from installing something on a test server because you want to test your code in a realistic environment, so you have to install it on a realistic test server that behaves like the real thing and Very important you run tests from outside Because usually use servers from outside you use their services. So you also in the test case you have to do the same and You of course also can run tests from inside, which is especially useful to simulate error conditions like You remove the network what happens? Of course remove the network from inside and rsh is a very useful tool for that Because in that scenario you don't need the super duper security. You need the super duper automation and ssh and automation suck and Don't forget a reboot is also a test The last thing I did in my consulting years was reboot before leaving the customer Because I didn't want to go back to the customer the next morning after they rebooted the server So yes rebooting is a test. It doesn't cost much. You do it from inside sudo reboot You wait a moment you run the standard test from outside and you know if it's good And if it doesn't work, then you know you have to fix it and That actually will save you once getting up at night or Save you buying your admin colleagues a crate of beer Few examples from the real world those who know us they already heard that we use RPMs for everything Software configuration doesn't matter what everything that goes on our servers has to be packaged in an RPM package and RPM packages have spec files as their master plan. This is a typical spec file Tests like some preparation installation installing into some fake chain shoot environment and Some files that are then shipped as part of the package And the most simple thing you can do as a unit test is a syntax check as part of the build phase of your package Or whatever other tool you use to ship stuff to servers And if you use sudo and you ship sudoers, please syntax check them like that Because if you don't you'll cut off the tree you're sitting on Because if you have a syntax error in the sudoers file, sudo will refuse cooperation Even if the rule you would be using is in a different file the complete sudo stops to work Another typical example you can find in hundreds of packages in our source repository is this Syntax check bash syntax check python syntax check yaml very important if you have configuration tested before deployment because configuration is also code and Configuration can break your server just the same like code can break your server So the more you test configuration at least for obvious errors like configure like syntax errors The more robust your world will be and the more resilient your deployments will be Because that means if it doesn't work it won't build and if it won't build it can't harm my system There are lots of more examples if you look on my home page. You'll see another talk about this topic which has a few more The more interesting part of course are system tests Like in this example a system test tests the entire system in a realistic environment Same when you have a car checked every two years They don't take off the wheels to check them They check the wheels on the car as it runs on the street and they put a fake street under it So that it's easier to handle the then stationary car That's the important thing about system tests The important thing is how to mock away the things that are irrelevant for the test and This is a perfect example for mocking Right the car feels like driving on the street It behaves like driving on the street, but it's actually stationary in the garage where the test is being run on these wheels So everything is real and Everything that's irrelevant for the test is mocked away with these little two wheels here and That allows this test to run anywhere anytime under Stable conditions and actually we have trailers with the setup driving around the countryside So that people can check their brakes and the same about unit tests about system tests in IT You want the system test to be exactly so that your code runs as expected Without depending too much on external environments, which you cannot provide together with the test a little bit about build automation Of course, nobody runs these tests manually because then you would be busy testing instead of coding So in our world the build automation looks like that. We have a source repository like everybody else We have a central build automation tool in our case team city. Many people use Jenkins, but there are others Even a sophisticated bash script could be enough for that purpose If a change happens it gets checked out on a build server which runs unit tests creates an RPM package and Uploads it into a dev yum repository So far so easy The next step is deploying that package onto a test server and running system tests Now this takes maybe 20 seconds this can take several minutes But if the unit tests fail, I don't need to run the system test because it's irrelevant That's the thing about quick feedback slower feedback small test big test If the test was successful The same RPM package is moved to a production yum repository and from there It's deployed by the same build automation to our production servers and That way we basically instrument our entire platform and any change goes this way From source code to production and that's actually how DevOps works because DevOps in our case means devs and ops can provide commits into these source repositories and it doesn't matter if the source code is Turned into our Whatever billing application or into our OS provisioning Everybody can Contribute to both of them If they're there if they know if they ask their colleagues for a code review and so on but they can And that's the big change in DevOps. They can if they want to just go do it and We have test automation. So if you break it, it won't build don't be afraid just try it out And that's the important thing to learn here It's not enough to allow people to Change code you have to help people to overcome their fears to overcome their kind of natural resistance To work in fields where they're not really proficient Because many improvements are small improvement. Oh, I don't like that Provisioning takes five minutes, but look I see there a simple solution to fix it. Okay one minute saved But maybe nobody in production had time for that. Maybe the developer who was Testing the provisioning including the setup of a software if it runs in this initial kind of border case condition He was sitting then waiting for machines to boot and install and boot install and got annoyed. So he fixed it a Few more examples from the system testing world and yes, they are ops related Who uses persistent storage Okay, everybody else what do you do? I mean you have to store stuff somewhere So in our world each virtual machine has one or two hard disks One hard disk is the system disk and we always wipe it and format it and install it from fresh And if you store stuff on the system disk, you know that it will be gone eventually If you need persistent storage in a virtual machine, you have to add a persistent storage disk For those who use AWS EBS is the key word here But the idea is the same you have a system and you have a persistent storage Now, how does the persistence storage get configured into the system where to mount it were to format it and so on in our case we wrote a service for that design mount service, which Uses certain Algorithms to determine what to do like oh, I have one extra disk or it has a file system label persist something. Let's mount that Actually not difficult about 200 lines of bash But if that service fails then in all our platform the persistence storage will be gone So, how do we protect ourselves against this risk? We write a test in this case we write a test that runs through all possible permutations of actions that this service could do including error scenarios and We use mocking So that we don't have to connect real storage, but we use a low setup to provide an image file as a persistent disk And this is very convenient because I can also simulate different scenarios I can add two or three disks and see what happens and so on and I test of course service start-stop that the service mounts and unmounts my persistent storage as required and now I have a Delivery chain with a source code sign mount service and as part of that a virtual machine gets provisioned and set up with a Little bit of fake storage and all the tests run and now I can tell everybody you don't like the persistence storage handling fix it And they can fix it and I can be sure that if the test run it will work in production the important thing is always what to mock away and what not and Linux provides you such a huge basket with little tools and tricks how to mock stuff you can use routing or Firewalling to fake network problems It might actually you should be doing both because if you set a route to deaf now it behaves different than if you drop packages and You might test your software against those two scenarios Another example who's using a proxy for their servers to access the internet Okay, who's allowing direct connection from web servers to the internet? Okay impressive Because we don't trust our web servers web servers can be hacked and hacked web servers can download additional stuff So we use a web proxy squid in this case as an application layer of firewall for outgoing HTTP traffic Again, if a configuration change in the proxy service Would go wrong then all our platform would not be able to talk with the internet and Then a lot of additional value services on our platform would stop to work So we wanted to cover the entire proxy configuration with tests Not the proxy code the proxy code is quit and it's from upstream and we never touch it But for us the configuration of the proxy service is also code that can break the platform And we wanted to cover that with tests and the way how we do that is We run each configuration change through a big set of system tests For that we set up a test squid server Load the configuration there and then for each function group, which is in our world kind of a role We do at least one test to make sure that the most important HTTP call for that function group goes through this configuration set and we use for example x forwarded for headers to spoof the source address So that on our build server We can set let's this request comes from function group 5 and The rule set will think it's function group 5 and then we check for access denied messages Because of course the test server doesn't have internet access. You don't want your test calls To go against the production servers of your partners That might make them upset and that might cost your money if it's a billable service So obviously the test server doesn't have any internet access Which leads to a very funny result if I have a request and this is the Server that should be allowed to do the request to an external URL In the good case I get a bad gateway error because the squid on the test server allowed the request But it can't go to the internet If the rule is wrong or if I have an error there Then I get a forbidden from the rule set in the squid and error access denied and this in the test case Would mean test failed So this is completely upside down like bad gateway is good forbidden is bad But the test was successful and this is again about mocking you need to know what you mock and then when you know What to mock you know how to write the test that reacts to the right triggers from your mocking environment last example VM provisioning we also have service in our data center which we need to provision and Every morning we have a Test set up running for 15 minutes and setting up various virtual machines Some of them broken some of them good and we check that the broken ones are not allowed to work And the good ones are allowed to work and that on the good ones actually the automatic and automated environment will set up a working Linux operating system All that happens every night so that we know okay We can still provision new systems and that's a very huge system test, but it's also very valuable Because they're about I don't know 20 software packages that go into this automated provisioning setup Which we have for virtual machines and for hardware This is actually open source and you can go there and find all the code for the system test and so on if you do that for your platform and You get different release cycles for different software packages and each release software package goes its own way from development production and In the end they all meet somewhere in production and you know that they work together because of your tests We call it continuous life deployment. That's our way how to maintain Stably maintain always changing platform and The general rule is we deploy applications when they're ready and we automate the delivery chains from source to production The end result is low risk lots of fun And that's the whole thing about DevOps risk mitigation. You have all the fun from DevOps from doing stuff together But you have a low risk and you're not afraid to do stuff together You'll find the slides here plus a few more links and other talks about this topic I'm at the end of my talk needless to say we're hiring So if you have a passion for automation and for keeping things simple, please talk to the people who have our logo on the back Thank you very much and we have 50 minutes for questions So any questions from anyone? So first thanks for a really interesting talk And one of your examples actually the last one testing the proxy You kind of showed how you would use our conditions because you don't want to rely on external services And I wonder if you could compare this to to actually mocking it by like recording and replaying Responses or stuff like this. Why did you not choose it or what your opinions on this? Okay, it's a good question and I Think that's exactly a question about dev or ops So as a dev person I would think how can I mock away the internet? As an ops person I say I don't need to mock the internet. I just need to deal with it and This code has been written by an ops development team And so the solution is let's take a server Let's let it do what it usually does because setting up a server with a proxy and the configuration It's just standard. You say a new server type proxy go go go done So there's nothing needed to do for that The only thing that we had to do for the mocking Actually was to set up the test server in our dev environment, which in general doesn't have internet access So in this case, I would say the answer is it was the easiest thing to do Like the cheapest in terms of effort and in terms of changing the system The only other change which we did to the proxy configuration was To allow the build servers that run the tests to use the x forwarded for header to simulate the actual originating IP That's the only real change to the configuration which we did we use load balances internally So normally only the load balances are allowed to use x forwarded for and nobody else and For the system test to work, of course the build agent that runs the script that runs through all that Needs also to be allowed to use x forwarded for but that's all except everything else is the original production configuration and Maybe I didn't say that in this case the entire proxy configuration Resides in a software RPM package So we have a source repository and there we have all the proxy configuration and the test cases and Each time somebody needs to change the proxy configuration. We just do a new release of the proxy configuration RPM And that RPM goes first on the test server it runs for all the tests and Then the same RPM is installed on the production proxy servers, and then we know for sure that that set of configuration works and The end result is that now developers add proxy configuration to their function groups if your developer and you set up a new software Then you go into this software package that contains the proxy configuration and add your own proxy configuration for the calls you need to make Plus one few test cases and you're done. You just wait ten minutes, and then it's life in production and that's how we play DevOps and that's how we bring Dev and ops together in improving our platform and Reducing turnover cycles development cycles and so on So you had this chart about automation and automation automation is great because we are all lazy, right? So but there you and you even just repeated that that if I commit something that Automatically goes live right so but you also had the slide with the release cycle. So what's your? Your politics there everything goes live as soon as I committed and it passes the tests or yes Anything else? Well depends on the team and on the software product More and more teams put more trust in their tests than in the ability of the product manager to push the release button but the ideal situation is you trust your tests because the tests are documented knowledge about your platform and The manager putting pushing the release button is just believe I believe this will be good push Yes So in the setup that you've described here, how would you typically deal with? replicating say your production database onto the test server would you fully replicate it or try to do something partial because one of the things that we find Most dangerous when we're deploying is things like schema changes that are very difficult to test against fully well in our world everything has to be a package and Everything that acts has to be a service Like any acting part in our platform has to be a service that can be started stopped and that has a status so database changes also have to be a service and We have services that do database changes as needed when they detect a new database schema And that also happens for example here because together with that package comes a new database schema the Services here say oh your database schema. Let's update the database and The task of reducing data from production to test Belongs to the developer who's creating the data or whose software is creating the data and for each software one of the tasks on the checklist for Production ready is did you write something that will create a test database from production and Then the reduction of the data anonymization removal of personal information of people information It's all their problem because they created the database And that's the only way how you can scale to hundreds of roles or function groups otherwise you have a team which just runs behind the others and Always needs to adapt their changes into the Conversion process. I wouldn't want to work in that team So you described how you cover your ops code with tests Do you also do it in test driven way like in a narrow sense writing test first doing baby steps Refactoring doesn't make sense in this scenario. Yes, we do that. We have a lot of Python code in operations what I mentioned initially the example I mentioned about the Authentication code It's managed by a Python script that does all the patching on the Linux configuration file level and that has full test coverage with Python unit tests and Yes, we write in this case. We first wrote the test and then the code that does the patches and That's why in the end The feature for which we didn't write a test also didn't have the code Yes test first It doesn't mean that it's easy test first by the way Especially in operations where sometimes testing means setting up a lot of stuff, but yes, we do that You describe a Boolean automation workflow that actually tests lots of RPMs Well, what about when you are testing things like a configuration management puppet chef All that kind of systems. How do you have those kind of changes also included in your workflow? Well, as I said, all configuration is in packages. We don't have puppet. We don't you don't use that We don't need puppet because puppet solves problems, which we don't have Okay, fine. Fair enough. So sorry But if you would go to a puppet conference you would find out that Making puppet recipes testable is a really big problem And the reason it's a big problem is that puppet combines code and configuration in a very nasty way Well, I mean, okay, it's fine leaving puppet outside if you would try to do testing some kinds of things like Duplications or Stuff like that you need to set up, but I really really complex and networking setup And that's go way beyond just getting VM It's less it's like getting two or three VMs that you have to cut networking with between them and Try to check stuff like that. It's not so easy like just running a Actually, it's Here I mentioned one package, but our automation can easily handle an arbitrary number of packages which I involved in this change because the system tests running here Trigger the propagation of the packages and we of course have a hook that Actually propagates the packages that were installed on the test server. So in the test in the job here I say five packages are relevant for this Feature and then it will propagate all these five RPMs if they're here and installed on the test server to the production Repository which is fine for deployment whatever what happens when you have to just like my SQL It's not responding and your slave is a way beyond whatever it should be How do you test that your the rest of your system is coping with that? Well, we Just I'm just looking for what's the framework you use to automate that kind of tasks We use the framework called keep it simple Keep it simple means that the average developer doesn't take too long to write his first piece of useful code in many cases that means we have some part on the build server which runs the job that will RSH into the test server and Do some nasty manipulation before running a test? And we use RSH because in RSH you can just say well this IP range is allowed And you don't need to bother about faking away SSH keys So basically you have to write the logic into the spec files the logic which is mostly included in puppet No these spec files provide a Simplified way of doing the same thing we separate configuration files depending on how they change and That's why we get away with a lot less patching than you usually do in a puppet world Puppet is good at patching stuff, but we lay out configuration so that it doesn't need to be patched That's why I said puppet solves a problem, which we don't have All right I just wanted to add that the puppet situation about testing and testable code is much better than it used to be there is beaker And there are some other tools to deal with that. I know the community has been active because there was a big problem But but the end I think what I'm saying is that you need unit tests and system tests regardless of the tooling you use And even if you use puppet chef, whatever tooling you still need unit tests and system tests And the unit test will still test something small and the system test will test something big The question is always how do you express it? How do you abstract it and so on and how do you make sure that the stuff you test it goes unchanged into production? In our case, it's simple because the rpm is created once and we never Create rpms again after successful test That's evil. You deploy in production that what you tested and not something else And if you have a world where you test something and then you create something else to deploy in production You'll always have this little gap that can go wrong and you'll for sure find a smart hacker who will be happy to use that gap for his own purposes Okay, I have to ask for a detail because it just sounds too great what you are saying We were discussing the migrations and the continuous deployment Do you manage to do migrations without service interruptions because when we do migrations we have to At least partially shut down services and we can't do that just because somebody said push to the repository Well, there are several layers on which this question needs to be answered The first layer is can your application handle The situation where the old and the new version runs alongside Because if the application can't handle that then the deployment won't help So first make your application so that version 5 and version 6 can work together on the same database Make it so that the database upgrade From version 5 to 6 will not harm the version 5 code using that same database That's the first step on the application level Then you can go to the operations level and say okay I have 20 web servers and I want to have a rolling upgrade going through these web servers So that there's no external impact and yes, we're doing that in our world It's very simple any server is allowed to install the young pet the rpm packages presented to it through the various yum repositories attached and We have a tooling called yet shell Which does the rolling upgrade including load balancer on off monitoring on off Services down packages upgrade services up Monitoring on load balancer on checking next server. Yes, we do that. We do it actually in an automated fashion team city in our case does trigger this kind of waves and If there's any problem the wave just stops and we go and check what happened I think we've got time for just one more question Hi, so a lot of teams when they're deploying to a large set of servers and Do the so-called canarying so first you deploy to a server handling 1% of traffic or one tenth of percent of traffic Let it work with this small person of traffic for a while See if there are memory leaks, etc. Then to 1% 10% and finally Fulfill it do you do such a thing? We do that in a few cases in most cases. We don't Do it so far. We are on the process of getting there Our yet tooling can for example do exponential deployments It can like deploy one then can deploy five and then can deploy the remaining and if we But the thing is here Yum repositories represent the target state in our world So this time of deployment. It's kind of a fuzzy gray zone and in our world It's always okay to deploy the latest updates. It's never wrong never nobody can be punished for doing yum upgrade so how do you because if I understand correctly your Comet to deploy cycle is around minutes. Maybe our right So how do you deal with bugs that show up after like few hours of work or few millions of requests? Like I'm leaking four bytes of memory per request Well as a developer, it's your responsibility to think about the potential danger of your change and If you have stuff that could go wrong in that way Then you need to deal with it already on the development side and not expect the deployment to solve your problems and If you want to have a longer lasting state of different versions, then in our world you create yum repositories for that you create for example our big core application Creates a new yum repository for each build and in that yum repository are a few hundred rpm packages with a few gigabytes of stuff and In that case, we can always take a few servers Hook them up to the yum repository of the next version Upgrade them all and wait a little bit And even if you would reinstall one of the cannery servers, it will automatically get the N plus one version. It's supposed to be running and We do a lot of state management Like you mentioned in with the help of yum repositories for just creating special yum repositories and putting packages there Okay, I think that's time. So thank you very much