 tests and why you should try to burn down your production environment without killing it actually. But first my name is Sebastian. I'm the chapter to the lead of the back-end department in the company the Karten Macherei. Probably no one of you knows what we are doing so a couple of words to the company Karten Macherei is translated to English like card makery and we are not doing like maps and all the stuff that we are doing really literally cards for wedding invitations and all emotional events and we are seven years old we are above 100 people we are mainly based in Germany so we have three locations in Germany we have in the company about 15 to 20 nationalities so main language of ours is English and we currently have five stores and several million of designs and cards and ways to get your cards how you want that but before we go into the topic short disclaimer everything I tell you tonight or yeah today sorry not that late yet everything I tell you today is based on our own experience there are better ways and other ways how to do some stuff feel free to approach in afterwards or in the Q&A section let's discuss there some more and better topics but everything I want to show you is based on the fact that we had to find a solution for the specific problem at this time so so much for that so next point first let's go shortly through the agenda first I want to show you our project and the architecture of our project just for better understanding why we do something like smoke tests and why we actually thought we need that then I would love to get you all on the same page what kinds of tests exist and then we go into the specific smoke tests and I show you how we get our server architecture running and how we deploy the codes in order to see when it's the best time to run the smoke tests and last but not least questions and answers good but first let's go for the product project architecture and before we start into that when I started in this company I got we had a monolith and a friend of mine just run some random tests on one of our biggest files we have there and this number is just showing the complexity of one single file and it's the biggest file we have and we had the fun out of it that we tried to get there for this number a wording so that we get their real name for this number we found several websites and that's what we came up with so come on there's still stuff which doesn't sound real well or that one sorry but that was the closest I think we tried several websites to get their wording for this number and we ended up like three or four of the websites just said sorry that number is too big for us good so we said okay come on we have to change there's something that's we can't leave it like that we are senior developers we want to get their good shit out there and we want to get it in a nice shape so everyone understands the system from just reading it so we said okay come on what's what do we have to do so we want to have and that came from our boss he wants to have this view that's how it should look like and with all these nice tiles so we started something like one and a half years ago to build some build and try a few things and after half a year we finally found the perfect solution we went for a couple of weeks months into the wrong direction killed it restarted so we came up with with a project we called a fury well fury is based on a us tv series maybe someone knows it um it's from it from a us tv series from the 1950s actually it's about um an orphan boy um with a with his fastest and most reliable horse and we said okay come on we need something really fast we need something really reliable because we want to have that possibility to scale that wide so we need it reliable so that's what we came up with and our idea what we finally came up with is we said okay why should we actually serve html and render html somewhere on request so we said okay come on we won't need that so um we said we put our pre-rendered html into a key value store we put a search engine next to it and then we served the request directly from the kv from the kv so no problems on that side anymore but somehow we said okay come on somehow we have to get the values in there first but we still need to render the html but maybe not on request anymore so we said okay come on we need a backup so in the backend we do pretty much the main trick and that's what I would love to show you in this slide um we call it actually you collect an export so for us we start just collecting the information uh from our legacy database then we publish do we render the html and we render um and we reprocess all the search requests we want to handle here um and publish that to these systems and one thing we have to do as well is because we are not replacing from our legacy system all pages so we said at the beginning we want to replace only product detail pages and category pages so everything which is cart rich lists we don't want to worry about we only want to have these pages fast in the first place and then worry about everything which is based on sessions databases some real big issues what every shop has so we said okay come on we take the we want still want to have the view that all these pages which are still in the legacy system um they should have the same header and footer they should at least from a header and from a footer perspective they should look the same as any other page so we said okay to have no dependency from from the old system to the new system we said okay we push the header and the footer into a kv storage for the legacy system to use so that's that's the only that's why it's red that's the only restriction we have there that either system knows of the other one besides that indifferent which system you run you can you can um as long as it's not the back end process if it's just the front end process it's completely independent so but how how now is the question how does that work how does the request work because we have two different systems delivering pretty much similar code or at least in the end the same product um so we said okay come on we want to put that in place if we get a request and we can handle this request we will go to the key value and check first if that request exists and yeah if it's exist then we will do some extra searches we might get to the session to tweak there are a few things on the front end and just ship it so in the case of for example the card the cards won't be available in the new system so okay the new system is shipping a four four and then the whole request goes just to the to the old to the legacy system but that's so much for that to our architecture if you want to know more about the architecture it's an own talk so where we go much more into detail why we actually did that what's the reason for it how we managed features like 100 code coverage like shipping the page was in 100 milliseconds so time to first bite we have for example on the home page last week it was on the home page time to first bite 35 milliseconds so if you want to know more have a look on that website there is the link to the talk but now let's get all on the same page we have that test per meet that's what we use in our company and our as the first and as the ground layer we have their unit tests so what is a unit test unit test is pretty much as the name is saying it testing a unit so when these when we have these four units we want to test only unit one we just have to the direct dependencies in 99 percent of the cases you don't if the code is clean you don't need to worry about class four so dependency for you just can ignore them but everyone should know by today unit test is not the solution for everything how about that one the unit is working so but just one person at the time and how about that one probably some of you might know that already come on house how should that work so we need something more so and and and the animation is seeing already we need integration tests so the integration test is saying it already we should test the integration of the dependency to another dependency so in this case we want to test the dependency between class one and class two so easy way we mock it we mock the rest away but we still can use the unit and we have a rule setup for us we when we started that project we said we want to have 100 percent code coverage and we nearly everyone we talked to in the first place said don't dream of that but that's how we managed we managed we make sure with unit tests that each unit is properly tested and has 100 percent code coverage the unit itself but there are cases where you just can't use it where you can can't use unit tests for example when you have factories where you have just where you really put your dependencies together and that's our small level here we are getting there into integration tests so all of these 100 we make sure just with everything in unit test and slightly a bit in integration test and we really make sure that this is not someone just saying okay i don't want to write a unit test i just do an integration test and get the coverage that's not allowed for us but there's still left still space left and space is left for acceptance tests a lot of you know them um acceptance tests we can do them on both ways we are doing them on both ways as well like first way is doing it real with code so we take and hold all our four dependencies and test them together and what we do as well is doing um selenium tests and all this stuff to make sure that real clickable and javascript and all this all these features are working i have to say we have still some downsides we are not there we are not 100 percent perfect there as well because in my opinion we don't have enough integration tests between beckon and frontend because beckon has proper tests frontend is nearly there already they're trying to catch up with us at the moment but we have no real tests which are making sure that the objects we are handing out to the frontend is actually working with the javascript and that's the only places where it usually when there is a fuck up on the website it's usually there because there this is the part which is not covered so that's pretty much testing everything so but still we have that this is all done before we deploy so we said there we need there more we you saw our architecture if a page is not shipped by the new system it will be shipped by the old system and the old system we don't want to this old system to ship pages to ship pages which should be in the new system so we said come on we need more stuff we need something more in this pyramid up there so what could that be and the talk is saying it already with smoke tests here small disclaimer again the solution we have for smoke tests is the solution we found and we used for the time being at when we had that problem i learned in meanwhile that there are tools which are can easily handle that which are easily can where you can easily run stuff one smoke test or even sieges or whatever but what i will show you now is pretty much like the way we did it so what are smoke tests original smoke tests is not coming from my tea as usual original smoke tests is coming from testing grains and sewage systems and all this stuff so the idea is to just close all the normal endpoints and just put smoke in there and see where the smoke comes out again if the smoke is coming out there is a problem so we said okay come on let's do something let's do something similar so we said before we uh or during a deployment before we actually tell our system that this is ready to go live we run the smoke tests we really run we really try all our endpoints and see if something starts to smoke if there is smoke coming up we have to look at that point if that is a big problem or if that just if it might be just an DNS hiccup or something like that so that's pretty much our list of definitely our our defined list what we wanted to have and we said it should be simple we should cover all URLs in google index because what we saw during this as well during writing the code that's sometimes in the in the product management department the girls changing some URLs but forget to set up the redirects so broken URLs in in google index and it's usually ending up on our plate that there is a product missing which should not be missing um yeah and what we said what we found as well is um the list with the optional parameters so please smoke test not just the regular URL but everything tests some at least some URLs or if you want all URLs with the necessary parameters on it because only then you will know if that parameter might cause some some trouble so how do they work they work in an easy way so you have a production server which is running your application you have a ci server where you want to run your smoke tests from and then you send a request and you get it back and now you can validate on the response like if the header is okay if time to first byte is okay and if there is a body if you want to do this make sure your ci server and your production server is actually in the same network especially if you want to handle the time to first byte if they are not in the same network if if you if it's just a different data center it will fuck up your your time to first byte what you can test there as well is like when it's slow it's failing when it's above your limits sorry then you can see that so i said it already smoke tests should validate status code time to first byte if a body is provided and what we are testing as well if it's the correct server so we want to make sure that actually the correct server is shipping this page because we smoke test not just the new URLs but we smoke test as well existing URLs from the from the legacy system and therefore we want to test if it's the correct server we expect there but what you should not try to do with smoke test is actually doing acceptance tests like trying to find their diffs and all this stuff because therefore our better systems like selenium has mentioned before so in the first place we wrote some code and that's how it looked so we have there like based on phpunit so we said we have their data provider this provides us for each tests pretty much the list of URLs and each URL we just send the get request and just run the hazards against it so when we were finished with that piece of code it was written in five minutes ten minutes we were running it so it took 45 to 50 minutes because we are smoke testing as i said we are smoke testing nearly all URLs we have probably we are currently not up to date to all URLs um but it's about 20 000 URLs so it takes 45 to 50 minutes to run them so we were getting back to the boards getting back to the papers so what can we do how can we speed that up and then we said yeah let's do concurrency so let's do concurrent smoke tests let's send not one request but let's send three because it's a real server so with a real server it must be able to handle more than one request at the same time so we said okay come on we let's try it with in this case three let's try it with three requests we send the three requests at some stage the first request comes back we directly send a new one because that pipe that pipe here is getting free so we send the next request then another request comes back okay we will send the next one and at the end oh there's one request which is not coming back properly that's red but it comes back so we know it but then we tried to increase the number the number how many threads how many pipes we can run at the same time how how big can the concurrent number be and we got scared that something like that happens to our storage so we don't want to have that so with testing all of that we tested it carefully um we came up with the number which worked quite well for us is actually take the number of cores your system has subtract one for the linux on it and fire on everything else they have this is what your server should be able to handle if not maybe look in your code maybe there is something hidden in your code so that's how it looks for us now as you can see we have their continuous deployment on our system so and you can easily see here that here got something slow because we see here a throughput how many requests we can put through within one minute and each of these is a full set of 20 000 URLs and you see here it got kind of slow already during this deployment and it got even slower here and then okay here we have to do something so we fixed the problem and went back up so that's how it worked out and that's how it looks and we are still we with the with the rule of having one core using one core less than in the in the hardware exists um we never got in trouble with our data center because we are not hosting ourselves we don't have an own data center or stuff like that we have a hoster taking care for our hardware they never complained that we run out of limit on on some stuff of that one what is usual getting more into trouble is if we try to run that on some stage environments where everything is on one server that's usually more tricky but you don't have to write the code I just showed you yourself the good part is just for that for this conference here I sit down like weeks and thanks to Steffi she took care that I have the time to write that there is a library for that already and I really wrote it down really made it available well it's even four star already um I try to keep up the numbers as high as possible um so you can use that and feel free to improve it feel free to get there more features in it but I would love to explain now how it works it works in an easy way we have again PHP unit because that turned out to be the best system to run these tests to run everything so you might know that as I said already there is the test itself the only test and a data provider that you can use the data provider what got a bit in our way is that the data provider is actually running in front of all tests so if you use this library inside your normal test library and you want to run all the whole pyramid uh the data provider is run first then even if you want if even if you expect the unit tests to run first before any test is running data providers are sorted so now PHP unit is doing the call to the data provider at the far at the first step the data provider is now taking care of sending all the requests so and sending that with the with the proper concurrency you have set up and the data provider gets the response gets the responses takes care creates result objects for you and then we hand every single result into an own test which makes it easy which makes it easy and clean to read so that sounds quite complicated with the help of a consulting company called the phpcc we looked into the code and we wanted to make it as clean as possible as easy to to read and it's easy to understand as possible so we said we said there's only one thing you need to do and that's actually to introduce your our trade here but on this trade it's just making sure for you that you have features so let's go through the code so we start that's pretty much your entrance your entry point you can start here with a list of URLs it's indifferent if it's a list or if you have a CSV file or you can attach their databases whatever you want and then you put it into into the URL collection there you define there the options so the options like request timeout this is not the timeout the page should give you but that's the timeout curl you'll get because the or any system which is behind there we're using curl in this library for the moment but it's built like you have the opportunity to use W get or whichever system you prefer there then what is for some requests important when you want to test for example against redirects if you to explicitly test if there is a redirect working you can set it up as yes or no the concurrency we set it here to three again like an example and the body length and for the body length I have to explain I said earlier don't check what is in the body this body length here is just to save your memory because if we would save your whole body into your into the result object and do this for 20 000 pages you can count yourself if it's if the body is one megabyte of html for example then it's 500 megabytes you don't want to have that in your memory so that's why we limit this to 500 characters actually um and these 500 characters are only there for first testing if there is a body and the second test you can what the second reason is there if something is failing we will show you the the what was this 500 characters so if there is a php error or something at the beginning of the html or at the beginning of the body then you can see it so then you can see for example that you missed the parameter or whatever that's what the body length is for and there's one parameter not here in this example but it exists so if you have your system to test behind the basic out there's all there's as well a basic out class you can hand in here as well um where you can put in your username and password and test it against the basic out system good then as I said the result object is coming back um there yeah the result object is coming back and um with the result object you have in there for example the ready to go features so you have in there the url again you have in there the time to first byte you will find headers you will find bodies status code um as string methods which is giving you that error message so what we use for the error message like explained and um some more small features like this validate if it is a validate result the last but not least we have in here already some prepared assets like if it's a success um so we got a 200 error 200 status code that we test against time to first byte that the body should not be empty and um that a specific header is there so to test that the page was shipped by fury we set up by our fury system we set up on the server that this service every time sending this engineer is sending there um the app server fury um and then now we can test against it you can see if it's there or not oh and I forgot one feature there's one more feature in this trade there are two methods called uh success output and error output because what you can imagine when you won't see this one while the data provider is running while the data provider collects and runs all these tests you don't see anything so we said we put in there some more features um like having in the trade two methods success and error output um which you can override just to create your own like view we have in our life uh environment we have in here account popping up um like it's counting every I think every 50 entries we show the number that to see if the script is actually still working or if it if it run into an endless loop or just died or what's the stake there so but in this test case we run 223 um tests one of them failed so it was a 404 by purpose to show here something and that's how it that's how it's it's outputting so in in this case it tells us okay there there was no success that's and and that's the header and body and all this information you have there there are a few more features to come which are already in the pipeline so you might have heard there's already PHP unit 6 so to get on that one a bit more um the library at the moment is based on PHP 5 and PHP unit 6 by purpose um because we want to enable teams with this system with this library to smoke test their existing legacy system and create the new one while they still test it so the so there might be teams out there which don't have yet um a PHP 7 environment so they can easily run that on a bit more legacy systems but there will be soon another a second version a complete second branch where we want to put up that in PHP unit 6 um and improve there the usual quality more assertions um redirects all this stuff what we want to do there good but now how is this fitting into our architecture so because you can imagine if we run that against the live system during the peak time that might still cross some trouble but that's what where our server architecture comes in into the game and it tells us a few more things so we have their current setup it's an a b setup so we have two identical servers we don't in at Katmachari we don't worry about hot standbys and failovers and all the stuff this is all the data center doing for us so from our view we just take care of server a and b and we have web server service in front of that which is not doing much except just this web service knows how every request goes to server a now or every request goes to server b so if so only one server is live at the time so the other one is idling the other one is doing nothing so now we want to deploy there something so we have that deactivated server uh a server b is currently doing every request so we don't need to worry on server a that there's users on it or anything else so the idea is so we deploy the code then as mentioned before we run the collect and export script and before all this happened we we run already the unit integration and acceptance tests on it and now after collecting export after everything looks good to us now we run the smoke tests but what we are not doing is actually running the smoke tests against the web server directly because for us that costs some trouble because when we access this server directly who's making sure that this connection is working so we copied one to one our configuration on that server and give that server a second name to listen on so we can access either the whole system we are our normal url which will end up on on the active server or we have in here a hidden configuration which is one 99.9 percent the same as the other configuration except the one place where the hostname is mentioned just to go to have this tweak to go through the system here to go directly to the server with the smoke tests as I said we reach all the endpoints everything works but if not we won't you see there is something left here we won't go there we want we just stop there is an email sent to a lot of people that there is something failed and at least one of them has to look and has to double check what is broken there and then we rerun the whole deployment if necessary or sometimes it happens that the DNS lookup costs some trouble especially at the beginning because we had didn't have the DNS lookup cache active and then we rerun just the tests in these cases and then we go to the next step and the last step and that's the switching on the system so we have their currently active server B we just put in place server A tell this web service just to reload so every request which we got already queued is still going to the old system every request which is from then moment on coming in is going to the new system and that's pretty much we are pretty much sorted already just a conclusion please do me a favor and write tests please do yourself the favor try to get to 100 percent it's hard to get there but you will love it when you are when you're there because 100 percent is a good number and 100 percent really make sure that you don't have any loopholes in there it's not saying that there are no bugs in the system but it gives the freedom to refactor code it gives the freedom to write clean code and getting 100 100 percent is forcing you in a lot of places to write clean code as I said before smoke test your website and only activate the server if it's working if everything is fine and then I'm already done I already would love to see thank you and now I'm open for questions hey hi and hi unless you've already got two servers that you're paying for so I was wondering is there a reason not to do it on staging and then just deploy once those are past and they're staging bigger yeah I get your points we are actually currently not doing another stage at all because but that's due to a database issue because our database between stage and life is not equal so we can't run the same list of URLs against the life and the stage system and that's in my opinion the biggest point you can still run it against your stage system but is it still showing you that not the hardware failed or your Redis went down without noticing or got stuck that's more the problem that's why we said we want to run that actually against the life environment because only then we can be 100 percent certain that these pages we have there listed actually work and actually will be available for the customers more questions you were saying that you would probably do things differently now what sort of things might you do differently now yeah as I showed you in in the slides already with the library we just introduced we when the viola was writing that library for this conference here I realized that we are not doing proper concurrent smoke testing and with this proper concurrent smoke test because we in to explain it in the way we did it before we just did concurrency but we send it a chunk of 10 requests and waited at all at the same time and waited until the last one came back so we ended up like with 20 000 URLs at something like eight minutes which felt kind of okay but now with the library I tested it I have expected more actually from from the library which is proper doing or doing proper concurrent testing but we only gained something like a minute on 20 000 URLs so I expected much more to get out there but if all pages are near to 100 milliseconds they are all coming back literally at the same time questions already yeah thank you yeah um you said like you know for normal case you um you just gave the UK use case saying you know she was going where you passed a smoke and look for where the smoke is coming from the other side right do you have any better use case in terms of code uh where we need to close all the in points and you know check where the smoke is coming yeah as I said this if I get your question correct um the smoke tests when we see their smoke we have to look on on each single case what if it's just the DNS lookup as I mentioned before that's it's the usual case we had in the last month um but now we have more often the problem that there is a redirect missing um which someone forgot to set up um and we never really had their bigger issues that a page send it as a 500 or a 404 this more the system every of the other tests in so the whole pyramid in in total is making sure that the system is working so since we have that whole pyramid in place yes we still have bugs yes not everything is perfect but with the pyramid in place we see much much earlier not on the deployments anymore uh when something is failing so and we had at at the far beginning when we introduced the pyramid we had there some trouble with um um that the database was not consistent that we for example when we built that system and we have there quite a strict rule what a format uh from a from a technical perspective should look like and the format name should look like so for us it's as an example f040 is a small square card um so we set up the rule for it but we tested it against our development environment we tested it even against our stage environment but as soon as we wanted to go live there was more than just f40 so there was more formats which we and kind of formats existing um and it's not just starting with an f for some for envelopes it's ef we never thought about that and that's usual like the fuck up we we had at the beginning with the tests um to make sure that this uh this is helping us a lot because otherwise we might not have seen it if we wouldn't have run the whole pyramid question answered thank you very much there's one in the back you said you tested about 25 000 URLs uh how did you go about getting the entire list of URLs on your site yeah okay that thank you thank you completely forgot about that topic um it's uh our list of URLs is actually not dynamic so we tried it we thought about doing it dynamic but we ended up finding far too many reasons why the dynamic solution is not good because if you create that list dynamic on the stuff you just exported if your export is just doing not 20 000 URLs but just two it still would would work it still would run to smoke tests work so sorry to so we said and i really would recommend that either take an own system to maintain the URLs or as we did it we just have a CSV we have two CSV files one for the new system one for the old system so for pretty much for each smoke test category we have an own CSV file which is holding all the URLs but to get these URLs there are two ways either you look in the index of google and make sure that you get them or what we did we when we were sure to cover all the pages manually um we just took our redis and exported the list of available URLs into a file and since then it's manual so if there is a URL which is really a 404 which is really has to be taken out there is really a deployment for us necessary to take this URL out to make the system work again but that's the only way we figured which is making sure that your system stays stable because you have to double check that if there is a 404 on a on a google index you don't want to have that there should be a proper redirect to the correct page only we removed some 404 pages recent just recently because they weren't in the index at all in the google index at all and there was never content on it so someone created category pages without putting products on it so we took them out as well that's answering your question okay one here at the front detected an actual issue with this smoke test and what kind of issues they were as i said before usual the usual issue what we detect with the smoke test is um we had once or twice that the redis uh during we because we are pushing with much as much power as we have into the redis so when the collected export is running um we clean the redis completely we strip it empty and then refill it as fast as we can and once in a while we had that that on the server it just stopped and we didn't see it we didn't see it during the deployment so pretty much a last week the last push into redis um somehow killed it so we couldn't access it anymore so and that's what what showed us the the smoke test because wherever we try to access the page um the fury system answered four four sorry i don't know that system because i have there an issue in redis or redis just didn't get it um and redirect and then redirect to the old system happened and the old system answered that's why i said we we want to test the headers so i tested the header against the system to make sure that this system the correct system is answering um and therefore we saw that redis for example uh went down or um we see usual issues as i showed you before in the in the graph um we see when the page slows down so because we um some don't count me on days maybe two months ago um someone introduced the frontend feature uh wanted to put something from from our new team members wanted to put a nice feature like putting the name when the customer is locked in put the name at the top so that's your name and he used twig for it so we don't have trick rendering in the frontend so now on each request we would have to fire up twig so we love twig we have twig in the back end but we don't want to have it in the frontend fast because it's yeah loading it up it's not necessary for us so we said okay if you want to do that don't do it with twig just do it with xpath so because everything we have to tweak still in the frontend which comes out of redis before actually showing like customer lock-in status and all this stuff um is done with xpath and filtering and that's what it what showed up as well because the time to first buy went down from our 100 milliseconds or yeah we are from down from 100 milliseconds to something like 200 so a lot of pages started to fail and the smoke tests got really slow and you can easily see that because we are running that deployment really on a pipeline and our whole pipeline takes long usually something like 15 to 20 minutes when it takes more than 25 minutes then it's it's usual to smoke tests and then you have to look into it good more questions hi um if I understand correctly this only does get requests at the moment yeah doing only that's based that on get requests so would you not recommend doing post request for example the problem or our solution for that is easy we don't have get requests in the new system so it's not it's not meant to be for post requests yet there is ideas to put them in there in the future as soon as we come to that point because post requests is based on the legacy system and we said we will introduce everything just for the for the new system to narrow our focus so we don't have to think about everything left and right we only narrow our focus on the on the topic and tasks in front of us there will be in the future features available to have their post request as well thank you good okay good more questions okay then thanks again and have a good day