 So I guess I'll have to ask someone I know to pass around the the mic Because as you may know or remember this is an interactive talk so we prepared some Stuff to follow around but we are really hoping to have a discussion Thoughts and sharing experience. I guess it should almost be time to start I don't know. All right, I guess you all agree that Will now proceed Okay, so The topic of this talk as I just told earlier is an interactive talk. So we're really hoping to share our experience Why did you we want to make this talk This is Ramnes. I'm ultra bug and we work at number Lee That's where you can find us if you want later discuss things with us But to get back on the title of this talk It's about What happens when? She happens and The main thing in our job daily job. We run some pretty heavy throughput web services and that gather data for our customers and We can never be down Down time is not acceptable and losing data Which is another story is not accept acceptably either So we've developed over the years some kind of practical reactions and we have learned to develop and design our infrastructure a bit differently and We are still learning. That's why it's an interactive talk because we don't claim we have the answer for every use case so we wanted to start with The basic stuff which will lead maybe I hope to the conversation we'll have let's take a simple example Which Guillaume will introduce you to This is like a very basic application like What you could have when you start a company or anything So you have and Jennings who service GDP requests beyond that you have a flask application Who handles all the logic stuff and you put all your data in MongoDB so the database for example But really that could be like any database So first example is what's what happens when your database is down So in our cases, we have multiple solutions So we are for example the server could be burning, but if you have that you can have just a replica set of Databases so if one burns well, there are still two or three other database I can take care the lead and okay. You continue to serve Request Something else that could happen is that you miss some resources for example, you don't have RAM anymore So if you don't have RAM anymore, well, you could trigger some automatic kills like with Uisgi for example do that you can just say in Uisgi. Okay, if that process take more than I don't know like one gigabyte. Okay kill it You could use C groups like with the curves or anything just to say, okay This process just have that amount of memory If you don't have any disk anymore like if the disk burn off if you have big failures What could help is a red one red 10 anything? Basically Never run in production a web application on something that doesn't have a red one Another good thing you could have is a distributed file system like NFS or Anything there is a lot of things you could have This is a good idea for some use case for some of the case and time it can add some other risk That's a that's a choice to take If you have a server overload like the database can't handle any more requests because it's already like at this full school Well, there's not much you can do except monitoring it. So, you know what in app it when it happens and Scaling or recently so you are just add more servers. So you can you can handle more requests So If you have like some Some other ideas or some remarks about that Not easy to tell us about it. Like like Alex said, it's really an interactive thing. So and While you get the microphone, I'd like raise You to raise your hand if you're back in database server already crashed your web service applications Or anyway Okay Alright, so I guess you all have experience in these fields. Like I said, we we prepared basic Stuff like this and we'll get deeper and deeper in between the talk. So yeah, please Hi, so, yeah, I think we all shared the experience my question would be why don't you use or didn't use any of the standard Tools or solutions for those kind of problems for example For the least here mezos seems to be a good solution. I can answer that for complexity sake Who doesn't know mezos, okay, so we lost already a Lot of this out of the audience. So just to get back about what it is. It's and correct me if I'm Saying it wrong. It's a cluster service oriented Clustering service oriented solution Who is resource management so that it can spawn a resource somewhere and spawn it somewhere else if the given server Which where it was running? accidentally dies right, but setting up mezos and Managing it is an overhead That you may or May not want to to to have Kubernetes is also the same kind of thing by Google Google platform runs on Kubernetes and it's also maybe a good solution And it depends on the architecture here. Yeah, we took a basic example with no automation whatsoever and because also we believe that Sometimes simplicity is an and built-in features of the technology we use our best their response To making a bigger infrastructure and adding again complexity. Maybe maybe you can save complexity by using Right technologies or technologies who under failure in the right way Also, we won't talk about mezos or Kubernetes in this tool But this is really like the first example in the next example will go like on bigger architectures. So Yeah, so it is my experience that I heard a lot of similar responses from different teams and The thing is sooner or later They end with a lot a lot of moving parts and And Sometimes okay really sometimes it's perhaps cheaper to just use something and invest like a week or two Instead of having to answer the phone at 3 a.m Yeah, yeah, like I said, it really depends on your team and the size of your team or your company Yeah, but I really agree with you I just wanted to say like please don't call like plain NFS a distributed file system Yeah, it's just like you sound like Gloucester or yeah. Yeah, it will burn you. Yeah You're right When I when we wrote a distributed file system, we are more we have we had more in mind HDFS Which we use intensively. Yeah, it's a misspoke Okay, in my experience, it's not very hard to Avoid hardware failers we have replication we have master slave We could Clone backup our data, but it's very hard to recover logical failers when we logically corrupt our database or corrupt our MongoDB database and And How we could avoid this Yeah, yeah, we'll we'll cover maybe deeper which example who relate to your To the problem you you're you're talking about. Yeah, I agree with you. That's only pure hardware failure Yeah, any other hardware failure Experience Hi, so I forgot to mention that mostly with this kind of homebrew solutions I noticed that They end up with a much more complicated architecture. For example, if you would want to somehow Make now out of this technology stack Some failsafe architecture, I mean in my experience teams have ended with multi-master highly complex Maria DB clusters and whatever and You know the solution is simply use salary use blah framework Just do it Yeah, we'll get some some of those afterwards. You're right Um Let's let's let's continue. I don't I don't know if I'll be contributing much, but just an anecdote about hardware failures And this one project that I was only briefly we have this big data center in Verizon or Amazon or something But it was in one place in the world and tsunami hit hit This later. Yeah, okay. No, no, no, keep on keep on Then we thought up. Yeah, we have to have another one on the other coast of us. Sure stuff like that Of course we'll get to that as well. You just want to see me walking. Yeah That's because you said you were tired earlier. Oh, no, actually, um, yeah, I was No, or your No, actually, I was a little bit late because we were I was stuck in the EPS meeting. Sorry Um, these are all server things, but yep, if you have service just service or not reachable So the network is missing and that work Is a thank you problem as well. Thank you Also hardware can fail there. Yeah, very barely. Yeah, that's another possibility and Unreachable back ends indeed that's Maybe what occurs most than a server burning a burning server actually The first thing that comes to my mind with unreachable back ends is a sysad mean guy who tripped over the cables True story. I'm sorry Not not me man. Anyway, the first thing is you have to make him remember that's human behavior so maybe Find a forefeet for the for it and they look at the keyboard for one week whatever you want, but you have to make him remember on The hardware side you can handle also switching and switch failure the easy answers to this on Linux for instance, but it also works on Windows is use network bonding now when you buy Server they have at least one network card with two ports use those two ports to To end and plug them to two different switches. It's really easy to do When you have a really real network network people you can do LACP, which is Higher but more resilient and more robust way to do the same thing aggregating two ports and Adding up their bandwidth while adding full tolerance to your networking That's a principle. Do you have any? Sharing knowledge about switch or unreachable things. Yeah Yeah, hi, is anybody using hardware anymore? It's not everyone running in the cloud or using virtual machines And you're running it yourself Yep Okay, just asking So, yeah, we do it ourselves. So yes We We buy everything we host everything ourselves and so we have to take care of these kind of problems. Yeah And we use gentle in production Yeah, we use gently next in production, which maybe a lot of you haven't heard about We are some kind of crazy people when I say we're used to shit on Ling maybe It's part true any other Thing to share about network resiliency Okay, now let's get a bit deeper in the stack Having a fail-proof stack can also help when it's not about only the hardware part on nginx There are two things I like to use mostly is That in nginx you can handle back-end HTTP errors Like your upstream gets back to you with a 500 error What do you do? Do you pass back these 500 error to your? client or do you try to handle it nicely I Show an example of this if you don't know about this It's called name location in nginx. We use this a lot So when something bad happens you can see on the bottom error page, whatever it is We will change the error code to 200 to mask it for the user while still serving some kind of pixel because this is a pixel service and We can even undo if there was a redirect get parameter in the URL we can still Redirect the user to the correct page even if our back-end didn't or made something terrible so that's a kind of little trick location and error page Handling can really save you from facing Hey 500 error Calls from your clients We use it quite a lot You can also Serve from cash so nginx has cash caching capabilities you can say okay if I get Neuro code from my back-end. I will just serve a stale cache response It's pretty handy as well on your flask application Usually you can also use stale caching which can be handy if your database Right is down as well. You can have some answers in cash and serve from stale cache It's better to answer something than an error code and then you can have multiple Techniques to not lose data. This is more focused on not losing data Spooling and test deferral in The basic way is the is the way that you get some data from your HTTP call and This data is very important to you. You don't want to be asking your client to send this data twice Even more when in our case it's navigation data So it's a browser and user browsing a website data. We can have this data back Spooling it means that whenever we have it We are not forced to immediately insert it in database We can take this data write it somewhere on disk and have another process be fitted with this data and Insert it in a safe way. So if your back-end is down It just it just just can try and try on over and over Inserting this data while it was a long time ago since you responded to the client That's that's deferral There are also message queuing technologies such as Maybe you heard about it. You're already here zero MQ Rabbit MQ which is more resilient and stuff like that that can help you get data and Make him into a task. That's also the salary philosophy which I which is using rabbit MQ as a as a message broker the important thing here to Me and to us is Don't send back error codes to your Clients even if you have unless you really have to Depends on what you're doing, but you can't handle them even on higher levels of your infrastructure and Don't lose data. Don't ask your clients to Send again this data. You have ways and means to handle these kind of failures as well and to not ask for it Do any of you use any of those techniques? two three four What? What techniques do you use? Hi, I used to work for a WordPress hosting company and a lot of what we did was basically rely on the reverse HTTP cache to a lot of the content being served is actually just Static content in a way like think of a lot of people running basically websites those glorified blogs are basically just you know static content after a while and then the back ends could fail all the time and Customers would never notice if you serve from cash Everyone's happy the front pages up the main articles are up a lot of things are available Especially when your website is basically a content publishing platform because that content doesn't actually change that much is not very dynamic It works very well. I mean You don't have to wake up every five minutes in the middle of the night drinking out as you can sleep through it And everything's fine. No one will notice except the people trying to publish an article if it's something really urgent Then the yes, they will complain but Any other users who want to share their experience or what they are using it for Yes to complete this thing even of our website like e-commerce You can use similar techniques, even if you need database actually to insert the orders or stuff like this because like 95% of the content is static so you can have something like varnish serve the static content Then use some tiny JavaScript to just get the little tiny part specific to the user like the user name the name the basket and etc and I've I've seen it used to Like lighten a lot the charge on the on the back ends and it yeah It's very effective and even if you have like one or two minutes of a don't time for your back end your user can still add Navigate the website see all the products and maybe By the time they add the cards The back end will be back up and you won't lose any money Yeah, yeah, yeah, I Guess the conclusion here is it's better to run it even a degraded version of your website or whatever Services it you it is your run Than having it fully down Depends on the use cases Yeah, it can't be argued you want to argue it Come on, we are here for it. Yeah, I Want to hear the contour point for example if you charge money from client It's better to say I cannot then take money after several hours I guess even that can be argued So The next thing you can do is of course clustering your application. So If one of your back end is down Or one of your databases done well, it's still walking So the bad thing is even with a lot balancers are still a single point of failure. So you can always go You can always get more redundancy like Even if you're you have two load balancers and two ups is then the whole data center can go down So you have to get another data center So it's kind of an an affin loop, but yeah redundancy is good. Okay So now we can get to Your points where your data center burns Yeah, this photo looks pretty bad I don't know if it was shop photoshopped or it's if it's an actual photo, but I was like, oh my god I don't want this is to be the c-sops coming back after the fire around in the in the data center room On the upside Actually, it's pretty simple have multiple data centers if you run them yourself if you use the cloud Like it's been subject Suggested in Amazon you have those this notion of availability zone That you should use make sure you do remote backups whatever you do and test them In France we had a recent story where a Big company lost its customers data and they found They thought that they had backups because they were using backups and remote backups and when they tried to get them back up Yeah, I can't be said It failed There again, I don't want to be the c-sops over there and you don't want to I guess And on the IP routing and connectivity stuff you have BGP anycast stuff for having a single IP address accessible all over the world the world and Something I appreciate also is DNS health checking For this we use route 53 on AWS. I don't know who knows about route 53 Okay, not so much it's DNS service from AWS where basically you can have geodistribution based DNS responses and Add to those DNS records the health checking so if your data center or whatever happens is down one of your IP to your Web services down it will not be answers answered from DNS queries anymore. It's pretty handy and Cheap as well on the application design you have to think about Geodistributed applications who runs at least one geodistributed service here, okay So I'm not talking about too much people. That's still It's a very interesting thing to do as a developer. It's a real challenge as an ops It's a real challenge even when you want this service or this kind of When I say service it can be a database service available all around the world It's it's also a nice a nice thing to try and achieve Anyone had this kind of problem already where they were relying on everything in one place Yeah What happened to you on the whole data center? Yes, so obviously I'm not a administrator of the network at some kind but I I seen this all so main Service was located in one that the center and it failed power And And it ended up just in four hours of outage complete nothing more crucial infrastructure was located there So we just dialed up our clients as I say to sorry And afterwards we afterwards we apparently distribute Yeah What time did it take to distribute your? The whole thing I would have to ask my administrators. Yeah, but I know that certain steps were carried out. Yeah Just add a little bit because Just how easy Terrible stuff can happen to a data center Especially if it's not like a big company they like a small data center or a service provider Having a small hosting area because I used to work in a kind of the same environment and basically So many things kill the wrong. We had a story I mean, I won't name the company but basically it happened overnight and Night shift who was monitoring the object just everyone fell asleep suddenly and they missed all the alarms and basically when the morning shift came Like all the temperature in the server room where we had a lot of our customers Hosting their services was like 70 degrees. We opened all windows and starting started just like you know to try to get somewhere There but basically a lot of things can go horribly wrong. So choose your Data centers carefully and try to really get more of them if it's possible. Yeah Contracts with your providers one Yeah, yeah, yeah, but I'm just reading to Contracts to your to your providers are not enough Usually and I've been even providers say like 99.9999 percent, but not one hundred percent. Yeah, yeah it's Luckily this was a data center that was only used for the development but We had an air conditioning that was running really hard and it leaked water into The power outlet that was behind the UPS so No more Interruptible power supply and what proved that it was interruptible It was down for two days Yeah, it was major major problem Yeah, you have to call your clients in the end indeed. So I guess this is This must be very hard to explain. I don't want in to be in the sales department at this time the problem with geo Separated distributed locations is not a problem when it goes down But when things come up again That's right I've had a few times where services came back up and we had both of them active because they couldn't see each other But the rest of the world could either see one or the other Yeah, and then people start using it and when they when they see each other again then one of them has to decide to be slave again and Weird things happen. Yeah, that's called the split brain situation Where your brain doesn't know? Anymore because you had usually two peers that's why in clustering in general and in everything you should do is that always be at uneven numbers and And you already know about the voting strategy. Okay, if I am in a Disconnecting situation Who is down? I am or is my peer down if you have only two peers you have no way to know You have at least to have three peers to be able to know If you can't reach any of the two other peers, you're down that's Solid pretty solid. It's not always solid, but it's pretty solid At least always thinking uneven numbers always whatever you do Yeah Okay, so Terry is great, but Sometime real more problems are a bit more complicated and it's not always like dev up stuff It can be like really coming from your code. That's what we are going to see So one day I was walking like normally Doing my stuff and One of our market guys came and then told me a run as the claim says you can't authenticate on the server On the website something's wrong I was like, okay, I'm going to check blogs This happened like maybe 10 times per day. So Okay, let's let's see. Maybe something's wrong So I say to the machine. I look at the log and everything's okay So well the client is Whoop, what did it? So, yeah, the client must be wrong. So he goes away and I'm happy Something like one hour later I'm still walking and the guy come back and tell me that it's still not working for for the client So I'm exhausted. All right, I'll check the code. Maybe something's wrong Then I look at my application and I see them. Does anything see something wrong? So after so this again, I notice that the same email function can't fail So if the email function fail, well, it returns. Okay, it works So, yeah My conclusion to that story is that you have to know your code life infrastructure is great, but code can fail too Even if you don't don't like the guy who wrote the code even if you don't understand the code you it's it's you if you're Maintainer of something you have to understand what you're doing and you have to refactorize when needed Error should never personally that from the Zenf Python. Well, and yeah, don't always blame the guy Sometimes it's easy like, okay, that's not my fault. It might be another server thing So that's why the devil thing is great So you can like really understand what's happening on your server Even if you're just a developer at the origin And the other the other way I run this true too so Do any of you had similar situation? Oh, did you what kind of really weird things happen? Okay, now it's gonna be brave for developers to raise their hands I know I had a city situation where a similar thing where they're saying oh, this isn't working for the client She's trying to like do all these things. She should like She had a really odd workflow. So I was thinking how this is all working all the tests are passing I go into the website and I'm looking thinking this is all this is all working fine And I ran all tests and it's working fine and what I didn't really realize it took me like a week to realize Where she kept on coming back the point where I'm like I use no script So I'm happy using the HTML back end and everything so I can find what I didn't realize was if you enable the JavaScript JavaScript uses a different API and that's the thing causing the problem So make sure to eat your own dog food and use your own API That's like that got it. Yeah, it looked like it was working, but I didn't write the code. So it's fine Yeah, but in the end you were responsible Rise in Python we get used to the libraries we're using raising exceptions a Really common one that doesn't is memcache pretty much every memcache library will return zero instead of raising exception so Wrap it or do something like that But there's four or five places I can think of in different projects that we've been working on Where we trace back something so like why isn't anything working? It's because we think memcache is working when it's not Yeah, I Tend to like the memcache Python Library because of this But sometimes can be nightmare. Yeah, so you have always to check about the it's like the go It's like in go. You have to check the year you return Any brave any other brave developer want to share about this. Oh, yes We have one here That's it That's it through Yes, so my example is not related to Python really, but to PHP Yeah, and who thinks it happens more in PHP That's brave developers. Yeah, but for my defense, I am not to wrote the code But there is a very nasty thing when You try to auto load some file class And you have a syntax error Then if you do not handle this properly then php dies returns and the web server returns blank page with 200 okay And no other way to Debug this issue That's a nice one We we ended up at the WordPress hosting company when I work we ended up writing some code in the reverse proxy That would detect these sort of situations the white pages and alert us to it just because it's such a stupid default Why would you return a 200 when something's wrong and yeah, it's horrible to monitor for that I have to mention Python break break itself this rule. For example, uh Hezatov hides attribute error exception and sometimes it's can make Restrain things to Switch to python 3 One thing that's that's not related to python or any programming language really Um, I was having a server with a pretty large disk in it Um, and there were very very very many files on that And then suddenly a developer called in and said hey, I think the disk is full So I go look do a df and now it's not only 10% used And he says well, I can't write any files anymore That's okay touch fall This fool Yeah, it ran out of our notes. That's something Yeah, yeah, that's a nasty one. Yeah, that's a nasty one. We we often overlook. Yeah Absolutely, right and there are file systems who don't rely on inodes So when you know that your application might spawn a lot of files think about them indeed I have another story that I forgot to put in the presentation that and so I'm going to tell it right now Basically in my old company like it was a small startup before I work on and bully So we were trying to to get things fast. So basically our web server was running inside a t-max And sometimes when we looked at logs We were like just crawling on t-max And one day we were like, oh my god, the web server is not running anymore And actually it was just t-max when you scroll it's in a A pose he sent pose to To your application so The application was down just because t-max was We were trying to take the log with t-max Don't run t-max in production And about the DevOps philosophy, I don't know how what kind of objective we can add who works As a DevOps or in the DevOps minded company I see I don't know if you're waving to say hello or Okay, just wait I come with a microphone because we can't understand you in the back Yeah, we should have thought about that a little probably a little bit early The DevOps question is hard because When your managers and everybody talks a lot about DevOps, but they hire a guy who is a DevOps As a DevOps position, then it gets you know tricky. So you get back to the silo So we are developers and the DevOps. So yeah back to developers and since admins I was actually a developer who had to run back to our admins to check up why the fuck is docker not working again Oh, the elastic search containers Clustered with each other who interesting. So in that sense, I was a DevOps because I need I needed to worry About the code and about infrastructure mess ups. So It's a tricky word Yeah, it has a different depth Uh, depending on when you stay where you stand Which led me to a question Who runs docker in production? And can you share some experience with it? I'm interested And more I'm more interested in when it's failing. Obviously One thing we were actually doing Just this new project. So so it was more of a proof of concept But we've already started to get it out to customers and we were Working together with this Consultancy who told us how to do docker and cloud foundry if someone knows cloud foundry So our whole infrastructure We would provide services for cloud foundry based on docker So, you know, like spawn radius is elastic searches stuff like that But The docker cluster was actually one machine with all the containers Or all ups for all services. So don't do that Oh, yeah, okay Thank you So I can share a funny story where the docker demon crashed on the ci server So you can imagine having like 15 Super highly paid developers who are just mixing the air And it was also fun to debug because you know, who would have thought of it? Yeah, but to Get to the previous point How much effort would it take to implement something like Supervisor or whatever process that would monitor the demon Probably not much. Yeah. Yeah, you're right. You're right So we sometimes we're on own barriers And we yeah, I really agree with you just about the DevOps thing Yeah, it's kind of a buzzword except excellent, especially for recruiters Um But all we see it's abnormally is really like not a single person being a developer But really a team where you have yeah people who develops people who make maps that The thing here, um, but just working together and understanding what the other is doing. It's just very important giving time also to a developer to acquire and helping him Understand what is not used to do and it's a Another real world problem So we have our statistic on grafana. So it's a very nice board One day I was looking at grafana and I saw that those statistics. So it it wasn't really important because it was Really just it's a maximum processing time of one of our services Um, basically the average processing time was still very low. So, uh, we didn't really, uh, Investigate the that thing but, um It stayed there for like, I don't know, maybe two or three weeks or even more. Maybe I don't remember But we never understood really what was happening with why the maximum Processing time was so high and then one day Like what? So my first idea when I saw that the graph was going so low is oh my god, the service is down Actually, no it was still running But what was happening is so I searched for like One or two understand what was what happened I I ended talking with one of the most ops guy in the team and he told me That's strange because at that moment I deployed an Ansible playbook on one of On those servers and so we looked at the playbook. What was the difference? And the only difference was in the etc host file Uh, basically, uh, or DNS server was like, uh, all the time queried at each database insertion So sometimes it was just overloaded So just putting, uh, the IPs of database in the etc host file of each machine Fixed the tree so Yeah, that was pretty weird Sometimes some other stuff And it's not just Resolving the database server you Be surprised to see how many Code is reversing the end of the DNS. So you have to not just forward but also reverse Is happening in quite a lot. Yeah That looks like this problem to be honest We felt pretty stupid with this one as well And this one is pretty interesting because two days ago you made Presentation about using console. Yeah Sounds very contradictory. Yeah. Yeah, and the thing the question is have you tried to put a local GMS cache, I don't know by nine something like that and a TTL around 30 seconds or minutes or something like that Would it make the trick? Yes, absolutely What's what's embarrassing with this is that We also lacked consistency in what we do You know on another type of infrastructure. We have local cache local gns cache But there we didn't have it and when uh, geom says The people were working on the uncivil playbook is also to start Normalizing all of these. So, yeah so maybe We think that we have something in production and it's running for so long that Nothing can happen to it And we tend maybe sometimes to forget about its resiliency or performance or just applying the latest of your knowledge Just for the sake that it's running I don't care or I don't need to bother so much unless something Weird happens in this case. It was good news. You know, we were satisfied with these shitty processing time But on another type on another type of application. It might be not so So I think that one good trick is to always profile your applications at least once And I was I recently used vm proff from Pi Pi guys and it actually just slowed down my service of At around 5% So it's actually viable to just switch one instance and check what actually your code is doing Actually in that situation I profiled the code and I didn't have the same results as synods garfan as that's that's why I was like What the fuck like it's not working as it should and that's why It wasn't really important. So like I said, we just let it let it go and Well, it was a good surprise when when it was fixed We came up only with embarrassing examples. So you feel more comfortable sharing this I have a comment regarding performance monitoring tools You really have to configure it properly. We had a situation that response time was Average response time was between 30 and 60 seconds And it was caused by uploading files for example 5 gigabytes files was uploaded by few person Few people and and it increased the average response time In the same kind of problem sometimes or Or metrics ever goes down And then we think oh my god, my application is done, but no it's still running Yeah Who is using who is not using a metric system who doesn't do metrics on their applications Who doesn't have this kind of graph? Nobody you all have it Okay, I see again Waving people Yeah, yeah Yeah Question I was precisely going to ask this tension of this kind of question Have you managed in the end to put in the graph in grafana some come the percent till Graphing so you can know a 90 percent the 90 percent five percent till 99 percent till of the response time So you know that kind of problem We have been trying we're having hard problem with this and we are deploying promethios Plus graphana and Using that with elastic surgeon influence. We are having a lot of problems to really calculate the 99 percent till the 95 percent till But have you managed to do this? Basically we just show what we use in general is a comparison between The current day and the day seven days ago. So it gives a good idea to Is it normal or is it weird? And in using carbon and and grafana for the for the visualization you also have the annotation Feature which is good Where you can have a bar on your graph saying, okay And you can plug it to your deployment or continuous delivery stuff So you can have a bar on your graph saying, okay from this point on this is version two point one And then you can do also matrix comparison Related to code deployment It's it's pretty it's pretty good in disaster recovery. It's also A good thing to have So when you know then you broke something and you can do the same with server provisioning and deployment This at this time I added a new server Maybe it has some weird Side effects side effects. Yeah For the percentiles if you mentioned you've already got elastic search if you don't have aggregates But have the actual requests logged there You can just use kibana because it has a really nice visualization if I recall it gives you the percentage The percentiles as well Think you should get that for free from kibana The answer is the problem is to combine it with grafana for the remote audience Did anyone come with a question? I guess we have Yeah, we have four minutes left four minutes left. So Open discussion now question I'm looking to know if edmai hasn't some experience with Trying to Deploy a new version of yours Backend and only deploy it to let's say five percent of you users Try it out see how it handles And then go for 100 Especially angenex So progressive deployment Anyone thank you. Yes for the same company I was talking about A bit sooner about the e-commerce. We were always rolling up the traffic, but it was with hapoxy in front of I think it was hapoxy then varnish then anginix in that order And like the new servers were rolling up like 10 percent traffic on the new version Then we had the software error monitoring and the And all the metric on this server we were checking that's Like the response time was not doubling etc And after a few hours a few more servers were joining and at the end of the day all the traffic was rolled up from Answer to question Depending on your stack I can relate we do a Lower level of this And we we run our python using usgi And in usgi you have this feature where you have it's called touch chain reload Where your workers are reloaded one by one After and and usgi will make sure that the one that is reloaded reloaded correctly before reloading the others So it's a good failsafe low level deployment trick And on the side note if you are really really committed to trying canary releases Which is usually an encounter you release it when you put the bird and try the mine or if these questions Try cuvier nets that resolve this problem in a very reliable way, but it's always a very own It may be too much complicated for your case But it has exactly this kind of procedure when you say I have a rolling deploy strategy When I want to keep any A number of pots, which is your application deploy and in an hour would it began to increase the number Yeah Kubernetes this this feature and kubernetes is very nice But kubernetes still doesn't have else check, right? kubernetes still doesn't have else check, right? Yeah Yeah, oh Okay, so yes, it has okay. Thank you So that was one of the really bad things that I would it is to Okay, great readiness check. That's right. Thank you paul Yes, uh, I want especially thank you because this is like an interactive format and uh It's a little experiment. Just like to have not to have only like One-sided talks just like to have a on the direction I want to thank you for like taking a leap of faith in the first year we try this So I think please give these guys an extra hand, please Thank you very much. Yeah, yeah lightning talks up, but why you thank you. Thank you