 Good afternoon everybody Good afternoon Everybody's still awake more or less Let's see if we can make this kind of scary enough in terms of failures that to keep everybody awake so the topic is why is it always DNS TLS and bad conflicts which Hopefully makes sense to everybody who is here. Otherwise at some point you might say, ah, yeah I kind of remember that pattern because my in my mind how that looks like is I'm not sure if you remember Harry Potter, but there was this scene when McGonagall was asking like Why is it always the three of you when something happens? And that's pretty much where I see DNS TLS and bad conflicts sitting there And maybe they're just a bit of a side effect when something bad is happening, but that's kind of the Common trend that that you see in it that whenever there is an outage or something is broken It's one of these three is not far or maybe all of them combine just like in Harry Potter. So I was talking at example for DNS And I took an example from Akamai where they said like well we made the DNS bad configuration and everything disappeared and I'm I have my own anecdote for that And I'll ask at the end who recognizes themselves in that so a while ago at one of my previous jobs We were moving DNS servers and we were moving moving I think back then from Gandhi to AWS And we had different names and every domain name has its own set of DNS servers But I made the stupid mistake and I copied the DNS server of the first entry for all of the DNS service And I think the time to live for our DNS service was 24 hours So as you can imagine at first everything works perfectly and we did that I think on Friday evening or Saturday morning, too, so it's So it's kind of like we're not much much is happening and then on Sunday We woke up and we could basically watch how our stuff was disappearing from the internet once the TTL was Propagating in terms of DNS and then we You can only watch because the TTL is 24 hours. So Then we fixed it and waited for it to reappear So that was kind of like my story of how DNS can be tricky, especially with the time to live It's like sometimes you can just helplessly watch And don't make stupid copy paste arrows Does anybody have a similar story of DNS that they did something similar? Okay, couple of hands. Yeah, that's I guess how how you learn and how stuff happens and then TLS is another one of these classics Who never had a TLS certificate expire on one of their production systems? It's just very few hands going up Because it's it's a very recurring thing and I could count or have countless examples for where did it happen one of my favorite ones is that Gradle plugins had a TLS. Well, it's it expired Unsurprising November 21 and there was this discourse thread around it and what was funny is that? Exactly one year later on November 22. They had the same issue again so that and And we'll see what will happen in November of this year So maybe there was some learning at some point or or it's maybe it's just user driven I think there was this old Saying of about Microsoft that Microsoft doesn't test software They release it and wait for you to report box And this is kind of like the same thing with the last you can basically not monitor it And you just wait for users to report them and maybe that's their approach here and then you have bad configurations There was the one big Facebook outage where everything disappeared Which was a while ago and looking at you most of you look old enough to remember Facebook because when you talk to students now There's nobody uses or knows Facebook anymore Basically, but for the old generation Facebook is still a thing and it it was kind of like a big thing when it disappeared and DNS was kind of like the thing that was Seen first that disappeared, but it was in the end a bad configuration Where I think they had a health check that was checking something and that check was wrong And then it stopped advertising the BGP routes and then everything just disappeared and again it took quite a while to to fix that and Allegedly because everything at Facebook is driven by APIs and the Facebook platform I think that the legend has it that the access to the server room was also controlled by something that depended on that DNS look up and they did drop the DNS entry basically and Allegedly they had to chainsaw their way into the server room to fix it But I'm not sure if that is an urban legend or really happen But it it sounds very believable that bad configurations can have those effects The funny thing is Cloudflare is always very good about writing these blog posts And they had an explainer about the Facebook outage back then the funny thing is not so long after they They had their own outage around that where they had we're trying to fix a long-running stability thing and that took down everything in their network and The unfortunate thing for them was that they made the change and everything looked good It's almost like the TTL story But then they have a couple of core components which they called spine which serve most traffic and They triggered a bug that was only there in the spine So they rolled it out on the smaller regions and everything looked like it was working and once it hit that spine like that the busiest Locations then everything started dropping and only then did they detect it roll it back and fix it and figure that out so Coming back to Harry Potter There is the thing that they call that the Hawks And I think DNS TLS and bad configurations are are almost like the Hawks of IT They the things that are sitting there and we can't live without them, but we also have to live with them to some degree So how do you take out the Hawks or what was the solution in Harry Potter? Yeah, you basically fight I don't know fire with fire or you had something to to fight them and in the case of Harry Potter You had this magic tooth from the snake. I think that you could use Unfortunately, we don't have the magic tooth of snakes so my my take is that the The magic tooth that we have and that we should maybe use more is or our health checks to fix stuff like that And I'll I'll give you a quick overview of how I think or how my mental model is we can have a discussion afterwards if you agree or disagree or think it should be done differently, but I think by having health checks So basically Is this thing up and is it running? And if you structure and combine them the right way, you can actually detect things a lot quicker Then just waiting on your users. I mean waiting on your users to shout is also a way of alerting But maybe it's not the way you want to be alerted So you could structure in a way that you have some health checks outside of your system to check it from the outside You have it on the network and you have it potentially even on the host for health checks And as you build those and see kind of like where stuff is failing you can pinpoint the problem relatively quickly and easily So from outside the network It's kind of a made-up example, but let's say we're hosting on AWS and I put my health check on digital ocean So I can check From the outside is for example, the DNS lookup working is the upload uplink of AWS there or can I even reach That provider from digital ocean or some other region Did I configure the firewall correctly on the outside to my application or am I dropping you the traffic there? Is my load balancer active and serving traffic correctly? It's the service itself running and how is the latency so I can also compare the latency to on the inside of the network and with these Signals I have a decent overview from the outside. It's more like the user's perspective where I can say like stuff is Failing or not failing doesn't tell me yet so much about the insights once I'm in the network I could for example test across availability zones That I have and more availability so on my health check And I then check that the network within the cloud provider is working as expected because sometimes the network drops there Because whatever router or switch exploded on their side. Did I configure the firewall correctly on the inside? You could have TLS checks for valid certificates here on the outside even I can check if the service is available here And how is the latency and for example the last two points in comparison to the previous slide already? Give you a good indicator if you can only reach the service on the inside of the provider Then it's probably something to the provider that is causing the issue or if the latency here is much lower than on the outside Then it's kind of like an uplink problem potentially Whereas if it's working within the networks provider, then at least you know Is it the cloud provider to the outside or is it in my service? You could also have like the service check on the instance itself So you can check on the instance. Can I reach my my service? For example, if you have a proxy in front of it, can I reach the service without the proxy? So maybe the proxy is down so you can easily pinpoint and see like oh My app is fine, but it's not reachable because it's the proxy and having these checks can pinpoint quite easily where that is You can again see the latency Is it like the the latency of the application or local host fast? So it's a network issue or it's somewhere for or it's a service issue when the service is slow in general You could also check for example is my database up and did I configure the firewall on the database server correctly? That that connection is also working in terms of health check so you can check from different angles where stuff is failing and Then hopefully figure out quite quickly why things are in the right state or in the wrong state so One thing I didn't do so far is I didn't introduce myself Which always kept for a for later because I'm I don't think I'm so important or so interesting But why am I talking about that? I work for elastic the company behind elastic search We we have like health checks like that and we also use them a lot internally And that's kind of like where this entire idea of a talk came from so how we look at health checks Or what we can do is we have a component that can do health checks Which we call heartbeat which is the worst name of things in the world because everything is called heartbeat And for example for Linux services if you call something heartbeat you have five other things They are called heartbeat so we we shouldn't we should have picked a different name But that's kind of like what we were doing and here This is showing very simply like how you could have these different checks implemented Where for example, I check with ICMP is the host available in general and I just ping it every five seconds for example with a chrono syntax and I I run a TCP check against my database to check is my sequel and postgres reachable This one is only TCP because it doesn't speak HTTP or is my my status page on my my website available and I check that every five seconds to to see if those are up and Just signals like that are very cheap to run like if you do that every five seconds or 30 minutes or whatever You don't need to keep the data for very long, but they do give you a very quick overview of how things behave I'll show you the thing live in a bit and you could make your health checks a bit smarter and go deeper for example here. I Have something that where I can add something and I can do a post We're actually post whatever data there and I'm expecting a 200 back and the next page after that should say saved so you can actually see me that flows to some degree and do that Just to show you that quickly live Do you want to see the code first or the outcome code? It's always code, right? Is that large enough for you to read? Yes, kind of okay, so I kept it very simple, so I'm I'm taking Dev Converse is that every 10 seconds and I want to test that page I also take the HDPS page to see is it reachable through HDPS and this is certificate valid and then it actually turns out that this is only a redirect at this point and You're actually being taken a Dev Con info and I take that page wherever my cursor is I Check that every sec 10 seconds. I expect a 200 back here And you can also for example in the body you can say it should have Bruna and community conference But I don't want to have the term promotion on there. Otherwise my my page will fade and then I I Sorry I'm also adding a bit of metadata. Basically. I check say like where I'm where I'm running this from and at the very very bottom I Also say where I'm running this from so this is running on on my laptop and we could draw it out on a map So we put if you have multiple finger locations, you could see stuff like that in compare latency and everything So how does that? Look like that also large enough for everybody to see I So we can actually switch it to the last 15 minutes where This was running and everything worked We could also switch it to the last I don't know Two hours or so because then I was trying out stuff and things were failing so we can also see things failing These are some of the monitors that I've configured so for example We have the HTTP and HTTPS that comes to set and they are okay It also shows me when the TLS certificates are expiring Those have a dedicated page for that here where you could see from this check we basically Picked up the TLS certificate and this one is three days old and this one is 29 days old and it's valid until We have some more months We don't need to be too worried, but you could then just slap an alert on top of those and get say like 15 days before they expire you get an email and you should probably start replacing those if it's not working automatically in the uptime We could look into one of those here for example here it tells me that This is working so far and it's up and it's running from my Philip at DEF CONFUSY set note we can actually Going to the alert see overtime So at some point I changed the name because I think Philip was easier than DEF CONF again Context you can see how the latency for the response times developed over time I'm not sure why there is spiking a bit here, but it's conference Wi-Fi after all So I'm positively surprised that it's working so safely The other thing that you really see here that by default a redirect is considered successful in my tests So for example if I change that to expect that 200 my test would fade I'm not recording that the body here, but you could also record the body But I mean the two out of three bytes of the redirect itself are not super interesting the One with the body is well here. I had it misconfigured because I had it without the WWW dot which is another redirect that is set up, but I expected a 200 to come back So you can put it together the right way and if you I don't know we scroll down here. You can see down here We got a 300 to back, but we were expecting a 200 so it was initially failing and well The html was basically This is a redirect and it worked as expected in terms of testing That it failed here, but then I fixed my test and since then those have all been green and just test as they should Okay One other thing that is interesting in that context is synthetic monitoring I guess everybody has seen health checks and is doing health checks one way or another Is anybody doing synthetic monitoring or synthetics one? Okay, so synthetics are yet the basic idea is you are simulating the browser and you do more action and then Sometimes I get the comment that oh, this is a new fancy name for something that we've been doing for decades Or some people have been doing they just never knew what to call it But using or simulating the end user is it's not exactly a new thing The synthetic monitoring is a concept that as a term I guess is not as old and the way how most people or many people are implementing it is playwright Has anybody used playwright before? it's it's a way to write JavaScript code basically to To test your pages What it looks like is something like this So you could import playwright and then you say Go to Whatever page and then on that page click a button or expect there is a and you to do element place holder And then expect I don't know what needs to be done Thing so you can really simulate the user and the idea of synthetic monitoring is basically You don't just want to get a 200 back But you want to simulate the flow of your users So if you let's say you're a web shop and the most important flow in your thing in your web shop is somebody put something into the the cart and then checks out and That's the most important flow and if you break that you don't want your users to tell you that you have broken two days ago the Checkout process, but you want something to continuously run through those main steps to be sure Whatever changes that is still working So that is kind of like the idea of synthetics that rather than waiting on your users to shout something is broken You have the most important flows you constantly test those like every bless you every five minutes or so you test those yourself So no matter what databases down or what you deploy That main flow is always working and operational for you I Show you that in action in a moment So you can either write the code to do that and it looks something like this so you can write that Many people will say we don't want to write Check click that button and search if there is an ID with that name and extract that value so there's also a way to basically record You clicking around in the browser and it will just extract the rules and there are multiple implementations of doing that I think playwright is one of the more active or well maintained things right now It's mostly driven by Microsoft But playwright is quite widely used in that browser simulation testing environment and you can use that so I Think much code or anything I have Enabled two monitors, so I I'm just checking that country info from two different locations, so I'm running this this from Germany and US and I guess you can already see I assume the data center Where this site is running is in the US because the latency there is much lower than Sorry from Europe And then I mean we could add as many flows as we wanted to but here you see this is the site Tested from these two different locations and you can see how that the test looks like we can actually go to that monitor See these were all the iterations that we ran on that page I'm basically only going to that to the main page and display that right now to keep it very simple So you could edit the monitor and then you can either upload the script You can write whatever steps you would want to have in here So we could also change that Www. which is the right one and then I don't know we could go to CZ directly and Then I can run this test this will take a moment until the test run through So you can simulate that right away and once you're satisfied with the changes you have made you can save it and then store that Okay, this looks good I'll update my monitor the next run will do that But I actually wanted to get back in here You can see basically a screenshot every time we ran this and what is actually also nice is you can see all the simple steps So for example, if you have 20 steps in your workflow It will take a screenshot and show you how long did it take to do each one of these steps And you could do all of this see all of the steps here You can also see then In terms of weight of the page like the Defconn page like what are the dependencies and what are you loading? So how fat is the page? Where did you spend your time and it also knows the main Google metrics like large contentful paint Etc. So it can extract all of that from the flow that you have shown and it will tell you Well the weight and the timing and how long all of that took So it does make testing or it takes testing quite a bit further than just like checking if the server is returning a 200 But you can get to the So to wrap up The two main concepts I think here are the health checks which are cheap fast and give you a good overview and They don't need a lot of data, sir Everybody's talking nowadays of observability Observability is normally quite expensive I think a rule of thumb that many people have around observability is that 10% of your entire infrastructure cost should go into Observability because that's like full Instrumentation and monitoring everything will take that much overhead In terms of like storing the data extracting it processing it health checks in comparison are very small and cheap to run and can Give you decent value. They don't normally are very good at telling you why something broke But they're very much on this classic monitoring side to tell you something is broken and you should take a look at that and That is kind of like the addition you can then once you know something is broken You can look at logs metrics traces whatever other information you have collected But there's a first step the health checks are often a good indicator something is wrong and you want to fix something in your system Once you want to go further and you have like these main user flow is flows established that you always want to test that are More complex than just a 200 return then you could do something like synthetics where you play through the same steps Also, maybe compare it to a week ago. It took one minute to run through this now It takes one and a half minutes you probably deployed some code that is slower You have some calls that are taking longer so you can use that for Checking your own so checking correctness But also checking speed and kind of like drift in development and what has changed over time and That's it Now it's time for everybody's questions and your own horror stories if you want to any questions Do we support only playwright? So in in our product where we have kind of like embedded all of this Yes, we only do support playwright at now right now though for example I mean elastic is also more of a platform So if you if you have anything else and that spits out logs ideally Jason or something like that You could just throw it in and then build your own visualizations You won't have like as fancy dashboards that are pre-built because that's kind of like tightly packaged together So yeah, we have picked playwright. What what were you thinking in terms of alternatives to playwright? Selenium, okay. Yes, so I I think Selenium is kind of like that the first or classic one that started a long time ago yes, though, um, I mean the Playwright is kind of client-side as well or it can be so you can so I ran that in a centralized way from from a cloud instance But you can have your own runners wherever you want with playwright as well So I think while there are some differences between Selenium and playwright It's I would say a very similar type of tool. It's just like the the syntax and implementation is is different I think Selenium is how old 10 15 years at this point Yeah, which is not to I mean It's part of the success that it has survived so long But I think there are some things that are I don't know more cutting edge or hipper around playwright that change Which is not to distract from Selenium. It's I think still I always say it's the the Jenkins of of client-side testing and then people are never sure if to take that as a praise or insult around Jenkins, but That's a different discussion You also use Jenkins. Well, it's tried and tested and I can also say from our side Getting off of Jenkins is a very big task once you are heavily invested into it, but that's also another discussion Any other comments or questions? What is my favorite Harry Potter movie? That is slightly unexpected now I think I I personally enjoyed a later darker ones more. I feel like there is a Very strong progression for people that grew up with them that the first ones are more like kidlike and the later ones got more adult or Darker and I was maybe a bit too old for them. But so I enjoyed the later ones more Yeah, do we have any stories? So any any uncommon stories you mean for testing? So there Yeah, so that's actually a good point So I think there are two components to that as well So that the synthetics are more like we simulate the user and we we gather all their information We we do have another component to gather like that what the end user is doing. It's I think normally called rum Not a drink, but real-time user monitoring Which is like a JavaScript snippet that is injected into the browser and then you can basically see what? Users are doing and there you can kind of like follow their weird behavior and learn what they've been up to whereas This is much more the proactive monitoring. We will rebuild some scenario to to run through it Did that kind of make sense Sorry, it was about skewing. Sorry. I okay. Yeah, so that the thing about the skewing is You should have a special header normally or something like that to filter out the data Also, like when you do that the flow to put it in the shopping cart that you have a secret parameter that you pass And that is like this is a test order and it will never hit the system. I don't think we We had any occurrences where that went wrong though. I've also heard stories where people yeah Where testing was basically taking down the system or faking doing lots of fake orders because they They forgot to or somebody removed the check that this is a test order Which is kind of like a classic, but I I'm not sure there's a Solution for that, but unlike writing or you can only write a very big comment like please This is important. Otherwise bad things will happen But yeah, normally and you should have a dedicated header or something like that that your system can figure it out And then you just filter it out and what you even might want to do for example When when we often set up like things like that or for example with Joe look here like some health checks We would have a rule whenever we ingest even logs that we drop those logs because they are just noise So so we normally try to to have a feature flag because you don't want to see the logs and Traces and anything of those test users. Anyway, it's it's just garbage that your system would in Incorporate or it depends on how you treat it or you could just collect it and then put it in a separate bucket Basically and filter it out, but yeah, you want to mark those I think that's also for example for tracing an open telemetry That's also the recommendation to always have a dedicated header and then you can just filter on the header To to get rid of those. Sorry. I didn't get the question right at first Anything else or is it bureau clock? Sorry, it's always bureau clock. Yes It is always bureau clock Thanks a lot for staying so long. I see you at the social event later on. Thank you