 I'm actually surprised so many of you showed up this early after yesterday's great party. So my name is Leon. And today we'll be talking a little bit about testing, monitoring, production. I'm really awful with titles, as you can see. So I'll look at this slide about me. This is me. I'm pretty sure that was right after I wrote my first Hello World. I've been doing this long enough that I hate pretty much every technology equally. And for many years, I've been working on MUTI where I get to get my hands dirty with all the really big systems, which is really cool. I will have the slides posted on SlideShare, maybe with different background color. But they'll be there. I'll tweet it out. And like Jason mentioned, spend a few moments and moments. You have speakers in there. I just want to say hello to everybody. I had a lot of great talks yesterday, right? There you go. You have a round of applause for some of the more first speakers. So it's always great, either whether you're a veteran or a new person. All right, so let me, like, have a little bit. I hate testing. I got the following. And before some people rise to stage, show them it's my business. I think testing is absolutely important. If you don't have some sort of a testing in your pipeline, then you're doing something really, really serious wrong. But I also think that testing is completely not enough. Show of hands and have at least one of those in your development. I'm going to use to wake up in the morning because of me. So testing can give you full services. You've got all my tests passing and pushing to production. That's it. It's the term of the system, right? You're testing. It's rarely happens like that. So there's a couple of problems with testing. First of all, of course, it's a data problem, right? This is a whole, it's working on a lot of damage. No matter how much you try, I had to start with people many a time. You cannot revoke the production. You're either going to have the one extra record that's going to break a camel's back, or you're going to have some weird traffic better than production. Or you're just not going to expect certain inputs that are going to come to production. By the way, I'm going to have a lot of who here knows who to pull from a lot, so I don't even know. As long as you know the last name in the world. How many of your applications have been tested to support this? Which, of course, brings me to the second problem, which is probably the biggest problem. There's going to be that next user who comes in and goes, oh, I wonder if they can make this even worse. So how many people here have played World of Warcraft as well? Come on, let me try. There's more of us here. So for those who don't know, World of Warcraft is a massive, most popular online game. It used to be one big mess, one big, most popular, and now it's probably still in the hits. It's had this again, right? You get to play for orgs, for dwarves, elves, wherever you want to be. And they're also known for a lot of really, really interesting bugs that they introduced. Those who play actually know. So anybody know what corrupted bug is? Yep. So to get the game interested, the developers keeps on coming up with a new inventive content with different mechanics. Can you hear me now? Oh, I'll try to speak a little louder. Is it better now? I'll take what I can. Thank you. So anyways, they tried to come up with more and more inventive ways and different game mechanics to keep the things interested. So with one page, they came up with a boss who would put a curse on a player in the group. And it would do a lot of damage, but the caveat there was if the players stood next to another player for a couple of seconds, the curse would jump to that player. So if you're not careful enough, basically you wipe the whole party. Problem is the developers did not account for users. And what does the first user do? It's try to exploit anything they can. So I believe day two of the boss being in production, one user got that curse and teleported to town. Mechanics of the curse were that the curse would jump to any player, either human or NPC. This is what major town looked like. Bigger problem with that is the curse would also jump to more people if somebody died. And since the town had low level creatures, it would basically occurs that it was spread across the whole realm. They had to do a roll and restart of the servers just to get rid of this bug because the servers were empty. You could not survive anywhere unless you were hidden deep in a forest somewhere. It was actually defined as an in-game plague because it was really the same effect. So short of that, of course, there is other factors why, now going back to our original topic, why testing is inefficient. I mean, there is always a lack of foresight. I mean, Y2K is probably the best example of people not thinking things through. There's always too many use cases that you can't or you don't check for. And I'll give you the last word of work of reference. So when they released the first big dungeon for high level players, about six months after they started receiving the bugs, the some players could not enter it. And to enter the dungeon, you had to jump through the window. And after that, you appeared in the dungeon. And some players were reporting that they don't fit through the window. But that's not all. So after some debugging and looking through it, it appeared that the only players who couldn't do that are female torrents. And torrents were a class of minotaurs, bulls, who were supposed to be wise, one with a nature. And female minotaurs were their counterparts, which were effectively walking, talking cows. And the reason that bug was not noticed for six months after it went to production is because nobody wanted to play a walking, talking cow. So nobody actually got it. And that's what millions and millions of people played. Got that specific combination of character high enough level to try it out for six months. And of course, there is a chance to assumption, right? How many of you had to develop something for a use case when they launched the production? And then business said, nope, we're doing something completely different. So to summarize a little bit, testing is great for no notes, right? When you're known for something, you can test for that particular case. And you're great. That's what testing was for. Testing is OK for known unknowns. Like, to a certain degree of certainty, you can plan for problems. But testing is really bad for unknown unknowns, right? You can't really test for what you have no idea. I mean, you can't test for user stupidity. Ha-ha, finally we get to the monitoring. So why do we monitor? I mean, in case you haven't been listening for the past 10 minutes, is because testing isn't enough. But seriously, we monitor for a whole lot of different reasons. I mean, first of all, software is never perfect. In anybody who tells you they write a buck-free code, you can hit them with a stick, because they're lying either to you or to themselves. Systems become more and more complex, and you want to make sure you keep track of every moving part. There is always external dependency worry, which people tend to skip. How many of you have dependency on a third party service, whether it's major one, like a full integration of Salesforce or just using Facebook Connect? And of course, it is all source of other reasons. But in reality, it can be summarized in one thing. We monitor because things change. And when things change, they usually change in production. So what do we monitor? What do we monitor to try to solve our problem of things breaking in production? Everything, right, would be the short answer. But more specifically, we monitor systems. We monitor databases. We monitor applications. We monitor integration points, yadda, yadda, yadda. I mean, how many of you monitor all of the things listed here? Oh, wow. We've got to have a talk. All right, so question, is it enough? Right, is what you monitor enough to keep you asleep at night? Or is it too much and you get way too much traffic and waste many alerts, and they don't let you sleep? So the question is, what is important? What do we want to monitor, really, that will help us keep the systems up and running and quickly react to them without overloading with information? What do I learn? So Twitter is a perfect example. Well, first of all, the notorious for breaking things. But a little while back, they had an interesting bug where you could go to the website or through the client. You could submit a tweet. It would say, great, your tweet is submitted. Everything is great. Site is up and running. And your tweets go into Dev know. Right, so from my operations perspective, servers are up, APIs are up, they're returning 200, everything is working. But in reality, it's not, right? So all the systems checks are fine, but the business is failing. By the way, that also goes for unit tests. If you rely too much on unit tests, it has the same problem. If all your unit is surpassing, it doesn't mean your stuff is working. But anyways, this is problem I favorite quote ever. It was told me years back by a CEO of the large company when we were talking about some of the technical stuff that we need to fix. And that's actually very true, right? From a business owner's perspective, they really don't care what kind of technologies you're using, they don't care, they don't care what broke, right? All they care is if the business is successful in business and making money effectively. So we monitor because things change, right? We talked about it. But changes affect business. And a lot of people, especially in attack groups, either don't understand it or they're not previewed to the larger picture. I always a huge fan of the top-down approach when you're in, right? You monitor the business and everything else is just used to support those particular metrics. So in order to do that, you need to understand the business, right? You need to understand what are you building, right? Your software is only there to support some business objectives. It's not there for the sake of using technology. You gotta define the baseline. You gotta understand what constitutes good versus bad. And it's not necessarily binary. It's not necessarily, oh, it's up or down. It's more of a threshold, right? How many registrations do we get in an hour or a minute? How much revenue is coming through the systems? What is the traffic patterns, right? And of course, you should be able to correlate the data because once you identify the problem with the business metrics, well, you still need to look at your system metrics to figure out what's wrong. So here's another example. And actually, I was too close to involved with this one. So one of the companies, they are actually pretty big. They have about 100 million users, send about a billion emails a month. They have 5,600 metrics. I think it's even more now. So they monitor everything starting from intense outtests on their servers all the way up to how many registrations they had in the past two minutes. So as everything, it all starts with a call, right? When the client picks up the phone and says, something is wrong with the website. I mean, how many of you had that call with client or business owner? That's general how it starts. So something's wrong with the website. And I was like, great, would you like to elaborate? And the guy goes, well, we'll look at the numbers and our revenue is down, okay? So I was like, all right, let me see what he's seeing. So luckily, we're actually monitoring the revenue. So as you can see, there's clearly a dip. By the way, we fix it if you look at the second dip eventually. But you could see a dip in revenue, but it's not like it's zero, right? It's just lower than the average. And at that time, we didn't see the other spike. So I'm like, okay, so maybe you're not doing your job and not selling enough stuff. I'm like, what are you calling me for? I was like, but let me see, let me do my due diligence. So let's look at the actual traffic. So look at the traffic and it is lower than expected. So it's in line with the revenue they see. So you have less people coming to your website to spend less money. That seems legit. I'm like, all right, but let's dig a little deeper, right? Let's look at low times. Maybe performance like something on a page, third party dependency, maybe something went wrong. Performance plummeted, people can check out. I mean, I don't know, a whole bunch of stuff can happen. Performance look just fine. So that's the first place where I could have just had to unscrew it. I mean, it's your problem, not mine. But let's have to dig a little deeper. So let's look at the database, right? Because you can send it, again, it's a Twitter problem, right? Something's happening, but it doesn't come back, right? Everything is fine with the database. Everything is fine with the systems. Everything looks normal. I mean, there's less people coming. There's less revenue, sounds legit. Luckily, we'll continue digging deeper and deeper. And finally, we looked at the email, the availability. And apparently, one of the major ESP providers, I forgot whether it was Yahoo or AL, whoever it was, accidentally put it on a blacklist. So all the marketing emails that were supposed to go to people in that domain, and that's a big chunk of it, bounced. So less emails got to their customers, less people got to the website, they made the less money. So that's the question I get often, it's like, great, you were monitoring all sorts of cool stuff, right? And you were able to troubleshoot it. So what if we didn't have email monitored, right? I mean, we would have probably still figured it out at some point, but it would be monitored after this. Instrumentation is never done. That's another thing. A lot of people think that you launch an application, you put a whole bunch of monitors on top of that, and you're done. But instrumentation is an evolving process, right? You discover things like that, you sure as hell put a monitor on it, right? If not alert. So we had an example with another system which has very similar systems, except that they had higher decline rates. All of a sudden, again, it starts with a phone call, and it goes in something wrong with our website. It's like, okay, what's wrong with your website? It goes, well, the decline rates are higher. I was like, okay, okay. And we must have spent hours and hours going through logs, going through database, going through rows, trying to figure out what's wrong. And nothing was wrong, right? And, oh, nothing was wrong. Like, everything seems to be normal. And the next day, we come in the morning, we continue troubleshooting, and the same client calls, it goes, oh, by the way, could you create a ticket to take off the American Express logo from our website? I was like, why would you wanna do that? He's like, oh, we stopped accepting MX. And worse, a wedding singer, again, this information could have been brought to my attention yesterday. But realistically, nobody would block the decline, monitor the decline rates from the beginning, right? I wouldn't even think about it. How would you, why would you monitor, like, decline rates for different types of credit cards? Like, that seems overboard, right? That goes to the sound white noise. But apparently, that's the actual case, so we put another monitor on it. So, to summarize, I'm all done. It is testing and monitoring, in no way I'm preaching to get rid of testing. Testing gets rid of almost obvious issues of the whole, oh, shit, I press on the button and everything blew up in my face. You absolutely must have testing. But you also need to have monitoring because, as I just showed you, you never know what you're gonna get from production. Understand the business. This is actually one of the challenges that hopefully the whole DevOps movement is trying to solve, is get all the groups involved. And for technology people to understand the business that are supporting is actually gonna help us to support. Much like performance and security, monitoring is not a feature. I don't know how to be. You cannot build your whole application, go through a test, and then as an after-sourcer, say, hey, let me put some monitors on top of it. You will fail in that case. Make sure the monitoring and instrumentation of checks is part of your development process. You develop your feature, put a metric in it. It doesn't cost that much and it will save you as at some point. It's all about continuous improvement, right? You never can cover 100% of the cases. Same as with testing, same as with monitoring. You can cover 100% of the cases. So when you discover things like, you know, decline rates on a specific credit card type, put a check for it. That's it. Monitor everything but alert very selectively. You don't wanna alert on every single metric because, well, honestly, anybody who's on call is gonna start ignoring a lot of things and things are gonna start falling through the crack. I'm also really bad at conclusions. So that's it for me. I think we have a few minutes for questions. Any questions? The question is if I have a specific set of tools for collecting the metrics and monitoring it. I have my preferences. The graphs that you've seen, we use Torconus. We spun out that company so we kinda eat our own dog food. Realistically, there is a lot of monitoring solutions out there. Any solution that will let you collect random metrics, whether they're text or digits would work as long as you can correlate them later. Some do it better, some do it worse. There's also Nagios, which I wasn't gonna mention at all. But yeah, realistically, anything that lets you collect non-typical metrics, I guess, because system metrics you can get from anywhere. But the business metrics, like the arbitrary metrics, the registrations, the revenue, all the stuff that you wouldn't think about it, you want something that supports that. In second part, you wanna support the correlation. You wanna be able to graph multiple things on a graph to see, or at least side by side. Question is, have I seen effective ways to effectively pipe alerts into the Ticket Tracker or something similar? Yes and no. You have to be very careful with that. Again, because you're going on a volume, it's a lot of false positive, right? You need to hone your thresholds enough so it'll be effective. So for example, let's use the same example. You monitor on revenue, right? And you set a threshold. You assume any given minute, the revenue over the past roll in five minutes should not fall below X, right? Well, is that really true, right? Does it count for like two AM when most people are asleep, right? And if you set a real low bar still, like sometimes it can fall below. So when it does, is there a problem or is an anomaly that somebody should wake up, look at it and say, nah, everything is good. It's just slightly below it and go back to sleep. So it heavily depends on the business requirements, right? If you can identify the metric that you know for a fact once it goes outside the threshold or once it falls to zero, let's say, it's a problem that somebody need to look at it tomorrow, then, yeah, you can pipe it easily into a system because you do it the same way as you do with anything else, either with VictorOps or with PagerDuty, just pipe into alert and pipe the alert into your pipeline. Yeah, so question is, can I talk a little bit about responsibilities in organizations for monitoring? The answer is for collecting the metrics, everybody should be doing this. So developers, ops people, database people, even QA people, right? Anybody who is responsible for architecture, for code, for adding new features, should be adding new metrics in. That said, I also big supported that anybody should be able to put an alert in. However, no alert should be going in production undocumented. Like, we have in my company, all the developers have full access to monitor system. They can put as many alerts as they can, but if somebody gets to work up in the middle of the night and the only way for them to solve the problem is to call that developer, they're gonna be real pissed. Right, so if you're going down that model that any alert that goes into production should have a full documentation, full troubleshooting steps, what to do, what to try, what other metrics to look at to see if anything wrong, if everything else fails, or no, then you ask for it. But to answer your question, anybody really. Like, it depends on the business. Whoever is in control of application mostly, whoever is on call, should be able to put metrics. It's, I think that that question is more of a culture thing, right? It depends on the company. I'm not a fan of actually making somebody completely responsible for monitoring, right? It's the same thing as when they don't call, right? If you put responsibility on the people who are responsible for fixing it, they become much more diligent at it. Generally, if you had to put responsibility on somebody, I would say it's people who get woken up in the middle of the night. Because when they do and they realize there's not a monitor on something, they can go back to, if it's developers, they'll go back to ops, if it's up, they'll go back to developers and say, hey, I got this alert, I didn't have any information of this stuff. We need to instrument this, right? So here's the ticket, go instrument these things. I don't know if that answers your question, I hope it did, okay. Maybe one more question if there is one? Dashboards actually have very little to do with actual alerts, right? Dashboards, you want to show the information in consumable format. Alerts, when I say alerts, I mean, it's something that wakes people up in the middle of the night. So you can show as much, so the graph I was showing you, if you want to show 20 of those graphs with 10 things and if they're readable, an actual show available information, then why not? Right, as long as you don't alert on every one of those things individually because any one of them may not necessarily show a problem. Right, when you put an alert in, it's all about actionable alerts which is a whole separate conversation altogether. But you've got to ask yourself three questions, right? When you put in the, do I care if it's broken in two in the morning? Right, can I fix it in two in the morning? And can I fix it tomorrow? Right, and if that's any of those questions is, yeah, yeah, I can fix it tomorrow, then why would you alert on this? Right, you probably just want to send an email. Sure, all right. Yeah, I mean, the short answer is yes, right? Because it is iterative, right? You want to have a continuous feedback where you discover something new, you try to pull it back and you determine whether you want to pull it back all the way back into your test suite, right? Because some of those things can be discovered during the testing, right? Or you want to pull it just back into monitors and just instrument the metrics, right? And since Jason is trying to kick me off the stage, we can talk about it offline. But yeah, there's a couple of things you can do with this. It depends on the model and what kind of organization and what kind of data you're collecting. Yeah, thank you. It could be something we can do as an open space, possibly, right? Yeah, absolutely. So let's bring that up. Okay, everybody put your hands together for Leon. Take it on, appreciate it.