 Hello everybody. Welcome to Fixed Website, a DevOps success story, a rant. It kind of depends on exactly how into it I get as I go. One or two less people, excellent. Let's get going. So help, the website is slow, screens the PM. But first a little bit, who am I and why can I even be giving this talk? I normally hate these slides, but I feel it's appropriate. So I'm a developer, sometimes ops, but following the whole DevOps thing since it was just a hashtag. And I came from an actual engineering background, so I bring up a bit of rigor and a lot of first principle understandings. I was aqua hired from a small startup into a larger company, the company we're talking about. Its name rhymes with demand media. I'm a community organizer and I run our Los Angeles Pearl Mangers, so if you want to come learn some Pearl, meet some people, we occasionally have meetings and we'd love for you to come join us. And I'm now two companies away from the company we're talking about here as I've just started Zip Recruiter. And it's lovely and they're hiring and they were kind enough to sponsor me to come and at least I'm using our slide deck because they're pretty. And so yeah, after I was acquired by Demand, they took our startup, cancelled our product and put us on a new one. And it was awesome and we were totally into it for about a year. And then it sort of got surprised, cancelled due to at which point we sort of were left to wander the company and figure out what to do. And so I sort of went Kung Fu style, just wandering around looking for problems to fix and things I could improve. So I had worked on pieces of the data stream that lead to the website we're going to look at. So I had a pretty solid understanding of what their data model looked like, which was useful because apparently they did not. So a little about the website. So I decided I wasn't going to say the name eHow. So I pretty much removed it from my slides. But it was a ComScore Top 30 website originated its site type, which is the massive collection of somewhat interesting information has been derogatory called a content farm. But we like to think that was our competitors for the content farms and we were the originators and out there trying to give the people the media they demanded hence the name. But most of the traffic is search traffic coming in. So people search for stuff. They come. Hopefully they like what they read. Maybe they click through to another page. But really if people come, they look, they leave. So the traffic comes in from Google and then we monetize through Google AdWords, which really means the biggest client and the biggest customer is Google. We're really happy to be providing information to people. But with like any premium service, you are what is being bought and sold. And also it's ginormous. So when I was there it was about 3 million pages and about 20 types. So it's not like you could just test your things and look at all the pages and have QA just look at all the sites, which made pushes somewhat fragile. It was also the huge money maker for the company. It went public on and then it started to get hammered. It may or may not have been affected by Google Panda, which you can read about on Wikipedia, which was them tweaking their search results. We never officially announced if it affected us, so I will continue with that. It was a site somewhat in decline that had made lots of money where the main customers Google and they don't tell you whether or not they like things. They just randomly change how much they're paying you for their ads. And then you have to wonder, hey, that thing I did two weeks ago, did that, is there a correlation? So the team over there in Project, you add very superstitious because they've got all these metrics for how the site is doing, but their biggest metric is Google and there's a big time lag. And Google is very secretive about what they do so that people are not trying to game them. And it's a very long tail site. So there are some pages that get hundreds of thousands of hit a day and there's like tens of thousands of pages that get one or fewer impression a day. Very long tail. And so it's just a simple lymph stack, right? It had a rather complicated data model wherein data was coming from two divisions away in the studio getting thrown into a content management system and then this site was pulling from the content management system. And that's fine as long as those are all just caches along the way, but they weren't. Data was being inserted and tweaked and instead of actually being a master slave, it's like overslaving the rendered output that you get through the HTTP interface and then we're re-rendering it again. And this was one of the first things I did when I was working on this project was figure out what this model looked like. In fact, when I was digging through my notes, I found my previous model of this which was wrong, and I showed it to everyone and no one knew it was wrong, which I just noticed when I was going back over it now. I was like, that's not how that worked. But there are one, two, three, four, like five teams involved in getting data from the left side to the right side and in two different states and three different lead bosses. And so they didn't really communicate well. And we are several generations in developers. So the original coders have gone and left. And the people they trained have left. The people they've trained have left. And so now all you have are the myths of, oh, we can't change that because we think it breaks this. But here we had real hard data. This metric is bad. Look at this graph. My p.m. says, and so there's three important things that I saw when I was looking at this graph. One, he has it on his big time scale. Two, he has these notes about what he thinks things meant which are 90% political of I think it's this other team doing something, breaking my site. And C, and it's been getting worse since then and it's going up and to the right. And this bottom one is not a graph that you want to go up and to the right. This bottom graph is something called deficit time. And he's like, this is not good. It went up. I was like, okay, what is that number? What does it mean? He said, well, it's the deficit time. It's going up. You need to fix it. And he's like, look, it's a metric to fix it. And that's about all I could get out of it. So I went to his boss, our CTO, and hey, what does this metric mean? And he's like, well, it's the DT. It's the deficit time. So that must be the database. So we should replace the database. And this is a popular opinion because we all hate the database. It's this big Mongo. You saw what it looked like. There's two separate huge Mongo things with slaves and all these workers trying to synchronize things in an ad hoc manner. The problem with this is deficit time is not, in fact, anything to do with the database. But they had actually forgotten what the DT meant because the other important thing from our graph is that this is an average from a roll-up average table. So all specificity has been lost. There are two levels of averaging. The PM only has access to this roll-up table. So he doesn't even know where the data is coming from. So I have to dig in. And I find that deficit time is just a label for a calculation we're doing in here. We're doing Splunk, but you could be doing this if I was doing this today. We'd be in Log Stash, same sort of thing. Throw all the data from the stats in, and then you can do queries and aggregations on them. And we found that, oh, DT is the delta between the amount of time the PHP is running and the amount of time it takes to deliver the packet. So this is our delta time. We really wanted to kill Mongo. I hate the Mongo. Unfortunately, this is exactly where Mongo is supposed to be used. It's being used as a key value store to get one big JSON blob to store it in RAM. And you do one lookup per page and you get it. So we couldn't just get rid of the database. So yeah, so I dug into the Splunk and found where the data was getting sent. I had to go two teams over to find where the data was getting pushed into Splunk. And a different team to talk and say, hey, well, how's the Splunk getting aggregated? And eventually found that, oh, DT is defined as US minus Pp. And went back to the Apache logs and found out what those meant. And also noticed that it was, eventually I noticed that it was US and not US SP. So the US is the microseconds it takes. We eventually had to figure out what US meant, which is the amount of time it takes from when the packet comes in with the get request to the last response packet is sent out. And that's what Apache like. So hey, I have this number. It's bad. I have a graphing system. Let's look for it. Let's go look and do some analysis. So first I had to go learn the Splunk query language, which we're paying, I don't even know how much, but tens of millions of dollars a year for this Splunk instance so we can throw our data in and run queries against it. No one knew how to do complex queries. The only thing it was used for was reporting like one-off graphs that would be saved and then reported and if they were bad emails would go out. So I dug in and figured out how to slice and dice in the Splunk. And no one had done any analysis this way. So what were the correlations we didn't find? Day over day, day of week, week over week, page subtype, none of these seem to show any correlation with the delta time or the microsecond time being high. So what was next? We also did find that large data files that were being served not by PHP were throwing off the numbers in the graph because they didn't have a PP time. So their delta was large because their delta was whatever the US time was. So we learned to filter those out and start looking at some queries. And as a, what was the next thing to try? It was size. And I did a bucketing and found that there's a big chunk of small data, less than 15K, there's a chunk of 15 to 25K, there's a big chunk of over 25K. And that if we bucketed them, the results within each bucket were pretty, were self-similar and different from the other buckets. So here we are looking at small packets, less than 15K. These packets were great. Let me pop out around so I can actually read the labels. So the one line that is bad at the top is the max time. And max just means maybe one packet was bad out of a thousand. You don't learn a lot from max, but it does give you a top-level threshold. Here we're still using averages which are hiding the fact that outliers, that are extreme outliers, get too much weight. Whereas if we look at the 90th percentile, 95th percentile, even in this case 98th percentile, still looked good. But if you had some packets, some of the responses, 100 milliseconds. Some of the responses, 40 seconds. And so those would throw off the graphs and make the averages, and the averages of averages look good. But here, even the worst case time was one second. But most of the numbers are terribly low. However, it is always a warning when you have to be on a log graph when you're looking at your data that there's some skew in there. When we look at the medium, it gets much worse. Now we're talking one to four seconds for the max is peaking and the average has gone way up. And so why is it taking more time? And here we find, oh, the big max is throwing off the averages. And what's interesting in the large, it is also bad, but it's not actually as bad as the medium. But this is more of a sampling issue in that there are fewer people requesting the fewer pages, because there are not actually that many pages that are bigger than 25K. And most of the ones that are being, are not being queried by super remote people. And then we also had another category to look at, which was the bots. Since we were categorizing in our system, hey, this is the Google bot. This is the Bing bot. This is our internal bot. And the thing we notice from this is that our internal bot is almost really good. It never has this problem, which is what is eventually pointing towards, oh, this is a network issue. Now, once I said network issue, they said, oh my God, the network is bad. It's terrible. We have to fix it. And I was like, no, this is a networking issue in that you have distance and you have light, and it takes light time to travel distance. And this is just a fact of life. But it was very useful to see that, yes, some of our co-located crawler is very fast. It doesn't have these problems. So now it looks like, oh, well, it's something network related. It's something size related. What is it? Although, I'm like, oh, that shouldn't be that bad. Let's go figure it out. But then, while digging into this, asked the previous person from this position, hey, are we gizipping our traffic to Akamai? No, no, no, we can't do that. So the other thing with this site is it's very big, and it has a CDN layer in front of it to, in theory, cache the traffic. And when we talked to it, it said, hey, I'm caching 99% of your traffic. It's all good. I got why that came up. But there was a discussion as to whether or not we were gizipping to them, which would be very important if this was size-related. And they said, no, no, no, you can't do that. It's not supported. The Akamai guys were in two days later for unrelated things, and asked them the question, hey, do you guys support gizip? Well, of course we do. That would be ridiculous not to. So we wasted some time trying to figure out how we could get gizip enabled. The long story there was that it was enabled, but it still took a week to convince people that we could be able to turn it on if it was turned off, which point I realized I was going to need a lot of data to convince people to make any changes because they didn't want to. So I went and I built my own server, and I started hammering it. And I was testing the gizip versus the non-gizip because we still thought that was a thing. And the only thing I found was that if I super overloaded the server by like 40x what it would see in production, I could get the US time to go up relative to the PP time. And took a while and realized, oh, in this case, there's buffering happening in the Apache. So Apache times again from as soon as it receives the packet with the get request. It starts at timer. But if you have a lot of traffic coming in, that may be sitting in the in queue before it gets handled by one of the workers. So your time is ticking. So I was able to find a similar problem to what's going on on the website, but this is clearly not what's going on on the website because I went and looked, and no, we didn't have huge queues in the wait queues. We did have a lot of packets in time wait. And spent some time looking at that, and that's mostly a red hurry. Time wait is when your packets are done, and they're waiting for a final act, and they're waiting to make sure no lost packets were sent. And by default, they sit there for two minutes, which is way too long. But until you have a problem, it's not actually a problem. And you have plenty of sockets, and it's not a problem. But this was data. Real data, reproducible. Got my customized HTTP perfs that I was running. Got to compile my own HTTP perf because there's a bug in it that had been filed with a patch, I don't know, four years before I started this. But it was perfect for what I wanted it to do, which was give it a list of URLs, have it hit all of them over and over again, and then give me some stats. And since the dev system was also tied into the graphing system, I was able to make pretty graphs. So what we learned from that is it's really useful to instrument your stuff. Put your data, even the dev data, put it into the graphs so you can play with it. And while I was in there, I found, this is also when I found the difference between found exactly what US meant, because I went in and looked at the Apache code as to where it started and where it ended recording that. And I also found that there is a module to record the USFB, which is the time taken to the first byte to the client, which is actually what you want to optimize on, not the US. Because the US takes in to account how long it takes to send the last byte to the client. And to send the last byte to the client, the client has to acknowledge your previous bytes. And so you are stuck waiting on slow clients. So if your clients are slow, and we are a long tail site serving people all over the globe, and many of them are very slow, and then it might take them three seconds to get all the data back and forth, and now your US time goes up. There's zero you can do about that on your side. Now we were actually insulated a little from our US time, because with the caching layer, we just had the US time of sending it to the cache. Which meant our clients were actually seen. So we found from the graphs, and digging in, there was something about 16.5K. And I didn't really have time to figure out why it was mattering, but I could tell the number below that, great, above that, worse. So it came with a plan of action. Pages are too damn big. So I went back to the PM and told his dad, your pages are too big. They will serve faster if they were smaller. Now coincidentally, this is what I told him two months before on General Principle, because his pages are too big and they take forever to render, and there's old crust. But now they had data and a picture, and so they were able to get buy-in from above that yeah, we should shrink these pages down. So there was some embedded CSS, pulled that out to its own files, which is better because those can actually get captured properly. Got the page size down, and yay, the number went down. And then I spent some time thinking about, well, why was this? And got to dig into, so our HTTP requests are running over TCP, which is itself running over IP, which is running over the network stack layers that I don't pay attention to down below. And so TCP has a receive window, which by default in our Linux boxes is like 3, which means you can send about 4K of data before you have to wait for a response that says, hey, I got that, and then it builds up, and the more successful data you send, the bigger the window gets and the more data you can send. But in a website, especially one like this where you're not getting many repeat users, every connection is a new connection, so they all start slow. And some clients, Mac and some other things, set it way higher, like 65,000 instead of 3. And there's been a push a few years back to maybe push that default up in an Ethernet based world, in a high bandwidth based world. And in fact, in a late 2.6 kernel, it became a tunable parameter, and in a 3.0 kernel, it became a tunable parameter that defaulted to 10 instead of 3. And for my experimentation, Akamai sets that to 12. And so we could send about 12 packets of about 1430 bytes, which is the MSS for, which is slightly smaller than the MTU, but the actual segment size for a Ethernet packet most of the time, which is about 16.5K. And so the advantage of that is if you can send, if your data fits in that 16K window, when Apache goes to send it, says, hey, I'm sending this, gets dropped down to the TCP layer, and the TCP layer can send all 12 packets without waiting for a response on congestion control. And then it's able to respond back to the client, back up the stack, and say, hey, we sent that, we're good. And Apache doesn't wait for the final SIN response and AC response from the remote server. And so it's able to say, hey, I'm done, as soon as it sends the packets. Actually, I think I have a little bit modified. So in the standard where you have three data packets, if you needed to send five data packets, you would have to send three, wait for the X, and then send the last two. And so this is a whole extra 2X round trip time that the packets go up. And so I think you can see in between T equals 0 and USFB, there would be render time, but I borrowed a graph, so it's not in there. And the US time, if you could send it all at once, would be up there. And if you had to send in two sets, it's way down here. And so with the ACMI settings, you can send 12 before you have to wait for the X. And CDN Planet has a lovely init CWND top on that, where they plan that we should bump this number up, and especially if you're a CDN, you should be bumping this number up. So it's not surprising that they did. So was this a success? So we fixed the glitch. The number on the chart improved. The customer's experience was somewhat improved, because we got the pages smaller. They were loading faster. This ignored the 4 seconds of render time from all the widgets on the page. So the extra 100 milliseconds we saved, sending them the packet was kind of swamped. And also we found from all that that Google, the RAIN customer, will they have a good network? And they're co-located close to us. The last time was already pretty good. And we also found that, well, they care more about U.S. first byte than how long it takes to get all of them, because they realized that that last part is network dependent. So it made me sad that I was able to fix the number and show them the number. And the response was what I expected it would be a month before. We can fix the number, but you won't see a Google change. So he was sad. But I was happy that our customers were happier. Because I think of the customers as the people actually getting the data. And while digging into all these things, I learned a bunch of other things, and I was able to share a bunch of other things. I didn't go over how we had an Akamai broken config such that it was reporting 99% cache when it was in reality caching about 10%. And so every time they did a push, they would wipe out all of the cache. And since most pages were not hit in between pushes, it was just a big, very wasteful experience. But they thought it was all good because they had a metric that told them it was good, but their metric was bad. And then I also found, well, the next step, if you actually want to make this work, since it's not caching, let's build a local cache that we rebuild after every build. And then if you want to flush your Akamai caches, they can hit our cache as origin and get our render times down. Because the assumption was, hey, most everything's cached, so it doesn't have to be fast. Which is odd because they're saying, hey, this one is slow, go look at it. And yes, there were a bunch of assumptions that were incorrect, and that by showing them to people repeatedly, I was able to get some buy-in on the state of actual facts. And I learned that this was an environment that was hostile to facts versus opinions, which definitely helped teach me a lesson in terms of leaving. But guesses become assumptions, become institutional knowledge and lore. So you need to remember to check your assumptions. Just like with the Mongo, like, I wanted to kill the Mongo, it wasn't this problem. Was it other big problems in the stack? Yes, but it wasn't this problem. So there were wins because there was learning. I learned stuff, and then I documented it, and then I improved it, and then I explained it, and I shared it, and when I left, they gave a big packet of information to my team and they're still using it. And I think that is the important DevOps takeaway, which is whatever the state is, you can make changes, you can dig into what's broken and what's not broken, and you can improve it. And yes, make sure that your truth is not lost in the case of myths because much had been forgotten. So that is most of it without too many of the roundabouts. If I have any questions from the audience. Yes, this is Simba. Sorry, he was not able to make it. He's busy being a therapy dog this weekend. The slides are up on my blog, Low-Level Manager, and I am providing them to the scale, and they will be up here soon. Yes, let me see if I can repeat that for it to aggregate up, aggregating effectively outliers for monitoring purposes and exploration. I do recommend, instead of just the average on the charts, some percentiles so that you can see the magnitude of how bad your bad ones are, did not come up with a good way to tag, hey, these are really bad outliers, what's wrong with them? Do you have any thoughts being your database? It might be on Redis. Yes, and I definitely did look at that, and I found some other bugs that was like, hey, the time it's taking to render the 404 is gone up. And that one was a super case of, oh, there were some 40-second for a force, really slow bad clients that were taking forever. And so, counting, hey, how many are over half a second? How many are over a second? How many are over 10 seconds? Okay, these are a flaw. And actually, half of those were a default value that Apache would put in when the connection went away in the middle of getting it, and all the acknowledges were lost, and it would just hang out for 40 seconds and make your logs look really bad. But then if you looked upstream, you saw, oh, no, that was actually responded, and just, I do have one little bonus material since you asked for it. This was a really fun one I found. I was working on this, and I wanted a day to not be looking at it because it was hurting my head, pounding against the wall. So there was a different report that counted the number of 404s. And the number of 404s, it was actually counting error lines seen in the logs. And so there were these errors that were showing up in the PHP logs. And all they had was a count over day and that it was going up. So I looked at them over time, and I found one specific one, and I counted it, and I found a perfect correlation. Any guesses what it correlated? Does it say? Excellent. Well, see, that's why it's important to look. It correlated perfectly with the releases. And I was able to dig that down to the release script which walked through all of the hosts and pulled the new website, swapped the sim link, went to the next one, did the same thing, went all the way through, and then it walked through them all and restarted them. And this meant that there was a window of time between swapping the sim link and restarting it when most of the pages would work right, but certain compiled pages would not find their compiled objects on disk, expect them to be there, but due to some weird PHP caching thing that I did not understand, wouldn't rebuild them. It would just complain that they weren't there. And so for the three or four minutes that it took to go through all of them and download, and it was taking longer and longer as the builds were getting bigger and bigger, there was just this time window where the website was broken every time you pushed. The PM's belief was that this was related to a code issue on the Ops side and the network was bad. He did not want to believe this as the reality. So yeah, I was able to specify the PHP library correlated, and I pushed a change for the release script, and they sat on it for a couple months, but they did eventually, I think, switch. Because, oh, well, we're going to change how we do that anyways. Man, let's not change it at all, because very risk-averse, because yeah. Well, thank you all very much for listening. And so yeah, I'm now a zip recruiter. I'm very happy. I do not have a grumpy talk like this about zip, and so we are hiring and looking for pro people, Python people, other people, mostly just nice people, because we're an adult-based organization, which is awesome. Thank you again for your time, and the speech was free, so the beer is free. The puppies are not free. Thank you very much.