 I am Lisa Van Gelder. I am VP of Engineering at Stride, which is a consultancy company here in New York. And before I worked for Stride, I used to work at Guardian Newspaper, and this talk is about some great caching disasters that we had while I was there. So I joined the Guardian in about 2008, and really soon after I joined them, the websites had a going down every lunchtime. It was really weird. So lunchtime was peak traffic, but it wasn't higher than normal traffic. It was pretty much standard. There's no reason why it seemed to be going down. So it went on for about a week before someone finally thought to look at the caching stats. What we found was that right before the website went down, the cache was being cleared. And this made sense. So traffic at peak time about then was about 2,000 requests per second. We're doing that on 12 servers. So the only way we could serve that quickly to rely very heavily on caching. So if you clear the caches at peak traffic by the time the caches fill up again, the website's down. The servers can't cope with the load. So now we understand it's caching causing the problem, but why? So you started going through the logs trying to see what was going on. And we noticed that right before the website went down every time, someone submitted a poll. So this is the example of a poll. It answers such burning questions as who has the better hair, Jennifer Aniston or Justin Bieber. And the poll at this point was actually numbers, so not percentages. So 29 people wanted to vote for Justin Bieber's hair. And someone submitted a bug. They said, right, 25 people voted. I voted. It didn't go to 30. What happened? The answer was caching. So developers were great, I can fix this bug. So every time someone submits a poll, we'll clear the caches. The numbers will be correct. So at lunchtime, someone submitted a poll. Website goes down. To learn two very important lessons from this, the first one being if caching is important to you, then monitor it. The second one is never build a clear cache system. Why? Partly because a clear cache system is a big red button that will take your website down. And you know if there is a big red button, someone is going to push it. The other reason is that you build a clear cache system when your caching is complicated and you don't want a complicated caching system. Why? Because caching is really hard to test. And it's vitally important. And the more complicated it is, the more likely it is that something is going to go very badly wrong. So the trick is to keep it as simple as you possibly can. The other good trick with caching is to cache for a really short amount of time. So often when people build a cache system, they think that, well, the longer I cache for, the more I protect my servers from load. The long cache time is a really great idea. But you have to think not only about protecting servers from load, but how long it's good to show users stale data. So long cache time means stale data. So we build a three-day cache at the Guardian. And this three-day cache served to cover up a multitude of really bad SQL queries. We had some awful really badly performing SQL that did things like build a list of articles for the site. And some of the articles are pretty old. They didn't change very much. To protect ourselves against those queries, we cache them. We cache them for three days. Every now and then, one of those articles would get deleted for legal reasons. And when that happened, we would send the cache system notification to remove that article and just that article from the cache. So there came a day when that clear cache message got missed. So an article got deleted from the database, but not from the cache. The three-day cache is a hibernate query cache. If you hope that hibernate would deal gracefully with something being deleted from the database, you would be wrong. Hibernate through an exception. Hibernate refused to load anything. Hibernate powered our website. So at this point, half the website is showing sorry pages. We're kind of half down. Effectively, we have cache poisoning. So the cache is showing bad data, but we're still serving from cache. So we're not stopping. This would have been bad enough if the cache time had been, say, a minute. We would still have turned sorry pages across the website, but the cache would have naturally reset as the expiry time reached. But this is the three-day cache. So we're showing sorry pages for three days. We were lucky that this happened at four in the afternoon, so we could clear all the caches. The website didn't go down immediately. But all those horrible queries that we tried to protect ourselves against by caching for three days, we're running them all, all at the same time. So performance on the afternoon is absolutely dreadful. So the lesson from this is don't try and fix awful SQL with caching because at some point it's not going to work on you on them anyway. And the other one is cache for the shortest amount of time that you can get away with. So we did a bunch of testing at the Guardian to find what a good cache time is. We discovered that for us, actually one minute was the optimum. So it's long enough that it protects our servers from most of the load, but it's short enough that if something goes wrong or uses C still data, that we notice. So work out what the optimum cache time is for you and it's probably shorter than you think. Does everybody recognize this guy? This is Julian Assange, the creator of Wicked Leaks. So in 2010, the Guardian did a live Q&A with Julian Assange. This was really the first test of our comment system and it's caching. It's the first test on the load. And we're feeling pretty good about the comment system at this point. We've got a really simple one minute HTML fragment cache in front of the system. So we're feeling pretty good. So the live Q&A kicks off and pretty soon the response time is dreadful. It's absolutely awful from the comment system. So we look at our cache stats, which we learned from last time. And actually the cache hit rate is really good, but it's not helping us. Actually, the better the cache hit rate gets, the worse the performance is. And weirdly, we're doing a lot of database callings, which makes no sense. So ordinarily, the comment system is one of the least important parts of the website. So if the comment system was running really slowly, we would just turn it off. But the core came down to my editorial that this was the most important thing on the site right now. So we had to keep it going. So we turned off pretty much everything else. We got the main site serving stale content. We turned off article updates just to keep this thing running. And we didn't go down. We clung on by our fingernails, but we didn't go down. Now, the end of it went to investigate, what happened? Why didn't caching save us? What we discovered was that it wasn't just HTML fragment cache. It's a cache with processing instructions. These are basically server-side includes. And what they're doing is actually building links. So Guardian services didn't know what other services were deployed to. So at runtime, this thing was doing a database query to try and find where the other service was deployed and building a link to it. This is a terrible idea. So caching works really well because you can serve cache content really quickly. If you add processing instructions when you're serving cache content, it's going to slow you down. If you add a database call for every bit of cache content you're serving, you're done. Your cache is never going to save you. Never do this. Only cache that in content. I'm going to finish with a woolly rat. So pretty much every company has a woolly rat. This is the one piece of content that suddenly goes crazily viral. And it could be days, months, sometimes even years after this thing got published. It's that one story that suddenly gives you crazy traffic. This is ours. It's a new species of woolly rat. It went viral because it's huge. No one had ever seen this thing before. And it's a massive woolly rat. We came from Papua New Guinea. It was discovered in 2009. We got the front page of Reddit. And then suddenly this thing went absolutely crazy. But the interesting thing about this kind of story is, yeah, it's crazy traffic. It's happened like you've never seen before in your life. But it's crazy traffic to one story. Or in other words, this is a brilliant test of your caching system. Your cache would eat this for breakfast. In fact, if you get hit by woolly rat and your system struggles, there's something wrong with your caching system. Go investigate it. In our case, the thing that was wrong with our caching system was that our cache was manual. It relied on a human being there going, oh, traffic looks a bit dangerous. But it turned on our extra caching layer. We got hit by a woolly rat 8 p.m. There were no humans, noticed that the traffic was high and turned on the automatic cache system. So after the woolly rat, we made that automatic. So we had our servers notice when traffic was getting a bit dangerous and turn on caching automatic. So that could be automatic. You may have noticed that all the dates in these slides were pretty far in the past. This is partly because I don't work regarding newspaper anymore. It's also partly because we finally learned our lesson about caching. Caching became much simpler, much shorter, just one minute HTML fragments. And since then, as far as I know, there have been no great caching disasters. So these are my lessons. I added an extra one on this. I actually did a preview of this talk with some other people. And one of the people in the audience worked with Fastly, which is a CDN. So I've adapted the last slide to say, never build a clear cache system unless you are actually a CDN. So if the entire purpose of your company is to do caching, then you probably want to build clear caching. Otherwise, never do it. Leave it to the experts. Keep it as simple as you possibly can. Thank you.