 Yes, hi, my name is Abe, I'm a data engineer at Etsy, and today I'm going to kind of go out on a little bit of a limb, and I'm going to talk about data visualization from the perspective of an engineer in the trenches who is tasked with life or death almost, like keeping the site up as close as I get on a daily basis to life and death, and so that's the perspective. So for the duration of this talk, I want you all to imagine that you are, whether you are or you aren't, one of these kind of galley engineers who, at least at Etsy, is capable of bringing everything down with the push of a button, and I'm going to explain how that process works and what we do to avoid it. So Etsy, Etsy is a company we're based in Brooklyn. We are the world's largest marketplace for vintage and handmade goods. So we had 1.5 billion page views, 117 million dollars worth of goods sold, 950,000 users in December alone of last year. So we're not small, we're growing, and we're heavily trafficked at all times of the day. We practice something called continuous deployment. This is kind of, this is a process that we use to continually deploy to the site, and that you might think like, obviously you're continually deploying, like you need to add new features and you need to make the site better in all kinds of ways. This is actually a continuous deployment, it's a fairly new phenomenon in the software engineering industry. In the past, what would happen is deploying would be like this kind of ceremony, like everyone would dress up in their deploying robes and they would go march off in procession, there'd be trumpets playing, and the high priests would deploy to the site. With continuous deployment, it's more commonplace, every engineer can do it, and it's a lot easier. So what I mean by deploy is, I mean, it means a lot of different things to a lot of different people. So in our case, it means that you're updating the site, releasing your code hopefully while not taking the site down. Now this is relevant because at Etsy, everyone deploys. So it's not just, there's not just a handful of committers who have access. We have 250 committers, not all of them are even engineers. We've got a lot of designers there, some product managers who have access to the deployment, which is like a very scary thought, like you're going to unleash this army of people who could potentially destroy everything on your site, like how do you trust them? How do they trust themselves to do this safely? Excuse my voice, I'm just coming down from a pretty bad thing, bronchitis. So we make this really easy for everybody, and we make it so easy in fact that on your first day at Etsy, you're expected to deploy to the site, right, knowing nothing about the stack, knowing nothing about our architecture, knowing nothing at all. And with us, like not really even knowing if you can code, you know, because you got past the interview, but that doesn't really mean much. So this is tense for everyone involved, but we make it really easy because we put everything behind the big green button, where everything's just kind of abstracted. So it's all you commit your code, you push to master, we don't use branches, we just push right to master. And then when you're ready, you push this button and it does a whole bunch of crazy stuff, right? It'll R-sync to distribute shell, which will pull from Git and copy files over and do a lot of different things that you don't even know about. But all you need to know is that once you push that button and you wait like a couple minutes, your site's going to be live, serving the entire world to see. So you better hope it works. And with this process, we get about 30 deploys a day, which is an absurd number. Like that is 30 times a day for the site, for someone to accidentally introduce a bug or do something bad that could break the site. And this is even like bleeding edge for sites that do practice continuous deployment. Like Facebook, Facebook practices continuous deployment, they'll deploy like twice a day, right? And it's still a pretty heavily orchestrated process. I don't know anyone who's deploying this much, like this liberally. And the reason we're able to do it is because, yeah, so you're probably thinking like, okay, is that state like, you know, things can break. The reason we're able to do it this way is because we have a whole suite of tooling and processes that we use for engineers. So they feel confident in both their code and with the state of the system and their knowledge of it, right? So it boils down to having your finger on the pulse of the system. So what exactly does this mean? We have so many things going on in our etsy.com that all come together to produce this website. You can go browse and buy your hand-handed T-Cousins or whatever you want. We've got hundreds of web servers. We've got Gearman, Task Manager, we've got Hadoop Stack. We have sharded MySQL databases and a master database that kind of sometimes works in certain situations and doesn't and like very complex interconnecting group of computers and systems. And how is any one engineer supposed to manage all this, right? But they have to if they want to successfully navigate the waters of deployment. Because if they don't, if they're not able to, they won't know when something is going wrong and they won't be able to fix it. So you have to give them, but this touch has to be able accessible to your very first engineer on the very first day who has never written any part of the stack and who still needs to be able to easily grasp what the system is doing and how it's behaving. So this question reduces to how do you make this entire stack full of metrics and full of not transparent variopic processes consumable to a handful of engineers who didn't even write the thing in the first place? How do you do that? That's the point right now. Because once it's consumed, you can imagine a God of view of the system where some entity is fully aware of everything going on at any given time. And this entity is consuming the entire system in terms of the information that the system holds. And so this entity, if this person existed, would be very good at diagnosing problems and preventing them. Unfortunately, we don't have, because we're just humans, we don't have that kind of access into the stack. We can't be in every single CPU execution block that runs. And so how do we handle that? So I'm going to introduce you guys to this little chart that I'm going to keep referencing throughout the rest of the talk. This is the principle of abstraction and information and consumption. So at the bottom corner there are things that are easily consumable. But the trade-off is they don't actually contain much information. The information is kind of, it's not very dense. And as you go up, as you get more abstract, you become less consumable, but you gain a lot more information that you can consume more efficiently. But it's harder to kind of comprehend. There's cognitive overload. And so as an engineer, you want both of these things at once. You want to maximize both. You want all the information, and you want it to be consumable. And I think this applies to a lot of data visualizations too. And I think there are a lot of data visualizations on a broader kind of data-vis sense that have lots of information that aren't consumable. So to illustrate that, there's this, which is, you might think it's, oh, what an excellent, good-looking visualization. And I apologize if the author's in the crowd right now, which is a possibility. But I don't really know what's going on here. And that's because it's trying to make a very large amount of information, which is the $3 trillion cost of a war consumable to me. That's a massive amount of information. So this chart's very, very dense. And it contains a lot of stuff, but there's a lot of cognitive overload trying to decode it and internalize it and then make it something actionable. And why am I to computer turn off? And my computer's on. Can I get a, what's up? Exit to the desktop. Exit to the desktop. I'm here. Oh, your core of the colleague. Oh, that'll do it. So there's so many parts of this system that I didn't build that I don't know about. And so you have to visualize them, right? And, right. So yeah, so there's that. And then there's, of course, the original data visualization, which is very easy to consume. The flip side is that it doesn't contain much information. It's an avocado. That's it, right? And so the ideal visualization, the platonic ideal would be an avocado, something that's as easy to consume as an avocado, but that contains all this information about the Trentron dollar war so that I as an engineer can then go into my whole web stack and manipulate it and know when it's wrong and everything will be well. So let's get into some of the tools that we use for this. So our first line of defense is a tool called SuperGrep. It's a real-time error logging tool. It's a very simple tool. All it's really doing is tailing log files, right? So you got a bunch of log files in different places and there's a node process that literally just goes tail F, pipe, grep, error, right? So whenever an error pops up in your log, it will bubble up to this web app via WebSocket and then you can look at it and it'll just kind of scroll through that way. So this is just the composite view of all our logs. There's like nothing special going on. And you might be thinking like, this is not a data visualization, like what's going on? Well, it is. It's just very low on the abstraction spectrum, right? There's not abstraction. And as a result, it's very easily consumable, right? Because you get all the information there, you can get the date and you get like a hash of it and the error, like what part of the stack of the error, the actual file and the line of code that threw the error, what the error message said. Like you get all this information and you can just read it and know what's going on. And what people will do is you'll push and you'll kind of have this like sitting there and then if it like blows up, which is the Etsy parlance, like a super grip blowing up, it'll flood with errors because the good thing about being heavily trafficked is once you get an error, everyone will see it. And so you'll know when it's angry and then you'll be able to fix it because you just read one of them. Of course, the problem with super grip is that since it's very low on the abstraction spectrum, the information is not very dense, right? As you can see, you can only have like 30 or 40 distinct discrete pieces of information there. And you have to read through it and it's not a great way to, like I don't really know exactly what's going on with the entire stack with this. So the information's fluffy, right? The flip side is easy to consume. If we want more information though, we have to go a level deeper. We have to kind of bump up the layer of abstraction. So the problem with super grip too is that it only tells for errors, right? Systems are more complex than that. They don't always just tell you when they're mad. But if you load up the site and it's not loading because you just push something and nothing's showing up in super grip, what do you do? You've got a problem. You need more denser information. So what you could do is look at the actual logs themselves, right? Because I'm sure there's logs about it somewhere most likely, hopefully. And maybe they're not throwing an error, but if you look at them, it will clue you in as to what's going on. And if you wrote the system, like I don't know, it's very, it's almost intimate. Like you know your logs and you know what they're like normally and you can check them and easily diagnose them. So you could do this. You could access it into your server, poke around for the log file, try to remember where it is, like what VAR log, whatever. There are standards for this to make it easier to guess these things if you haven't experienced the log before, but that's why we have them. So people who don't necessarily know the system can come in and do it. And then you can hopefully guess what it's supposed to look like and maybe scroll up a little while before, you know, you see like a spike or some other kind of weird anomaly. Or you could just take all this information and turn it into, oh yeah, sorry. So this is, again, easy to consume. There's no abstraction about it, but it's really hard to get at, right? So there's a lot of friction. SSHing at the server, doing all this stuff, which means that it's actually not, when you take that into account as well, it's not that easy to consume. You have to do all that stuff. So you bump up a layer of abstraction, you turn it into a graph. And so this graph is exactly the log file. There's a one-to-one relationship between what is on this graph and what's going on in the log file. And the difference is that this is just a little higher on the abstraction consumption continuum, right? So what we get out of this is a lot more dense information. I can look at 24 hours worth of data in a single screen without even interacting with it, which I can't do in the terminal. However, the trade-off that I make is I don't necessarily know what's going on. I have a couple clues, right? I have the name of the metric, which is Q-size, right? Which hopefully is a descriptive name. And I have data points and times, right? And if we go back to this, here I know the file name. And luckily, it also says Q-size. So this is a pretty good example. But there are examples where the name of the metric is not at all apparent, like very complicated names with lots of levels of namespacing. You don't necessarily know what you're looking at. But if you made the graph, you know what it's for, and you internalize that, and then you're able to overcome that little bit of cognitive overload and have an effective tool, which is the graph. And so to do this, to easily make these graphs, we have a tool called StatsD, which we wrote, which is a demon that you put in your code that collects metrics and then forwards them to a round robin database. And it's as easy as putting this anywhere in your code where you want to graph something. So you used to call StatsD increment, whatever your metric is, and the value. And you can also call set value on it. And then you just do that and then you'll magically have a graph, right? So to get the graph I had before, I had the code that produced the log file. And instead of writing to a log file, I just call this and then I get that graph. So we make it very easy for engineers to do these things. And the philosophy that we subscribe to is if it moves graph it, anything that moves graph it, we want the entire network to be potentially viewable. And if it doesn't move, you graph it anyway, because it might one day move and become relevant. And you're laughing, but this is like, if you have a graph that's not moving, it's saying it's there, then that's normal. And then if it's not doing that, then you've got a problem. So you want to have that backlog of data that you can look at. And so after that happens, you have your StatsD and it goes to a tool called Graphite, which is great, it's a round-robin database. It's not as simple, but you get access to all these awesome graphs that you can kind of look at and scroll through and find the graph that you want and then apply whole winners or moving averages or whatever you want to do to graph to manipulate it. And that's very cool. Of course, the problem with Graphite is that, again, it's kind of hard to get to the graphs. Like if you want to look at any individual graph, you have to click on the name and the name and the name and the name, and maybe you don't even know the names. You have to find it. We're at the same problem that we had before. Like it's dense, right? We have all the graphs. There's abstraction around it. But again, it's harder to get at on a larger scale. So how did we solve this before? We bumped up the level of abstraction. That's what we're gonna do again and create dashboards, right? So right here, a dashboard is a collection of graphs, right? So you're taking, it's like on the same screen now without even interacting and going to 10 different graphs, you can view them all at once, right? And so this is one more layer of abstraction. And correspondingly, it's a little bit harder to figure out why a certain dashboard is why it is, right? Because you have to get to create the dashboard yourself. You have to make it. Presumably all these different names have something in common with each other. You don't necessarily know what that is. But the flip side is once you figure that out, once you get past that cognitive burden, then you have access to all the relevant, hopefully, information right in front of you. So it's a lot denser and it's a lot more actionable. And so the process is you go, you push, you hang out with the dashboards and you make sure they're not misbehaving in any kind of way. Now, you're not out of the woods just yet. Okay, you might think you are because I have my dashboards, like what could possibly go wrong? I'm looking at everything. Well, because it's so easy to just litter your code base with stats that you call everywhere. We end up with an avalanche of metrics. Like, we have hundreds of thousands of metrics. We have a quarter million metrics that we use, which is a lot. That's far too many for any one of us to monitor. And dashboards don't really work at that scale, right? We have so much information, we can't consume it, right? If we were to have a quarter million, with a quarter million metrics and 250 committers, that's a dashboard to engineer ratio of one to a thousand, so it's not feasible for an engineer to be watching dashboards all the time to get this sense of touch and intimacy with the stack. But it's important that you know what's going on all the time. So, once again, we have the problem of too much information, too hard to consume on time. How do we solve it before we've bumped up the abstraction? And that's what we're gonna do again. And so the problem here that we have is if one of these graphs spikes and no one's watching, like you can't see it, right? So you need to have that in front of you. And there's also the fact that there may not even be a dashboard around it, right? They may just have a metric that is a loan metric that an engineer didn't necessarily have the time to make a dashboard. Because remember, dashboards, there's more friction around a dashboard. There's no friction around adding a stat C call, but there's more friction around a dashboard because it's higher on the level of denser information. It takes more energy to create and more energy to consume. So people don't always do it. But that doesn't mean that your metric is any less important. It doesn't mean that that metric can't give you any less level of insight into its corner of the system. So that doesn't mean you shouldn't know about it if it's misbehaving. So with dashboards, we have denser information than one graph, but it's negated by all the different charts and graphs that we have. So to consume this, we're gonna bump up a layer of abstraction once again with a tool called Skyline. So Skyline, you're thinking like, okay, how do you get more abstract than dashboards? Which in all these tools, nothing I've said really seems like that abstract. Like it's not like a data visualization where there's like crazy things going on. I don't even know. But these are actionable abstractions, right? Because as engineers, we don't have time to consume. And I would say like, if you wanna test out how good your visualization is, figure out if this is something you can use to save the world in five seconds, right? And if you can't consume it like that fast, I would say there's too much cognitive overload with your abstraction. So everything we use here is they're very simple and they're as simple as possible because the less we're thinking about our data, the more time we have to think about things that matter, which is how do I fix this problem? So anyway, so Skyline is a real-time anomaly detection system. So I'm not gonna get into the gory details of it, although I'd love to after. So you can definitely feel free to talk to me about how it works. But the basic gist is that it takes all your metrics, all quarter million of them, and it analyzes them for anomalies in real time. And then it will surface them. If it finds an anomaly, it will surface it to you super grip style at this dashboard. And so what am I looking at here? You don't even know. Well, you got your names of your metrics, which presumably hopefully provides you a little bit of context as to what you're looking at. You have the particular data point that triggered the algorithm to say, hey, this is bad, you should know about it. And then you have the one hour view of the graph and a 24 hour view for context. So you can see here there's an anomaly detected, right? We know what it normally looks like and we know what it looks like in the past hour. And so we can figure out, okay, thank you computer for telling me about this. This is something I wouldn't necessarily have been looking at, right? So the next step is to look at the names. That's timers, page, your activity list logged in some 90. So I think that means the logged in activity list page time to log in the sum of the 90 percentile is, well, 40,000 milliseconds, right? And but then, so that's quite, it's hard to figure out because this is just a random thing that pops up. You have no context of a dashboard or anything. You have no kind of explanatory text. You don't know who wrote this. So it's really high up on the level of abstraction. But what you gain out of it is a massive effective information density, right? Because you are effectively consuming everything, right? Even though you're not looking at it at all, what you're doing is you're outsourcing your consumption of all the metrics to the machine, right? So you're partnered with the machine and the machine does a lot of the consumption for you and says, so it looks at all the graphs for you and it says, okay, I think you should look at this graph and then you kind of are effectively skimming off the cream of the crop. But what you've actually done there is you've actually played God for a little bit. You've looked at everything and known everything. And so you get this incredible amount of density. But the chart if you make is this pretty decent, like a really big cognitive overload because you don't really know it's like, this is hard to kind of penetrate. You don't know if anything's important. Like it's noisy, like it's kind of a funky tool. And there's also the trust issue. Like how do you know that the computer is telling you the truth? Either it's giving you things that aren't anomalous which is why you have to manually check the graph yourself or it's not telling you about things that are anomalous. Anomaly detection is not a solved problem by any means. There have been papers upon papers written about these algorithms and I'm no statistician, right? So I'm an engineer, I wrote this tool for myself and for my colleagues. But no one has solved the anomaly detection issue. And then the reason for that is because like we have a quarter million metrics. It's not feasible to fit a model to them. All of them by hands. You have to figure out a way to do it kind of automatically and this is not to be smart enough that it can handle any kind of metric you showed it. There's a whole slew of problems but again, I'm not gonna talk about that just yet. You can find me after and we can talk about anomalies all day long. But so after this, you find an anomaly. Then what would you do about it? You could either go fix it or you could get suspicious, like all good engineers do, that there are other things that are kind of messing up in similar ways, right? And so how do you do that? So at this point, you still need to consume the every other metric, like to search all of them for metrics that look like the one you're messing up. But that's infeasible. So again, we outsource that to the machine but it's called Oculus, which is a correlation system, a massive scale metrics correlation system that happens in near real time. It's not as fast because it's super expensive to run but the basic way it works is you give it a metric and it'll find metrics that look like it, right? So here's our metric that we had before and not unsurprisingly, the 90th percentile sum looks similar to the sum but these are like, so this is the desktop first of logged in, but we can see that there's, we can go back and kind of infer that like, okay, something's wrong with the activity list, right? Because different metrics are messing up and so now we have more context and now we can maybe go look at the activity list and figure it out. And so there's an enormous amount of cognitive overload here for the same reason that we had before, like you don't really know what's going on. There's just kind of this magical thing. There's this option up here like FastETW, which is the algorithm, you could change that, you know, to tweak it but it's kind of hard to grasp if you don't have a degree in statistics of what you need. And so, but it's an extremely powerful tool, right? You get to compare all the metrics that we have in our stack against one another to find similar ones. So it's very high up on the level of abstraction. So the moral of all these different tools is that to write a good tool and an effective one and a communicative one, which I think should be the goal of any tool, you have to add touch to the system. So touch is a pretty, it's a kind of hard concept to define. Touch is knowing it, right? Touch is like you're running through your fingers through your hair, you know what you normally feel like and if you pick at something and it's like, oh, I found a lump here, like I just know, you know it and that's touch, right? And any kind of thing that you normally use like your computer, you have touch for it, right? Cause you know it and you can absentmindedly be aware of it. And so a good tool adds that, right? With SuperGrep, you kind of leave it off there and you kind of just have a finger on it, right? And you're aware of what it normally feels like and you don't, and then you let your own unconsciousness kind of tell you about it when it messes up. And so more touch means more intimacy with the stack, which means you know what's going on with it, you can fix it. So the problem with these tools, though, is that something that's easier to consume is less dense. So you don't like, but having like more awareness of exactly what's going on, you sacrifice the ability to know awareness on a larger scale and then having awareness on a larger scale, you kind of sacrifice the ability to know what's going on at the actual level of what's messing up. So it's a balance. And I think the answer tends toward, like the better tools probably and more useful tools tend towards to be less abstract, right? SuperGrep's our most popular tool. SuperGrep and the dashboards. The other ones, Skyline and Oculus, they're less widely used. I mean they're newer so people are still, and they're still working at the kinks of them. But they're less widely used because they're harder to use. You have to focus on them as opposed to letting them do their thing at the side. And so when you have more abstraction, you have less intimacy, but great information density. So when you're building these tools, keep that in mind and make an engineer happy. Thank you.