 With that introduction, I'm almost obligated to tell you that by the end of this talk, I'm going to have each and every one of you pulling out a checkbook because I'm going to be selling an amazing product that's going to change. So I'm Daniel Fisher. I'm from Microsoft Research. I'd like to thank Domenicus for allowing me to borrow his laptop. I'd like to demonstrate that even when I touch it, neither the laptop nor I catch it on fire. It's really cool. I'd be delighted to talk with any of you about Microsoft Research's role in the world, what we do. But in general, I look something like an academic researcher. My day-to-day work is to get to understand problems out in the field and to think about solutions. And my first encounter with Big Data actually happened something about 2007. And the story begins, as all the best stories do, in a server room on Saturday night at midnight. I had been sitting—oh, was that again? Okay, so they do catch on fire when I touch them. There we go. My story begins, as the best stories do, in a server room on Saturday night at midnight. I had been talking to a colleague of mine who worked on what was then called Microsoft Virtual Earth, but would eventually be called Live Search Maps and eventually Bing Maps. And we were having a conversation about how users use the system. He sort of believed that people would go check out the whole diversity of the world. There's a big role that's full of sky imagery, and gosh, they'd be checking out mountains and forests and oceans and plains, and I, of course, thought this was a really interesting question. I'd love to find out. So he spoke with him and his team, and he introduced me to a program manager who got me in touch with Dev Ops, who put in a change request. And, you know, it seemed like sometime in six to eight weeks we'd be able to put in a spec for the way that the data would be transferred. You've all been through this sort of story. And so that's why at Saturday night at midnight I was sitting in front of my computer and suddenly my instant messenger window pops up. And a guy from the server room says, I was running the backups and I realized I can take care of this for you. Go check this share over here. And I went to the file share, and there was 20 gigabytes full of zipped compressed files. Internet service logs showing me precisely the logs of what tiles had been downloaded when by whom. So I have, for each of them, a date, a date and time. I have a URL. I have some other stuff. Because it's a tile-based system, you can look at the URL to figure out precisely where it is. So this is actually the address of this tile that is 02123 and so on, is a quad grid that tells you precisely where this location is. So these couple dozen gigs, unzipped to about 100 gigs, kind of overwhelming my hard disk in 2007 anyways. And then a couple days full of pro scripts later and a week of SQL server churning away. And I finally had something that I could actually present, which looked something like this. This is a map of where people are looking at when they look at virtual earth. It's a log scale. The brightest spots are very, very bright with hundreds of thousands of hits. The dim spots are very, very dim with approximately zero. The spot on the corner shows you what the size of the tiles that we're looking at in this particular image are. So these are people who are looking sort of at this, I don't know, 100 mile or so range. And as you can see, people tend to look at, well, English-speaking countries, because that's where the product had been rolled out. And they tend to like to look at cities and they like to look at coastlines and they like to look at islands. And they like to look at this bright spot in the northern Atlantic Ocean, or the southern Atlantic Ocean, right under the Cape of Africa. Anyone? Zero zero. Zero zero, precisely. It turns out that you have just found a bug in the JavaScript control for virtual earth version one, it was correct in version two. If you typed in a location that the system couldn't find, it would fail and drop you into zero zero. And that bright spot is users winding up there. And the tails that you see extending in all directions are the frantically scrolling around, trying to figure out where they are. I know there's a bit of a Seattle contingent here. The bright spot over there on the left side is downtown Seattle, and you can see that people like to look at downtown and they scroll around to the space needle and the university on the other side of the water. You can see Redmond and Kirkland and Bellevue and a bright square. Anyone know what that bright square is? That's Bill Gates' house. And the reason that it's a square is because the slash dot effect. Slash dot had put out an article saying, now you can spy on Bill Gates using his own tools. Click here to see Bill Gates' house. And so people would click over there and their screen would fill up and they'd look at Bill Gates' house. And as you can see, they didn't scroll around. They did not go load the tiles near and see who Bill Gates' neighbors are. They said, okay, I've seen Bill Gates' house and hit the back button. And that's the visible evidence of it. Southern California got the same sort of effect. For those of you who are considering vacations out to Southern California, I'd like you to notice up at the top, we see all sorts of beautiful bright spots next to each other. Hollywood and Santa Monica and Los Angeles and downtown. And locals can, I'm sure, tell me much more about that. And then there's a very, very, very long stretch before you see one other bright spot. It is the brightest spot on the virtual earth map. The single most looked at place on earth is also, of course, the happiest place on earth. That's Disneyland. But when you bring your family there, please be aware that you're in for a one hour, very, very boring drive. And you can see it all right on this map. So we were getting real feedback and real information, both about, as I said, how people were behaving in the system. This bright, again, here's a zoom in on part of Central United States. You can see Chicago on the right side. You can see on the left there's Salt Lake City and Denver, Colorado. And that very bright cross in the middle is the geographic center of the 48 United States. And when you started up virtual earth in the United States, you would get BUS right there. And if you were to pan or search or zoom, you'd look at the various places that you wanted to. And if you were to just grab the zoom knob right over there and shove it all the way to the end, you'd find yourself right in the middle of the, I think, the Snake River Indian Reservation, which has nothing much in particular to look at. The imagery was terrible. And you can see the cross as people, again, try to relocate each other. And I think that little arc on the side was actually them playing with calibration games on precisely where they'd place the screen based on resolution. So it brought this to the team and they started making the revisions. And they started learning about how users were using their system. And everyone is very happy until one of the people said to me, by the way, can you distinguish American users from international? I thought about that. That's actually fairly straightforward. All you have to do is look at the IP address, which had been in my original logs because anonymization had clearly failed badly. Anyways, I could have gotten that, I guess, but I realized, well, the way that I'd architected this thing is, of course, I started with raw data and I'd filtered out all the fields that I didn't, or I'd dropped out all the columns that I didn't need and I'd filtered them. And then I had this process of aggregating it on the server side and building indices. And from there I was generating shapes which I was able to serve up as tiles finally to the client. And the problem is that first step, I was doing pretty much by hand. I had a Perl script that was running it all. To do this distinguishing between American international users, I'd have had to go back to the very source files and those very earliest Perl scripts and do a reprocessing and be able to expose more data. And of course, once they did that, they'd ask me about the next dimension and I certainly didn't have that one available either. So I was going to be making a very difficult world for myself. And I think that this story illustrates in a microcosm some of the challenges I'm going to want to talk about in big data about where you have to make decisions and how this looks a little bit different from some of the exploratory visualization that we otherwise so like to talk about. Now the definition of big varies. In 1975, I was being born, but this guy was giving a talk at the SIGMOD conference where he said, sorry, at the VLDB conference. This was the very first VLDB conference and databases were not nearly as big as they are today, but still this is a substantial size. 200,000 magnetic tapereels for the US census representing 900 billion characters of data. That's 900 gigabytes. That's not bad. But what's most interesting, 200,000 tapereels. Consider what that means. If you've got an algorithm that's, you know, just takes order of n through the data, how many order of n? Four passes, six passes. Really every pass above one is a lot, and even one pass, looking at every single tapereal is a horrifying and terrifying idea when you've got 200,000 physical tapereals that have to be picked up and moved, the robot head going completely nuts on this stuff. Just a couple years ago, Ben Scheiderman gave a talk over at, this one was at SIGMOD, thinking about what he called extreme visualization, squeezing a billion records into a million pixels, and he talked about a number of techniques. Certainly by then, we knew that a billion records wasn't actually a huge deal, but it was certainly, but it gives you some idea of where the visualization community was considering to be big. I want to sort of step away from any particular size because I'm sure someone right here is going, ooh, but we've got this exciting new technology that covers 10 billion, fantastic. At a zero, it falls down. At another zero for if it really becomes a problem. Instead, I think that we want to talk about cases where the size of the data set is actually part of the problem that has to be solved in visualization. We just had a fantastic conversation about performance, but that was all rendering side performance. And on the rendering side, we have certain limitations. We, for example, unless you're sitting inside this room downloading 20 megabytes to your screen isn't actually a huge problem. So I'm going to want to talk about two different parts of what makes big data challenging. One of them is the representation problem itself. What do we put on screen? How do we represent? What visualizations do we choose in the cases where we've got billions and trillions of data points? The other is interaction. What do we need to do to make visualizations useful for interaction when we're dealing with these very large scopes? Of course, as we're talking about this, we're not only talking about user's time and attention and programming effort, we're also beginning to talk about money. A lot of these things are beginning to happen on distributed systems now, and what that actually ends up meaning that if you run a query that runs on a couple hundred cores for a couple of hours, you've spent an awful lot of money just keeping the system spun up. And that's an interesting world. I don't think any of us, at least I certainly have never before encountered a time when I actually was spending money with my computer, but I now definitely am. Speaking of which, I'm not talking in particular about any technologies today. There's going to be no code on any of these screens. I'm talking about concepts and thoughts that I've been working on. But that said, I just came back from the O'Reilly Stratoconference a month or so ago, and everyone there is talking about Hadoop and the Spark stack and various competitors and permutations of them. There's a lot of really exciting things going on in big data and distributed data systems that are really worth paying attention to. And I know a lot of people in this room are very much focused on sort of the front of the system, and there's a lot to be said for what's going on in the back, too. Now, when I'm talking about size and some of the issues going on, I've got order of magnitude a million or so pixels on my screen. Maybe you've got a retina display and it's 10 million pixels. Maybe you've got five retina displays together and you've gotten yourself to, like, you know, 50 million pixels. We're not adding too many more zeroes after that without, like, overwhelming both the human eye and physical capabilities. And that means that if you've got enough data points to fill up memory, 10 to the 9th or so bytes, you're going to have to make some decisions and some sacrifices. And if you're filling up your hard disk, 10 to the 12th or so, you're going to have to make even more. So I think that it's safe to say that no matter what we're doing, we're going to have to have some conversation about more sophisticated visualization techniques then put every data point on screen and call it a day. I mean, we can try. Here I've provided for you a big data scatter plot. And a big data network diagram. And over there, I tried it in parallel coordinates. So there's some limits. We're going to have to aggregate. Fortunately, people have been giving a lot of thought to aggregation techniques. There's this lovely paper a couple years ago from Nicholas Elmkvist and Jean-Denis Fiquette thinking about what they call hierarchical aggregation. But in general, it does a fairly good job of outlining what do we mean to be getting at when we say that we want a scatter plot and therefore what's the aggregate form of a scatter plot? What's the aggregated form of line chart or a bar graph? Some things are pretty obvious. This happens to represent a couple hundred thousand points underneath, but you have no idea. We're showing some averages and some confidence intervals and some values and it's a sum of stuff. It works fine. Sometimes aggregation should take some thought. So for example, we often see stock tickers that look something like this. This is the stock price for a large company which I'm fairly familiar with. You can see some changes in fortune and you can also see some stock splits and some highs and lows and that kind of thing. The critical bit that I really want to bring out about this is that when I pulled this stock trend, I got some sort of numbers. I think that's probably average per pixel maybe or per month or something. So I manually went in and I pulled instead the actual minimum and actual maximum value per month for each pixel so that I can actually and placed one pixel per month here so I can actually say precisely what the values are and you see that this is actually a noticeably different chart than the one above. The highs are higher. The lows are lower. The shape of that curve is a, there's times when you can see a lot of variation. You can see there's times when there's not much at all. So choosing your aggregation as opposed to picking some hidden default can be a very useful way of starting to think about what your visualization is actually trying to show you. I actually pulled this image from Jeff Hare's immense paper and thanks Jeff. Again, he's got one of those big data scatter plots and he's also shown what it looks to bin and bucket that data into a hex grid and you can see much more texture in where the data actually is when you go ahead and bucket it. Lots of techniques however open to aggregation. Martin Wattenberg's pivot graph mechanism was meant as a summarization of network visualizations and I think this is a criminally underused system for a lot of network visualizations I've been seeing. We've got two different attributes. One of them lined up vertically, one of them lined up horizontally and the thickness of the edges represents the number of connections that go between things of those attributes so it's a way of exploring the forms of heterogeneous connection inside a network. We're all fairly familiar with stream graphs. Stream graphs can be fantastic for our fundamental form of aggregation and even some of the more exotic forms. Tree maps turn out to be fairly good for aggregates and I say mostly because we need to have a conversation about precisely how you store your data to represent a tree map and there's when I started thinking too hard about this all sorts of interesting things started coming up but in general a tree map is simply a pile of pie charts stacked on top of each other and so you can in fact build that as an aggregation. Lots of people have independently reinvented this concept of what's turning out to be basically a generalized histogram. Hadley Wickham has a lovely paper called In Summarize and Smooth where he walks through essentially this technique. You're going to choose some sort of bucketing across your data, some sort of range that makes sense. You're going to take points, put them into the buckets and then you're going to create some sort of shapes on screen based on those buckets. Every chart that I just showed you is in that critical sense sort of a histogram-y kind of thing. But here I need to take a moment and distinguish between exploratory and presentation visualization because everything that I showed you so far applies to both exploration and presentation. I don't care whether you're, you know, pre-cooking things for a month before you come up with the one image that you're going to show your users or whether you're trying to allow them to explore you're still going to have to work with these aggregations. But things get a little bit harder when you're trying to be the exploratory, wonderful, interactive people that I know that everyone in this room really wants to be. And to do that I'm going to tell you the story of Walt. Walt, the hypothetical histogram. Walt wished to exist. Not in a simple world, but in a world with lots of data. He was large, he contained multitudes. He sounded his barbaric yelp, which is why he was named Walt. Walt's world though is a difficult one. Walt's in a world where data sets are very, very big. Where queries against his data set take, let's simplify the story to say precisely one minute. You do one query into his Walt server, it goes across the distributed system, highly optimized, does all the right questions, distributed across Hadoop, spins up the right cores, gets everything done. One minute later we get back a result. You've got another question. You send it off to the core, it goes away, comes back another minute later. And this is actually kind of how a lot of Hadoop and Spark clusters are working today. So when we wish to first bring Walt to existence, that first moment, well what do we have to do? Well we want to do some sort of, we want to do some sort of bucketing. And so to do that, wow, things are coming in out of order. First we need to find out what the minimum and the maximum are. We need, after all, to be able to figure out where those buckets go. It wouldn't do to be placing points into buckets based on the assumption that Walt ranges from 1 to 10, and suddenly see 100 million somewhere in there. So we're going to have to do one pass based to figure out the minimum and the maximum. And that's all that that pass is going to do. I assume that we're all clever enough to at least do that in one pass and not to do one for min, one for max, that would be a pain. Then the second pass based on that minimum and maximum allows us to bucket the points. So our total time to bring Walt to birth, the first time is two minutes. But of course, we like interaction. We like to interact with Walt. And maybe you want to, for example, change the bucket count. We already know Walt's minimum and maximum, that's good news. So pretty much we're still going to have to go back through and re-bucket every single point. We have to look at them and say, ah, you know, this point, you know, this point now goes into this new bucket and this point now goes into this new bucket and this point now goes into this new bucket because you've got new buckets. And so we've got a new Walt. I mean, maybe we were clever. Maybe in advance what we did is we said, you know what, the user might want 20 buckets, 30 buckets, 50 buckets, 80 buckets, 100 buckets, 10 buckets. Go find ourselves a maximum common denominator, you know, say 3,000, and we'll say, well, we're actually going to pre-bucket everything into 3,000. Now when the user sits there with their slide or dragging around the number of buckets, we don't have to do a re-analysis. We can just grab these new values. So that cleverness pays off. The user now has this incredibly rich experience. They, on the front end, are seeing of, you know, the number of buckets they want to. On the back end, we've thought of it as a large number, a couple thousand buckets, and everyone's reasonably happy. And then we realize that we want to add a new sort of interaction, too. We want to cross-filter with another histogram. This is going to be a little bit of a challenge, too, because in this other histogram sitting off to the side, well, you know, when you cross-filter over here, you then want to light up the buckets over here, the parts of vault that apply to that histogram and don't light up them. So you can, again, do another pass. Or, again, we could have been very clever. We could have predicted that users might want to have these two histograms interacting. And so as we are bucketing things in advance, we actually sort of did this multiplicative combination of all possible ways that dimension one and dimension two could have interacted. If we did that, if we were very clever in advance, again, we've got this interactive. We've got this data pre-cached, and we can interact with it directly. But we're getting to a lot of cleverness. If we do both of these techniques, the thousand buckets to make sure that we're ready for different denominations of the slider and the multiple combinations, we're beginning to talk about pre-caching a couple million items. And there's a lot more interactions we wanted, a quick lens. Some of the things that we often want to do with histograms include changing the number of buckets, zooming in on a single bar and breaking it down into more bars, filtering out some data, cross-filtering into other visualizations, cross-filtering from other visualizations. You often want to look at a couple sample rows from the histogram. Grabbing all of these in advance and making sure that they're all ready to go means we're getting pretty close to just actually having collected the original data set and remember our premise is that we can't possibly do that. So we're going to have to make some decisions and some trade-offs. Now, there is a technology out there, I should say, that is actually designed for precisely the scenario. Nothing I'm saying is new. There is nothing new under the sun. OLAP is online analytical processing and then the open-source world, as far as I can tell, the major vendor is like the Pentaho line of products. OLAP is essentially a cubed version of the data in advance, pre-calculating a set of dimensions that you're interested in and putting them together into a form where you can do a series of high-speed queries against those subsets. And so OLAP can pre-contain a lot of this cleverness and can get you much of the way there for these. But you then have to still have made a number of decisions about what are the dimensions that you're going to want to cube and again, if you over-cube, you've broken yourself into so many small pieces you're not actually winning anything. So the moral of Walt's story, if you will, is that we have to decide what operations will support rapidly and which ones we're willing to allow to be slow. We're beginning to outline as we talk about this a bit of a solution space. When we're talking about these big data problems, things are really nice if we can pull a chunk of the data to work with offline, because anything that's offline, anything we've already got in the browser, anything that's, you know, close to us in the server in a cached chunk, we can be very quick with. If we can't do that, we can at least build ourselves really good indices that have the data pre-stored in some useful form, whether that's OLAP or whether that's Jeff's immense, sorry about the typo there, or the nano-cubes project. We can decide that we're only going to deal with the subset of the dimensions, my solution from the hot map project, giving up on certain types of the data. That gets us somewhere. There's another set of solutions, too, the divide and conquer world, as exemplified through Hadoop is trying for this, although, again, we run into timing and speed problems. And there's one other genre of solutions that I want to talk about, which is sampling and streaming. So a couple of years ago, annoyed that the big data queries that I was doing at least were beginning to turn into, well, you know, this technology, the sort of thing where you drop in a bunch of data into a system, come back the next day and find out what your results look like, which actually meant that you were at the point where doing a query might mean that you'd ask, hmm, I have 10 days until my deadline and I have seven queries worth of time to do this between. That's a bad world to be living in. Especially because I knew that a small subset of my data would actually have most of what I wanted to know. I started to think about trading accuracy for latency. This image is adapted from Hauerstein, who did this first and better, but I independently and accidentally reinvented it until one of my database friends said, you should look over there. Joe's doing this stuff. It was great. It really was great. I love finding people who are doing the right thing. So standard databases in some sense have this notion that time passes for a while and then suddenly you have all the data you ever wanted. Your happiness chart can be thought of as like going along and then very suddenly there's a spike. And yet we could be living in a world that looked like this because very often looking at a small chunk of your data can make you very, can make you almost as happy as looking all of it could be. So if we could trade accuracy for latency, we'd be getting somewhere. And well, you know, I'm an impatient kind of person. All of these, we're talking about basically doing samples and the sorts of samples that we're all familiar with and using those to compute confidence bounds. And to give you a demo, I'm going to actually show a project that we put together a couple years ago called Sample Action that tried to explore this idea. We're looking at the FAA Flight Database and we've, let's see, determined date of week versus arrival delay. And we can rapidly see that after looking at like what, .002% of the data, we've shown that flights on Monday are 1.7 million minutes late. That doesn't sound quite right. It may have been stuck on delays, but 1.7 minutes on average seems high. Wait, I did, I did some before. Let's fix that to average and rerun. So having looked at virtually none of the data, I realized I was doing the wrong query and I fixed it. I look at virtually none of the data. Again, I can quickly see them getting, oh, single digit answers. Okay, that's the sort of delay that I expect to be seeing. That seems roughly right. Let's let this tick forward for a few moments. And now having looked at 0.2% of the data, I can already tell you that flights on Friday are going to be much more delayed than flights on Monday or Saturday or Sunday. So, you know, it's worth spending that extra night in town and not spending your night instead sitting on a tarmac. We in fact brought this to real people at Microsoft working with real data, a guy from the Xbox team, someone who was analyzing the Twitter fire hose, a guy who was working with backend server logs, trying to understand how dealing with this incremental data would actually change their lives. Was this helpful? Maybe they needed four digits of precision every time they did things. Here's what we learned. We learned that most queries that most people did most of the time were the wrong question. They were asking the wrong thing. They dropped in the wrong data. They looked at a dimension that wasn't meaningful. They forgot about a significant factor in their data. They forgot to filter out something. But that's okay. They carried out lots of queries. They cut them off early. Our users became fearless, which is what we want to see about it in exploratory data. And they were fairly happy with getting approximate results. It turns out that most numbers that they wanted to were rough. That they wanted to see were rough. As one person explained it to me, if I see one thing all the way up here and one thing all the way down there, I know they're very different. If they're about the same place, I don't want to cut one off or cut the other off. I'm pretty much in the right place. There was one interesting insight for us on the back end, which is that randomness in databases turns out to be a real pain. Our database architecture today really desperately wants everything to be built in terms of indexes and optimized orders. And we just said we want to be the opposite of that. We want to be as random in order as possible, so you can get as good a sample as possible. I thought I was alone in that, but it turns out that the nice folks at Amplab and MIT CSAIL have built BlinkDB, which is beginning to play precisely with this idea of approximate summarization of databases. And Spark Streaming is now beginning to allow some of these same sorts of incremental computations that we've been playing with. Turns out, of course, that you need to start thinking about new technologies to support streams. What we ended up doing for this, for a number of our visualizations, has been dealing with basically digging through old 1980s literature on how you dealt with old systems then when data was big and memory was small, and trying to remember what sort of technology we knew about, like the reservoir sample, where you keep a sample of K elements from the data, such that each element has a K oversized chance of being kept around. A reservoir sample turns out to be a fantastic way of keeping a fair sample of a data stream as it's coming past you. There are, in fact, one pass algorithm approximations for histograms. Unfortunately, they generate equidept histograms, not equi-width histograms. And those are really a pain to visualize. We spent some time fighting with that and got ourselves stuck. But if any of you have ideas on how to visualize an equidept histogram, there's a lot of opportunity there. Incremental visualization really changes the rules on things. In a numerical data, we're talking about changing bounds on everything you can see. Any color map, any scale, any axis might suddenly change on you as you discover that there's new data that was outside of what used to think was the domain. I'm a little short on time right now, so I'm actually going to skip over this conversation about my tentative framework, where I really just go back to my conversation from earlier and say, hey, look, as we're doing all this bucketing and aggregating, somewhere in there you're having to make a decision about where you sample, where you cache, and where you network. All of these trade-offs become an interesting conversation about precisely what degree of flexibility you're going to allow your users for what sort of interactions you can get. So I had just gone through a couple of standard examples pointing out that when you do server-side rendering you're doing all your caching up front, and when you're doing D3 you're trying to send all your data up to the client and then have your client deal with it. What I want to bring out, though, is that there's cross-disciplinarity here. This isn't the way that SQL, or Hadoop, or visualization packages or anything really works today. As this data gets bigger, InfoViz, front-end, and back-end concerns are coming together. This is calling for all of us to learn new skills and to teach new ideas to the people around us to try and explore how these ideas can be integrated. There's going to be a lot of close collaboration across fields. It's not going to be someone shows up with a CSV file or a server or even a SQL database and says, go play with this. And on that, I think this is a call to go build cool stuff with big data as well as small. You can tweet me, you can go find my webpage. I'm around for the rest of the conference. Thank you all so much.