 Our last speaker is a quantitative engineer at Facebook where he creates visualization applications to yield insights from petabyte of data. I knew I really wanted to, at least one talk, that it was dealing with big data and real big data, not like the buzz term to do things with, like, self-taught. So I was really, really excited to see Jason's submissions because if anyone has really big data at Facebook. So before he was there, he was a senior data scientist at PayPal where he analyzed and visualized geodata. So please give your warmest welcome to Jason Sundrum. Hey, everyone. It's great to be here. I'm sorry to be the last thing, I guess, between you and beer. But I'm here today to talk to you about some of the things I learned while developing data visualizations at Facebook. We're in the homestretch. So I guess you've been sitting for, like, four hours and 15 minutes at this point, so sorry. Stretch if you need to. But I like to start, actually, by telling you a story. And like every good story, this one starts with, you guessed it, recruiter spam. So in the spring of 2012, I was living in Somerville, just a few miles northwest of here in Davis Square. And I got this email from a recruiter at Facebook. You may have seen where that was going. And apparently I don't know how to scroll on a computer. So they hired the right guy. So I got this email and I replied mostly because I was intrigued about visualization at Facebook. What does it mean and like can I get at some of that like really rich texture of data that doesn't require a security clearance and working for the NSA? That sounds great. And the fact that the weather was like significantly nicer in San Francisco, okay, I played a little bit of a part in that decision. But I love Boston. So after some soul searching, I drove across the country and like everyone else could not resist taking the selfie in front of the Golden Gate Bridge. So it's good. It's really beautiful. If you're in the Bay Area, I highly recommend that as a Marin Hedlund's great selfie location. Yeah. So I found myself on the way to Facebook a few months later. I was on a shuttle bus before it was really cool to start protesting public sort of private public transportation headed from San Francisco to Menlo Park, or NPK as we call it. I was wondering like what the hell am I doing actually? And so why was I wondering what the hell I was doing? I'm clearly at Facebook to visualize data. And we start off at Facebook, you know, they give you this great boot camp where they're like, you're an engineer at Facebook, you're going to learn how to use all of our things. We have so many things. And I was like super, I don't know very much about any of those things. So Facebook has, I think, some ridiculous number that they've published, like 300 plus petabytes of data, which I don't even know where that's stored. In Hive, it turns out a lot of it, which is super powerful, but it's also really slow. And my sequel skills are like we're dodgy at best. Seriously, I was like select and then some other stuff. So getting it wrong on a small database, you're like, oh, that didn't work. Let me try something else. I would get it wrong and then wait 20 minutes and then go like. And then there was PHP, which I also neither knew nor liked without knowing. And I found myself completely unable to reach for my familiar tools, like D3, Python, like the happy places that I used to live, they were so far, including Boston, were so far away. So I was like, why am I here? So this talk is going to be a real big downer. No. I hope to leave you all in tears. No, there's, I think there's going to be good news. And I'm actually going to tell you how I kind of found a place for myself at Facebook and how I answered some of these questions, like what does it mean to visualize data? That's really super huge. And how do you do it? And why aren't more people talking about it? I actually don't know why more people aren't talking about it. People love talking about big data, but I'm going to tell you about three revelations that I came up with that were revelations to me. I don't mean to suggest that you don't all already know this, this talk, you could just know it all already. I don't know. But this was news to me and here they are. All big data kind of has to become small data to be visualized. Even a retina display only has five megapixels. That's like five million, which is sort of small. Second, what I'm calling the fresh data revolution, you heard it here first, or like why real-time systems are really good for big data. And finally, if life hands you more data, consider using more pixels. So I'll explain kind of what that means later. But I want to leave you feeling inspired and in power to use some of these revelations in your own work, and I'll talk about the stack that I sort of made for myself or invented at least at Facebook along the way. So I'd like to ask like a quick survey question. How many people here are designers? Sweet. Full stack developers? Nice. Fernand people? Okay, it's like equal parts so far. Back-end developers? Yeah, not too many. Okay, did I miss anyone? Like raise your hand if you didn't raise your hand yet. Okay, cool. So about as many as back-end developers. Welcome. You are loved. So yeah, my team is a pretty diverse group of people. My team, Quantitative Engineering, builds data-driven applications for insight. So we're small and mighty, just like those guys. We got a statistician, not me, a machine learning PhD also not me, a data engineer and not me, and a visualization person, I guess that's me. And someone, this is really important actually, to help us figure out what we should be working on and what we absolutely should not be working on. And we also borrow time from designers when we can get it. We try to make things that we think are pretty, but designers help us realize that sometimes we are wrong. We're very much like, if you heard Lisa's talk earlier, we're on the exploration side of the explanation, exploration continuum. So we really want to make tools for people to get insights. We don't want to make reports that people can read and then they're out of date and then who cares. So keeping it fresh is really important. And we use Python extensively wherever we can. So I was really happy to hear Jake's talk earlier, which is about so much more than Python, I realize. So a lot of people talk about big data and I'm sure probably everyone in this room is sick about hearing about it, so I'm going to talk about that. They talk about small data, too, which is also kind of coming up. Thick data, which I think I read in the Atlantic recently, and they mean all kinds of things by it. One of my friends on Twitter actually said, big data is the same thing as small RAM. And I was like, yes, this is exactly right. Try shipping a gigabyte of data to your web browser and you run into some problems. Try displaying more than 5 million points on the screen and you run into some different problems. What I was the most excited about when I joined Facebook was figuring out what it means to try to visualize big data. There are a lot of companies trying to provide solutions for big data problems. At Facebook, there are several systems for charting and exploring data. And when I first started there, there were so many questions I just wanted to know the answers to, right? I was like this hungry for data kind of guy. How many photos are uploaded per minute? And how does that vary over the course of the day? And how does it vary by country or by mobile device or by gender or pick a dimension? So I found out, kind of in a hurry, there are over 350 million photos uploaded per day on Facebook. And I think that's, I don't know what that is in flickers, but it's like maybe one flicker every week or so. So I went to kind of start making my first dashboard or query or just to kind of start playing with this data. And this is what happened to me. And I was instantly, when I saw this thing, I was turned from this energetic, curious person into this zombie that just wanted to browse the internet like it's a drug. I don't know if this thing has that effect on you, but I'm weak, I'm weak. So I had this question, like basically like born of total frustration. And I don't know if that's big enough to read. But like I saw Mike's cross-filter demo and how amazingly fast and responsive it was to select and filter data. And I was like, that is what I need. I don't want any of these, these like weightings. I want fast and responsive, but I didn't know how. So I was like cross-filter plus hive and like total, total crickets. From my friendly database community. And of course, like wisecracking crickets, which I mean to be fair are the best kind. So I was like, oh, well, maybe I can sample. I could take these 350 million data points and turn them into 350,000. And like that's better. But then sampling is like hard, I think, to do statistically. Like there's a statistician telling me all kinds of things about whether these numbers that you have are like something about p-values. And I was like, I really was important to get actually correct numbers. So to be somewhat serious, there's all these different dimensions. And there's all kinds of possible skews in your data. And if you're looking at one tiny corner of the data, it could be under-sampled or over-sampled and you can make some errors. I even understood that without a statistic's PhD. So then I was like, OK, aggregation. And honestly, this took me a really long time to figure out how to do. So it turns out people like in business intelligence, those are BI people who like have been they've been saying things like OLAP and talking about cubes for for like I think since the 1970s and like. I've been ignoring them like not all the way since the 1970s because just but only because I wasn't old enough like given the profound unsexiness of BI and like my own personal distaste for acronyms, I completely just ignore this. And like all of those who forget or ignore history, I was doomed to repeat it a little bit. So I did and I'm going to tell you about it. So you get to repeat history, too. So, yeah, it's kind of readable. You can see like data.csv, this lower file snippet here, which like theoretically could have billions of rows in it. Obviously it doesn't because it wouldn't fit. It gives you a sense of like the kind of grain of the data that I started and think about visualizing. And as I was looking at data.csv and wondering why I couldn't somehow just put it in a browser, the insight that I rediscovered was that despite the fact that there are potentially billions of records here, there's actually a really limited number of values that can be present in each column. So it's the cardinality of that column, the number of discrete values that really matters. So there's some number of genders. I don't want to prescribe the number of genders that there could be, but let's say it's like less than 10. Some number of ages under 200 and so forth for states and countries and times. And you can do things like collapsing those things so that you don't get really big numbers, but you're basically just taking those numbers, those cardinalities and multiplying them together to get some number that's hopefully less than billions. So that's the trick. So for example, you could take 100 rows of data that start off as 100 rows of data and you could condense them into like 110-year-old males from California on Tuesday at 10 a.m. And that's one row. So yeah, product of cardinalities, maybe we're kind of onto the right track, 30 or I don't know how many, 40 years too late. So at the top here you can kind of see how that aggregation is expressed in SQL. So I learned a little bit of SQL you can see. And you can kind of imagine that maybe you want to do some bucketing on the ages, maybe you care about age ranges instead of specifically how old somebody is in seconds or minutes or days or years. And you can maybe even say I only care about 100 countries, not all 200-plus countries that exist on Earth. And that kind of works. So you have this query that you'd like to run to make stuff small and you have a bunch of data, but you still have to figure out how to run the query at a time that doesn't make you want to start browsing the Internet again. And so here's where it's really handy to have an in-memory database. So like my friend said, maybe big data is small RAM, and I was like, aha, big RAM. This will solve everything. And actually it's sort of true. So there's this database that was developed at Facebook called SCUBA. I think it's public information that exists, and there's like a white paper about it somewhere that you can Google for. But it's this crazy in-memory database, and you can basically just get the RAM from hundreds of machines all in this glorious pool that gives you fast and efficient querying. So there's a screenshot from the white paper on the screen over there. And you can see it's got this leaf aggregator architecture, which is pretty exciting if you're in a database. It's like I think 10 of you maybe were. So that's good. So queries that would take like tens of minutes in Hive or even Presto. I don't know if you guys know about PrestoDB. That's another open source Facebook technology. It takes only seconds on SCUBA. Unfortunately SCUBA is not currently open source, so we can talk about what you might do instead of using SCUBA. If you don't have like super tons of data, you can probably just get a machine with lots of RAM and use SQLite, or if you have bigger data than that, you might have to do some research. So we've got all this data, and now we need to like, you know, it's smaller and we want to push it to a web browser. So like you need to make some kind of API for that. I use Tornado because as I think I said earlier, I love Python. And Python speaks thrift, which means it can really talk to pretty much any other system at Facebook. So those can be written in Java or C or you know, some other unnamed languages like PHP even maybe. So I use Gzip on CSV like pretty intentionally. I think the default choice for APIs these days is often JSON. But CSV is like a lot more compact. I also looked into Message Pack, which is another way of kind of compressing data. Unlike Gzip, which is supported by browsers, if you just like set the right header, I'm sorry, using Gzip, which is supported by browsers if you set the right header. You don't have to worry about extra dependencies to like, you know, compress your data and then deflate your data on the client side. So it's super fast. The browsers usually do a super good job of actually just deflating your data and you don't have to think about it. And you have this trade off that you, I think is the right one for me at least, between compactness in terms of like the transfer time for your data set and overall speed, which is like encoding time plus transfer time plus decoding time plus figuring out what the right compression libraries to use. The folks at the New York Times, I think this week launched this project called TAMPR, which is another serialization protocol, which I haven't used yet. Partly because there's no Python encoder available. I think I'm definitely going to check it out and see if that changes my recommendation there. So finally, like after lots of talking, we're at CrossFilter and we have data and it's in the browser and we're really super happy. I use DC.js to wrap CrossFilter, which basically gives like really easy charting so I can make simple charts relatively easy easily. So we get the fast slicing and dicing. We've got, you know, millions or billions of rows and easy updating of coordinated views without a lot of custom code. DC, like I highly recommend it for just sort of playing around. I think you might, if you're developing a custom visualization, you might want to go a little further. So here's the small snippet of code that like really takes that insight about how you do aggregations and helps you use it with CrossFilter. This is actually in the API doc, so you don't need to furiously scribble it down. But this is basically saying, hey, we have these mappers and these reducers and here's how we use them together when we're grouping. Yeah. So here's the kind of whole picture of where we've gotten so far. We have data coming in on P-Tel, which is sort of just like this distributed way of telling log files. We have this in-memory database called SCUBA. We have Tornado supplying GZIP CSV and JSON for some stuff that's not too big to basically a pretty traditional Web front-end CrossFilter in DC. And so let's take a look at it. So I ran into some problems with our legal department earlier this week. So I'd like to recommend that you check out the CrossFilter demo page and just pretend you're playing with fresh Facebook data instead. So although I can't show you the UI or demo it for you, what I can show you is the impact of the UI that I built. Not as exciting, maybe. So during the State of the Union address this year, you can see here's Barack Obama, and it says trending now on Facebook, and men are talking about the economy, women are talking about inequality. So there's actually an analyst at Facebook using this tool, watching the State of the Union address and typing words in that Obama was saying, as he was saying them, he was looking at who was talking about those words as they were being spoken. And then so they were using this tool that I built, hitting this tornado API, querying this in-memory database, getting stuff back. And then to keep the thing really super high-tech, they were on the phone actually with an analyst at ABC News going, hey, this is interesting. And I think so the last piece of the communication needs a little bit of work. It's still just humans talking to humans. So yeah, what do I do next? I kind of made a new stack, at least new for Facebook, for working in a pretty agile way with data products at Facebook, and it was something I was a lot more comfortable with. I'd like to extend that stack actually by building reusable cross-filter-friendly charting components using React, also developed and is open-sourced by Facebook, not talked about earlier. In Sam's talk, where he talked a lot about Angular, it's got a lot of similarities to Angular and some differences. With any luck, you should see some of that work open-sourced later this year. If you've actually done any work with React in D3, please talk to me. I'd love to hear about it, and maybe you'll learn something and possibly avoid making some mistakes. Okay, so we talked about taking a big dataset and making it manageable. Let's talk about the fresh data revolution. Of course, what I mean by fresh data is real or timely data, real-timely data. Facebook, I think, gets something over 500 terabytes of data per day. If you just look at what's coming in an individual second, that's only five gigabytes a second, so already you're in the gigabytes range, it could fit on a flash drive. If you only care about certain kinds of that data, like text posts or status updates written in a particular language or set of language, you could be in the megabytes per second range kind of before you know it. Fresh data is automatically relevant because it's recent and it's exciting. I think today we should all commit to adding fresh data to our big-data buzzwords, so we got big data, thick data, and now fresh data. I have a secret. I was really just interested in this because I wanted to make a fun screensaver. This is how I got started, so don't tell anyone. This got started as a hackathon project, like so many great things do. Facebook runs a hackathon once every six to eight weeks or so, and they last 24 hours, at least and sometimes a couple of days. Participation is totally optional, but if you participate, you can come up with a project or join somebody else's project, and at the end everyone presents their work to everyone else, sort of traditional hackathon model. The twist is I suppose that the top hacks end up getting, you get to present to Mark Zuckerberg, and many of the features on Facebook.com actually originated this hackathon project, so like video uploading, chat, and the like button, which are pretty successful features now all started as something that some guys were, some people were doing overnight. As a data and visualization person, I'm really a lot less interested in features to Facebook.com than I am on like just finding cool data and visualizing it, which was the whole reason I joined Facebook in the first place. So my first hackathon project was just to build a real-time map of check-ins. I gave a talk at Strata, Irene mentioned a couple of years ago about visualizing geodata, and I made like this non-real-time visualization of check-ins using tile mill and processing, and I was like, yeah, it's the way to go. But then I wanted to build it in real-time and then I wanted to give it to everyone on the web, and I had some thinking to do. So I came up with this architecture, which is a bunch of boxes, right? So I had heard a lot about leaflet, and I was like, oh, cool, I should use leaflet for maps, and it's going to work with tile sources, so I use cloud-made tiles, and that was really easy to get going. And then all you need to do right is just put some data in there. How hard could that be? So I used p-tail from before, which is basically a source of new incoming data points to look at check-ins, and instead of taking this incoming data stream in a database, I just decided to process the individual records, like going, okay, is this a check-in that I care about? Yes, send it on, and if not, just ignore it. And the key, I think, one of the really important points here is throwing away data really efficiently is something that's kind of a bit of an art, and that was important to make this actually work properly. 0MQ is really the trick here to connect the web server to this processor. I think at 0MQ, the project builds itself as an intelligent transport layer. iPython Notebook also uses this. I read the docs, which are in terms of documentation, it's probably one of the best experiences I've ever had. It was really super readable, friendly documentation, and I decided a push-pull architecture was going to work, so the event processor in the background pushes stuff out, and the web server just sort of goes, give me more. And so I've got data, it's coming in, it's in this web server, and I just decided I want this data to just keep flowing. So I used web sockets, which I'd also never used before, and it turns out those are really actually not too hard at all, and it wasn't even time for the sun to rise, so I was like, okay, great, let's use, uh-oh. I decided, uh-oh, let's play with things. What did I do? Okay. Do I have a password? That's really hard to do when you turn so far away. Sorry, guys. I'm going to cancel that here, and then try again. Is there words? Okay, great. Awesome. So I'm going to stop waving the mouse around so radically, hopefully everything will go better. Stuff, zero MQ, web sockets, not so bad. Oh, yeah. So the web sockets, they send this data to my front-end, and I just put them in a queue as they arrive, and just set a timer so that I repaint the screen so that I'm getting roughly 30 frames per second, which seems smoothish. I'm using D3 to basically make the dots seem like they are arriving, and getting smaller, kind of just like getting dropped on a map, and that's basically it. Except I was like, what color should the dots be? These are the important questions in life. What color should the dots be? And I realized I don't really have a lot of particular reasons for making the dots any given color, so I was like, hey, I have text along with these dots, these lat longs. Let's just calculate the sentiment for the text, and then I can say, hey, these are happy check-ins or these are sad check-ins. So I did that. And here's the first screencap I have of that project working, and I was like, okay, super, it's time to sleep. I was going to try to show an interactive version of this, and I'm gonna maybe wait for a little while so that I can not screw things up. Or maybe I'll be bold. No, I'm not gonna be bold. So since this is a database hack, I didn't really hope of having any hope of getting Mark's attention because it wasn't a Facebook thing. So I just set this as my screensaver using WebSaver, which is a project I totally recommend people check out if you're interested in making screensavers for yourself, which basically just allows you to set a URL as your screensaver, and I went on with my life. A couple days later, I was like, hey, I think I'm noticing patterns, and then I was like, hey, maybe I should see if those patterns are real. And so I used Hexbin. I think this entire talk is probably just a love story about Mike Bostock, but I was like, hey, I'm gonna make a Hexbin map, and so thanks, Mike. It was really easy to do that. I think it's more informative, but it's a lot less exciting because it doesn't blink. Some of the other people in my group actually took to just showing the map off to people at the end of their presentations going like, hey, we do things with real-time data and text algorithms, and it started some really good conversations. And I got some pretty nice feedback from people I don't ordinarily hang out with as a result. So I have this approach of, like, PTL plus, like, a bunch of NLP, plus your MQ, plus tornado, plus WebSockets. How many people have I lost? Awesome. So I decided to take the same approach and use it on something else. Game of Thrones. That's exciting, right? So, like, a lot of people are actually talking about TV shows on Facebook as evidence, but this was actually an ad in my feed, and I was like, oh, my friends are going to spoil Game of Thrones. I can't let this happen. And so, like, within a few days, I mean, there's so much conversation on Facebook happening about television. So, like, I have some cards from Netflix. Like, within a few days of being released, there were seven million people talking about it. And so, obviously, you can't read all that stuff to figure out what they're saying, but wouldn't it be cool if you could put it on? That's what I thought, and I was like, maybe I can see what people are talking about on TV while TV is happening. Which I did. I had this projector set up, like, in my house that was just showing, like, some visualization of TV, which I'll get to while I was watching TV, but it turns out it's super distracting. I should have predicted that. So these companies like Trender that basically make these dashboards for people like, you know, big television studios to go, here's what our audience looks like, and here's what our numbers are. And I wanted to, like, kind of do this, but, like, in real time and without, like, without me having to type stuff while I was watching TV, I just wanted, like, you know, to find out things when I was looking up at the screen. So we have the same kind of architecture as before. There's more little black, or, like, NLP boxes over here, which is, you know, to figure out how to tag the different contents with, text contents with topics. So topic tagging was, like, a little box there, figuring out sentiment, that's the same box as before, and then just doing, like, some kind of relevant stuff to figure out what are terms that are actually relevant, because there's a lot of conversation happening, but not all of the words matter. So there's things like, you know, stop wording, which get rid of and and the. There's, like, more advanced algorithms, like TFIDF, which is the, like, multiplication plus division, I think, basically, maybe with a log thrown in there, it helps you get, like, relevant terms, and then you can use point-wise mutual information, which actually helps you figure out which of your terms are, like, the most important, and you're pretty much good to go. So you've got these giant, you know, kind of counting API, various different things going on on the NLP side. That's about the only change, except for the UI, of course. We don't have a map anymore, we have a new UI. So here's roughly, like, what I had going on. It's a screen cap of the tool running a TV show, American Idol, which maybe people have heard of. And so, like, I just, every 10 seconds, I wanted to see, like, you know, what's going on and what are the top terms. So you can kind of see, like, some of the terms and some of the hashtags that people are using. I think you can maybe barely make out Majesty is the cat's meow. That was one of the contestants. I thought that was a pretty awesome hashtag. And I ran this same project later on in my channel show, Naked and Afraid. If you're not familiar with the show, it's two contestants that are trying to survive for 30 days after being dropped into a strange place with, like, nothing but the clothes on their back. And, like, the first thing that they do when they get there is they take their clothes off. And, of course, of course, the contestants are strangers of opposite genders. So, titillating. So, like, here's the impact of this project. It's very, very quiet, which is not going to help. Yeah. Facebook has been all night long. We also have the hashtag lots of butts and hashtag clothes and Afraid trending and creating lots of buzz on Facebook. So I'm responsible for telling people to talk about lots of butts on TV. So, like, next, I think actually I should use Cubism because it actually deals with some of these, like, there would be spikes in traffic and I was like, hey, these bars are way bigger than my web browser. Or, like, now I have to scale them and using Horizon plots actually deals with that very nicely. I would think I have to do some back-end work to support it since I'm not using Cuba Graphite, but I'm super interested in checking that out. It allows me to continue taking advantage of all the great work Mike has done. So now I think I'm at the silliest part of this talk where I'm just taking big data and putting it on a big screen, but it's going to be really fun because the screen is really big. So I'm going to talk about the process of just building. This is a 20-foot long display and getting live data on it and it's part of what Facebook calls an executive briefing center, which is basically just a right? Executives need their big data to be big. I didn't really realize I was going to get that response. I can wait. Yeah, so this is basically for premier clients and brands to kind of understand what kinds of insights we have into their fans. The work was a collaboration with the Design Studio pitch interactive and you can see in the bottom right over there this is a machine called Vista Spider X20, which I never heard of until we started this project, but it's basically this magic box that drives all of the monitors. That's good when you have a tech talk that involves the word magic, right? So yeah, here are the rough specs of what's going on. 41 megapixels, that's more than a retina display. 36 touch points and 20 individual displays. What you don't see as a bullet point there is like Jason's desk because I really wanted to plant myself there and I even volunteered to move my desk but they were like, look dude, this is for the world, not just for you. So why should we do this? And as I think somebody wanted to figure it out, this was a marketing driven project. And here's what they have to say about it and I'm just going to let you read it while I take a drink of water. Yeah, so for me this is a way of taking data real and more real. So I thought of it as taking data visualization and turning it into data visceralization to present clients with data that really touches them and speaks to them and in turn they can touch and speak to it if they want. How to do. So it actually took two to three months to get this to work. I took a pretty good hardware guy pitch interactive very talented visualization studio and one person to figure out how to put it all together and it took all of us saying no constantly to the marketing people. So like saying no is like a really important job skill I learned. And we knew we need a lot of screens. So here's like the initial sort of picture of the install where like well we're going to need some mounts on a wall and so we started just calling the project the wall because for a while that's all there was. And Wes Wes Grubbs at pitch interactive had the bright idea he's like hey maybe we can just run one Chrome window like one giant Chrome window in presentation mode and I was like dude that's never going to work. So he was right. It worked. One thing that didn't really work was getting animations to work well. So he ended up doing all the UI almost all the UI and WebGL because we had some crazy graphics cards on those machines and that actually ended up making it possible. We did use D3 for a lot of the static stuff. So there were some challenges like just working with a large amount of data. It had to be like really customer specific because that was the whole point of this. We were like we want to tell you Coca-Cola we want to tell you Walmart like this is what we have to tell you about what's going on on Facebook with your peoples. And we didn't know who was coming in on any given day. So like one day we're really early in on the project I got a like I seriously got a phone call going the NBA commissioner is here and like I'm not into sports so I was like my response was like the NBA has a commissioner but then I I basically created a bunch of pipelines in Hive which are slow that run every day to kind of gather some of the data and then use PrestoDB which if any of you are actually working with Hive now I'm a little bit sorry although it's awesome but be should totally use Presto wherever you can it's like a lot faster it's open source it's great and you'll make people love you if you use it but even the Presto query takes a couple minutes but there's a lot of parts of the experience actually that don't require a huge amount of data so we're like visualizing trending hashtags and letting people kind of like just look at kind of broader picture stuff which totally keeps them entertained while the rest of the data loads I've watched it happen and the queries get cached so if somebody is super important you know somebody can just go do the query beforehand and it's all ready to go for the day 36 touch points means like testing is really interesting and important and it also means like a lot of API calls can happen because like so many fingers and touching and so using like having an asynchronous back end was really important so fortunately tornado supports that out of the box and that was pretty nice I want to show you this amazing video of how we tested this is Wes he's just rolling his body over the screen that is dedication he's saying thugs so fortunately this was like kind of early on but it literally took us like a couple of weeks to like kind of hammer out all the bugs with the infrared touch screen like there was an issue where like if you open the wrong door like a like this tiny beam of sunlight would come in and it would like trigger phantom touches on the screen and like of course we didn't know it was the door that was open because it only happened at a certain time of day some pretty serious debugging was involved we had to deal with real-time profanity was not something I was like ever prepared to deal with we were looking at like trending hashtags from the internet and I was like oh well we should totally be able to share that and I got this really panicked another really panic call which was a little more awkward and so I did some googling and apparently this was like part of a viral campaign to raise awareness about testicular cancer so I'm now like very aware of testicular cancer and so again do a google image search if you're curious I think they're mostly safe for work and for life maybe I'm not so sure so this is basically the overall design the Node.js stuff and the front end was all done by pitch interactive this sort of the back end tornado stuff is what I did I'm gonna not talk too much about it except for to say that when you click on the hashtags like we saw before you can actually see kind of what the overall audience was for that hashtag so you can see here it was mostly a female audience talking about that hashtag and they were mostly kind of young and mostly in the United States and here's sort of the finished polished experience with suitably important looking people looking on while somebody talks about data if you'd like to check out this experience for yourself next time you're in the Bay Area I'd be really happy to give you a personal demo so what have we learned hopefully something so in order to visualize big data we have two choices one is to make it small and two is to make our screens really big and what I learned actually to be successful in a new place really requires combining the familiar the things that you know really well with the new you can find a way to play to your existing strengths while developing new ones you'll seldom go wrong if any of this is at all interesting to you you can find me afterwards I'm always looking for great people to join my team and just to know I'd like to say thank you a big thank you to Irene and Adam and to the committee for putting OpenVizConf together this is truly spectacular and I'm really happy to be part of it