 So, hello. I wanted to start by saying thank you to Martin for inviting me along to introduce myself to you guys today. So, Martin's asked me to talk about the practice of data science. He's asked me to talk about this big data thing and about the kind of visualizations that I get involved with. Now, to be honest, I get a bit confused when I talk about data science in the abstract. Data science, and especially big data, is getting burdened by buzzwords. And it's getting difficult to draw connections between all of the different types of work that a data scientist does. So, my favorite definition of data science, I'm afraid, is a bit of a cop-out. And it's due to Harlan Harris, who's a data scientist in DC. And it's that work which data scientists do. So, this lets us move a kind of definitional burden onto who are data scientists. So, this is the eponymous data scientist Venn diagram by Drew Conway. We are a rather rag-tag bunch of hackers, of engineers, of mathematicians, and computer scientists. We are a rather diverse bunch. We're deployed in all sorts of different areas. You might find us working on telephone call logs, on internet server logs, on transport network usage, on the occasional pile of diplomatic cables. We are all over the place. We all share, though, one specific starting point. And that's that we start with data. We're typically given data that's been collected as part of some other service. And we begin from there. We ask what kind of questions we can answer from that data set. This is known as abductive reasoning. And it's at right angles to the traditional sciences where we would only ever collect data to falsify or not a hypothesis. Instead, we start with data and we wonder what kind of hypotheses or what kind of stories we can pull out of it. So, it's a broad topic. It's many things to many people. And so, today, I thought I would give a somewhat idiosyncratic view about what I think is important to the subject and what we should be thinking about in the future. So, like any good idiosyncratic story, I'm going to start by talking about my mum. So, when I was young, my mum passed on to me a great love of British Ordnance Survey maps. If you've not met these maps, they're incredibly detailed maps of all of Britain. They show paths and roads and property boundaries. They show pubs and post offices and the contours of mountains. They're an aspect of Britain that I miss loads now that I live in the US. My mum, when I was young, taught me to read these maps. She showed me how to compare what I could see in the landscape around me with what I could see on the map in front of me to try and figure out where I was and where I was going to. I'd think of these maps now as an incredibly detailed intricate visualization of a huge structured data set. But at the time, it was my first and only experience of using data to be able to look down at something on above, to be able to use data as a kind of extra sense to see all of this stuff around me. What makes these maps powerful is then this thick layer of data that's overlaid on the topological view. Public rights of way and property boundaries are clearly marked, empowering those hikers that are walking the paths to cross those paths that maybe landowners have tried to obscure. And not for nothing, it gives the Brits that hold these maps an incredible view into the land that surrounds them. It gives us a sense of space and location that I never knew that I had until I left. So I moved to New York City a few years ago and fell in with a rather odd bunch, a bunch of people that were obsessed by data. I think that they were intoxicated by the ability that data gives us to see things from above, to see these huge behavioral phenomena that we're interested in. So we were gathering lots of different types of data. So we were looking at Twitter streams, Wikipedia edits, crime statistics, biological data, and we were starting to get quite good at pulling these little stories out of these data sets. And just as we were getting good at this, WikiLeaks released the Afghan war diary. This was six years worth of behavioral data from the soldiers stationed in Afghanistan. This is the one of the visualizations of a model that we built from this data set. It's the probability of seeing explosive hazard events in all of Afghanistan for 2004. And then this is the same thing again, but in 2009. What I want to talk today about is the power that came along with this visualization. I gave a really small talk to a really large audience in New York about this visualization and some of the other visualizations my colleagues and I had built from this data set. And after the talk, an American soldier came up to me and started talking to me. He wanted to tell me about his colleagues that were stationed in Afghanistan, about his time over there. He wanted to tell me about the opinions that he had on the politicians that had deployed him. We spoke for quite a while and at the time I was a bit nervous and a bit confused why an American soldier would care really at all about some visualization, some nerd had made in New York. But looking back, I'm starting to think that it's because he had a sense that this visualization and the visualizations that went along with it had some sense of power that we might be able to start affecting how people were thinking about the conflict from this view. So at the time our focus was about the process, about the process and the tools of data science. We thought maybe everyone should be building things like this. But instead our audience was not all that bothered really that we'd used the R programming language or built a set of kernel density estimates. Instead they were interested in Afghanistan. They were interested to see what we could see from our data driven vantage point. They wanted to know about the activity on the Pakistan and Iran borders. They wanted to know about the relationship between road building and conflict. They wanted to know about IEDs and UAVs and they wanted to know about the difference between the conflict before and after Obama had got into power. So my argument for today is that data gives us this ability to see these huge behavioral phenomena from above and that data science then is the practice of wielding this power. So in this context I want to make sure that we think more about the framing of the messages that we put out when we're building models and visualizations like this. So this was the first visualization that we put out on the internet. It did quite well. It got some tens of thousands of views which is not bad for a kind of amateur visualization. It should make this clear that this visualization has tens of thousands of authors. This is a visualization of situation reports. We take all of that data and from every single data point to build a visualization like this, we throw out data. This act of zooming out involves throwing out so much stuff so that we can get a view onto this huge system. And so this throwing out of data forms the basis for how we frame our messages. It forms the mechanism by which we think about framing. So for this visualization, what exactly was our message? In the talks and stuff that we did after putting this visualization out on the web, we backed right off from this kind of dark red stain spreading across Afghanistan and moved instead to the neutral blues and whites that you saw in the previous two visualizations. I think that we were nervous that we didn't really have an opinion or the right to have an opinion given the size of the audience that we discovered and the amount of thinking that we'd done ahead of time. So I've got two more sets of examples and I want to use them to talk about this idea of framing a little. So my job now is data scientist. That's what it says on my card. I first had this job at Bitly a few years ago and now I hold that same position at the New York Times. Bitly, if you've not met it before, is a link shortening service. They take long URLs and make them shorter. It started out life as a small affordance for Twitter so that you didn't waste characters and has grown up into this rather large distributed sensor. So the typical use case goes like you find a picture of a cat. You want to share that picture of a cat with two sets of people, everyone that you know and everyone that you don't know. You take the URL of the picture of the cat and you give it to Bitly. Bitly then gives you back a short URL. Now you put that URL everywhere that you can. So on Twitter, on Facebook, on YouTube, you put it in instant messages, emails for reasons that I never quite understood. You put it on LinkedIn and it becomes a kind of sensor. For the cat sharer, it becomes a sensor for their audience so they can start to sense something about their audience. And for Bitly, it becomes a sensor for how popular cats are. So during the Arab Spring, we built this graph. Okay, I didn't mean to do a dramatic pause. So we built this graph. So on the top are all of the clicks from Egypt during the beginning of the Arab Spring. And on the bottom is all the clicks from Tunisia that Bitly sensed. So on the horizontal axis, we've got time. So October 2010 on the left, all the way up to May 2011 on the right-hand side. And then along the vertical axis, we've got clicks per hour. So to build this visualization, we again had to zoom out to be able to show a few months' worth of click behavior from two whole countries. We had to take all of the interesting data that Bitly collects about what kind of languages these people speak, what websites they're looking at, all of this detailed behavior, and just throw it all out. Instead, we just keep the time point of when the clicks happened and plot that. So when we made this, we're getting a little bit more cognizant of the fact that we were trying to get a message out. So we were aiming to do, to tell two stories with this graph. The first aim was to tell a couple of little journalistic stories. So the first was in Egypt after they tried to censor the internet for a few days, we saw this big spike in social media traffic as measured by Bitly. And then in Tunisia, we could see this long, slow buildup of social media usage up until they ousted their president and then it all calmed down again. The second message that we were interested in getting out was the commercial one, that Bitly was capable of seeing the patterns in this data. And because we'd taken such a kind of spare approach to the modeling and the visualization, it worked out quite well. So we made a blog post with this in it and it got lots of clicks, which takes care of the commercial message. And we got to lend our weight to the conversation around social media and this kind of conflict and ended up working with John Sides at George Washington University when they wrote a really interesting paper using this and other data. So the last two sets of visualizations that I want to show you, and they do take a bit of time to load, are from the New York Times. So both of these visualizations are made by Nick Hanselman, who is a creative technologist in the R&D lab. And I love these visualizations because I think they really get at this point that I'm trying to make today, which is the ability to see behavioral phenomena from above is powerful, but we need to properly frame the messages that we're trying to get out when we build these models and visualizations. So what you can see here is a recording of a live visualization of all of the page views to the New York Times. I was way too scared to try and put a real live visualization in, but this is recorded live. You can see each page view rise off the surface of the globe, and we've color coded the page views. So you can see clicks on the homepage are in blue, clicks on section fronts, which are like the mini homepages for politics or world of sports or whatever are in red, and all the rest is in dark gray, so views on articles, interactives, and so on. What you get from this visualization is a live, complete view of all of the readership of the New York Times. You can't help but get curious when you're playing with this visualization. I always want to zoom in and see if anyone's awake and reading in my hometown of Sheffield in the UK, or if anyone's awake and reading on the other side of the planet. You get a sense of just how global the New York Times readership is and how the dynamics of its reading play out. And before we'd made this, no one had seen this before at all. No one had been able to see how all of this behavior plays out all at once. I think of this as a hypothesis generating machine, and the important thing here is that the questions that come out of this visualization are now grounded in this complete kind of low-orbit view of the readership. So one of the questions that we were asked is about audience segmentation. So traditional audience segmentation is done by trying to boil down our users into a taste profile, which I'm not a massive fan of. I don't really like the idea that the New York Times would think of me as this kind of reduced into sports and politics kind of persona or into gardening and cooking or something. I don't like that the Times might have that view of me. So instead what we thought we had to have a go at is segmenting sessions. So what you can see here is the same data as on the previous slide, except we've arranged all the clicks into sessions. So a session is you come from outside the site, you read a bunch of pages, and then you leave. So we've organized these sessions into 14 columns. Every row in every column is a reading session, and every pixel in every column is a particular page view. The aim for a visualization like this is to try and generate some sense of trust that we've captured how people are using the site. So we try and show something that our audience has seen, knows exists, and so expects to see in the visualization, and then show a bunch of things that they don't know. So if you can see this horizontal, pardon me, this vertical straight line, these are the one-and-dones. These are people that come from outside the Times, read a single thing, and then leave, probably from social media. This is by far and away the most popular use case on here. You can't see it whizzing by because it's so popular. Another one is the homepage squatters. So these people sit on the homepage of the Times all day and only occasionally dip into the site when they find something they'd like to see. So both of these behaviors are well known in the Times. So all of the others, though, are not very well characterized, and so I get to name them, which is always a mistake turns up. So these are the homepage bouncers. These people come and read an article, then from there go and see the homepage, and from the homepage they disappear into the site. What I'm hoping with visualizations like this is that they generate some trust into the modeling and the clustering and all of the data science that we do, so that people trust this data, trust the work that we've done. I'm hoping that when they see all of this behavior on one little visualization that they ask more interesting questions, grounded in this view of our readership from above, and so therefore make more interesting user experiences as a result. So I think data science is this ability to see things from above that we wouldn't have without data, and I think we've got a long way to go in terms of framing the messages that we try to get out, and I look to people like Georgia who's up next for inspiration along these lines. So thanks very much.