 Thanks, Irene. So I'm going to sort of fortuitously sort of book in this conference a little bit, because I think my boss talks out of great with this idea, like, why are we creating visualizations? We're not just creating visualizations to give the pixels on our screen a workout. We're creating visualizations for some purpose. And the purpose that I typically create visualizations for is data analysis. So I'm going to talk a little bit about kind of the idea of exploratory data analysis, some of the other pieces you need to do that, apart from visualization. And this is a little bit of my journey, because I really started off, like, this is the first R package as I created for visualization. But it very quickly became apparent to me that, like, being able to create a good visualization is a result of a lot of successful steps. You've got to be able to get your data into R. You've got to be able to tidy it and arrange it. And unless you can do that, you cannot create a good visualization. And so I have this sort of, this is my sort of little mental model of data analysis tools or data science tools that you've always got to start. The first thing you can do, you have to do, is you have to import your data from whatever crazy format it currently lives in into your data analysis environment of choice. Now, that for me obviously is, ah, but I think it applies regardless of what you're using. Then I think it's a really good idea to store your data into this consistent format. I'm not going to talk about this too much, but I think if you do ever use R, I'm just learning about this idea of tidy data, incredibly powerful idea, because the idea is you put your data in this one format and you leave it in that format and you're not kind of constantly, like, ramming the output from one function to the input from another function. Then there's sort of three main tools for actually understanding a dataset. You're going to do some sort of transformations, you might be doing very simple things, like counting, you might be just changing the order, you might be creating new variables that are functions of existing variables, like simple stuff, but often very important. And then there are two main, only big toolkits that are useful, visualization, which is obviously the focus of this conference, so I'm not going to talk too much about this, but visualization fundamentally is a human activity. This is making the most of your brain. And visualization, this is both the strength and the weakness of visualization. You can see something in a visualization that you did not expect and no computer program could have told you about. But because a human is involved in the loop, visualization fundamentally does not scale. And so to me, the complementary tool to visualization is modeling. I think of modeling very broadly, this is data mining, this is machine learning, this is statistical modeling. But basically, whenever you've made a question sufficiently precise, you can answer it with some numerical summary, we'll summarize it with some algorithm, I think of this as a model. And models are fundamentally computational tools, which means that they can scale much, much better. As your data gets bigger and bigger and bigger, you can keep up with that by using more sophisticated computation or simply just more computation. But every model makes assumptions about the world and a model by its very nature cannot question those assumptions. So that means at some fundamental level, a model cannot surprise you. So that's why I think like modeling and visualization are so complementary. Visualization can surprise you, but doesn't scale. Models don't surprise you, but they scale. So the vast majority of real, any real data analysis, any real data science project is gonna involve doing these things again and again and again. And then at some point, which I'm not even gonna really touch on today, except I'm gonna be doing it to you, you have to communicate the results of your analysis. Now visualization also has a really powerful role to play here as well, it's a great way of getting that those insights that are in your head into the heads of other people. But I'm not gonna talk about that today. So today I wanna kind of focus on this explore part of this, this transforming, this visualizing, not really gonna do any modeling today, but I don't wanna talk about this like abstractly, I wanna talk about this concretely. So what I'm gonna do is I'm gonna show you two examples. And I thought, I'm trying to think like, what data is the most interesting data to show you? And really, personally, I think the most interesting data is data about me. So I'm gonna show you two little data explorations that I've done recently for fun. The first one, we're gonna kind of look at my patterns of GitHub commits, and then we're gonna look a little bit at places that I've traveled to. And then finally, we're gonna kind of culminate them in this diagram where we look at my GitHub commits based on the time of day and the time zone that I'm currently at and whether or not I'm traveling. So I'm gonna do this with a live demo, not only as a live demo, hopefully, I have enough battery. I am also using the development version of a number of packages and the development version of RStudio. So wish me luck. So I'm gonna load a bunch of R packages. I'll talk a little bit later about some of these packages so you can at least look them up if you're interested. And I'm gonna write a couple of helpers. This is just gonna like scrape the data from GitHub using GitHub's API. And I'm not gonna live that dangerously, so I have actually cached all this data locally. I am not going to download it live for you. But what I'm doing, basically, I've gotta do this, to get this data, I've gotta do it in two steps. First of all, I have to ask GitHub what are all of the repos I've contributed to. I've just asked it for what I am reasonably certain is the last 100 repos that I have touched. And then I'm gonna go and get all of the commits to all of those repos. Now, this takes a little bit longer, but I've cached it locally, and then we can just take a look at this. So this is a JSON file, which is basically a hierarchical structure. You can see I've got a whole bunch of repos. If I drill down into a repo, inside that repo, I've got a whole bunch of commits. And then I've got some information about each commit, which is like the char, which is the unique identifier. Then we've got some information about the commit itself, like who committed it? Well, this one is not me. This is Scott, what it was about. And somewhere in here, oh, in here, when that commit occurred. So what I'm gonna do is I'm gonna do a little bit of scripting, and I am going to do, that would be helpful, but it is the wrong plug. I think we'll manage. This is gonna be like one of those horrifying, anytime anyone ever posts a screenshot of their phone on Twitter, there's always like 2% battery left, but I like to live dangerously. So what I really like, Jenny Bryan is a fantastic name for what I'm gonna do next, and there's Rectangling. So I'm gonna turn this crazy hierarchical list into a rectangle, a rectangle, which is a tidy rectangle, which means each of my columns is a variable, and each of the rows is an observation. So I'm gonna end up, oops, and I just accidentally pressed the wrong keyboard shortcut, so I am running the entire script. I'm gonna end up with this data frame where I've got, so there's about 23,000 commits, I've got the repo, who authored it, the date, the date time, and so on. Now I've done a couple of little tricky things here. So originally this data starts out, this data from GitHub has a date time. So this is both, this is the exact instant and time that that commit was occurred. And I'm gonna partition this into two things, into a date, and into a time. And I'm gonna do this partitioning thing in a really kind of tricky, or perhaps, well, very hacky way, I'm basically gonna say I'm just gonna convert all of the years, month, days, and seconds to exactly the same time. So I'm gonna set, so I'm just gonna have basically times on January 1st, 2016, which was basically in work. But this idea that I've got this variable, I wanna partition it into two pieces is a really, really useful idea. So done that, the first thing you do whenever you've got some data from somewhere, it's a really good idea to take a look at what's in there. So I just counted the number of commits in each repo. There are some surprises here. So for example, the Spark VR repo, I'm like, well, I haven't written down anything on that. Worked on Deployer and a few of these other things. But so that kinda looks okay, but when we look at this, the count of authors, you'll see there's 22,000 commits here, and only 4,000 of them are from me. So with this GitHub API, there's no way to get just the data about me. I've had to pull in all this data about other people I couldn't care less about. So I'm gonna get rid of those. I'm gonna do a little bit of filtering just to basically look at the last year's worth of data, and then I'm gonna arrange it in chronological order and just take a quick look at that again. And again, somehow I've managed to put the wrong keyword shortcut. And let's just take a look at that. So 3,000 commits now, and you can see you've got the name of the repo. Who authored it? That's always gonna be me, the Shah, and you can see all of my gimmick. Commit, which is things like, don't need explicit slushes, I guess. Fortunately, there's no cussing on this page. The other thing I'm gonna do here, which I think is a really great practice when you're doing a data analysis, is I'm just gonna dump this out as CSV, and because I am using Git here, I can easily take a look. Wow, that's interesting. And you can see there is some massive diff there, which is, yeah, thanks. Massive diff, which is suspicious that the data has changed so much the last time. I've run this, which was only a little while ago. But so if you're doing a real analysis, you'd be like, hey, what the heck is going on there, but I'm just gonna press on and hope everything is okay. So one of the reasons I started looking at this data originally was, I sort of wanted to try and feel like, this is like some self-reflection. Like what is it that I've actually been working on recently? So the first plot I'm gonna do is just a plot. Again, if you've never seen R code before, hopefully this is, I guess, how many of you do use R? So even if you've never used R before, hopefully you'll be able to make out kind of what's going on here, because the structure of this code is designed to help not only communicate to the computer what's going on, but to communicate to other people. So here I'm creating, I'm taking this data set, the Hadley data set, this thing is called a pipe. This is basically just gonna feed this data set into this step and it's the most easiest to pronounce this as then. So take the Hadley data, then pipe it into ggplot and here I'm just saying on the x-axis I want the date and on the y-axis I want the repo and I'm gonna display that using a geom that's short for geometric object and I'm gonna use a quasi-random geom. So this is basically instead of, well actually let's just do it with, let's do a scatterplot first of all just so you can see why scatterplot is not very good. Let's do a scatterplot here, run this code and well you can't really see anything which is part of the point. But you see here with a scatterplot, just because I'm just drawing points, those points ended up plotted on top of each other and it's hard to see like the relative density. So what I'm gonna do instead is use this quasi-random and I'll show you on the slide a little bit later if you'd like to learn more about that but this basically just adds a little bit of random noise. It spreads those points out so they don't overplot so much. So that's better, there's still way too many. This is showing you every single repo I've contributed on to in the last year that's a huge amount of data. The screen is not very high resolution and it keeps looking on and off but I can only fix one of those problems and so what I'm gonna do is I'm just going to lump together a whole bunch of those category, a whole bunch of those repos into another category. Really, it's gonna swap ports in the hope that that will make things, get fingers crossed. So what I'm gonna do is I'm gonna use this function called factor lump that's basically gonna take this categorical variable and it's just gonna lump it together so I've only got 15 categories dumping everything else in other categories. It's just a very, very quick way of simplifying that data and maybe since the screen is so big, so small, I'll just jump it down to 10. So now you can see the 10 repos that I've contributed most to and the rest are all lumped into this other category. Now this is a step forward because we've kind of got rid of all the small scale stuff so I can see the big picture but how's the y-axis ordered? Well, currently it's just ordered alphabetically which is, I mean, it's okay, I guess, but I think it would be more useful if I could order it so that the things I've been working on most recently are at the top. So I wanna improve this visualization, right? But to improve the visualization, I have to modify the underlying data. So what I'm gonna do is I'm gonna reorder this to reorder the repo by the average date and then for various not that interesting reasons, that's gonna end up with the most recent stuff on the bottom and certainly we learned yesterday from Matthew that maybe that makes sense for your culture but for me that looks weird so I am then going to reverse it. So you can sort of read this pipeline as a series of imperative statements. Take the repo, reorder it according to the average date, then reverse that ordering oh and then don't forget to lump together so you've just got the 15 or the 10 most common repos. So now you can see kind of chronologically, I guess going down to the oldest repos at the top moving down to the things I've been working on most recently at the bottom. And one thing you can kind of see here, I think is that I'm pretty bursty, right? You'll see that like for a lot of these packages there's a whole bunch of commits and then nothing happens for a long time. You might also notice something's going on here. We've got two packages, right? Where it seems like they've been like working on them exactly the same time. Well, it turns out that this, if you want to artificially inflate your get up commits, it's very easy to do that. It turns out these are actually the same commits, but this was a project that got split into two pieces and those kind of historical commits ended up in the same place. So we could weed those out by looking at that unique identifier that's sharp but I couldn't figure out a simple algorithm for deciding like which repo should that belong to. So I just left them in there. The other thing I sort of interested in is like introspecting on like when do I work? What is my working process like? So what I'm gonna do is I'm going to create a new variable. Oops, no, I'm not. I am just going to plot those two dates time variables. I'm gonna put date on the x-axis as the first argument and then the time of day on the y-axis and again display that with this quasi-random plot. And this is okay, I think, but I'm gonna tweak it a little bit. I'm just gonna kind of round each date. Like there's so many days on the x-axis, it's hard to see the fine scale detail. So I'm just gonna round it to a week. So we're just kind of seeing like what are the times that I work on code look like across multiple weeks? And it looks like you can kind of see, I've made sure this is the right time zone that I'm in them so you can see normally I don't do much work before six a.m. and much work after six p.m. Although there are a few exceptions, right? And this certainly was not me getting up at like three in the morning to program. This, well, my speculation is at this point without looking at any more details, this is when I was in some other time zone, right? So it wasn't that my working times had changed or that I'd changed for a place where the times are different. And we'll come back to that a little bit later. Well, then we could also say, well, what happens if we look at this by day of week rather than the date or the week itself? As you can see, I'm kind of like a, I'm not really a hobbyist programmer anymore. I'm like a professional programmer. I just pay, I just work when I get paid to work. So most of the time I'm working on the regular days of the week. Now, to create this plot, I've had to do some more data manipulation again, right? I've had to create a new variable from that date time that gives me the day of the week. And for reasons unknown to me, R has decided that Saturday is the first day of the week. Right? And that, well, not a while, I guess actually it's Sunday is the first day of the week and we're gonna go from the bottom up. So again, to create a more natural visualization, I have to do a little data manipulation here, which I did have in here previously, but I have accidentally deleted. So what I'm gonna do is I'm gonna take this weekday and I'm gonna use this function called shift, which just kind of like shifts the bottom to the top. And then I'm gonna reverse the order of that just to create a more natural, to me, more natural order where the weekends are next, the two days on the weekend are next to each other and Monday is at the top of the plot and time proceeds downwards. So this is like to create this visualization, this is a really simple visualization, right? This is a scatter plot. Most of the challenges to create this visualization are not in the visualization, it's getting the right data aggregated in the right way, broken down, taking this date time thing and breaking it down into more variables that are more informative like the day of the week in the time of the day. Now what I wanted to do next, well, so when I look at this, right, we identify most of the time I work between six a.m. and six p.m. And at least if you know me as well as I do, you would imagine that maybe I'm traveling on these times. So the thing I wanted to look at next was my travel data. And to be honest, the reason I originally looked at this travel data was laziness because I've recently decided to become a U.S. citizen. And one of the things that you have to do when you become a U.S. citizen is list all the times that you've left the country for more than three days in the last five years. And I was like, oh my God, that is gonna be awful. But fortunately, I use Tripit to track all of my trips and I knew Tripit had an API. So I was like, well, why don't I just scrape that? Use the API to get that data. And that led to this part of the process which I evocatively call endless screaming, which was basically trying to figure out how to correctly auth against Tripit's API, which uses some crazy non-standard auth thing. But then I carefully read the webpage and discovered that if I emailed them and asked them nicely, they could turn on basic auth so I can do that. And I'm gonna enter my password in a way that you can't see it, so you don't hack my Tripit. And then again, I just have some code for like, this is again all so many modern web APIs, use JSON, it's so easy to slurp JSON into R and turn it into a nested tree. I just write a couple of helper functions around that that aren't very important. And then I bring down all of that JSON. Again, we can look at this with, just to take a look. So I've got this JSON file, 177 trips. And if I open one of these trips, you can see, well, some blood idea. When was this? This is from the 4th of April to the 6th of April. And I went to the primary location was Detroit. And if I drill down on that, you can see I can even find out like the latitude and longitude of that trip. So I'm just gonna go through and I'm gonna rectangle that, flattening that list. So I have columns like my identifier, the start date, the end date, the latitude, longitude, the city and the country. And I've got to run this line. And so we can look at this so you can see, I didn't rerun this recently enough that you can see this trip. But you can see my last trip was to Detroit and I travel a lot, obviously. Again, I wanna track what's going on with this data. So I save that as a CSV. I'm using Git so I can check that into Git. And then if my input data changes, I'm gonna see that very easily in the diffs. This is just a great way of making sure data is not changing when I do not expect it to change. Okay, then I thought, well, let's take a look at that data. And so what I'm gonna do, what I thought would be a great way to visualize it is to put each country I visited on the Y axis and then basically draw little lines showing how long I was in that country on the X axis. And most, like most initial visualization ideas, this turned out to be basically useless. Because most of the time I'm visiting some in the US and a few times I visit, most of the time I just visit one other country and like the only other country that I visit frequently most frequently is New Zealand. The other thing when I looked at this plot that I was suddenly perturbed by is what country is N.A.? It's like, I don't remember like, is that North Africa? But turns out this is actually, this is a missing value. So N.A. is how I record missing values. And I'm pretty sure this is just some trip. We trip it. Actually, I should check that out really before I go and tell you a compelling story about that data. So I'm just gonna say give me all the trips where the country is missing. Yeah, and so you can see here are four trips where for whatever reason, trip it could not tell me where that trip went to. Again, you know, if you're doing a real data analysis, you dive into this and figure out why, but I'm just gonna ignore it and continue on. Okay, so I thought then, well maybe, well that's what I'm gonna do next. So then maybe it's too hard to show countries on the y-axis. Let's use the y-axis for something more useful. So I thought I'm gonna use again, do the same trick. Why don't I put each year as a line and then just have the days within the year. And I'm gonna use my same trick here to kind of find the start day. I'm just gonna define that like day of the year, like 1st of January or 5th of February. I'm just gonna set the year to 2010. So just kind of, so they all have the same year. And I'm gonna repeat that. Again, I'm doing this, this time I'm drawing a segment, right? And a segment has two parameters, well there's four parameters. There's a starting x-position, that's the day my trip started, the starting end, the ending x-position, the day my trip ended. And then I'm gonna make these flat lines so that both the starting and the ending y-position will be the year. And when I do this, again, I get a little, a visualization is a little better, but there's still some problems. And you'll notice like there's a couple of years where it looks like I spent the whole year traveling. I was like, that's a little odd. But it turns out my problem is that this approach is not a very good approach, right? Because if I left on December 1st and came back on February 1st, this is gonna end up kind of switching the things around the line gets drawn in the wrong direction. So again, if I was doing a real analysis or something that was high at stake, so go back and figure out how to do that correctly. But for now, I'm just gonna throw those trips away. And I think that, I mean, you're laughing a little bit and it's not a great thing in general, but being able to do that, right? Just being able to say, this is important, but I'm not gonna worry about it now. That's a great technique to just ignore it and move on. So we'll skip a few steps since I'm going a little slower than I expected. I'm just gonna make those segments a little fatter. And then I'm gonna color them by country. Again, this is not very good because you are not capable. First of all, I cannot fit like 20 countries on my legend and then you can't perceive the differences between 20 country colors. But we could, that's like a starting place. So there's all, and this is particularly, that's a terrible visualization because the legend is taking up much of most of the space. But again, like so many of these challenges are not about the visualization, but about the data. And sometimes the visualization makes me think, well, I need to change this in the data. And then sometimes the data suggests things I want to visualize. I only have a couple of minutes left. So I'm gonna kind of skip to the conclusion. And this, I will not bore you with the more endless screaming of this entailed, but I found a API that goes from a latitude and longitude point and tells you the time zone at that location. And then figuring out how you compute the local time, given a date, time and location is a little tricky. So I wanted to show one other visualization quickly here. And this is just a visualization. The other thing that visualizations are great for, and I know Mike as a big believer, it was visualizations to check that you've actually programmed something correctly. And the fact that each of these colors, which is the time zone is on the straight diagonal line indicates that I have coded this correctly, which is awesome. And then I can skip to this very, skip to this plot where the color is now a little redundant. So I'm gonna get rid of that. And now this is sort of like a more of an explanatory graphic, more of a communication graphic, so I've added some labels and stuff. And this is basically, this is my life. When I am not traveling, I am incredibly consistent. GitHub, Git Committer. I basically commit between 6 a.m. and 6 p.m. every day. You can see there are two big gaps on Tuesday and Thursday. I do not take very long lunches on those days. That's when I do yoga. And you can see I hardly ever practice, hardly ever code on the weekend. But when I'm traveling, well, again, I tend to not code before 5 a.m. And I don't, well, I code quite late at night. Much, much, much more variation because all of my daily rituals are now gone. So two cool packages if you use R that I wanted to point out. One which I didn't get a chance to show you, but it's called Gigi-Repel. This implements some of the force-based layout stuff from D3. So if you want to label a plot, you can make sure your labels are not on top of each other. And this Gigi B-swarm package, which provides these B-swarm and quasi-random plots, a great way of doing dot plots where the dots don't overlap. I just want to finish up. I just gave you a very, very quick overview. Many of you have never seen R code before. Hopefully I have persuaded, not persuaded you to never see R code again. If you would like to learn more about any of the thing I talked about, all of the packages that I showed today or many of the packages are in what I call the tidyverse. If you'd like to learn more about it, it's got a website and I have a book about it too with Garrett Grohlman called Alpha Data Science with O'Reilly. If you want to buy it from O'Reilly or if indeed you want to buy any book from O'Reilly, you can use this discount code ORTH-D which will give you like 50% off the electronic book and 40% off the physical book. Thank you.