 find for data analysis and help by applying to my controller. OK, so I'm interested in data analysis, which, to me, is the process where raw data comes in one end and understanding insight and knowledge come out the other. And to me, there really are three main sets of tools for doing data analysis. So before you can do data analysis, first of all, you have to get your data. And you need to get it into what I call a tidy form, a form that works well with the software, with the tools that you're using. Now, I have a little arrow here in any real data analysis. Often that arrow comes all the way around over here. Often the most challenging part of a data analysis is just getting it into a form that's easy to analyze. And once you've done that, there are three main sets of tools for doing data analysis, data transformation, where you are computing new variables as functions of existing variables, where you're doing basic summaries, filtering, all that kind of stuff. And then there are two main engines for gaining insight into data visualization and modeling. So visualizations are great because a visualization can surprise you. A visualization is also really helpful for refining your questions of the data. And often the first challenge of a data analysis is taking those vague, nebulous questions and making them sufficiently precise that you can answer them in a quantitative way. The problem with visualizations, however, is that they're fundamentally a human activity. Human eyes have to look at every single visualization. And that means visualizations fundamentally don't scale up. So to me, the complementary tool to visualizations are models. Think of this as statistical models, machine learning, data mining, whatever. But whenever you can make a question sufficiently precise that you can answer it with a handful of summary statistics or a simple algorithm or complicated algorithm even, I think of this as a model. So models are great because they don't need a human. They use a computer. And so they inherently scale much, much better than visualizations because it is much, much easier and cheaper to buy computers than it is to buy human brains. The problem with models, however, is that every model makes fundamental assumptions. And a model by its very nature cannot question those assumptions. And that means in some fundamental sense, a model cannot surprise you. A model can only tell you things that, in some sense, you already expect. So in any real analysis, you're going to have to use all of these tools multiple times. You might start with a visualization that will suggest a model. You fit that model, you transform the data to look at the residuals. You then visualize that again. Then now that you've removed the most powerful trend, you can see the subtler trends that remain. So in any real data analysis, you'll iterate through these tools many, many, many times. And a lot of my work developing R packages has been building tools to make these things easier. And kind of the first round of my work was reshape 2 for tidying data, plier for transforming data, and ggplot 2 for visualizing data. And what I want to talk about today is kind of the next generation of those tools. Tidy R for tidying data, dplyr for transforming data, and ggvis for visualizing data. And there are really three things that have had a profound impact in the way that I developed R packages. So the first one is RCBP. So RCBP is a tool that makes it very, very easy to write C++ code and call that from R. And this is really important because R is a language that is fundamentally designed to make humans effective. It is not designed to make computers efficient. So that means R is very expressive. You can do whatever you can imagine an R, but it is generally not the fastest of programming languages. C++ is fairly complementary to that. C++ is designed to have very, very high level performance. And any abstraction that's been added to C++ has never been added at the expense of sacrificing performance. So C++, it tends to take you much, much more time to write C++, sort of express an idea in C++ that would an R. But the end result is code that runs much, much more quickly. So this makes it possible for my packages that can now use C++ to expand to much larger data sets. As Kartik mentioned, another thing I've been working on is this advanced R program, this advanced R book. And this book has been great because before I wrote it, I thought I knew what advanced R was, but I didn't. And through the process of actually writing that book, it forced me to confront many of my misconceptions about the R language. It forced me to learn the areas of the language in which I didn't really understand. And one of those areas is this computing on the language or meta-programming, where you're using code to inspect and modify other code. And it turns out that is really important in R for creating fluent interfaces. Interfaces for doing interactive data analysis where the tools just recede into the background and let you focus on the challenges of the specific data that you're working on. The third thing that's had a profound impact and the thing I wanna focus on today is this kind of strange looking thing. This is called the pipe. This uses R's standard way of denoting an infix function, which is to surround it with percent signs. This is implemented in the Magrida package. And the idea of a pipe is very, very common in other places. If you've ever used the shell, pipes are really, really important for taking advantage of the Linux shell. F-sharp has a very similar idea. Clojure has similar macros that do similar things. Haskell has a similar thing in the form of Monads and their Arrow operator in Smalltalk and JavaScript and C++ have this idea of method change. It's a very, very old idea, but now you can access this idea very, very easily in R and it's hopefully I'm gonna persuade you. It allows you to make your code much more readable. And from my perspective, as well as making it easier for the user of the code, it also makes it easier for the author of the code. So today I'm gonna talk about pipe points. So let's start by, just I wanted to define a little bit what this operator does. So the rules are very simple. Basically, it takes a thing on the left-hand side and inserts it as the first argument on the right-hand side. And the way you pronounce this is then, so X, then F of Y is just the same as saying F of X and Y. Or if you don't want it to go into the first argument, you can use the dot as a placeholder. So this is gonna insert the thing on the left-hand side into the dot on the right-hand side. And we can join these things together. So X, then F of Y, then G of Z, G of Z, you should say, really, is equivalent to this sequence of function compositions. So this is really important because it allows us to return function composition. And function composition is a really solid, well-understood means of combining things. And it allows us to turn it into a sequence of operations, a sequence of things that look imperative so they're easier to understand. Now I wanted to motivate this with an example, which I am told should have cultural relevance to many of you, although it has none to me. We're gonna start with Fufu, which is a little bunny. And if you're gonna write what Fufu is doing in the terms of a sequence of functions, you're gonna write it like this, right? You're gonna start with the thing that happens first is the most internal function call. So if you wanna read the sequence of events, hop through, scoop up, pop on, then you need to read from inside out. You have to read from right to left. And that's challenging. And the other thing that's challenging is what are the two arguments to bop on? Well, it's the sequence of operations and this other object here. Now if we rewrite that with a pipe, this is much, much easier to see what's happening. We've turned from a sequence of function compositions into a thing that looks very imperative. So we hop through the forest, we scoop up the field, mouse, and we bop on the head. Now this is kind of surprisingly, I think, it's had a profound impact on the way that I and many other people write data analysis code because it allows you to write code that's much easier to read after the fact. And I think writing code that is easy to read is really, really important because every single project that you do involves collaboration between at least two people. And those two people are always present you and future you. And future you will be very, very grateful to past you if you've written code that future you can understand. Now I find it seems like the older I get, the more future me and present me remember totally different things and understand totally different things. And it's really important to have a right code that I can actually understand in the future. And over time, in the past I've had these moments where I write some code and I think, wow, I am so smart and awesome. This is an incredibly awesome bit of code. And then I come back to it in three months time and I read it and I have absolutely no idea what I was thinking and I think what a moron past me must have been. So the thing that's neat about this pipeline operator is you can apply it whenever you would normally do function composition. So it means you can use it with functions that were never written. You can use it with packages that were never written with this idea in mind. But it does work best when you can think about a pipeline of operations which means that the first argument to a function is typically the same type of thing as its output. So you're thinking about taking an object and transforming it repeatedly. And so today I'm gonna talk about three packages, this TIDER, the goal of a TIDER pipeline is to take a messy dataset and transform it through a sequence of steps into a TIDER dataset. Once you've got that TIDER dataset, you can use dplyr to manipulate it and ggvis to visualize it. A couple of other packages I've been working on recently is Arvis. That's a package that makes it very easy to express operations on HTML. So if you're doing web scraping, for example, Arvis makes it easy to express a sequence of transformations, a sequence of selections that gets you to exactly the piece that you're interested in from a web page. And similarly, lowliner is kind of like a pipeline to a list. So I wanna talk about three packages today. So TIDER, dplyr and ggvis. TIDER is the goal of this is to make this first step, this initial step where you take the data in whatever crazy format your collaborator has saved it in. And again, that collaborator might've been past you. You have no idea what they were thinking at the time. So the goal of TIDER is to take your messy data and put it into a tidy format. And the structure of a tidy format is actually really simple. You just need to make sure that your variables are in the columns and your observations are in the rows. How many of you have heard of COD's third normal form before? So COD's third normal form is a, at some point I took a database class and I think I understood what it meant, but when I now look at the definition of it, I have absolutely no idea. But this is basically COD's third normal form. And this is really important for creating databases because it helps you, it helps ensure that you represent one fact in only one place. Because you never wanna represent one fact in multiple places because as soon as you do that, you now have the possibility that those facts are gonna record different values. And what do you do when you have inconsistencies? So the goal of TIDER data is to frame this really powerful idea and important idea from databases in a way that makes sense to people working with data. And this idea is very, very, very simple. And I'm not gonna kind of formally define what an observation is or what a variable is. It turns out that's pretty hard to do in practice, but you guys already know what those things are and your internal definitions are good. So I'm gonna illustrate this with a little example. So this data is some data on tuberculosis cases collected by the World Health Organization. And it's about what I'm gonna tell you about this data. Instead, I am going to give you a little challenge. What I want you to do is turn to your neighbor and I want you to brainstorm for a minute and see if you can figure out what are the variables in this data set. So I'm gonna give you a little hint. If is female, you is unknown and 1524 is shorthand for 15-24. So you've got one minute starting now. What are the variables in this data set? Okay, time's up. So does anyone wanna suggest one variable that's present in this data set? Country code, that's good spotting. So you might have recognized or realized, but this is the ISO2 letter country code and AD, I believe stands for Andorra. What other variables do we have here? There's another easy one, yeah, right? That's also in a column already. Is M04 a variable? No, right? M04 represents probably two variables, right? It's a value of two variables, one which is the sex, male or female, and the other is the age. So we have the year, we have the country, we have the sex, and we have the age. Are there any other variables in this data set? What do you think this represents? What does that number one represent? Probably the number of cases, right? So we have another variable that is the number of cases. So in this data set, we have five variables, two of them are represented in columns, two of them are kind of tangled up in the row names, and one of them is inside the cells. So this data in this form is hard to work with because you don't have a consistent way of referring to a variable because they're all muddled up in these different ways. So I'm gonna show you a little demo of how you might go about fixing that with the tidy R package. So one thing you'll notice here is that's a little bit different. Once I've read the data in, I'm just going to, if you've seen Arco before, I'm just gonna put this in a class called tableDF. Now all that does is it changes the behavior because the default behavior of R is when you print a large data set, it prints the first 10,000 rows, which is almost, I'm gonna go out on a limb and say that it is never useful. So what this does is just prints the first 10 rows, it fits all the columns that it can print on screen and it tells you how much data you have in total. Now if we're gonna convert messy data to tidy data, it turns out for any possible, pretty much any possible form of messiness, there's only really two operations that we need. The first one is a gather. So we need to gather all of the columns that are not currently variables and put them into two new columns, which I'm going to call demo short for demographic and end the number of cases. So if I do that, now I have four columns. ISO2 and year are variables, demo, that's kind of a variable. It's not actually one variable, it's two variables, right? We've still got age and gender muddled up in there and then in again is another variable. So we've taken all of those columns that weren't variables and we've turned them into key value pairs and chained them down the data set. So if we want to split that apart, there's a handy function called separate, which is not very exciting. All it's gonna do is take the first character and make that a six variable and take all the other characters and make that age. And now we have what I would call a tidy data set because each column is a variable. And again, this is important because when you model things, you're interested in the relationships between variables. When you plot things, you want to map the variables and the data to things that you can perceive. And now you have a standard way of doing that because variables are represented in the site, all the variables are represented in the same one. Now you might also do a few more things with this data if you're gonna make it a little bit easier to work with as a person. You might change ISO2 to country and you might arrange it. We'll talk about those operations a little bit later. Or we could actually do this all in one pipeline, right? We could say start with this name of the file, then load it, then turn it into a table DF, then gather it, then separate it, then arrange it, then, well, this should actually be renamed. Right, so we can do that whole sequence of operations in one pipeline should we want to, although generally you don't want to do very long pipelines because you want to check at each step, have you done the right thing? So TidyR provides a few other verbs that I'm not gonna talk about in general, but it should, for the vast majority of messy and strange data sets, it should allow you very quickly to get it into the right form. And the reason it is so, I think it is useful is because, first of all, it gives you these programming tools, but more importantly, it gives you these cognitive tools to think about how should you arrange your data and my advice for that is very, very simple. You always put the variables in the columns and you always, and then the rows form observations. And if you can follow that advice, your data is almost always gonna be easier to work with and are, or in any other programming language for that matter. If you'd like to learn more about this, you can Google for the package name. There's a package on CRAN, there's vignettes, there's also a paper called TidyData, which expands on this idea of TidyData in much more depth and talks about some of the other common cases of messiness. And there's also now a pretty cool cheat sheet, which just summarizes all the most important things. So TidyR helps you get your messy data into a tidy form. Often the first thing you're gonna do with that is to do some basic data manipulation. You're gonna create new variables, you're gonna do summaries, you're gonna do filtering, you're gonna pull out the variables you're most interested in. And that's the goal of the de-plyer packages to make that as easy as possible. And it does that in sort of three ways. So first of all, when you're doing, for any operation you're doing in a data analysis, first of all, you have to think about what you want to do. What data manipulation do you wanna do to solve the problem that you have? Then you have to describe it precisely, and describe it precisely in such a way that a computer can understand it. In other words, you have to program it. And then finally, the computer has to go away and do it. So when you're thinking about, when I'm thinking about how can I make data manipulation easier, I'm kind of thinking about all these three things. How can I make it easier for you to think about data manipulation? How can I constrain the scope of the problem to give you some useful tools that solve the majority of problems? How can I make it easier for you to go from what's in your head into the computer? And then finally, how can we make the computer do it as quickly as possible? And so these are kind of the three goals of Deploy that make all three of these areas easier. And the way that Deploy makes it easier to think about data analysis is by kind of telling you there are five important verbs for working with a single table of data. So Select allows you to pick out the variables by their names. Filter allows you to pick observations based on some criteria to do with their values. Mutate allows you to add new variables based on existing variables. Summarize that you reduce down multiple values to a single value. And arrange that you reorder the rows. So my claim is that with these five verbs, plus the group by operator to be able to do this by group, this allows you to solve the vast majority of data manipulation operations. And that's actually really important because now when you go to do a data manipulation, you can now think, well, instead of the thousands of possible functions I have available to me, which one should I use? Now you can just say, well, which of these five should I use? That is assuming that you believe me that these are the right five. So I'm just gonna show those five verbs off. And the goal, one of the other goals here is to map functions to verbs. So you wanna be able to think about, you wanna be able to program the data manipulation in a very similar way to the way you think about it. So I'm gonna work with a slightly different data set. This is an R package called baby names, which contains basically the complete set of how many babies were given each name from each year from 1880 on. So again, this is a table DF object. And again, when you print it, you don't get all 1.7 million observations. Printed out, you just get the first 10, so you get some sense of what's going on there. So we have the year of the name, the sex, the name, how many children were given that name, and then the proportion of total number of birth. So this is telling us in 1880, there were 7,065 girls called Mary, and that's about 7% of all births. Well, this is actually not based on birth data, but based on social security data. So it's a little bit more complicated than that, but you can imagine it as births pretty easily. Now, the select verb allows you to basically pick out variables based on their name. So for example, we can say select all of the variables apart from proportion. We can treat variable names. Here I'm treating a variable name like it's a number. So I'm saying give me everything minus this variable, except this variable. We could pick out all of the variables between year and end. Now, in this data set, which is five variables, you don't really care about these operations. If you're working with a data set, it's not uncommon, particularly if you're working with government survey data, you might have a data set that has 800 variables in it. And the first part of any data analysis challenge is just to figure out which of those 800 variables do I need to answer the question at hand, and how can I pull them out as easily as possible? We can use filter to pick out rows that match some observations. So for example, I'm gonna filter out just to find all of the babies called Hadley, all the records about Hadley's. So you can see there's 155 records from 96 on. Or we could apply multiple conditions, give me the names of all of the, give me all of the records for 2013 for males. Right, so every single time I do this, I'm creating a new data frame. Now, when I select, if I just take a subset of the variables, because of the way the data frames work and are, I don't need to make a copy of those columns, I just kind of point to them where they were originally stored in the first data frame. When I filter them, I have to necessarily create a new data frame that just has those rows. Now, similarly, when I create a new variable with mutate, so for example, here I am just gonna do a simple transformation where I figure out the first letter of the name and the last letter of the name. This is also gonna create a new data frame. Again, it doesn't take up that much memory. The only memory it needs to allocate is for these new two new columns because these ones already exist in the old data frame. So mutate allows us to add new variables that are functions of existing variables. And then the last, well, we can use summarize, which is not that useful here, we can just use summarize to count. That's gonna give us a one row data frame here which shows in total how many baby names we have here. So that is 7333 million names in this database. Or we can finally, we can change the order of those rows. So those are just the basic verbs. Now the goal of these verbs is that I've pretty much told you almost everything you need to know about these things. Each of them is really simple to understand in isolation. And if you want to deal with a more complex problem, the way you deal with that is by combining simple operations together. So for example, you can combine any of these with a group by operation. So I'm gonna say take the baby names, group it by name and then summarize it. So now I'm using the same summarize code, but now because I have grouped it by name, I'm now gonna get the count of names, the count of babies with their name for each name. So you can see there are 56 R-barns and this is not the very many names start with AA. Or what we could do is we could take the baby names, we could group it by year and six. We could add a new variable called the rank which is just the min rank of the in and descending order. So basically what we're doing is we're creating a new variable that is gonna be one for the most popular name, two for the second most popular name, three for the third most popular and so on. And we're gonna do this inside the year and then finally just inside each combination of six and year and then just finally just to show you what that looks like. I'm just gonna look at the tail of the data frame, the first, the last six rows. And you can see so for example, in 2013, Zairie was an unpopular boy's name with only five names that was ranked, was tied with many other names at 2000, at 12,000 and 32. Now this data set for privacy reasons only contains names with at least five occurrences which is why N is no smaller than five. Yep. Oh, so when you group a data frame, it looks exactly the same. All that is done is it's built up sort of this index internally that says any other operation that you applied to this from now on is gonna be done by group. So if we wanna operate on some more complicated questions, we just chain these simple pieces together. So if we wanna see the most popular names of all time, we start with baby names, we group it up by name. Like grouping is a fundamentally statistical operation. Right, we're saying this is our unit of interest. Our unit of interest is the name for each name, we're gonna find out the total number of babies with that name, and then finally, we're gonna arrange it in descending order so we can see the most popular ones first. And you can see that for example, the most popular name of all time is James with over five million James's, five million John's, a bit under five million Roberts, Michaels, Mary's and so on. Or we could say how many Hadleys are there, which I'm interested in, because this is a very annoying trend. So we're gonna start with the baby names, we're gonna pick out all the Hadleys, we're gonna group it by sex, and then we're gonna count them. So you can see there are 15,000 female Hadleys and 70,000 male Hadleys, and we're gonna explore in a bit more detail how that's happening over time. So I'm doing the same thing, filtering it by Hadley. Now I'm breaking it down by year and sex. I wanna see the time course. Then I'm gonna summarize it again to count it. I'm gonna use a little trick from Tidia. I'm actually gonna make this data kind of untidy. I want one column for the number of men and one column for the number of females. I can use a function from that and then I'm gonna function from Tidia from that and then I'm gonna view the results which shows in this HTML table thing. So you can see early on Hadley was a, well it was never exactly a popular name, but there are only boys and no girls. And then you'll see we scroll down in time. In the 1960s we start to see it rise in the 70s of the popular for a girl's name and then recently it's been skyrocketing in popularity as a girl's name, which is very, very sad. Or we could do something like this. Again, break it down by year and sex. Add this ranking and now I wanna find all of the rows with a rank as one. So all of the most popular name for each sex in each year. Now we could very easily do the opposite, right? We could say for each name in what year was it the most popular? If we're gonna do that, we just group it by name now and we can see the year in which it was most popular. So because we can deal with complexity by combining these simple operations together, this makes it much easier to learn because you understand how each of these things works by itself to solve a more complicated problem. You just need to think about what's the sequence of these operations I need to apply. Now of course when you work with data you're not just typically just working with a single data frame or a single table of data. You're often working with multiple tables of data. So D-Plyer also comes with three families of methods for working with multiple tables. The first are what I call the mutating joins. So this is like a standard SQL join where the result is a new data frame that primarily has more columns on it. So if you want the intersection, you wanna see only the rows that match in both X and Y, you do an inner join. If you want all of the rows in X and the matching ones in all of the rows in Y and the matching ones in X, you do a left join, the complement has the right join and the full join goes to everything. There's also what I call the filtering joins. There's a semi join and anti join. These are kind of like these joins which you're probably familiar with already but these do not add any columns. All they use for is filtering. So semi join says give me all of the rows in this table that have a match in this other table. This turns out to be a really, really useful operation. The anti joins is the opposite. It says give me all of the rows in this table that did not have a match in this table. These are very, very useful when you're diagnosing what's gone wrong with your initial joins, why half of your data has now gone missing so you wanna find out why there isn't a match. Yep, now there's an option, you can join. So the way that Dplyr does these joins is by default. It does what's called a natural join. It assumes the variables you wanna join on have the same name in both tables, which is pretty common. You can override that if it's not the case for your data. And then finally, there's a sort of set or row based operations. So now these joins kind of think about tables side by side. These set operations think about tables on top of each other. Do you want all of the rows in common between the two? Do you want the columns that are in one but not the other? Do you want the rows that are in one but not the other? Or do you want all of the observations regardless of which table they're in? So really useful set of operations kind of cribbed from SQL, large VN house. Now one of the other things that's neat about Dplyr is that it is built to abstract away the details of how the data is stored. And that means you can work with local data frames, data tables, there's some experimental support for data cubes, but more interestingly, you can also have remote data sources. So one of the big bottlenecks when you're working with large data is that large data is expensive to move around. You don't want to move the data to the computation, you want to send the computation to the data. So what Dplyr allows you to do is work with tables and remote databases as if they are in your memory and your current computer and it does that by translating your R expressions into SQL and sending that query to the database. So there's a number of pluggable backends for that. So I think that's important because it abstracts away. A table of data is a table, no matter where it lives and you should be able to work with it in the same way without having to worry about the specific details of how it's stored. So a data cube is, so for example, when you think about a data frame or a table, normally sort of think about a sequence of columns, each column is the same type. A data cube, you're thinking about an array of data and memory where you have something that's basically all the values are the same type and it's wrapped around very densely in memory. There's some, so this tends to happen a lot. You tend to get data cubes, like if you imagine you have a grid of points, you have measured them at a set of time points, now you have sort of a three-dimensional data structure and now you have some kind of other covariate, maybe you've got a four-dimensional data structure. So you can, so if every combination of all of your variables is present, you can store it more efficiently and sometimes compute upon it more efficiently in a cube rather than thinking about a table. If you'd like to learn more about Deeply, you can Google for it, there's a mailing list, there's a GitHub page, there's a CRAN package, there's lots of vignettes to explain in more detail, all of the things I've skimmed over here. We end the cheat sheet now. I wanted to finish off by talking about GGViz, which is kind of the final piece of the puzzle, the visualization piece. And the GGViz is much like GGplot2, it's a grammar of graphics. It's not a typology of graphics where it lists all the name, it gives you a list of named graphics and allows you to choose between them. Instead, it gives you a set of small components that you can recombine to create whatever type of visualization you need. Now compared to GGplot2, GGViz is reactive. That means it is not creating a static plot, but it's a plot that changes over time, either as the data changes or you interact with it in some way. GGViz is also a pipeline to the way you create a complex visualization as by joining together many simple pieces. And finally, GGViz is fundamentally web graphics. It's of the web, it's not just something you put on the web, it's built with HTML and CSS and JavaScript and uses a JavaScript library called Vega to do the actual rendering. So I'm gonna show you a little demonstration of this, of what you can do with GGViz. Again, another dataset, this is a little smaller. This is data from the DEA on cocaine seizures. So I'm just gonna do a little scatter plot here. I'm putting weight on the x-axis and price on the y-axis. And when I first looked at this data, it took me a while to figure out what the units of the weight was because there was one measurement that had a weight of about 100 million. So I thought that was implausible, it was implausible that this would be measured in grams, but it turns out when I turned that many grams into pounds and googled for that many pounds plus cocaine, there was actually a seizure, which was an entire container ship full of cocaine. So here I'm just looking at a sub-seal of the data, you can see about the biggest seizure is about half a kilo. And what I've done here, the way I create a visualization is I take a dataset, I pipe it into this visualization function and I say, what do I wanna put on the x-axis? How do I wanna map my data to something I can perceive? What determines horizontal position? What determines vertical position? And here I haven't said how to draw it, so ggplot2 is, ggvis is intelligent enough to guess that if I have a continuous x and a continuous y, probably using points is a good way to do it. I could be more explicit about that as below and I could explicitly say, take that data, pipe it into a visualization, then add on a layer of points. Now we can do a similar thing if we just use a single continuous variable, then ggvis guesses that we might want a histogram and whenever you use a histogram, you always wanna modify the bin width. So here is a bin width broken down into five gram bins and we can see that there's hardly any data, more than 250 grams, so for the rest of this exploration, let's just focus on that. So I'm just gonna filter this down. This is a very, very common pattern, right? You do some visualization and now you need to do some data manipulation. So I'm just gonna plot that data now. And one thing that I kind of thought that maybe I saw when looking at this positive, it looks like maybe it's not quite a linear relationship, maybe it kind of curves off a little bit, maybe there's some sort of like bulk discount, so that as the weight of the cocaine increases, it gets like slightly relatively cheaper because you're buying in bulk. So we can make that easier to see first by layering on a layout of smooths. So I'm just gonna put a smooth curve through this data and then finally I could layer on some model predictions. So I'm just gonna fit a linear model to that data set and so I can have a nice reference. What's a straight line fit through this data versus a curved line and it looks like maybe there's a little bit of a difference out here, but of course there's not a lot of data out here. So it would probably be a little bit suspicious of this. We could look at another pair of variables instead. So weight versus potency. So potency is basically the percentage cocaine. So zero is there's no cocaine, it's all filler. 100% is all cocaine, no filler. So we've got a bit of a problem here that we've got a lot of over plotting. So it's kind of hard to see the trend. So again, I'm gonna layer on a smooth curve to that plot. And you see something here that I think that I personally thought was rather interesting and that is we see the sharp uptick in potency, right? It seems like smaller weights have higher potency and that's counterinsuative to me because I would have thought that like out here this, I would have thought this, I just kind of think of this is like the wholesale end and this is the retail end, right? And you'd expect wholesale to be relatively better and retail to be relatively more cut down. So I'm a little bit suspicious about this. So maybe it's just some peculiarity of the smooth that I've used or maybe there's something else hiding in here with all of this over plotting. So one thing you can do in GigiVis that you definitely cannot do in GigiPot 2 is instead of just messing around with the opacity by changing a value or changing the span of that smoother, what I can say is map the opacity to a slider, map the span of a smoother to a slider. So I'm gonna create something that's now has a little bit of interactivity. So I now get a slider and as I drag that slider, the plot updates. This is what I mean by GigiVis plots being fundamentally reactive. As the parameters change, so too does the plot. And similarly, as I change the span of that smoother, well, this isn't actually a very interesting, there you go. As I change the span of that smoother, I can maybe convince myself that this is a real pattern. It's not just an idiosyncrasy by chance that one parameter value happened to show this signal. So that's a fairly simple form of interactivity. I wanna show two other examples of somewhat more sophisticated reactivity. The first one is a grand tour. I'm just gonna pop out here. So the intuition behind the grand tour is very obvious. When you have a three-dimensional object to see what shape it is, you look at it from different angles. So why not do the same thing for a six-dimensional object? And it turns out the mathematics for doing that is basically the same in three dimensions as in six. It's just doing a little bit of matrix multiplication. So here I have a grand tour. So what happens is happening in R. I'm generating a random projection. I'm projecting the data and then I'm displaying it and I can control it as it's being piped to GigiVis here. And you can see here, I'm not gonna explain it in a huge amount of detail, but just to kind of skim by the code, you can see I'm using this package called tourR, which allows us to create grand tour objects. I'm setting up a few parameters. And this is all of the code that I need to create a custom interactive graphic. And what I'm doing, really what I'm doing here is now I'm creating this reactive object. So I'm creating a data frame that is not constant in time, but varies over time. It's implicitly a function of time. And every, and then I feed that into GigiVis. And GigiVis knows what to do when you have a data frame that changes over time. It updates itself as new data comes in. So I'm also illustrating here how you can embed these things in R Markdown documents. R Markdown is a nice way of combining code and narrative. So I write this in R Markdown. I can include some text, which would be nicely formatting. I can include some R code. And then when I render that, I get an interactive document. And this one is just showing linked brushing. So when I select points in one plot, I can see them in another. Now, people have been able to do linked brushing like this basically since the 1960s. This is not a huge advance in terms of the technology. What it is advance is in terms of the availability. Because anyone with a little bit of understanding of R code can create things like this. And because all this and does is run R code in a special way, you can also use it to do arbitrary things. Like I can show this little table, which I'm just gonna make it a little smaller. I can show a table of correlations, which just shows the correlations of the data points under the brush. So anything you can do in R, you can now combine with an interactive graphic to do interactively. And then finally, I've showed you all of these things on my computer. It's also trivial to publish them. So for example, I can show you them. So you can use my code if you have R, if you wanna share it with someone that does not have R. You can publish it to a web server. And then this is accessible to anyone who has a web browser, which is basically anyone these days. Now hopefully this is going to load and not prove me wrong. We'll give it a few seconds and I'm just gonna blame Berkeley's for internet service for any problems. Well, well, maybe we'll come back to this, but you can publish these things to a live server and then people can look at them interactively. They don't need to be running R. All the person who creates this needs to use R, but the people viewing it do not. If you'd like to learn more about this, Google for GgVis, there's a web page, it's available on CRAN, there's a mailing list, et cetera, et cetera. Lots of vignettes explaining how to use it. I wanted to finish by talking a little bit about what I kind of see still as the challenges of data analysis, particularly in R and kind of what my goals are future workers. And I think there really are, this arrow is still a huge problem. Getting your data is often one of the biggest bottlenecks and it's still a huge challenge. The other thing I think is particularly interesting is this sort of modeling and not just modeling, but the interaction between modeling and visualization. So given that you've looked at the visualization, how do you update your model? Given that you have a model, what insight does that give you into the next visualization you should look at? I think that's a really interesting and challenging area of research. There's been a little bit of work on this by David Roberts and he has written this broom package. What the broom package allows you to do is take any model object in R and turn it into a tidy data frame so that it is very easy to visualize with any of the tools that you're already familiar with. Modeling in R, I think, is absolutely one of our strong points and it is fairly consistent with the formula interface, but there are still a lot of inconsistencies and you are switching between various modeling families. You need to understand a lot of details and a lot of special cases, which makes modeling, I think, harder than it needs to be. So there are some interesting steps in the direction of making it easier like the Zellig package or Carrot. So my goal is I want to be able to provide or help to provide in R a fluent interface which allows you to go from having no data in R to having a final report or the sequence of tools that just disappear. They fade into the background. You're not fighting with them. They're there supporting you and they are effectively invisible because they just work. You're not constantly thinking, what's the name of this function and all of this function. That actually has the arguments and the other order of this function and how do I get the output from this function to the input of this other function. All of this kind of crap that is not really related to the challenge you're interested in which is how to make sense of your data. So personally for me this year, I am really heavily invested in making it easier to get data into R so making it faster to read files off disk, making it easier and more robust to connect to databases, to scrape stuff off web pages and to connect to web APIs. Thank you. Okay. Questions? Yep. The first question was, is my code available on GitHub? Now, yes, and here it is. I don't remember if I, it probably is. If you look at my GitHub repo, you'll probably find it and if you don't, feel free to shoot me an email and I can give it to you. So the question was, are there packages available to zoom in to graphics? There are some, there's some stuff based on SVG, there's some JavaScript ones. I know there is, I don't know their names off the top of my head. There was an older one called Play With. I think, I think you should be able to find them. One thing I guess, can I show you? Can I show my screen? One area that's, one package or website that's worth looking at, maybe I'll just try. Okay, I'll try that. So there's a new package called HTML Widgets. The goal of this is to make it very, very easy to, for R to talk to JavaScript libraries, often visualization libraries and it's worth looking at this website. There are some examples there and people extending, like if I was gonna go look for a interactive web thing that uses JavaScript, I would start by looking at the packages that use HTML Widgets. It does not, but there is a, you can add new backends to Deployer and there is a vignette that shows you how to do that. So it's possible. As long as there's a way to talk to site EB from R, then you can do it. So the question is have I ever, have I done any kind of formal evaluation of the usability of pipe methods versus non-pipe methods? I have not done any formal evaluation of basically anything I've ever done ever. But I think the anecdotal evidence I get is that it's really, so when I first created that pipe idea, I was, like it seemed like a really cool idea to me, but I have lots and lots of cool ideas that no one else can make heads or tails off. So I was initially used it pretty cautiously, but people really picked it up and really liked it and used it a lot. So the best metric I have is just my perceived visibility of people picking it up and people have picked it up really, really quickly and are using it and seem to really like it. And just that being able to put the verb at the start of each line is so nice because you can very, very easily scan a block of code and get the gist of what's going on. Okay, okay, good, thanks. So if you, this HTML widget is worth taking a look at, is a way, allows you to create interactive web graphics from R using a lot of the best JavaScript libraries. So if you wanna put stuff on a map, there's shiny, there's leaflet, you can see this is the R code and it generates this interactive web graphic. If you have time series data, there is digraphs which allows you to create interactive time series on the web. So they kind of see these as very like, I think one of the reasons that GG Plot 2 succeeded is because it took like the most common 80% of plotting tasks and provided a consistent interface to them. But that's not possible to do that unless there's a whole lot of other tools to do the other 20% of more specialized needs. So I see these JavaScript plotting libraries as a really, really important. They're designed for special cases. They have a lot of existing work put into them. They're extremely excellent. They're a great complement to what GGViz will get to eventually, which is a generic tool for making new types of interactive graphics. So the question is basically, I'll reframe it as how does stuff move into base R? And the question is basically stuff does not move into base R. And it probably never will. And I think that's actually okay. I think rather than trying to put more stuff into base R, it would actually be better to put less stuff in base R and really embrace the diverse ecosystem of R packages. And at the end of the day, having stuff in base R doesn't really make that much difference because installing packages is so easy anyway. I don't see that as a big concern. So the question, I really like this because I can just reframe it into a question that I wanna answer, the question is sort of about this, this 80-20 rule, like how do you deal with the fact that you've now got this like 20% elegant code and 20% icky code? I think one thing is that it's neat, the piping operator doesn't, you can use it with any function right now. So you still always can pipe stuff into functions that come from other packages. And I think there's a really interesting analogy between the piping operator and method chaining, which is if you've ever used the Pandas Data Manipulation Library in Python, it has a very, something that's very similar to the pipe idea, where you call a sequence of methods. So you chain, you have a pipeline of transformations achieved by calling a method after method after method. Now, there's the advantage of the pipe operator is that anyone can add a function into that chain. The disadvantage of method chaining is that the only person who can add a new method is the owner of that object. Now, there's downsize to the piping as well, but because these things are not, these functions that can go in a pipe are now functions that can be called anywhere. And that means that generally they need to have longer names because in order to be explicit, whereas if you're calling methods, they can be shorter. But I think that's something that's really powerful about this idea of pipelines, that you're not just limited to things that I've thought would be good in a pipeline. You can create your own functions that go in the pipeline and work really well there too. Yes, needless inconsistency. Yeah, so the question is about by use of non-standard valuation in R, and partly also about why is the syntax for dplyr and ggvis inconsistent? And basically, so using non-standard valuation, being able to evaluate an expression in R in an unusual way is really, really powerful in R because it allows you to create things that kind of break all the rules. And really, I think sort of an interesting journey for me is like how well have I understood how that works? And I've had at least I think three epiphanies where I'm like, yes, I finally understand how this works. And then two years later, I'm like, oh, I didn't really understand it at all. Now I finally understand it. And I'm reasonably certain I've had my final epiphany and I now actually really do understand how it all works. And I understand all the failure modes. And there was for about a year where I understood it and I was like, cool, now I can write code that reliably works for non-standard valuation. And that led to the creation of this lazy eval package. Now unfortunately, after finally understanding how it works and how to do it correctly in the vast majority of situations, I'm now kind of coming to the conclusion that you shouldn't do it in the first place. And instead you should use formulas which turn out to be one of the most powerful and useful components of R in terms of computing on the language or metaprogramming. They're not just useful for just defining model formulas but they're very, very useful in general. So unfortunately, D-Plyer kind of started on one side of that transformation in GGViz on the other and how we're gonna reconcile them to make them both work the same way is a difficult question because there's no way to make D-Plyer. There's no way to change D-Plyer without fundamentally breaking it to work in the same way as GGViz. So at some point there may be like a D-Plyer two which requires you to explicitly use formulas everywhere. I'm still struggling with that and how, what the right thing to do is there but for now it's just a small inconsistency between the two. So the question is basically what can you put in the columns of a data frame and all the examples I showed you just have very simple things like numbers or strings or logical. Now it turns out that in the regular data frames and base R you can actually put more complex objects in them. So it's totally valid to have a column of a data frame that contains a linear model for example. And so I think that idea of having more complex objects inside data frame columns is really, really powerful and I think there are certain circumstances we're having a column of a data frame where each value is another data frame can actually be really, really useful and really powerful and certainly putting models inside data frames is also a good idea. Now I, internet seems to stop working again but I've been kind of exploring that idea in this package called Lowliner which is thinking about how ideas from functional programming can kind of help us with that, that modeling part of the piece. Because often when you're, so I think a good example is cross validation, right? You take a whole random, you randomly sample your data and you randomly partition your data into like 80% training, 20% test. So now you have kind of two sets of data frames. You've got a whole lot of training data frames and you got a whole lot of test data frames. And now you want to fit to each training data frame you want to fit a model to it. So now you have a list of models and now you want to take that list of models and the list of test data sets and predict them and now you have a list of predictions and then finally you want to look at those predictions and compute the root mean square error or something and you end up with a vector of that. So I think you can do that inside a data frame and I think that's a really useful way of working and the goal is to get dplyr to the point where that, in Lowliner together to the point where that just feels natural and easy and the sort of each version of dplyr is stepped a little bit closer to that ideal and I think, and by and large the main reason you can't do this with data frames currently is that there are a few bugs in how data frames that contain complex objects get printed which basically makes them useless, sand or extremely annoying. One more, one more question? Okay. So let me, so the question was about the, there's some sort of more advanced forms of piping which I can actually happen to have a little picture of here. So there are a few restrictions about pipes. I'm not gonna kind of talk about them but these are like, these are pipelines that are useful and valid and there are some pipelines you just cannot construct for various reasons. Now it is possible to create a pipeline that has sort of a cycle in it like this where you do a sequence of operations and the final operation replaces the initial input and so that's pretty common, right? You do a sequence of operations and you use that to replace the thing you started with and there was this compound piping assignment operator in Margarita that lets you do sort of a little shortcut. I am fundamentally conflicted about whether that's a good idea or not. I mean, it's again really, really important when you are writing code to write code that's readable and that generally imply that if you wanna write code that is readable, you need to use a fairly standard vocabulary. You don't wanna use really esoteric words and I think at the moment this is fairly esoteric and I'm not sure whether like the amount of code it saves you is worth the cost of that additional knowledge. Now on the other hand, there's a thing that I think is really cool and that is often you apply the same pipeline to different input data sets and Margarita has this neat capability is if the input, the first thing into a pipeline is just the dot rather than executing the pipeline it creates a function that will execute the pipeline and that seems, I like this. I'm not sure again whether it's super useful but it seems really elegant to me that you've got some way of saying either create a specific pipeline that takes a specific data set and produces a specific result or make a generic pipeline which is just a fun function which takes some input and gives you some output. Thank you.