 Kia ora, Tata. Maenai, John Hosking. I'm the Dean of Science here at the University of Auckland. It's my very great pleasure to welcome you to the inaugural lecture of the Ihaka lecture series. This series is named and honour of Ross Ihaka sitting here. One of us, of course, is one of the co-founders of the statistical programming environment. Both were here in the Department of Statistics when they decided to build some experimental software back in 1991-92. So, a wee while ago now. One of the key decisions that they made was that it was to be free software that's free download and freely modifiable. And this has been one of the reasons for its extraordinary success. And that success is extraordinary. By standard academic standards, the paper that Ross and Robert wrote has attracted nearly 9,000 citations. I wish I had a paper like that. But it's a real impact. Ross' real impact is the number of users worldwide that make use of that. So that's estimated in 2014 to be over two million. I'm sure it's grown considerably since then. Ross now ranked fifth in the top programming languages behind C, Java, Python and C++, but ahead of languages like C-sharp and other academic programming environments, such as Matlab and domain-specific languages in the stats area, such as SAS. This is almost a textbook example of curiosity, inventiveness, great ideas, and sharing openness leading to good science. Ross through R has had a profound impact on the practice of modern statistics. In 2008, Ross was awarded the Pickering Medal from the Ross side of New Zealand in recognition of his achievement. He has a Rockstar status amongst statisticians, but as anyone who knows Ross knows, he remains extraordinarily modest about his achievements. He asked about his work. He said, simply, I have fun. He finds the accolades he's received kind of embarrassing. So Ross, embarrassment aside, this is a great occasion to celebrate the impact that you've had as an academic, and I think this is a very fitting series to commemorate there. So it's a very great pleasure to open this lecture series to be held annually in Ross's honour. Thank you. My name is James Curran. I'm here in the statistics department. Let me say at the outset that it's just absolutely fantastic to see so many people prepared to give up their Wednesday night to come along and listen to a statistician who would have believed it. We have a fantastic lineup for you over the next four weeks, and tonight we're going to start with one of the best. So in some sense, Dr Hadley Wickham needs no introduction, but I did promise him that I would embarrass him. So we'll start out with a brief one. So Hadley is actually a former graduate of our department. After graduating with Master's Degree from here, he went away to Ohio State and got a PhD. Some people might say that he actually rewrote the field of statistical graphics. He then went on to a position at Rice University in Texas and then was snatched up by RStudio, which is a product I'm sure most of you are familiar with. John used the word rock star to describe Ross earlier and perhaps the same can be used for Hadley. So I remember seeing a talk with the academic Peter Donnelly telling a joke about how you can tell the difference between an introverted and an extroverted statistician and the extrovert is the one that looks at somebody else's shoes. But I think we can take the metaphor or the rock star thing just sort of one step further for Hadley. So I don't know if you can all see this, but when you meet somebody and you don't have a piece of paper you ask for something to be signed, I'll never wash my laptop again. So jokes aside, it's hard to overstate the impact that Hadley has had especially in the R world and just as evidence of that, this is from an article called The Hitchhiker's Guide to the Hadley Verse. Hadley is far too modest a person to use the word hadley verse. I believe last year he suggested it use R that we should use the word tidy verse. These are I believe in the order of 55 packages that was of August 2015 that Hadley has authored. They are packages that are shaping our field and are shaping the field of data science that are really important to us. So without much further ado, would you please join me in bidding Hadley a warm welcome and I'll hand it over to the floor over to him. Thanks James. So what I wanted to do today is to talk about some of the things that I've been spending a lot of time thinking about lately around the kind of theme of how can you most easily and naturally express yourself in R. To this end, my goal as a tool builder is to engineer something I like to think of as a pit of success. Now this is not a pinnacle of success because to reach a pinnacle, you have to strive, you have to work really hard. The goal of a pit of success is you kind of almost fall into it without even trying. So certainly I think we're a long way from just falling into success in R but today I wanted to sort of think about to talk about some of the things that I think make coding in R lead you to fall into this pit of success. And really in some respect I think the thing that helps you fall into this pit of success is theory and some kind of consistent principles. So these are things that normally you wouldn't learn. I don't think you need to learn them early on in your journey as a data scientist in R but they're things that you'll kind of notice. You'll discover naturally and because there is this sort of rich underlying theory my hope is that my success to me is when you try something that you didn't know if it would work or not and it just works. So today I'm going to talk about kind of a general strategy for solving problems I think is really useful and how that, how that, how we can see that reflected in R. So basically I'm going to argue that a really good way of solving complex problems is to break them up and do simple pieces, combine them together and hopefully those simple pieces will also have a consistent structure. So I think there's two main reasons that solving complex problems with simple pieces is really important. The first one is iteration. So whenever you do a data analysis whenever you do a visualisation whenever you do some data manipulation almost always the first thing that you do is wrong. And often the second thing that you do is wrong as well and the third thing is wrong as well and the fourth thing might be wrong and I'm someone who's been programming in R for like 15 years now something scary but this still happens to me all the time it's not like I start typing R code and it works right away instead what I've tried to engineer is a workflow where I can fail as quickly as possible. And that's why this idea of simple pieces is so important because you can take a little step and check your work and if you haven't gone in the right direction you can just take a step back. If you kind of take a long wandering journey and like two days later you discover oh I'm not actually where I want to be you've got a long way to backtrack you've wasted a lot of work. So if you can break up a complex problem into simple steps and check your result check your work after each step this is a really good way of solving complex problems. The other thing that I think is useful about these simple pieces is that not only do they allow you to solve this problem but maybe you can rearrange those pieces in a new way to solve a new problem. But you can rearrange them and solve a new problem that you hadn't thought about before. And so what I'm going to talk about today I'm going to basically break those components down we'll talk about them one by one as how they reflect and as why they matter and are. And I'm going to start off by giving you a challenge. So here is some R code hopefully you're all familiar with R code. I want you to read this code I want you to figure out what's the point of this code and I want to see if you can spot the mistakes. So talk this over with your neighbour you have one minute starting now. I'm going to take a stab at telling what the intent of this code is. So this is a pretty common scenario of you reading some data in some crazy format and missing values have been recorded as negative 99 and I want to convert these to R's missing values so when I compute these values I don't get some crazy number. So what's wrong with this code? Well I've created this by copying and pasting and as whenever I copy and paste I've made a few errors. So you might have spotted this 98 here and you might have spotted this INJ and anything else? Well so C&D missing absent here but that's not actually a mistake it's a narrative variable. So you don't know that so here I've tweaked the code a little to highlight that I've added a comment to kind of explain why they're not there and I've highlighted my mistakes. Now I am so bad at copying and pasting that typically when I create a slide like this where I've deliberately introduced two mistakes there's normally a third mistake as well I seem to have avoided that. So the problem duplicating your code by a copy and paste is first of all to hide the intent. Right? You or as experienced R users kind of read between the lines here and understand what's going on but you certainly couldn't guess that C&D were character variables so it didn't need this transformation and because I have all this duplication I've also introduced these possibilities so to me like the piece the thing that you use in R and pasting is a function so I think a great rule of thumb is it's find a copy and paste three times but as soon as you've copied and pasted more than three times it's time to write a function and so here I have written a function and one of the great things about functions is you can give them a name and so immediately this code is easier to read because I can now know I can now divine the intent a little bit more easily for these missing values so the kind of unit of problem solving in R is the function you write code you just type out some script to solve a specific problem one problem a function allows you to solve a class of problems now one thing I'm not going to talk about today but one of the things that I think is really great about R as a programming language is that R is a functional programming language and you can write stuff like this here I'm using a little function from the per package it's going to take a data frame it's going to say for every column where this is true apply this transformation so now once you've learnt what this function does and if you haven't ever heard of this function before that is fine because I only wrote it like literally three days ago but once you get the idea of this I'm going to take an object I'm going to see what parts of the object I want to change and I'm going to apply that transformation so this is a really, really powerful aspect of our programming language and I'm not really going to talk about this all tonight but I encourage you to learn a little bit about functional programming if you do want to increase your skill as an R programmer but what I wanted to talk about today is if a function is a piece what makes a function simple and I'd argue generally you want functions that are like Legos not functions that are like Playmobil right the argument is that not like Lego is more fun than Playmobil or Playmobil is more fun than Lego but the neat thing about Lego is that you can recombine them in lots of ways the use of Lego is not constrained by the inventors of Lego unlike Playmobil where the uses to which you can put it are much, much, much more constrained by its construction so to me I think there are two key ideas behind writing simple functions the first idea is that you want your function to do one thing well and the second thing is you want to be able to understand that function in isolation you want to minimise the amount of context you need and so to make these concrete I'm going to talk about two aspects of each of these so one way a function can do more than one thing is it could compute a value and have a side effect so I'm going to talk a little bit about functions and how ideally they should either compute something, they should either give you a value back or they should do something with some side effect that affects the world in some way and I'm going to talk a little bit about this idea of type stability this idea that if you don't know what type of thing a function returns it's much harder to predict how that's going to behave when you're just reading the code rather than when R is running the code so again I'm going to give you a challenge here are seven functions from base R and some of my packages what I want you to do again, discuss with your neighbour which of these functions is called primarily to compute a value and which of these functions is called primarily for its side effect again, you've got one minute starting now let's do a quick show of hands I kind of realised I started with a little tricky one but who thinks mutate is called primarily for its side effect and how many think because it's for computing a value and some of you clearly don't know this is a little bit of a tricky one if you haven't used dplyr before I'm going to say your homework is to go and read about it so you can at least deliberately choose not to use it but mutate the job of mutate is to take a data frame and add a new column that's a function of existing columns but it doesn't actually literally modify the data frame that you give it it gives you a new data frame with a new column so this is a function that is called primarily because it returns a value what about write csv is this called primarily for a side effect or for its value right, who knows what it returns well I know obviously but it's called primarily because you want to save a csv by the disk what about print return a value side effect the job of print is to display something on the screen what about summarize another function from dplyr does it primarily compute a value or return a side effect to have a side effect so summarize collects multiple values down into one we've got gm line adding on a gm line is that a side effect or computing a value now that's a little tricky so this actually does not it's a little tricky because when you look at ggplot2 it kind of seems like the job of this is to make a plot appear but what's actually going on behind the scenes is it is building up a plot object for you so ggplot2 kind of follows this philosophy in a similar way the way you create a complex plot in ggplot2 is by combining pieces simple pieces we'll kind of come back to ggplot2 a little bit later and why in some ways it's worse than ggplot1 what about the assignment arrow is this called a computer value or have a side effect and what about our unit computer value or have a side effect Thomas do you want to prime me on a computer value yes so here's just what we went through hopefully I did that correctly and our unit value is a little bit interesting because if our unit did not change the state of the world in some way every time you call it you get the same random numbers back so there are some functions that by the very nature have to do two of these things so our unit has to update some kind of global state which is what allows it to give you new random numbers every single time you call it but generally you want your functions to do one or the other I'm going to show you two examples of functions in base r that do both and but by and large the functions in base r are really good they either do one thing or the other and it took a little bit of searching to find these two so the first thing in base r if you compute the summary of a linear model and you print that summary you get a p value for that model but that p value is not actually stored anywhere if you want to extract that p value you cannot that p value is just printed but it is computed when this object is printed it's not stored anywhere so you cannot get that value and use it in other ways you might be interested in the counter example the opposite side of this is hist you call it primarily to get a histogram but it does also return a value which you can use if you wanted to draw the histogram yourself some other way and you can kind of explicitly say to histogram don't plot this I just am calling you for the return value not for your side effect but this thing here this is a little icky I think having an argument to a function that changes its behaviour in such a fundamental way makes it harder to reason about your code and this is related to this idea of type stability and to paraphrase Forrest Gump if you've got a type stable function you never know what's going to a type unstable function you don't know what's going to come out of it you have to have some additional context so a type stable function no matter what the input is you always get the same output a type unstable function well maybe sometimes it gives you a list and other times it gives you a vector and other times it gives you a matrix it's harder to reason about because you have to not only know what the function is but you have to know what the arguments to that function is so I'm going to give you another little challenge so here I've written a function called findVars the idea of this is you give it a data frame a predicate function a predicate function is just a function that returns true or false and it's going to use air supply and it's going to extract these variables so the goal of this function is to say give me all the numeric variables in this data frame or give me all of the factor variables in this data frame so what I want you to do is read this code I want you to execute this code in your head you may get compiler errors that's okay I want you to try and predict what is each line of the code going to do here and since this is the University of Auckland they've included a line of for experts only if you can predict what this returns I'll give you a gold star so again one minute starting now it's the one in the built-in line which it's kind of not important in some ways it kind of is so now that you've run this code in your head let's run it in R and see what we get so I didn't tell you anything about this iris data set maybe some of you are so intimately familiar whether you already know about it but let's just run it and see what happens so I'm going to say the intent here hopefully is clear so give me all of the variables in the iris data set that are numeric and what do I get well it's a little hard to see here so let's use str what do I get I get a data frame back now what if I say give me all the variables in iris that are a factor what would you expect to get back I hope you get a data frame back right? I don't get a data frame I just get a single factor do any experts want to guess what this what this is going to do so here this is a little crazy right here I'm indexing a data frame that has zero columns so I'm going to have a data frame that has zero columns and 150 rows and you might be like I would never create a data frame like that and if so you are very lucky because I always end up accidentally creating these crazy data frames but any ideas what this is going to happen so how many numeric columns are in this data frame zero right? my definition of a data frame still with zero columns in what happens hmm I get this rather uninformative error message that there is a invalid subscript type list so what's going wrong here well there are two functions here that are not type stable so the first one is s apply so s apply is short s apply it takes each element of df and applies this function to it and then simplifies the results into the simplest possible vector so in this case that's normally going to be a logical vector but what happens if the data frame is empty if there are no columns then there are no logicals in fact s apply never sees any logicals so what s apply does in this scenario is a list so sometimes s apply returns a list sometimes it returns a vector and in other cases it returns a matrix so this doesn't make s apply bad right? this in fact makes s apply very convenient for interactive programming but it makes it harder to predict what it's going to do when you just read some code without knowing exactly what the values of these arguments are then we also have the square bracket so if you retrieve multiple columns return the data frame if you retrieve a single column just gives you that single column again this isn't bad this is very convenient when you're doing an interactive data analysis because often you just want to pull out that one column you don't want a data frame with one thing in it but as soon as you start putting these into functions and you start reading your code later on it gets increasingly difficult to predict what's going to happen so I think this idea of a type stable function is really powerful because it makes the results of the code more predictable without understanding all the details so I can rewrite this in two ways, first of all I'm going to use another function from per called map, map underscore logical, it returns a logical or a dyes trying it will never return anything other than a logical vector and then in base art we can say that it was false and that ensures that this substituting operator always returns always returns a data frame rather than trying to simplify into a vector if it can so the atom of problem solving or reproducible problem solving in R is the function you want to try and make your function simple in two ways ideally you want to make each function do one thing well you have one example of that is that your functions ideally should either compute a value or have a side effect but not both and they should also be understandable in isolation, the less you know about the arguments to the function and the better you can predict it's output the easier it is to understand your code so once we have these simple pieces how do we put them together well in base art in two ways that you can use multiple functions to solve a problem the first way is to assign each result of a function call to a variable so here I'm doing a little analysis using dplyr, I'm going to take this flight data set I'm going to group it by the destination then I'm going to summarize it computing the average delay and the number of flights and the estimate of the mean is not going to be very reliable when there's hardly any flights so I'm only going to look at the flights with great destinations with more than 100 flights and then I'm going to sort it in descending order of delay so here I'm solving a fairly complex problem give me the airports with the greatest average delay by combining these simple pieces but what I've done here is I've been forced to name every single one of these intermediate objects and naming things is hard so what you might be tempted to do is just say well I'm going to call this foo and you've solved that problem it's easy to come up with a silly name like this but now you've got this problem if you make a mistake down here you've overridden the previous value of foo so if you made a mistake now you're going to go all the way back to the top back down here again so maybe what you'll do is say well I'll call it foo1 and then foo2 and if you look at this code you might notice I've done what I always do whenever I write code like this and that is I've forgotten to increment the value of foo in one of my lines of code now just like when I copy and paste code whenever I write code like this even knowing that I will make this mistake I still make the mistake so one way so maybe we should just try and get rid of these intermediate names altogether and one way to do that is to use function composition so again we've got exactly the same code but I've just done this in one giant function call and now I have to kind of read this from inside out I start with flights I group it then I summarize it then I filter it and then I arrange it so this has got rid of that intermediate variable problem but it's introduced a new problem namely that I have to read from right to left which is unnatural because I have to read from left to right from all the function names and then also the arguments to the functions end up quite spread quite far apart so a third option is provided by the magrita package in the form of this crazy looking operator for the pipe and what the pipe does is it takes you to write a sequence of function calls from left to right without any intermediate variables the pipe is really natural when you are transforming a single object in multiple steps like here take the flights data then group it by the destination then summarize it then filter it then arrange it it's really important to think about the readability of your code because every project you work on is fundamentally collaborative even if you're not working with any other person, you are always working with future you and you really don't want to be in a situation where future you has no idea what past you was thinking because past you will not respond to any email so i think it's really important to think about how can you structure your code so it's readable because you're going to be the chief beneficiary of that in the future so to kind of summarize the options we've got assignment which is great because you can read it from left to right but you can't omit intermediate variables it forces you to name everything and naming things is hard then I showed you function composition which is great because it allows you to omit intermediate variables but you have to read it from right to left which is unnatural and then finally we have the pipe which allows you to read it from left to right and allows you to omit intermediate variables but of course there has to be a downside to the pipe and basically there are two reasons I think there are two places you don't want to use the pipe so the first is these allow you to omit these intermediate variables but often you want those variables so first of all you might be able to give them an informative name and that will help convey the intent of your analysis better or secondly those intermediate variables might take a very long time to compute and you don't want to re-compute them all the time the other disadvantage of the pipe is it's fundamentally a sequence it's fundamentally best when you've got a linear sequence of transformations there's all sorts of other kind of crazy graph structures you can create with code with the other forms particularly this form but you cannot with the pipe so I think the pipe is a really good fit for many common data analysis or data science problems because you take something and you do something to it multiple times and I want to finish off this talking about combining things by showing you somewhere where combining things does not work very well because unfortunately ggplot2 the simple pieces in ggplot2 aren't really function the functions plus so when you create a ggplot2 plot you have to add things together and this is a little unnatural because now if you're using dply when I take flights I group it I summarize and I plot it and remember I don't use the pipe anymore I have to use plus and now what happens if I want to save this plot well I have to wrap that whole thing in ggplot2 so unfortunately ggplot2 was written before I kind of discovered the pipe and so we've got this awkwardness that you've got to switch between these different modes of composition and I think that is one of the nicest things about the pipe it's not something fundamentally new it's just a new way of expressing something old whereas ggplot2 tries to come up with something new creating plots by adding things together and it has all these kind of awkward boundaries now kind of interestingly ggplot1 the original ggplot did not have this problem and so this is the equivalent code in ggplot1 and so originally it was not called ggplot1 because I did not know I would be completely creating ggplot2 but I'm going to brought it back to life as ggplot1 this is sort of ggplot1.0 with some respect so it's very very similar it just is basically brought back to life as an R package that will work currently and it provides the pipe because the idea of ggplot1 was to use function composition and at the time I correctly realised that the assignment and wrapping styles of function composition were both kind of awkward so I decided to create the plus now you could do that all with a pipe which gives you this very nice sequence of transformations you start with a data frame you transform it to a plot you do something with that plot and then finally you save another interesting connection with the pipe is to this rather weird and R which is that you can assign an either direction in R so as well as the usual assignment arrow which assigns to the value on the right to the thing on the left you can also use the backwards arrow which assigns the thing on the left to the thing on the right now personally I believe this is still terrible and awful and you should never do it but it does seem like a very interesting idea and then basically the reason I believe you shouldn't do it is because I think it's when you come back to the code later on it's great to have the name of the thing upfront because that's going to give you a hint to the intent so one reason it's like the name of an object acts like a title or a heading does in the paper it sort of alerts you what's coming up and then I think just assignment is so important that it's a thing on the left that gets indented a little bit which makes it kind of subsidiary this is the thing that this pipeline ends up doing so you can do this and it's kind of cool that it looks so natural but I still think you shouldn't do that okay so talk about simple pieces functions that do one thing well and are easy to understand in isolation we've talked about how to combine those with either with function calls basically three styles assignment, wrapping or the pipe but sort of the last thing is how can you make it easy for when you combine all these little function calls to get something that's easy to work with because sometimes I think programming in R feels a little bit like this the output from one function doesn't fit into the input for another function so you're like oh maybe I could use this function to transform that one didn't work oh that didn't work either well maybe I can just kind of put half this function and oh there's this other function over here that does half the problem oh no oh that's so close but it didn't work ah finally I find the last function that combines them together so like each of these pieces is simple right you can easily understand them in isolation but you still have this nightmare solution that seems like a simple problem so I think it's important when you look at Lego I guess since I'm in New Zealand I should be saying Legos Legos are not cool because they're not only simple you can look at a Lego brick and understand it but they're also consistent you can take most Lego blocks and stack them attach them some way to another Lego block and that's what allows you to kind of go from these individual Lego blocks which are very simple to create art right Lego blocks are very simple and they're uniform but that does not constrain your creativity in some ways it almost enhances your creativity so what does that mean for r functions well that means you want a consistent data structure you ideally want your functions to take the same type of data in and return the same type of data out across a broad range of problems and I think a type of data that is the type of data structure that's really good at that in r is the data frame and I think there's just one extra thing you need on the top of the data frame and that is some convention for what do you put on the columns and what do you put on the rows and that's basically the idea of tidy data so if your data is tidy you put each data set in the data frame and then you make sure each variable is in the column and if you do that you're going to find naturally the rows become the cases this hopefully seems like a really simple and obvious idea but it legitimately took me like five years of thinking about it to figure this out but what's great about this is it gives you this consistent data structure if all of the functions work with this data structure you're not constantly trying to jam the output from one function to the input of another function to paraphrase Tolstoy Hadley Tolstoy says tidy data sets are all alike but every messy data set is messy in its own way to illustrate this idea I'm going to give you another challenge this is some data about tuberculosis collected by the World Health Organization what I want you to do is look at this data set and figure out what the variables are here I'm going to give you a couple of hints F stands for female U stands for unknown and 1 5 2 4 stands for 15 to 24 again you have one minute starting now what are the variables okay time's up who wants to tell me one variable they've located we've got gender you've picked one of the harder variables to spot but you notice we've got males and females kind of spread in the column names what are the variables up there in the column names we've got the age range so we've got sex and age range stored kind of somehow in the column names what other variables do we have yes so Thomas is either a very smart and insightful gentleman or he has heard this talk before but I'm going to go with the former and yes this is indeed a two letter iso abbreviation for the country code I didn't give you much context here right but I told you it's the World Health Organization so you might guess that the World Health Organization is going to store the country in some way here we have AD which I believe stands for Andorra any other variables that are really easy to spot none of you have said the easiest variable to spot yeah right so we have country and year and the columns we have gender and age spread across the column names what else do we have so this is measured right and again you might just guess that they're all integers right so you might guess that this is a count and this is the number of cases of TB for that age group for that gender in that country in that year so we've got five variables but they're stored in three different ways in the columns and the column names and in the cells themselves so you can imagine working with this data set it's going to be an exercise in frustration because you're going to need to use three different sets of techniques depending on which variable that you are working with so all we could do is turn this into tidy data as I said tidy data has one variable in each column so we have country year, sex, age and the number of cases and this is going to be much easier to work with because each variable is stored in a consistent way now you might wonder like what about stuff that doesn't naturally fit into this format and there's some tension here there's always some tension between do you want like a special purpose data structure that is absolutely the best data structure ever for this specific problem or do you want some general purpose data structure with some general purpose tool it might not be the best tool for this job but you're already familiar with it you already have lots of tools for working with it so I just wanted to show one little extension of this idea called tidy text this is a package by Julia Silja and David Robinson and if you're interested and you can read more about it here but the basic idea the basic question that this package asks is how do you represent data that looks like this how do you represent text data in this tidy format and one easy way to do that is to say well I'm going to make each word an observation each word is going to be a row and then I'm going to have variables about that word like which book is it in which line is it in which chapter is it in and what is the word itself and the thing that's neat about this is not that this is the best possible data structure you can come up with for text it's certainly not the thing that's great about this is you can now deploy all of the tools you already have for working with data frames and so it's very easy to go from data like that to plots like this where I have joined in some information that gives me the sentiment positive or negative for each word I have aggregated it by line within each book to find the average sentiment and then I have plotted that across all of the books so you can very easily do things like sentiment analysis which might seem like complicated and strange but they just put it into a framework that you are already familiar with now you might wonder well not everything I do is like a simple thing like a logical or a number or a character what about the complicated things like polygons or linear models well it turns out that you can handle these in data frames as well using something I call a list call so the basic idea of a data frame in R a data frame is a list of vectors and each vector is the same length that is the contract that a data frame provides in R that turns out that one of the vectors that can live in a data frame is a list in an R anything can live in a list which means anything can live inside a list column in a data frame now working with these with data frames is a little tricky there's some slightly annoying behaviour with data frames that make this frustrating so instead of using data frames there's a little bit of code that uses tibbles so tibbles are basically a modern reimagining of the data frame they take the stuff that time has proven to be successful and keep it and they get rid of some of the things that time has proven to not be successful but in brief tibbles are data frames that a lazy they do as little work as possible and they are surly they complain a lot now this seems a little counter-intuitive that you would want to use something that is lazy and surly but it turns out to be quite handy because for example when you try and extract a column that does not exist tibbles are lazy because they do not do partial name matching and they are surly because they complain about it and if you do succeed in extracting this variable you'll know that it is a character vector not a factor because tibbles are lazy and do little work for you they do not automatically convert character vectors to factors that you think was absolutely the right decision at the time but now factors tend to be more frustrating to work with than character vectors more importantly they also allow you to easily put lists inside tibbles and they have a nicer print method so when you print these things out you can actually see one of the really nice things that tibbled Zoo for you is they print the type of every column so you can instantly tell am I working with an integer or doubles am I working with characters or factors and the thing that's important about tibbles and list columns is they allow you to keep related things together no matter how complicated they are and I wanted to show you two little examples of that the first is the SF package by Edza Febesma and others SF is a successor of the SP package designed for working with spatial data or simple features data so this is what a SF data frame looks like basically you'll see it has a column called geometry which contains these complicated polygon objects a polygon is just a list of points basically well a polygon is just a list of points but sometimes when you have a country a country might have a lake inside of it and that lake might have an island inside of the lake the island inside the lake might have a pond so there's multi polygons well now you might not have like Hawaii for example you might have multiple islands well New Zealand would be a more obvious example so you can store these complicated objects in a data frame you can work with those data frames in the same way you can work with lots any other data frame here for example I added a few bits and pieces that you've got too so you can easy work with them using the special geom which automatically draws polygons when you have polygons points to your points lines to your lines and so on another area where I think list columns are really a beautiful fit is to cross validation because when you do cross validation you've got a whole bunch of fairly complicated objects exploding around so first of all you've got your data you fit the model to the training data and then you make predictions based on that model and the test data and then you summarise some model quality like the root mean square error and if you have a tibble you can just keep all of those things in a data frame together and that's really great because often when you do a cross validation you discover maybe five of the models had a really really bad fit and you'd like to dive in and figure out what went wrong, why is that data different what happened with those re-samples to make it different because you have this in the tibble you can easily say filter this give me all the models that have a bad root mean square error a very low root mean square error very high rather so what I have talked about today is a strategy for solving complex problems by combining simple pieces that have uniform interfaces so the simple pieces are functions functions are the atom of problem solving in R and you make those simple by striving to have each function do one thing well for example you never want to have a function that both computer value and has a side effect and you want functions that could be understood easily with as little context as possible for example you want your functions to be type stable you want to be able to easily predict the type of thing that they return you combine those pieces in three ways you can use assignment you can use composition or you can use the pipe the pipe is something that's not built into R it's not a new idea a bunch of other tools are piped particularly the linux command line but every other functional programming language is something similar in flavour pipes are great because they allow you to emit intermediate variables when you don't care about them but still read your code left to right so you can easily read it as a series of imperative statements take this data then do that then do something else finally if you want to combine these simple pieces together it is going to be easiest to do that if they have a consistent structure if they are like Legos and they all plug together naturally i think the most common and useful data structure for that when you're doing data analysis or data science is the tidy data frame where you have variables on your columns and observations or cases in the rows if you need to store richer data structures or is not lost you can still use a list column inside a data frame and you can jam whatever the heck you like inside of that so to finish off I'm going to show you well, I've kind of talked about these principles I haven't really given you any useful tools for doing data analysis or data science so I'm going to finish off with this slide which kind of displays my vision of the components of the data science process some of the packages that myself and others have written to solve those problems which is called the tidy verse now the other verse which I cannot say myself without sounding colossally entitled and if you'd like to learn more about that the best place to learn if you want to learn how to improve your data science skills in our my take on that is this book called Alpha Data Science with Garrett Grohlman you can read it online for free or if you want to buy it possibly the cheapest way although I don't know if O'Reilly ships to New Zealand or not but there is a discount code you can use on any O'Reilly called ALT-D which will give you 40% off the document thank you that this is a public lecture so please try not to make your question the way we learned things like R&D was you could see what the existing code did read it and do likewise and it seems that the pipes while they make data analysis easier to make that harder do you have any views on the sort of trade-off between having more complicated machinery and machinery with a better user in the face that is an interesting question hmm so I don't know I think there's almost some kind of like inevitable transformation right like 20 years ago you could buy a car and you can kind of understand enough about the internals if you broke down you could fix it maybe you couldn't build your own car from scratch you could at least kind of pull the pieces apart and see how it works and now if you buy a car it's like 90% computers and there's no way to fix it so the tide of this definitely is a little bit like that I guess the way that I have tried to solve that problem is by first of all trying to elucidate these key principles that will eventually kind of unify and underlie much of the tide of this so like a lot of the things I've done are like experiments you can't figure out what the best way to do something is without trying 15 different ways and finding that 40 of them aren't very good so I think like over time the goal is to kind of keep narrowing down on like a small set of underlying principles and then to explain those principles in a way that is easy for other people to understand apart from me so one example of that is like the tool that powers like non-standard evaluation and deep plier and other packages I think we've finally found like the right way to express that it's now called you may have heard the talk about lazy eval it's now called tidy eval I think now like five epiphanies where I'm like I finally understand non-standard evaluation and I am now confident that I really do understand it and I will not have any further epiphanies must say yeah like so I agree it is hard to understand I want to try and figure out what are these simplest building blocks and then to help you understand what those are and how you can combine them yourself I guess it remains for me to thank Hadley I didn't mention before Hadley is an honorary associate professor in our department we're very pleased with this relationship and we hope that it lasts a very long time just a small token of our appreciation Thank you