 Tēnā koutou katoa, and welcome to the final of the inaugural Ross Ihaka lecture series. My name is Julie Middleton, and I'm Communications Advisor to the Department of Statistics. In that job, I've written a lot about the man sitting over there who helped turn a single letter into a global sensation. And before he speaks, there's a few things I'd like to share with you. Ross's heritage is Ngāti Kahungunu ki Wararapa, Rangitāne, and Pākeha. For Māori, he's an important and rare role model in a field where we have too few Māori academics and students. His determination that remains open source reflects a kōmāori principle of working as a community for the greater good. To trace the roots of Ross's success, we need to go back in time to his paternal grandfather, Pītipi Ihaka. Pītipi was born in 1861 in Papawai, Wairarapa, and married an English woman, Mildred Baker. They raised their family near Masterton. Pītipi believed in education, and his grandson, Ross's father, was one of the first from his area to go to high school. Ross discussed this in a 2010 Mana Magazine article, and I quote, My father described going to get the old man's blessing when you went away to high school. He said, you learn everything those Pākeha have got to teach you, or I'll kick your ass. End quote. Ross will tell you, and some of you may know, that it's a sentiment he likes to pass on to others. So Ross's father, Jack, was one of the first from his area to go to high school, and he became a teacher. Ross's mum, Edna, was also a teacher, and Ross earned a PhD. As Ross once told me, these are really big steps in just two generations. At this stage, I'd like to acknowledge Ross's ancestors, both Māori and Pākeha, on her shoulders he stands, and Ross himself. Moe, e tukutotauonga korehere ki te aau, ko te whātemāramatanga te take. Ko tau e takatinana hia peneiti ana. E hara takutua i te toa takitahi, e ngari he toa takitini. Ane a mihi kaere. That translates as, those who have gone before, you encouraged your child's pursuit of education. Ross, you gifted your work to the world, your goals to share knowledge. The legacy you inherited and what you have passed on is perhaps best encapsulated in this well-known saying in te reo Māori. Success is not the work of one, but the work of many. Thank you. I'll pass it on to Paul Morale now. Sorry, I'm a lecturer, so I have to have slides. I don't know what I'm doing. Right, Ross is so special. He gets two introductions. So, first of all, I want to clarify something in case there's any confusion and you're thinking this is some sort of crazy coincidence. This is Ross Ihaka after whom this series was named. So, Ross is talking in his own series this evening. So, this is the big one. Ross is an associate professor in the Department of Statistics here, and he's probably best known for being the father of art. Or actually, one of the fathers of art is his two dads, either being Robert Gentwin. We can safely call them a father of art. And I want to just briefly mention a few of Ross's qualities through the prism of peer into it. The first point is if you've got kids, you need to put some effort in early on because when they grow up, they stop listening to you and do whatever they like. And this relates to Ross because this is his parenting manual. When Ross and Robert began work on art, they were looking to build a statistics program for teaching statistics students. And so, of course, the first step they did was to read a book about how to write compilers, which is what this thing essentially is. And what the significance of that is is that Ross's work is, when he tackles a problem, he tackles the problem fundamentally and thoroughly. So he doesn't just whip out something that sort of solves the first thing. He's actually approaching the general problem. And in terms of art, its success has relied on the fact that when they started this thing, and it's got rather out of hand since then, when they started it, they got a lot of things right because they were doing a thorough job. So that's a hallmark of Ross's work. It's something we can be admiring of. Second point. When you have a child, you must let it go. You must give it a chance to be independent and do its own thing and surprise you with where it goes. I suspect art was a bit of a surprise how it turned out. But again, part of why that has happened, why art has become so big, is because, again, the way Ross approaches research is when he does a piece of work, he involves other people in that work, he invites them in, and he shares that work with others and licences it in such a way that everybody can use it however they like. So the impact is as wide as possible. Again, that's in terms of art, a fundamental decision that sort of helped things to grow so big. Right, the third one. Don't brag about it because that's just because it's really obnoxious, so don't do it. And Ross has this completely nailed. You will hear many more criticisms of art from Ross than praise of art. And again, that reflects his appropriate work ethic here. He's not doing things like art so that people think he's amazing. He's doing it so they can do a good job. And he's concerned about getting it right. However, there is one really important difference between art and human children, and that is that when art succeeded, it did reflect actually on art's parent, and you don't tend to see Ed Sheeran's dad going up and getting Grammys, but art is meant that Ross has received a couple of significant awards that I know of. American Statistical Association awarded him the Statistical Computing and Graphics award in 2010, and the Royal Society of New Zealand awarded him the Pickering Medal in 2008. So, Ross didn't start art to become rich and famous. It's quite lucky he read it. He started art so he wanted to do a good job. And I hope that he's proud of what it's become. And now, in order to take down a few notches, I think we should hear from the man himself. So please welcome. I'll see you later. Can everyone hear me at the moment? I don't have much volume to my voice. So can people up the back here? Yeah. Just have the uncanny feeling of what it's like to be at your own funeral life. People standing up and saying nice things about you. I'm also going to use some slides. So this is joint work with Brendan McCardle and I'll have more to say about that later in this talk. As Paul observed, I always like to work in collaboration with people. So working on software, it's actually very important to do that because it provides an exchange and you can have the roles of somebody who's doing the work, someone else considers a critic and it just provides good feedback on what you're doing. I'm a statistician. It's been two weeks since my last analysis. But that's all right because it was just using R to do my tax return. In order to explain where I come from, I need to say a little bit about what statistics is. And that means talking about what it's not. So statistics is not about analyzing data. That might seem slightly surprising and it's something I try to explain to my students that statistics data analysis is to statistics is driving a car is to automotive engineering. So statistics itself is kind of a deeper study than just just analyzing a set of numbers. It's all about the methodology. Another way of saying that is it's all about the tools. Statisticians are tool makers rather than tool users. This changes the emphasis of what I do. As a tool maker I may produce a better hammer and from time to time I may encounter a nail and the thing I do is to whack that nail as hard as I can with the hammer. But my instinct on having done that is not to look at what happened to the nail it's to take a close look at the hammer. So I'm interested in studying the tools rather than the use to which the tools are put. In this talk I'll be looking at tools much more than what can be done with them. So if you're expecting to see nice examples lots of data and that sort of thing you're going to be disappointed because I'm going to talk about tools for doing things. The component parts of statistics are really statistical theory which is where statistics came from originally. It's the use of mathematical methods to study particular processes for analysing information. Computational statistics which is about algorithms which put into more practical play the techniques that the theorist divides and then at the end there's statistical computing which provides kind of a container for the methodology that others have developed so it can then be conveyed to people to actually use and practice. I work in statistical computing and the goal in statistical computing research is to produce better tools well I guess that's true of all of statistics. So the primary things that we deal in are computing environments that's the actual software that gets things done. Algorithms and interfaces tools we take from other places so that we can use them and graphics is really the third part of this and these are things I've worked on all of these things but more specifically on computing environments. So if we go far enough back say the 1960s maybe computing environments for doing statistics really consisted of forms. You were presented a form, you filled it in you ticked the boxes the computer took that and it carried out some analysis for you and printed out huge amounts of paper with the results of that analysis on it. That's all very well that sort of analysis is translated into GUI style software where instead of filling out the form on the keyboard you're now ticking boxes with the mouse but it really amounts to the same sort of thing. In the 1970s and into the 1980s the focus of computing systems for doing statistics really changed to looking at computer languages and there's a good reason for that because you can think of menus and pointing and filling in forms as being the communication equivalent of doing this. It's very hard to say very much doing that but when you have language you're able to express yourself in a much richer way and so the reason that languages came along was because they provided a much better way of doing analyses. Over the years I've sort of stumbled across a number of these languages when I was a graduate student at Berkeley I came across ISP and SISP was something that came out of Princeton, developed by David Donahoe and a few friends. S is a language developed at Bell Labs by researchers there, John Chambers and his collaborators. Lisa was one of my own that I put together and it barely saw the light of day. Explore is a system out of Germany a commercial system so it didn't reach perhaps as widely as it could have. LISP stat was a system developed by Luke Turney at University of Minnesota originally and then there were a few others. MATLAB is one that's less useful for statistics more for applied mathematics things are I stumbled across at some point and then more recently there's a thing called Julia which is something in the MATLAB vein but which is looking to improve performance of these things and of course there's a lot of other ones these are just ones that I happen to stumble across. Now in order to talk much about computing languages I'm going to have to take you down the rabbit hole with me. I've got my own particular little interest down there and it's got a big vocabulary so we have to learn a few words. I'm going to talk about R here because it provides kind of a leading brand in statistical computing it's one that's fairly familiar to a lot of people so that's the one that I'll use. I could equally be talking about other ones as well. R is an expression based language and what's an expression well if you look down around here you'll see that it's like a mathematical expression it's something written in a piece of algebra and that's the way you write the commands in R. But that's not what R sees so internally what R does is rearrange that input into something that looks like this. And the way this works is that the computer comes along you've handed it this thing and it looks at the top element here and it says I know what this is this is an assignment something is being set equal to something else so if I go to the left I find the thing that's being given as a name to something is A and if I go to the right I find the thing that's going to be the value that's going to be associated with A and you work your way down through this tree evaluating bits and pieces along the way so here I would see that this is an addition and I'm going to add B to C so really this form here is immediately translated into this and the way things work is the software actually walks traverses that tree and works out what the value of it is so this makes R at least conceptually what's called the tree walking interpreter now the significance of that is that the meaning of things you can derive by interpreting in this way of course to work out the answers you're free to actually do that in any way that you like in any equivalent way but we'll think of R as being a tree walking interpreter and an expression based language so let's look at some examples big complex examples as you can see we're going to add one and one and then we're going to add one plus a more complex expression which is one and one in parentheses and then another one as well the answers that you get here are as you'd expect the first one works out to be two and the next one works out to be four you evaluate things left to right although order can be overwritten by a parentheses so if you've done some arithmetic you'll know about this sort of stuff here's a more interesting expression this is the kind of thing I like to play with so the little quotes here are because these names of things are special plus expects to see something on one side and something on the other indicating that you're adding these two things together okay but here we are adding a new value to plus and in fact the value that we're giving to plus is minus and that changes the results that we can expect again so now if we add one and one we get zero and if we add one plus one plus one plus one we also get zero um there are a few of us who like to play these games and see what's going on with these things now let's look at a more complicated example again okay so this one's a bit trickier it's basically like the one plus one plus one in parentheses plus one example but at the beginning on the far left there I've changed the value of plus to minus so in the spirit of Hadley Wiccan's one minute quizzes that we had a few weeks ago I'd like you to take a minute to guess what the answer to that might be I'll give you a minute to think about it time's up let me show you the answer well to understand why that is we need to look at the tree and the way this works is we start at the top and we ask what is this thing at the top? it's plus what does plus mean? it means addition so we've figured that part out so that's regular addition and now we go and look at the arguments too to figure out what they are what are the things that we're adding together so we go to the left and we find that there's another plus there what's the meaning of plus? it's addition and we recursively go down now we go to the left and we work out this particular expression which sets plus to minus and then returns the value one so we've determined two pluses as being addition and then we change the value of plus to minus and we work out this part so these ones here are actually involved in a subtraction one minus one so the one plus one comes from this one and that one giving us two and these ones cancel so these are fun games that you can play with computer languages and I'm going to use this to introduce the idea of dynamic languages so computer languages come on a variety of forms one particular form is called dynamic and in dynamic languages things are allowed to change in particular you can change the values associated with names so we changed plus to minus in that last example in other situations you can change the types associated with things so sometimes a variable can be changed sometimes it's real sometimes it's a character string something else in contrast to that there are static languages and in static languages the changes that you can make are much more restricted so the values that you associate with a name may be limited in type so you might say this variable x can only contain real numbers and if you try to assign something else to that you'll get an error message at the more extreme end of the scale there are very static languages where once you've assigned a value to some name you find you can't actually change that anymore so these are called static single assignment languages r is a very dynamic language it's at the extreme dynamic end of things you can change anything pretty much at any time so we saw that in the example halfway through evaluating the expression plus changed into minus here we're just looking at things that can be assigned to a particular variable it can be a number it can be a character string it can be a function it can be an evaluation environment which is like a little dictionary where you can look up values that are attached to names and in this case a is an entry in its own dictionary which makes things even more interesting and you can change things by making them go away completely you can remove variables sometimes that's useful sometimes it's not now the result of this is you can make these changes but the effect of the changes may not be that obvious and we saw that in that little addition example halfway through evaluating something and the consequence of this is that every time you want to know what the value of something is you have to look it up you can't assume it's the same as a minute ago it could have changed and r can do very extreme things to you if you call a function that function can reach up and change any variables about what's going on inside inside these functions because they can do very nasty things to you one of the things about dynamic languages is that they tend to be slow and they're slow because they're always looking up values for things if I had an analogy imagine you were reading a book and you had a special kind of dictionary and the dictionary told you what the meanings of words were but these were constantly changing so every time you encountered a word you would have to look its meaning up in the dictionary and you can imagine how slow that process would be so computer languages exist on this spectrum going from the dynamic on one end to the very static on the other the r language sits at the extreme left of this picture it's a very dynamic language that means it's very flexible because you can change things in pretty much any way you like on the other hand it's very slow because you are consistently looking things up there are other reasons why are as slow as well and we'll touch on those in a bit so the work that I'm going to talk about tonight is about moving where the language sits from the extreme dynamic end of the spectrum slightly more towards the static so we're just going to make things a little bit more predictable and hopefully make things a bit faster as well part of the reason that are programs run slowly is because r is very dynamic that's not the only reason another reason is that r is implemented as a tree walking interpreter and so it is actually doing this holding a tree in memory and then walking around the tree working out what various pieces of it evaluate to both those things are minor introduce minor amounts of slowness what's more of a problem is the fact that r is called by value and this says that if you pass an aggregate thing into a function to operate on it say you pass it into a function that does some sort of statistical analysis that is not allowed to change the values of things that are passed on there the only way to avoid that is generally to make copies and that's the problem with r in general is that you're not operating on the originals you're making copies and operating on the originals now that's an accidental choice I think it goes back to where the S language came from and it was developed when computers were much smaller all the data sets sat out on disk and when you needed to do anything with them you copied them in and made it there and that was sort of built into the language from that point on and r inherited that from those languages another thing that makes r slow is that it doesn't have genuine scalars so when you type something like 1 that doesn't mean the number 1 that is a container a vector containing that number 1 so when you do calculations you are never operating on the numbers you're always operating on boxes containing the numbers and you always have to reach in and take the numbers out do whatever you want to make another box put the result into that box so there's an awful lot of this boxing and unboxing of values going on if you want to do something that is purely a scalar computation on numbers that makes things very slow so what we're going to do then is to look at a new language which is kind of like r it's in the same space it does the same kind of things as you might notice this language is called b in quotes the quotes are important because there was already a language called b which followed on from bcpl and then became c so the quotes are important and the quotes here stand for Brendan because this is a computer language that Brendan has been developing for his PhD the idea here is that we want to make something which is like r so you can do the same kinds of things but it's a compiled language and compiled means that instead of walking the tree you can actually reduce it down into simple instructions that can be executed very quickly what we want to do is keep it flexible enough for interactive use so you can use it the same way as r you're not constrained in what you can do with it but we want to move from the extreme end of things so where you can change anything at any time to slightly more towards the static okay so bring a little bit of order to this chaotic world that you have in r so for example you can declare the types of things so you can give the computer information about what is contained in a particular variable so I can say this is going to be a real number and the computer will be able to assume that that is always the case that's not going to get changed from under it when you give this information then things will be faster because the computer can make assumptions about what things are and operate on them directly rather than having to carry out checks another feature is to add scalars so individual numbers as opposed to collections of numbers in a vector and this will help make calculations fast and finally we're going to make aggregate objects passed by reference not by value and that means when I pass something into a function that's going to operate on it it gets the original and it works on the original there's no copying involved okay and that will eliminate an awful lot of work which is often unnecessary in r so here's the full list of things that were changed in making this new language we're going to eliminate first class environments those are the little dictionaries that you look up values in that will simplify a lot of things we'll eliminate a feature in r called lazy evaluation where expressions are not actually evaluated until the value is needed well at least for function arguments we'll eliminate the ability to evaluate things on the fly non-standard evaluation we'll use passed by reference I mentioned that scalar values more static and explicit typing so we can type things and the languages compile so a few years ago I handed Brendan this project and said in the palates of Star Trek make it so and that's what he's been working on now I might say at this point you're going to see me gliding serenely over the surface of the language like a swan what you're not going to see is Brendan's feet underneath the surface paddling furiously to make it work so from my point of view this is easy stuff from his point of view not so much so let's have a look at a function in this language well first of all let's take the R version so this is as you would type this particular function in R what it does is not that important it computes factorials and it's a recursive implementation of factorials but you would type it this way with one proviso a lot of silly people think assignment is indicated by less than minus you're wrong John Chambers a long time ago the equals operator into R and I'd rather type one symbol instead of typing two and there's other technical reasons why you should use equals instead so an R function that's how you do it let's look at the same function in the new language spot the difference it's the same so we're not aiming to make huge changes to what people see although this is a very carefully chosen example it doesn't have any local variables so you don't see any declarations but we couldn't give information to the computer about what's going on here so if I want to I can specify the types of things so I can say the argument to the function N now has to be an integer not only that the function itself returns an integer because we know that factorials take integer values so this is the type of the function this is the type of the argument N coming in just to indicate that there is potentially some difference here it's not identical in order to talk about the next thing which is declarations is to know a little bit about how R works variables in R are created by assignment so I can say A equals 10 before I said that there was no A it was just not there when I assigned a value to it it came into existence and it was a variable sitting there now that doesn't happen in many computer languages usually the act of creating a variable somewhere to store things or a name for things is separated from the process of giving those are two separate things and that produces the fact that these two things are combined in R produces some very strange results so here's another little piece of R code I assign the value 10 to X and then I have a function and inside the function if a random number is bigger than a half so half the time X will get set equal to 20 and then the function returns the value of X and the question is what is X is it a global variable or is it a local variable the answer is it's random half the time it's global and half the time it's local and you don't know which that's not a nice feature to have in a computer language it makes it unpredictable I can't assume that I always go to the same place to get the value of X sometimes it's up here and sometimes it's down there and I have to figure out which of those two it is this is particularly problematic for compilation because there you need much more regularity in things so how we can declare variables we'll go back to R and here is another function declared in R again what this does doesn't really matter that much but what it's doing is evaluating the exponential function by summing up terms in the policy so it's just a straight mathematical computation in R this is how I would do it I've got variables N, term, old sum and new sum and I know that they're variables because they're assigned values here so they are local variables within this function in the new language we actually have to declare the fact that these are local so in the new language it would look like this so all we've really done here has changed a few characters semi-colons have become commas and we've got a VAR at the front of this list of assigned ones so what I'm trying to convince you here is actually this is fairly lightweight we're making fairly small apparently superficial changes we can do more here though we can give more information to the computer we can say what these things are some of these things where all of them in this case are actually double precision floating point numbers they are numbers that we can compute on now that the computer knows that it can make some assumptions about how they are to be manipulated we are going to assume well we can figure out that things like plus can't change because they are constant we're not allowed to reassign those things so lots of assumptions can be made and so now things can hopefully be spared up and I guess the big news today is actually it works Brendan having told me that he's getting performance increases over R in fact is about 4 to 10 of a compiled R code I think that's right so that's 16 to 100 times faster than mobile R code so we're getting somewhere with this another thing to worry about in using R is that there's an enormous amount of copying that goes on again you pass an array into something it's copied before it's operated on now most people don't actually see this but when they do find out what's going on it's fairly terrifying so here's a very simple example and this is one that would come up in fitting a regression model to some data so straight lines analysis variance that sort of thing and what we're doing here is taking the columns of a matrix so I take the jth column out of a matrix big thing that looks like this and I subtract off a multiple of another column what's going on here than you would suspect first of all I can't operate directly on X I'm having to pull out a piece of it so this pulls a column out of X and this pulls a column out of X and now we do some arithmetic by multiplying this column by a constant that produces a new column and then we do subtraction of these things and that produces a new column and we've now got four copies of this one column from the matrix and I was doing this sort of thing all the time pulling out pieces of things using them for just a second and then you drop them on the floor so in this case you would be getting four copies of the column when you actually don't need any of them now that's not strictly true in this case because you can recognise that if you use this particular column you don't need it anymore so you can recycle it so you could actually write A times the values in there back into there and you could do the same thing for the subtraction but there's no way of eliminating the initial copy of one column to two of the first two columns so one thing we'd like to be able to do in the new language is to get rid of these array computations and replace them with scalar computations so here rather than pulling out big chunks of things and operating as big chunks which is producing lots of these garbage arrays we're actually going to use a loop and do it element by element now with all the correct declarations in place this shouldn't produce any garbage so there isn't any copying going on here so we're going to use machine registers and things like that to keep things very fast now whether or not this is going to be a good thing to do we're not sure because operating on big chunks of information like columns is a very fast thing to do that's got to be balanced against the fact that you are continually allocating stuff and throwing it away so there's a balance to be struck here and that's one of the interesting questions or what are the right idioms of the language to use and for people who know a little bit about R I'll go back to something that Hadley was talking about a few weeks ago and that was the question of whether or not to use drop when accessing array elements or chunks of arrays there's an ambiguity in R if you are saying let's get the element at a particular position in an array in this case a matrix does that mean that you want the number there or does that mean that you want a little matrix containing that number now in R you can't get the numbers out everything always exists in a container so you have to say which of these two things you want and the way you say it is by saying the array attributes or keep them so the first one gives you the actual element in there and the second one gives you a square thing which is one by one containing that number now this is actually relatively expensive because optional arguments have to be matched so when you do something like this even if the drop's not there you still have to go through the matching process that makes this kind of operation very expensive fairly expensive so it would be nice if you didn't have to worry about that so in the B language this doesn't happen because if I use scalar subscripts like this this says give me the scalar element if I use vector subscripts like these ones that says give me the array containing that but it's actually a lot more flexible than that oh I have to say one thing that drives me nuts when students come to see me is when they do things like right C of one because what is that that's a vector containing the number one but what is one one is a vector containing the number one so all you've done is done unnecessary work in doing this but you can understand exactly what they mean those things are different okay now we have a much more flexible way of extracting pieces of information out of these matrices so I can either get out sub arrays or I can get out plane vectors or I can get out scalars there's no additional argument here so there's no speed penalty for doing this and that's really nice so conceptually elegant it's very flexible I can do all kinds of stuff and it's fast as well other language things we have to consider well let's talk about call by reference this is again about the copying thing more in this case about function calls the costs of this are enormous when you fit a regression model you'll find that the design matrix which you don't actually get to see anyway but which is a hugely expanded version of the data gets copied something like six or seven times so you've got this huge amount of data being copied an enormous number of times and it's really unnecessary the only reason it happens is because R insists on copying things even when it's really not that necessary the biggest place where this is a problem is for data frames and data frames are just the statistical model of what data looks like it's a rectangular array and you have a list of variables across the top and you have individual cases as the rows another name for this is a spreadsheet so it's a fairly fundamental data thing but these things get copied like you wouldn't believe even the simplest computations so if I set a single element of a data frame to a value that duplicates the entire data frame twice so you want to indicate something's missing you have to copy everything twice in the process of doing this and the problem is that the code that does the manipulation of these things is written in R and R copies its brains out doing these things if you could somehow go down to a deeper level and that's done with matrices and vectors and arrays then you wouldn't incur this overhead so if somebody was willing to write the whole of the data frame functionality and see everything would be fine but nobody wants to do that because it's kind of ugly by contrast B uses reference semantics which means that you never pass copies you always pass the original thing and then you just make changes in it and you can get the results of the computations with a lot less effort now in the new language that's not going to happen because everything is just passed by reference but there are some problems to do with the dynamic versus static aspect of languages so again I'm assuming a little bit of familiarity with R here is a computation on a data frame remember this is a spreadsheet that you're doing the calculation on and in that spreadsheet you're going to operate on a couple of columns to produce a new column basically now in this where do these variables come do they come from the data frame or are they the ones that I defined up here now the problem in the new language is that these things this has to be compiled so this has to be understood ahead of time before you actually apply it to any data before you've actually read this data frame so you don't know whether it's got an X or a Y in it so there are some tricks that you have to do in order to actually make this work we're starting to run out of time so I won't bore you with the details too much but essentially you compute you turn the expressions into a function and then make the values you need arguments of that function so these are the kind of fun games that we enjoy playing okay so given that we're celebrating my imminent death here I thought I might make a few general comments about statistical computing research basically just to get it off my chest at this point the first one is that we need new statistical computing software nothing is ever perfect and boy that's true about R it has an awful lot of success but people are beginning to notice that it's slow but it doesn't handle vast data sets very well now to some extent we've been bailed out by Moore's law computers have been getting so much faster that people haven't noticed how slow it is this year's machine is four times faster than last so everything appears to be going faster but if you do actually handle big data sets then you are starting to notice it of course you can enhance the thing and that's been happening but at the risk of offending lots of people let me say it doesn't matter how much lipstick you put on it it's still a pig I guess this could be interpreted as modesty but it's a little more critical than that second thing that may not be so obvious is that I think that we need a multiplicity of these things we don't just need one kind of computer system that everybody uses as an academic I think we need lots of them and we need the competition and we need the cooperation we need new ideas coming on stream and at the moment R has kind of won the battle and there have been some very good computer systems for doing statistics that have kind of perished as a result of this and I think that's a very very sad thing the ones I would mention in particular are the S system which has kind of disappeared because R has eaten its cake there's also LISF stat which was produced by Luturni which had some very very interesting things in it and another system called the Hat by Duncan Temple Lang again kind of perished because R kind of took over the space and that's unfortunate so why don't we just hand it over to the computer scientists they know about this stuff we are just rank amateurs well the problem is they mess it up because they don't do statistics they don't know what these things are used for we are the domain experts we can offer advice on that now I often try and do this you know I will accost a computer scientist and explain the problem to them and they will run screaming from the room because like many academics what they want is little bite-sized problems that they can solve and write a paper on and get it out there and move on to the next thing well these problems the problems of developing these systems it's a big hard task that you are taking on development of these kinds of systems is hard R itself took I think nearly 20 years before it suddenly became an overnight success and there were a lot of people working during that time working very hard on things so there's an awful lot of work to do at the beginning you don't need all those hands but you need at least a couple of people to be working on these things because working on your own the problem is just too big you are too focused on what you are doing you don't actually see the big picture for because you are concentrating so hard on the development of little things despite this working on these new systems is fun or at least for somebody like me it's fun other people's mileage may differ I guess but if you are the sort of person who gets involved in this stuff it's actually fun and it's fulfilling and if I had to describe it it's like having your own playground so you have a playground called R and you walk around that for a decade you get to know it pretty well there are some broken glass in this pile over here and there are some big spiky trees over here and you know to avoid those things and there is a deep hole over here and you know that you have to step over that but it all gets very comfortable and routine but then you find one day that there is actually a little gate at the back of this playground and you can walk through and there is a new playground and suddenly the fun is back there are new things to try there are new things to do and you cast around practices over there and you know you should probably avoid those big unbalanced rocks over here that will fall on you but you get used to that kind of thing and it's fun and finally the thing we really need is for people to get involved in this now as I said I've tried to involve computer scientists from time to time and it's an unfamiliar problem for them so they don't really know how it's going to get used so they feel a little bit insecure about that even though they have all the skills to take this kind of work on so it would be nice to involve them but we also need our own people so we need people in statistics to be involved in this kind of thing and one of the unfortunate things is that a lot of us got interested in this field at about the same time so back in the 70s and 80s I guess and we're fast approaching our use by dates at this point and there are much fewer in the way of young people and fewer people who actually have the experience of building these systems from scratch there are people who will come in and work on a project like R and contribute but they don't have the experience of building so we need these new systems so there are one or two exceptions to this and I would like to point out my collaborator on this my PhD student Brendan McCartle who will be finishing soon and will be in the market for a job I can see his father applauding so let me just hang that out there the skills are available if you're interested I'm not expecting anyone in this room to be putting out their hand but maybe there's a wider audience that we're streaming to who may have something to contribute there anyway so this is a brief look at research in statistical computing talking about the kinds of things that I do and that Brendan does and a few of the people who work and are are doing at the moment so hopefully it's being enlightening perhaps a little challenging for some people but there it is hey you took up a tone at the start sorry it's my fault