 Yeah, people in the house. Teaching and learning from machines has been incredibly fascinating and I'm guessing everybody in the room shares in that fascination. It's not exactly the same as teaching human being. So I'm learning quite a bit about how to teach machines. It's just fascinating. And especially as a geek, you know, because you speak to machines whether they're supposed to be humans, but I'm not sure that's true. Anyway, enough of me rambling on. We're going to hear a little bit about open source R, which in Singapore we pronounce as RL, which is another sign in the way of saying R. But then, you know, it's a single letter that stands on its own, it's a little hard to pronounce, but yeah, communicates. We have Dr. Graham Williams from Microsoft and he's a data scientist, I think, what we seem to say. But in addition, he's got like three decades worth of experience in this area. He's written a number of books to deal with the data science and statistical things that relate to R, I'm not mistaken. So let's hear from Dr. Williams. Thank you very much. How was that for you? It sounds rather loud. Hopefully it's okay. So thank you and welcome and thank you for joining us this afternoon and for sticking around. I want to do three things this afternoon and we've got a pretty short time in which to do it. I do want to encourage you, if you've got your computer in front of you, feel free to load up R if you don't already have R on your computer and even load up the rattle. If you've got a Windows machine or a Linux machine or a Mac, then the suggestion is to go to the appropriate website which if you go to togaware.com if I have spelled that right without getting caps locked stuck on, you'll find there some instructions on installing R and eventually rattle. And I'm not always sure we're getting network connection here. Togaware.com anyhow, feel free to have a look at that. Three things I want to do. I want to give a bit of a view on AI and machine learning and in particular, decision trees, ensembles and a bit of a theme of everything that we do in machine learning is around ensembles. Then I want to quickly look at Open Source R and the suite of tools that we have available in Open Source R that support machine learning. All machine learning algorithms and most of machine learning algorithms that we have available are available in R. Even if they're implemented in other packages such as WECA, WECA being a more popular Java based machine learning framework for the computer side of the world, all of those tools are also available within R as well as TensorFlow as we've seen in our next year which holds suite of algorithms. And I'll give a quick demo of using those algorithms quickly in R. And I'll finish with the issue that we have with R around essentially elastic data science. As we know data science is about analyzing very large collections of data. R, like many machine learning algorithms, is memory based. How do we get over that major and significant hurdle and I'll introduce a couple of things around that. So I want to start off with machine learning. Now, I wasn't quite sure who the audience was going to be here. And we're in an Open Source conference the major theme of the conference is machine learning and AI. Who knows what a decision tree induction algorithm is? Okay, in a way I'm pleased to see it's only a small number. So I hope I don't bore the small number too much but I wanted to kind of convey that we often look at machine learning and AI in general and think about it being a bit of a mystery. It works magic. We don't really understand what's happening. And I was curious yesterday I'm not sure if Andrew is in the room but Andrew from Google in the panel yesterday made a similar kind of comment but when he learned compiler construction and implemented his first compiler suddenly the magic of taking a program and converting it magically in some way into something that the computer would run it turned into reality and he understood what was going on. And often we hear people dismiss or not delve into getting an understanding of what is actually happening in machine learning and AI. And in fact we often imagine that there's quite significant depth in the algorithms that we're running and yet they can be explained quite simply. Neural networks, the basis of deep learning have been around since the 1950s and we characterize our current era in computer science or in machine learning as an era where the computer power that's available to us and the amount of data, the sheer amount of data that we have available to us today is just the right time for neural network technology to really advance significantly to do things as we've seen today with CNTK earlier on this morning for TensorFlow. The types of analysis that we can do translating Skype from one language to another live as you are communicating with somebody the live demonstrations that we saw or the videos identifying objects and images and tracking those objects and images is all now possible because of the massive amount of computer power and data that we have available. But machine learning algorithms have been around for a very long time. I've been in the space of machine learning since the 1980s when I was doing my PhD. The algorithms and indeed the algorithm, the primary algorithm for machine learning that we use at that time are called decision trees. And whilst today we talk a lot about deep learning algorithms the whole suite of other machine learning algorithms that we have available to us are still very, very important in business. Deep learning algorithms are fantastic where you've got a massive amount of numeric oriented data worked exceptionally well with images, audio and so on but there's so much opportunity still for the whole suite of machine learning algorithms. I wanted to give a very quick introduction to what a decision tree induction algorithm is the most widely used machine learning algorithm in the world today is still the decision tree induction algorithm and one of the key things about this algorithm is that it gives you knowledge being an AI researcher from the 1980s our focus has often been around extracting knowledge from the analysis of data and it's very difficult to extract knowledge from very deep neural networks but a decision tree and other algorithms give us insight into knowledge. Like all machine learning algorithms three particular characteristics from an AI perspective there's how do we represent knowledge how do we search through the space of all possible models that we might build and how do we measure the goodness of a model once we have built it how do we use heuristics to work through that search base to find the model, the best matches the data that we have available. Every machine learning algorithm is essentially doing that searching through a massive space to build a sentence in a particular language that represents some form of data. It's a model. Decision tree induction is quite a simple example of building a model from data. It can be described really easily. Let's have a look at the room in front of us here and I'm going to make some assumptions it's not quite true but let's say half the people in the room are wearing glasses actually it's probably not about right about half the people wearing glasses what are the characteristics looking at the room here of the people wearing glasses versus the people not wearing glasses and these characteristics that I'm going to invent don't necessarily reflect reality but let's imagine they did now half the people in the room were wearing glasses and half weren't the next person who walks in through the door what would be our guess as to whether or not what's our guess as to whether or not they would be wearing glasses based on my model that I have from the people in front of me here 50-50, I don't really know it's a random guess almost because based on the population here 50-50 if two people walk in the door maybe one will, one won't let's now split the room into two and just for simplicity let's say I split the room all the females here males over here and then I had all the males here and I looked at the percentage of people wearing glasses amongst the males and making up a number let's say it was 20% of males were wearing glasses over here maybe it's 80% of females are wearing glasses next person walks in the door what do you think I might do to try and predict whether or not they were wearing glasses okay I'll test gender, are they male I'll make a guess with 20% accuracy that they won't be wearing glasses if it's female so I should have done my figures around the other way interesting point any model we build always or will never be accurate unless we had perfect data who has perfect data but it's an approximation of the real world now maybe 18 maybe it was 70 or 60% of women wore glasses I might look for another characteristic now to partition the females into two groups maybe tall and short tall females again just randomly picking these tall females 80% likely are wearing glasses short females have a 20% likely wearing glasses hence next person walks in the door if they're female will then also check whether they are tall or short and then make our decision based on that and that model that I've built you can kind of get a sense it's a decision tree it's a tree kind of in that structure that's the language that we're expressing the model and that's a model that I can now use to make decisions for me it's never going to be accurate but it's going to be pretty accurate we can sometimes get good accuracy usable accuracy for these models that's one decision tree these are used widely when you apply for a loan at the bank they are using these kind of models to decide whether to give you that loan or not they're predicting whether you're going to commit or whether you're going to default on your loan or they're looking at transactions who are using models like this to predict whether they're going to be fortunate transactions or not based on that quite simple now underneath there is some more complex mathematics underpinning what's happening there but at a high level that's as simple as we get in terms of using machine learning to build models of the world now the in the early days we only built one model we aim to build the very very best model at any particular time we soon came to realise that at any particular time I think it's not gender boy that model's not working well obviously males are much more golden I got an 80% accuracy just walking through the problem so but it's all based on what I'm observing here and maybe this is not representative of people who are out there and so on but building the one best model we soon found wasn't the best way to go when you have a group of people together each of them have their different views on the sorts of things to look at maybe gender wasn't the best thing to choose maybe it was age maybe there's some physiological aspect of that as well so we should look at age early on so maybe we have another model that looks at age first and then another model that looks at shirt colour, shoe size whatever so it might have an ensemble of models but we started discovering that we were getting better models and so one thing to remember about I think that everything that we're doing in machine learning these days is that it's all about ensembles it's all about a collection of entities working together in some form or other see some smiles gee, that's almost 100% accurate but males are wearing glasses coming into the room wow so ensembles are the basis of many of the models today in fact some of the best modelling algorithms are used in yes pretty good accuracy if we had to put it the other way no exceptions to the rule yet so ensembles XG Boost most popular algorithm on the cattle competition site at the moment available in our software older versions of boosting Aida Boost, Random Forest and so on ensembles are the key so we've got a whole suite out there of algorithms available to us and I'm a computer scientist coming out of AI research, machine learning rather than a statistician statisticians invented a language called S back in the 1970s interestingly the guys who were creating this made which were just down the corridor from the team creating Unix the forefather of Linux if you like at ATT Bell Labs, New Jersey you can actually see and they also plug into some of the AI community at that time as well you actually see some of the elements of AI and Unix in the design of R but it was designed by statisticians and as a computer scientist I can see many deficiencies in the language itself we love Python we love machine language for doing a lot of our work but the power of R is that almost any algorithm that I can imagine that I want to use for doing AI machine learning is available in a little line of the whole suite of statistical packages available there as well there is something like well there's over 10,000 who recently reached that mark there are over 10,000 over 10,000 that's very good reason there are over 10,000 packages contributed to the R ecosystem available 21 to download how many users do we have maybe 3 million R is the re-imperment open source implementation of that language called S by a couple of New Zealanders in fact in the 1990s it grew from a small team of maybe a dozen users in the 1990s being an Australian close to New Zealand knowing something big we were working on this we started using it quite early on but it's grown to 3 million users we guessed today and it hasn't grown because a vendor is sitting there trying to sign it's grown because data scientists analysing data have adopted it and taken it to their companies and said we want to use this language to do our analysis of data and it's grown organically because it's been so useful I introduced R and open source into the Australian Government when I joined the Australian Taxation Office 12 years ago to set up a data mining and analytics capability it took me 3 years to set up the infrastructure for our data science team 3 years of a lot of hard argument against a lot of the commercial interests from a lot of the vendors around the place who were fighting tooth and nail against open source software they saw it as a threat how times have changed so dramatically today the vendors are embracing open source software so enthusiastically and really see that their future has to depend on or include embracement of open source software I really find this bigger fascinating as a computer scientist R is a specialised language it's not a general purpose language as such like C Semper, Java Python etc so a very specific language according to IEEE Spectrum's language it's popularity and they use something like 15 I think different metrics to decide on the popularity languages are as the 5th most popular programming language out there that's massive given that it is a specialised language rather than a general purpose language it's not as popular of course as Python but Python is a language of the internet and a general purpose language more popular than C Sharp P.J. script and so on when I was setting up the data science team one of the problems I had though was how do we get people who might know databases to interact and start using some of these algorithms in this language called R and I wanted to get more of a community of users and having to tell people I had 150 data analysts across the organisation part of my role was to bring them up to speed with data science they had money they had no interest in going to coding writing programs in R I did not have a graphical user interface so I had a challenge and we brought together a package called Rattle to help us do that so Rattle is a package you can get a bit of a glimpse out of there but a package for providing a very simple trivial not a very well polished but a rough and ready graphical user interface to building your models most of you have installed R on your system and if you want to it is as easy as RG install R recommended if I spelled that right that would be on the end recommended probably if you do that, that will go through and install all of the packages that you need to run R you can then install the Rattle package with a RG install CRAM Rattle I've probably already got it installed I won't do that just here but that will then install Rattle you then start up R I'm a bit of an ancient user of command line and emacs so I'll just use that if you're starting with R today then R Studio is the recommended IDE really really nicely done borrowed a lot from the emacs ecosystem but done so much more as well so I'm going to reload in the Rattle library and I'm going to start up Rattle now that's very small apologies for that it's even smaller than it usually is but never mind hopefully you can see the key elements there A claim to fame is within four steps you can build four clicks of the mouse button you can build your first machine learning model and to do that you just click on execute there's a sample data set in there it was appropriate we were talking about weather in the previous presentation that's getting today's weather there's a sample data set of observations of weather that have been collected over the past 10, 12 years now this sample is just a very small subset of that it's just one year of observations but we can load in the sample data set there's the second click I go to the model tab I click on execute four clicks and I've got my first decision tree model now you can see that blurry bit of text there that's the actual model I'll click on draw and we get a better presentation of the model you can see the model that it's taking historic data just like we were talking about here predicting whether you're wearing glasses or not and it's looked at pressure 3pm as the first variable and if it's less than if it's greater than or equal to 101, 2 heck of a much better than the measure is then we look at the amount of cloud cover at 3pm if that's less than 7.5 we predict that tomorrow it's not going to rain with 95% accuracy on the right hand side we're going to predict with 74% accuracy that it is going to rain tomorrow very simple model using exactly that algorithm that we kind of got a feel for earlier and extensively use through industry today decision tree induction that's building one tree in forest here click on forest click on that that's just built 500 of those trees in 0.33 seconds very efficient algorithm does it quickly 500 of those trees all using different variables for partitioning and predicting whether or not it's going to rain tomorrow those models turn out to be much more accurate than just a single point model and there's many reasons why that might be the case so that's um decision tree algorithm bit of a sense of what machine learning does and R as an open source toolkit for accessing algorithms algorithms written in a variety of languages many of these are implemented in C this random forest algorithm the author of that Leo Bremen who's passed away that was originally written in FORTRAN and that's the original version that's still in the package today works well but C++ C-sharp other languages can all be integrated in here and I used the one interface to access all of those algorithms one issue with R is that everything is done in memory and that's a major problem if we go back to here it's if it thinks in memory, great if it doesn't, we've got some issues that's what the work of Microsoft has been looking at how do we extend to work out of memory entities and remove that so R and indeed Rattle is being extended and if you go to section on Bitbucket rather than GitHub from Bitbucket you'll find the latest version of Rattle which handles out of memory datasets of any size and it has those algorithms in parallel one minute just to finish the other issue is how do we scale up to big data when I do most of my work on my laptop here in R my laptop's got four or six kits of RAM I think and it might have two maybe four calls in here today in the world it's the era of the cloud and that is so powerful I mentioned that it took me three years to bring an open source stack into one government department in Australia and I repeated that for a number of government departments in Australia three years just to get that set up in working with the IT department today in five minutes I can push a button and within five minutes I can have a new data science virtual machine if you were here for Ben's presentation an hour ago he mentioned the Windows data science virtual machine I'd highlight that we have a Linux data science virtual machine completely open stack open source stack from the base up it's running on CentOS and it installs a collection of the best learning algorithms we have available today and our goal is to make it the very best available to anyone who starts up this machine irrespective of where those algorithms come from we have CNTK which is Microsoft's version of deep learning there we have TensorFlow which is Google's version of deep learning there we have obviously XGBoost no, NXnet which is in a sense Amazon's I think chosen version of deep learning there so we have a suite of the algorithms out of the box five minutes to set it up and it's elastic so now from my laptop here I can run a bit of code in R which will essentially not essentially it will fire up a virtual machine for me so I'm just very quickly going through this without really explaining I'm creating a random name and a resource group and a location being south east Asia which is here in Singapore I'm connecting to that data centre at the moment and once that's finished I'll create what's called a resource group once that resource group is there and if it takes too long I've now sent off a request to create a resource group and that resource group is being created and I'll now fire this off that's now starting up a data science virtual machine it's creating a new instance for me in the cloud and it takes a bit of time it takes five minutes you'll see it is running down the bottom there I think we're on the also network but it's communicating with that data centre and it's getting responses back checking the progress of the standing up of that data science virtual machine once that's there again from R with some packages that with my colleague here we're developing we can use R to control and run jobs on those servers and we can resize those servers as we need so that I start off with a cheap $20 a month server to a $1,000 a month server when I need it and only when I need it and turn it back off and scarlet back down when I don't need that amount of compute power so it's becoming very cost effective very efficient and a great way for us to do our data science these days as I say that's going to take another few minutes to connect there once that's fired up you can use X to go for example to connect across to it across to the connection at some stage you then have a GUI interface and I could then interact with Radle for example in exactly the same way as on my desktop run my processes there and I have when I need it a powerful machine a couple of resources to refer to but with that thank you very much thank you Dr. Willings and I'm sure we don't have time for other questions but I understand you'll be around later on