 Fi fyddech chi'n is to actually do some live coding and actually demonstrate what you can do with machine learning enclosure. Now the demo is actually was working nicely for a week. This lunchtime I had a dreaded out of memory exception. A little debugging session I think it's okay now but fingers crossed let's see how it goes. Okay so I think it's always good to start with the definition so what I naturally did was went to Wikipedia and pulled off the first thing I could find. Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed. Now the quotes over 50 years old but I think it's actually still pretty good definition and it captures the most important point. What we're interested in is getting our computers to learn to do useful things without you having to explicitly tell them every single task because in large complex data that rapidly becomes impossible. There is actually one big problem with this definition though can anyone see what it is? No? It's a circular definition. I mean they use this word learn to describe machine learning. I mean what kind of definition is that? And this learning thing is very interesting I mean what is learning? Learning is something which is so intuitive to us that we do it all of the time without even thinking. I mean maybe you meet someone at a tech conference you remember their name. Voila you've just learnt something and it's helpful to think about what that actually means and I find it helpful to think about learning in a very very specific way. Learning means building functions from experience. Every machine learning problem can ultimately be conceptualised as a function mapping some kind of input to some kind of output. It can be a simple mathematical function. You can do spam filtering if you interpret the output as a probability of an email being spam. If you fancy making some money you can do some stock market prediction and good luck with that. And even human learning can be thought of this way. I mean as long as you think of the input and output as being a thought of being a state of mind. So the machine learning problem basically boils down to taking experience in the form of empirical data and running through an algorithm that automatically generates a function that embodies this knowledge. So an actual question is are we any good at this? Well I think it's quite useful to compare the good analogy is to compare machine learning to the early years of the early years of flight. We're really only just getting started at this and there's lots of crazy ideas, lots of things being tried out and we've had some successes. We've managed to get some of these things to work but it's still very early days and a lot of the solutions are very much handmade. You see things like IBM's Watson is really really good at the specific task of playing jeopardy. It's no good at all at driving your car or making you coffee or anything like that. But it's still a very exciting field and there's a huge amount of value here if you can get machines to learn how to do something useful in real applications. So naturally I thought this was an exciting time and a good time to set up a new start-up in this space. At New RoCo we're building a new approach to machine learning, a new machine learning toolkit. The design is that it's going to be general purpose. It's going to work on any kind of data. Images, sounds, numbers, text, you name it. The power is in the algorithms, the algorithms that you can learn to recognise deep patterns and draw useful inferences from that data. Of course we're doing all the usual big data stuff making it scalable, real-time, et cetera, et cetera. As I'm here obviously we're using closure to do this and I thought it was worth reflecting very quickly on why closure has turned out to be a good choice. First of all it's actually a pleasure to use and I think this shouldn't be underestimated when you're doing a start-up and you're going to be working long nights with technology. I know some people like to bash Java occasionally but the JVM is a huge advantage. I mean it's got excellent engineering. The access to the Java libraries is very, very important in this space and also the ability to deploy and integrate into real-world applications is very useful if you actually want to get things done and build systems using machine learning. Interactivity, this is particularly important in machine learning because in machine learning there's a lot of trial and error. It's really helpful to have a weapon-driven toolkit so you can try things out, see how it's working and iteratively refine your models so you actually get good results. Functional programming I think is useful anyway, you're doing a lot of data manipulation and closure's been great from the perspective of building a DSL for machine learning with composable abstractions and I think that last point is very, very important. We're trying to build a generic toolkit here and if that's going to be useful and productive it has to be possible to quickly plug together machine learning components that you need to solve a specific problem. So with that said I'd just like to introduce a few of these abstractions. First of all is the humble vector. The vector is basically an array of double values and this is what we're going to use to represent inputs and outputs to our algorithm. Now in the real world your data is not going to come nicely pre-formatted in vectors so you need a coder. This is something which will convert data to a vector and back again as needed. You also need to describe the problem that you're trying to solve. This is the task and that encapsulates all of the training data that you need to use. So what is the function that you're trying to get this machine to learn using some set of training data? The module is what represents the function which is being learnt and typically that would be a neural network although you have the function to plug in other kinds of different modules as well. And finally there's an algorithm. The algorithm is what actually makes the learning work. It's what actually builds the function to solve that particular problem. So in the demonstration I'll be showing all of these different abstractions but before I do that I thought it would be worth very very quickly covering a little bit about neural networks for those who haven't seen them before. So a neural network is a structure that computes a set of outputs from a set of inputs and it's constructed from a number of nodes and weighted connections between those nodes. So when the calculation happens it's performing a weighted sum which is calculating the values for each of these nodes and these values are then flowing through the network. It's typically arranged in a number of layers. So you have an input layer at the bottom, an output layer at the top and anything in between we would call a hidden layer. And it's very important to note that it's the weights which are doing the learning. The structure doesn't necessarily change but you adjust the weights so that it performs the function that you're trying to learn. And neural networks are often a good choice for machine learning. Two reasons. One we actually know some pretty good algorithms for training them and the algorithms are getting better all the time. And secondly there's a useful fact that if you make a neural network large enough it's actually capable of approximating any kind of function. So they work as a universal function approximator which is a useful property to have. So how do we train these things? You start off initialising them with some random weights. You then choose a random training example as input from your training data. You run that through the network, you compute the output of the network, see what it produces. You then determine the error so you compare what the network produced with what you would have liked it to have produced, what was the expected output you'd want to see. And then you adjust the weights and you adjust the weights very, very slightly in whatever direction reduces the error. And then you're going to go back and do the same again with a different training example. And if you do this lots and lots of times, each time you're reducing the error very slightly, what you end up with is a network at the end which has a low overall error. So it's producing the expected output of the function as closely as possible over some period of training. So that's the basic algorithm for training neural networks. So let's show this in action. I'll start with a very, very simple example of how you can do this. This is Scrabble. This is a great game. I'm very fond of playing this with my family. One of the distinctive features of this game is that each character is a word game. Each character is associated with a numerical score. So what we're going to do is really, really simple. We're going to teach a neural network what the right score is for each character. So let me just switch over into demo mode. So we're going to start off by defining the actual data that we're going to use for training. So this is the scores for each character. And this is just a simple sorted map. You can see A is worth 1, Z is worth 10, and all the other characters have their own scores to find that. Now we want to find a way to encode the scores into double vectors. Lots of ways of doing this, but a fairly obvious way of doing that is just to have a simple binary encoding. So I'm going to find a score coder as being an integer coder with 4 bits, so a 4-bit binary number. And if I then try that out with the number 3, I get a result here which is a vector, which is 0011, which is the binary encoding for 3. I can then decode using the same coder, and I get back to 3 again. So that's the function of the coder. I'm going to do the same thing for letters. Here we've got 26 possible classifications, so we've got 26 values, so I'm just going to use the keys from our scores map to do that. And if I encode the letter C, for example, I get a vector which is 26 elements long, and you can see that the third element only is set to 1, which represents C as the third letter of the alphabet. And the task that we're trying to learn here is just a straight mapping task. It's mapping characters to scores. So I'm going to simply use the scores map and tell it that I want to use those two coders we've just defined. Also going to define a neural network. We're going to have 26 inputs for the 26 characters and four outputs for the binary score. And we're going to have one hidden layer with six units in it. And notice quite extensive use of closure keyword arguments here to be able to configure the different units. So it's quite nice to be able to see what's going on here. So I've created a small visualization routine which will show a neural network. Here it is. It's got 26 inputs at the bottom, four outputs at the top, and six units in the middle in the hidden layer. And the lines between are the weights. So the green lines represent positive weights and the red lines represent negative weights. They're all random at the moment. So the function we're actually trying to produce here is going to take a letter. It's going to code it with the letter coder. It's going to run it through the neural network. Then it's going to decode it with our score coder to get the answer. And if I just try running that function with the letter A at the moment, I get 12. Now 12 is completely wrong. It shouldn't be 12. It should be 1. But that's because we haven't trained the neural network yet. It's just coming out with a random answer. So let's define what success looks like. If we're going to evaluate this network, what we're going to do is we're just going to count the number of times when the output of the network, so the scrabble score that we can produce, is equal to the actual score from the map. So that's our evaluation function. And again, let's do some visualization. Let's put that on a time chart. So what I have here is a continuously updating in canter chart. And canter is a great library, by the way, which is just saying how well this network is doing. And currently it seems to be getting two of the scores right. And that's completely by chance. You know, the network happens to be producing the right answer for two letters. So let's do some training. We're going to use a standard back propagation algorithm. And we're just going to run that on the network for a short time and see what it does. So watch what happens to the evaluation. And also watch what happens to the network. Nice. Okay. So what you saw there was the score has just gone up from two to 26. So it's now getting every single digit right. It's solved this problem. And also the colours changed on the neural network because the weights got adjusted to learn this problem. Now, I actually had to slow that down. If I'd run that at normal speed it would have finished instantly. I just had to sleep in between each situation so you could see the improvement happen. But that's how learning works. And if I want to just test it out, let me just run a letter Q through it. You can see here that that's letter Q as an input. And it's produced 1, 0, 1, 0 as an output, which is 10 in binary. Now people often criticise neural networks. You can't really see what they're doing. In this case, in fact, sometimes you can. So you can see that the Q here has a link to this node with a positive link. And then it has a link onto this node with a positive link. So to some extent, that node in the centre is probably acting a bit as a feature detector. It's detecting that Q and saying that the high bit should be set for Q. So you can do some interpretation of what the network is doing that way. Okay, so that's the first simple example of neural network learning. So that was, however, a pretty easy example. So let's try something a little bit harder. This is handwritten digit recognition. And this is a pretty badly written too. And this is a much, much harder problem. This has got the type of issues you see in real world data. First of all, it's larger. This is a 28 by 28 image. So there's actually 784 pixels, so 784 input dimensions. And it's not just discrete values. We have some intermediate grayscale values there as well. And we also have noise. We have some things like random pixels, distortions in the data, which make it much harder to actually learn the patterns. So how are we going to approach this? Well, one thing we can deal with is the number of input dimensions. So to deal with the fact we have 784 inputs, let's do some compression. And this is a really, really nice trick with neural networks. What we're going to do is we're going to build a network and train it with the identity function. So we're going to train it to produce exactly the same output that it was given as input. Now, this may sound a little bit stupid. I mean, we can obviously write the identity function very easily. But the cleverness is how we've constructed this network. If we successfully train this network with our 784 inputs, 150 inputs in the middle, and 784 outputs, and it's learnt the identity function, then what must have happened is that all the information that was acquired to produce the output must have gone through that central layer. So what we've done is we've encoded in 150 hidden feature units. We've actually encoded all of the information acquired or that exists within that image. So if you then take the bottom half of this network, what we've got is a compressor. It's taking a large dimension down to a small dimension. And then, of course, what we can do is we take that compressor and we build the rest of our network on top of it. So we're going to add some more layers on top to do the actual recognition. Now, because we've compressed down from 784 to 150 in the first layer, this makes the whole network smaller, easier to manage, faster to train, etc. And of course, this is just function composition. We've just composed together two functions. Functions happen to be neural networks, but this is the function composition we know and love. So let's give this a quick try. Okay. So we'll start off just getting some of our data. We actually have 60,000 training examples here, which we're going to use. And again, it's quite useful to be able to visualise these things. So I'm just going to define an image creation function. I'm just going to map that over the first 100 data items. And again, this is the advantage of having a dynamic ripple environment. You can just do these quick visualisations. So that's the first 100 digits. You can see they're all handwritten. Quite a bit of noise in them. That's what we're going to train this thing to recognise. We also have the labels, and the labels are the correct digits that we're expecting to recognise. So again, we've got 60,000 of them. Take the first 10. That should be the same as the first line of those digits up there. So let's do our little compression trick. The compression task is just an identity function. I'll now define the compressor, which is going to have 784 inputs and 150 outputs. Let's also have a decompressor, which is going to be 150 inputs going to 784 outputs. The reconstructor is just a combination of those two. So again, this is function composition. The connect function is analogous to compose, but for neural networks. Let's see what happens with this. Define a function which is going to show our reconstructions and try it out. We get a lot of random noise. Again, that's what we expect. We haven't actually trained this network yet. It's just producing whatever the random weights are producing. So let's do some training. Let's have the back propagation algorithm. I'm just going to run this for a short while. I've set it up so that as it runs, it's going to update the reconstructions now and again so that we see how it's doing. Let's see how it does. So something's happening here. That's interesting. It's starting to look a little bit like some numbers there. Let's get starting to get reasonable. You can now probably start to make out that those are actually the same input data that was at the top. It's not going to be an exact replica. We're actually learning lossy compression here, but that doesn't really matter for the machine learning problem as long as we've got enough information to capture the features in the data that are going to be able to help us recognise the image. Okay, that's probably good enough. I'll stop that there. One thing we can do that's quite nice is we can have a look at those 150 feature detectors in the middle there. We can actually see what they've learned to detect. This is quite a pretty trick. What I'm going to do is I'm just going to show some images of what they've become sensitive to. This picture here shows for each of the 150 feature layers, each of the 150 units in the hidden feature layer, what that unit has become sensitive to. Again, the green is a positive relationship and the red is a negative relationship. You can probably see that some of these have sort of got some features. This one here, for example, is looking like it's become a one detector to some extent. Other ones you can see sort of a detecting a combination of different features mixed together. Those are the kind of little strokes and features you'd expect to see in digits. That's a good sign because you want these feature detectors all to be detecting different characteristics of the input data. That looks pretty good. Now that's actually trying to do some recognition with this. Again, we're going to need a coder to say the numeric values that we're trying to predict. We've got 10 different possible values just using range 10. I'm going to try that out. If I encode the number 3, I get a vector. It's 10 elements long. It's got a 1 in the index 3, which represents that value. That looks good. Our recognition task here, again it's a mapping task. We're trying to map, just checking something. Again, we're trying to just map an image through to a single classification what the output value is. The recognizer itself, we're going to take the 150 features we've learnt to detect through our compression algorithm. We're going to take that to the 10 outputs. The overall recognition network is going to be just connecting our compressor to our recognizer. Again, we can just use the back propagation algorithm for training this. The final thing we need to do and this is very important in machine learning when you have real problems is to also have some test data. The reason you have this is because you want to test whether you've actually learnt to generalise so that your algorithm, your learned function actually works on previously unseen data. I've got 10,000 test cases just so we can test how this is working. The testing task is basically the same as the task we're using for training just with unseen data. Again, let's see how this is performing. Here we're going to plot the error rate so we're going to see what percentage it's getting right and sorry, what percentage of errors it's making. It's making nearly 100% errors. It's really not getting anything right which is actually even worse than you'd expect to do by chance. Let's see how this works. Let's see if we can actually learn to recognise these digits. The red line is the training data and the blue line is the test data so I'm plotting two charts at once. That's looking good. They're going down and the other nice thing about this is that they're going down mostly together and this is a good sign. This tells us that we're actually generalising. This neural network with the blue line is working on digits that it's never seen before so they're not being used as part of the training data set. They're getting better and better and we're now getting about 10% error and if you go and run this for long enough it would probably get down to about 3 or 4, 3 or 4, 5% error. I haven't got time to do that today but that's looking pretty good so I'll just stop that training there and let's see the outputs we get. We'll just define a recognised function this is just going to take the image data and it's going to run it through our recognition network and put it through a number coder to decode the output. The first data item I think that was a 5 in the top left. Let's see what it gets. It gets a 3 so that's one of the ones that's getting wrong but that's not a very well written 5 it looks a bit like a 3. Let's just map that over the first 100 digits. That's the actual outputs from the network compared to the inputs. You can probably see it's getting most of them right. It's getting about 90% success rate on image recognition which isn't state of the art but it's not bad for a quick 5 minute exercise. That's actually the end of my demonstration material I hope you found it interesting I hope you've also it's been quite a good demonstration of what you can do with closure as a toolkit for machine learning. I wanted to make sure I left a bit of time at the end for any questions or any discussions or any ideas so thank you very much.