 Hello and good morning to everyone. Our first speaker of the day will be Heiko Stratman And he will be speaking about the Shogun machine learning toolbox Yeah, hi, good morning Quite surprises that fool here given the early time of the day, but that's good. I guess so Yeah, this is a Michael Stratman, and this is what I'll be talking about So since It's a quite short talk, so I can't go really into details. I just give a high-level overview over this project First tell you a bit about, you know, what it's about and the size and what I expect And then I'm gonna tell you a bit about what machine learning is about and then I'm gonna talk about the You know the machine learning features that we have in Shogun, and I'm gonna talk a bit about You know what you can do and since this is a quite I feel you know geeky place I can talk a bit about technical details of you know things that are going on under the hood and I close with some remarks on the nice community that evolved around our project recently So first as we are an open-source project kind of all the everything that you know People talk about these things as quite a local perspective. So here's some background about me. So first got You know, I before started doing these kind of things I got a bit distracted and pretended to be a musician and then Finally started studying computer science and then machine learning in London and currently I do a PhD UCL doing new science machine learning and for the people, you know, who know these words. These are my research topics I'm in particular interest in open source. So I joined Shogun to bring together my open source and my machine learning interest in 2010 And now I'm kind of guiding the project along So what is machine learning about how many of you are kind of familiar with with this term or what it means Okay, so quite a few. So this is a very very high-level Examples of applications what you can do and the things that I've come across so far. So Machine learning is a science and it's really a science of patterns and information Well, what does that mean it's it's kind of abstract thing and it involves lots of mathematics But what can you do with it? It's quite useful for automating things for example for recognizing things So one project I worked on last year was was about detecting frauds and airplane wings and there was a company and what they did They injected ultrasonic sound waves into the airplane wing and then the sound waves travel through the material They got reflected they come back you record them And then if there's like a little crack or something in the wing You can see this in the reflection and this is you want to do this in an automated way And it's kind of hard for humans to do this actually So we developed some some tools to do this automatically another one is so I Like going to the Sun and then every year go to the skin doctor and then they tell me off for getting sunburns and They always do this thing They always take little photographs of my of my skin and then there's this thing that scans them through and what it actually does It looks for characteristic patterns in these photographs Photographs that might indicate that you know, there's something dangerous going on and another nice example is so two years ago I was in three years ago. I was in India and I was using my my credit card and then he got he got Blocked immediately and I called the banks like Sorry, I need money and they're like, oh we we have some we we thought this was some fraud So their computer system told them oh, he is likely to be some fraud There are more examples for example, you you might want to predict things not just recognize or detect things So another project I worked on was actually on predicting how HIV of a bunch of patients reacts to a certain treatment whether it's resistant or not So what we did is we took the DNA of the of the HIV viruses of the individual patients And we put it into our pattern recognized in machine learning algorithm And then it told us don't give this patient this particular treatment and This is all learned from from From data so there are more things I do a PhD neuroscience We look a lot at you know brain scans and things like this But there's also commercial interests like you know these companies like Google Amazon Netflix I want to recommend you things that you you might like Yeah, and some so some people are sometimes confused so machine learning also is very related to computational statistics and So there's lots of exchange and for me It's really the same thing But you could maybe say machine learning wants to automate things with statistics more about understanding a certain process And you know all these buzzwords that you know around like big data Deep learning is a nice one currently you can they're all kind of related and obviously you can use all this stuff to build robots Okay, so here's a bit about sugar. This is our latest version. We are an open-source project We're made public since 2004. So this means we're public for 10 years now. It's quite old We currently eight core developers that kind of spent, you know time every day developing and we got about 20 regular contributors So you be quite a big project actually Given that we're just you know coming from the community the original background is from academia So people at university have been developing this and I work at university. So it's like a demand but we are getting more and more into more applied regions and What really boosted the project is four years ago. We started doing the Google some of code and so far We have been doing 29 projects that's 29 times three month full-time work. So we get quite some impact with this and And so I'd say this now and a couple of more times We'll do a workshop actually this weekend Sunday Monday, which is free. So if you feel to drop in I'll send I'll give you some details So he's a bit more about the size of the project. I Who's familiar with Olo the website Olo? A few okay last time I gave this talk at university. Nobody know the website So it's a it's kind of a they crawl get up to to get statistics So we got you know quite a few commits. We got quite a few contributors I really like these comments here that they generate so we have a very low number of source code comments It's mostly written in C++. We have a well-established major code base of whatever this means and 160 to two years of efforts So it's quite nice and yeah, we got quite a few commits So here's the here's the number of codes. So we you know, I could I could talk about exponential growth here and stuff But I won't we're about to hit the million lines of code, which is nice Here's the number of commits per month. So, you know, you see the summer kind of boost things and but even in winter We still have you know an average about two three commits per day. So it's quite an active project just to like set you up What we're talking about Okay machine learning. So You can't really see this one. Well, okay So the most classic textbook machine learning is, you know can be categorized into, you know, supervised learning unsupervised learning and some other categories. So All these textbook algorithms We have them implemented. So this one is supervised learning and Okay, so I got a hurry. So this is learning from data that somebody labeled for you Somebody gives you some information that he knows about the data and then you're trying to come up with This characterization of the data for some data that you haven't seen yet So you take your you know, you take your your all your scans of your DNA that you had so far and that you know whether the treatment was effective the HIV treatment and Then you try to predict this for a new patient So if you open a textbook all these methods you you pop across them So, you know, it's support vector machines. There was a buzzword a couple of years ago gas in processes Logistic regression is something that you know big companies currently are very interested in because you can paralyze it All these things are implemented within shogun and Yeah, I think I'll leave this for now So the other the other class of algorithms that we have quite a bit on is unsupervised learning and there it's a bit different You're just getting a bunch of data with no information to it And you you're trying to come up with a characterization of the process that generated the data Which we really write like this and again if you open a textbook There are all kinds of algorithms like in a clustering algorithms Okay mean so you have a bunch of points and you want to find say you assume that three clusters What are the clusters? Where are they? How can I know how can I characterize them? Can I use them for labeling things and we got quite a few of latent models if you know what that means? Which is basically trying to find a lower dimensional representation of your your information In order to describe it in a more efficient way for communication or to understand it And since I'm a bit in a rush again So these are all textbook methods, but we also got you know quite a few researchers implementing the work for for toolbox So that's for example what I do. So the stuff is actually not available somewhere else So these you know these things were kind of hot topics in in machine learning recently or still are like here this guy and They're all in there To get a feeling for what's in there have a look at our collection of ipython notebooks on our website They're quite nice. They are kind of tutorials about the methods and what you can do with them. I Skip this one here When you do machine learning in practice You have all sorts of problems like you want to import your data You're not pre-processed and all these things you can do this with the toolbox So you can you can have different types of different representations of data like you know dense matrices sparse data Strings collections of documents data streams. That's quite nice And that's kind of a unique feature of our toolbox that we can just handle all these strings under a unified framework They're different data types their pre-processing tools their methods to evaluate your algorithms to tune the parameters and all this So it's kind of call or comes all included to make your life easier Okay That's already what I'm going to say about machine learning So here now some technical features that might be interesting for you guys. So we're written in C++ So I'd ask why are you on a Python conference? Well, we provide We provide automatic interfaces to a bunch of languages. I'm going to talk about this in a minute So but the reason why we're written in C++ is because we can then actually expose a framework to a lot of languages and Since we are you know doing quite low-level things. It's you know, we can do we can do a fishing code We can handle the memory manually and do these kind of things We use quite a few, you know cutting-edge things for linear algebra and numerical computation like Eigen And recently we started using the nrcl and these kind of things to do computational GPUs If you if you want to kind of get a grasp for for the interface have a look at our class list of oxygen generator class list Okay, and now here's the one of the nicest things So we don't believe that it's good to tell users what programming language to use But obviously we want to since we love all of Python we use it a lot Like I use it a lot from a research when have an interface to this. So we use SWIC Does anybody not know what SWIC is? Okay, that's a that's a magic tool and what we do is we write our C++ classes We define a bunch of type maps that converts, you know, C types to save Python types And then we have a list of classes that we want to expose them We press a button and at compile time the SWIC thing generates Interfaces to all these languages So like whenever I implement a new algorithm and then I press a button then I can use it from Python I'm gonna show you an example in a minute This is quite this is quite neat because we have interfaces to you know, Python, Octave, Matlab, Java, R, Ruby, Lua, C sharp And it's all the same interface really With certain syntactic changes. So in C This looks like this, you know, if you know C code You know, you have your pointer and your type and your template and you find a new instance of a class and Some other bunch of new instances and then you call methods on these classes five minutes Thanks So now we go to Python and this I do the same thing here But rather than plugging in a 2d matrix At a pointer to a matrix and I'll plug in an umpire array, but it's really the same interface here I define a bunch of instances and a bunch of classes and I call methods on these classes And then if I do a prediction here train a sport vector machine I can get these here in Python, you know, the first index is one But if I go to octave and the first sorry, it's zero But if I go to octave, you know things don't really change and the first index is one But it's the same code that's running under the hood. So it's quite neat. I think so go to get Java. It's a bit more messy Okay, then finally Another thing that we love Python for is I Python notebooks So I said we use this mainly we use this thing for documentation quite a bit and what another thing we've set up Is we've set up a web service we can try sugar without installing it So we're running an ipython notebook server in the cloud You can connect it with your with your github account and it can you can run our example notebooks after it made is currently broken We broke it this week. It's bad, but we'll fix it soon And we also got a bunch of interactive web demos like OCR recognition written in jungle. It's also quite nice under this link here Okay, I'm not gonna talk about this We do thorough testing use build bot. So we got quite a few builds. That's that's also quite like this. We got fedora free BSD Windows Mac the offline in the screenshot, but they actually do work and Last two minutes. I'm gonna talk a bit about our community So this is really the nicest thing about the project for me at least meeting all these people So we got quite a few act quite a quite an active mailing list and you know an active IC channel that turns out here this guy who's The guy who introduced me Daniel he actually got back to us and I see we already knew each other So that's nice. We got you know all sorts of people like this guy from Spain here. He's quite active There's few people, you know sitting in the nowhere in Russia and just writing awesome code and I Met few of them last year. It's kind of hard to talk to them because they don't speak English properly But yeah, it's nice. This guy here lambda. Why is living in Mumbai? He works 26 hours a day and why it's really good stuff. So it's nice It's nice talking to all these people have a have a look on our github page or on our contact page Then some of code is as I mentioned before this really boosted. I assume everyone here knows about this So we got we got currently we got eight projects running I'm mentoring three so I don't really sleep these days But it's quite cool. So if you're interested in machine learning either mentoring a project or Joining as a student get back to us So few future ideas. We we just founded a nonprofit association To take we want to be able to take donations many open source projects do these these days We can be transferring a license from GPL to be a Steve people know what that means We're kind of aiming for using sugar in educational purposes, but also in industry and we also organize workshop So here's like we have a YouTube footage of our last year's workshop The next one is on Sunday and on Monday We got a hands-on session on Sunday where you can learn how to use sugar from a practical perspective And like a bunch of talks a bit more science stuff in the sea base And the hands-on session is at research gate check our website if you're interested It's free and you can just come along and grab a beer or coffee with us So last slide As I said, this is quite intense stuff. So we always appreciate any kind of help So you can just use the toolbox if you're interested in machine learning give us feedback you can fix bucks We got hundreds of bucks and get up if you are like a super good C++ software Engineer you can help us with design problems that we have within the framework. You can write Python examples and notebooks This is this is actually quite cool, this is super fun writing these notebooks You can you know help help us with the documentation. We need we have a website in jungle. I don't know jungle I don't know how to use it. So we need people to help us if you have like the super next-generation machine algorithm come and implement it just get back to us and Yeah, also come to workshop Thank you Yes, please So the question is what's the difference between shogun and other toolkits like scikit-learn or wake our orange yet so Like taking scikit-learn, which is the most similar one actually were quite similar project The thing is So if you want to use shogun you are not bound to Python That's kind of a big big difference to the project and also since we've written in C++ We can do things with the memory that the Python people have more trouble doing so we can you know build like huge data structures in memory and We treat them really efficiently. So we can have some really large-scale examples, you know with millions of examples that run on a single machine But otherwise it's it's you know, there's also quite a bit of overlap So we take a lot of inspiration from the scikit-learn website, for example, which I think is brilliant and the whole kind of way They document things and stuff and I so I know a few of the guys and I quite like the project also And I think it's good to have like a you know bit of diversity in projects More questions. Yes So the question is whether we have used machine learning to improve shogun who that's a good one And So so machine learning unfortunately can't write You know memory bug free code for us, but what it can do is We you know these these older plots are quite nice. So sometimes I do a bit of you know Data mining on on our you know the number of classes and how they evolve and these kind of things But it's really more for visualization and for for marketing Yes Yes, so the question is what I mean with large-scale So I got a few I skipped this example here. So these ones these ones here quite neat So this was done on a laptop. So there's an example from bioinformatics. It's about splice side recognition So splice side is something in your DNA where you you know where your genius your DNA is Transcribed to RNA and then it's cut into pieces before it's translated protein You do want to kind of predict where this happens and there's a data set here Which we're talking about 50 million examples and the dimension of the feature space the representation is 200 million So that's quite big and this runs on a laptop in a couple of hours So and this works by the magic of you know Defining data streams and kind of streaming files from the network and then putting them in the algorithms But a sugar is meant to run on a single computer. So it's not a distributed toolbox Okay, more questions. Okay. Thanks guys