 So the talk today is about scikit-learn or in other words why I think scikit-learn is so cool First of all, I would like to ask you three questions Not what's your favorite color actually, but If you already know what muscle learning is how many of you? Oh Great, okay perfect The second one is have you ever used scikit-learn? Okay, and the third one is how many of you also tend this the great training on scikit yesterday Okay, okay. It's just to Reef questions, okay So what actually much learning means? Much of learning There are many definitions about much learning one of this is Is much learning teaches machines how to carry out tasks by themselves, okay? It's very trivial. They're a simple definition and it's that simple the complexity come with the details, okay There's a very general definition, but just to to give you the intuition behind Much learning at a glance much learning is about Algorithms that are able to analyze to crunch the data and in particular to learn the data, okay from the data They're basically Exploits statistical approaches So that's why statistical is a very Huge word in this cloud. Okay, muscle learning is almost related to data analysis techniques There are many buzzwords about muscle learning. You may have heard about data analysis data mining data big data and data science, okay Data science actually is the study of the the generalizable extraction of knowledge From data and much learning is related to data science. I called it to do come back with this burn diagram Mushroom learning is in the middle. Okay, and data science is a part of much learning because it exploits Much learning machine learning is a fundamental part in the data science systems Okay, but what's what it is actually the relation of data mining and Data analysis in general with much learning Much learning is about to make predictions. Okay, so instead of only Analyze the data we have much learning is also able to generalize from this data Okay, so we have the idea is we have a bunch of data Okay, we may want to crunch these data to make statistics analysis on this data But and that's it. Okay, this is also called data mining for instance Mushroom learning is a bit different because muscle learning Performs this analysis, but the the goal is a slightly different the goal is Analyze this data and generalize try to find a to learn from this data a general model For future data for data that are already That are almost unseen at this time. Okay, so the idea is A pattern exists in the data. We cannot pin this pattern manually, but we have data on it Okay, so we may learn from this data. In other words, this kind of learning is also known as learning by examples Okay, much learning comes in two different settings. There is the supervised settings This is the general pipeline of a much learning algorithm You you have all the data on the upper left corner You translate you translate the data in a feature vectors There's almost a common step in preprocessing the data Then you feed those feature vectors to your machine learning algorithm and the supervisor learning setting supports also the labels which is the set of Expected results on this data and then we combine we generate this model From feature vectors and labels and we generalize that we we get the model to predict for future data in their bottom left corner of the figure, okay a Classical example of supervised learning is the classification. You have two different groups of data in this case And you want to find a General rule to separate These data's this data. Okay, so you you find in this case If a function that separates the data and for future data, you will You will be able to to know which is the class in this case It's a binary classification. So you have two classes and in the future when you've got new data You will be able to predict which is the class associated to this data another example is the clustering In this case the setting is called unsupervised learning the pipeline processing is this one you Have the same old processing, but what you miss is the label part Okay, because that's why this is called unsupervised because you have no supervision on the data You have no label to predict. Okay, and For as for the clustering the problem is get a bunch of data and try to cluster eyes in other words to Separate the data into different groups. Okay, so you have a bunch of data. You want to Identify the groups inside this data. Okay Just a brief Introduction. So what about Python Python and data science are very related nowadays actually a Python is Getting more and more packages to for computational science according to this graph Python is a cutting-edge technology for this kind of computation. It's about almost it's in the upper right corner and Actually, it's Replacing Substituting other technologies one of the advantages such as are or Matlab for instance one of the advantages of Python is that Python provides Unique Programming language across different application. It has a very huge set of libraries to exploit and this is the case This is why the reason why the Python is the language of choice nowadays for data science almost the language of choice And this is this placing are or Matlab By the way There will be also a pie data Conference at the end of the week. It will be started on Friday. So if you if you can please come Data science in Python actually Matlab could be easily substituted by all these Technologies such as a Python new pies. I pie and Matplot leap for plotting But there are many other possibilities for especially for plotting nowadays are it's could be easily Substituted with pandas. It's great package and In the Python ecosystem. We have also Efficient Python interpreters that have been Compiled for this kind of computations such as anaconda or and thought can apply and we have also Sighten or projects like site and site and it's a very great project to allow to boost the computation of your Python code, okay? The packages for muscle learning in Python are manifold actually I'm trying to Describe a bit all the a set of well-known packages for for muscle learning code and I would like to to to to make some consideration on why second learn is a very great one. Okay We have spark muscle learning lead py ML natural language toolkit and LTK sometimes called the shugun Muscle learning toolbox this morning. There's been a talk about it Cycler and of course pi brain ML pi. Okay, and there's a guy who? set up a list of this on a github where Everybody can put is Or her a contribution to this list In order to distribute the the knowledge about available packages in different languages and Python is very full-off Okay So we have spark and live spark and live actually is Implemented in scala. It's not Python it's there is a wrapping in Python, which is called pie pie Spark, but actually the library for muscle learning is at a very early stage Shugun is written in three plus plus and it offers a lot of interfaces One of this interfaces in Python the other Packages there are Python powered. So we are trying to to take to talk about this packages Natural language problem toolkit is for it's implemented in pure Python Okay, so no noom pi or sci-fi allowed But and the other packages are implemented in noom pi and sci-fi. So the code there is Quite more efficient for large-scale computations And LTK supports Python 2 and Python 3 is also in a half a stage Pi ML supports Python 2 actually the support Python 3 is not so clear Pi brain supports only Python 2 and these are the two guys there supports both Python 2 and Python 3, okay What about the purpose of this packages and LTK is for natural language processing, okay? It embeds some algorithms for Masha learning, but actually it is not supposed to be Used in complete much learning Environment, it's almost related to text Analysis natural language processing general pi ML is almost Focusing on supervised learning in particular to SVM Technique which is support vector machine. Okay, it doesn't many Algorithm especially related to Supervised learning pi brain is for a natural network with which is another set of techniques in the Masha learning ecosystem the other two guys there are Somewhat general purpose. Okay, so sci-kit and ML Masha learning pi are Contains algorithms for supervised and unsupervised learning and some others different slightly different settings for Masha learning Okay, so we're Removed we will not consider anymore the pi ML and pi brain. Okay from here on so We ended up with these three Libraries written in Python for our Masha learning code. So why to choose sci-kit learn? Ben Lorica, it's He's a big data guy Recommend sci-kit learn for six reasons the first one is Commitment to the documentation and usability sci-kit learn as a brilliant documentation and It's very very useful for newcomers and for people without any background about Masha learning The second reason is models are chosen and implemented by a dedicated team of experts and then The the set of models supported by the library covers most Masha learning tasks. Okay Python and pi data improves the support for data science data science tools data science problems and actually I Know if you know Kaggle Kaggle is a site where you may Apply for competition for data science and sci-kit is one of the most user package for this kind of competition The fire the the another reason should be the focus second learn is a Masha learning library And its goal is to provide a set of common algorithm to Python users through a consistent interface These two features are two of the features that I like the most. Okay, I will be More precise in few slides about this and and finally but by no means So last button I know means place the second learn scales to most data problems. Okay So scalability is another feature that sci-kit learn supports out of the box If you want to install second learn you have to pick very few comments You need to install new pie sci-pi Matplotflip a python actually is not needed It's just for convenience and then you install sci-kit learn all the other packages Noon pi and sci-pi in particular are required because sci-kit learn is based on noon pi and sci-pi Okay, but anyway, if you want to install other Version of the python interpreter such as anaconda. It's already provided out of the box the design philosophy of sci-kit it's One of the greatest feature of this package, I guess In my opinion it includes all the batteries necessary for general purpose Martial learning code it has as it's it supports features for and functionalities for data and data sets feature selection extraction feature extraction algorithms martial learning algorithms in general in different settings so classification regression clustering and stuff like that and Finally evaluation functions for cross validation confusion metrics We will see some examples in the next slides the algorithm selection philosophy for this package is Try to keep the core as light as possible and try to include only the well-known and largely used Martial learning algorithms. Okay, so the focus here is to be as much general purpose as possible Okay, so in order to include a broad audience of users At a glance, this is a great Sorry, there's a great picture depicting all the the features are provided by sci-kit learn And this figure here is has been gathered by the documentation. This is a sort of map. You may follow to That allows you to choose the particular in martial learning techniques You want to to use in your martial learning code. There are some clusters in this picture There is regression over there classification clustering and dimensionality reduction and you may follow this kind of Path over there to decide which kind of which is the setting most suited for your problem. Okay the API of sci-kit is very intuitive and Mostly consistent to every martial learning technique There are four different objects. There is the estimator the predictor transformer and the model. Okay, the These interfaces are Implemented by most almost all the Martial learning algorithms included in the library for instance, let's make an example the API for the estimator is The mated fit, okay The estimator is an object that feeds the model based on some training data and is capable of Inferring some properties on new data For example, if we want to to create an algorithm, which is called KNN or K neighbors classifiers We the KNN algorithm, which is a classifier. So it's it's for classification problems and then supervised learning It has the feed method But for all own also, sorry for also unsupervised learning algorithm such as k-means The k-means algorithm is an estimator as well and it implements the feed method too For feature selection is almost the same. Okay Then the predictor the predictor provides the predict and the predict probability method and Finally the transformer is the transform Is about the transform method that and sometimes there is also the feed transform method that applies the fit and then the Transformation of the data the transform is used to to to make transformation of the data in order to to to to make the data able to end In a form that is able to be processed by the algorithms Finally, the last one is the model the model is The the general model you may create in your martial learning algorithm the model is for supervised and for unsupervised algorithms and another great feature of martial learn of scikit is the Pipelines because scikit provides a great way to create Pipeline processing so in this case you may create a pipeline of different processing steps Okay, just out of the box. You may apply these Select k-bass which is feature selection step then after the feature selection you may apply your PCA PCA is the feature is a an algorithm for dimensionality reduction and then you may apply logistic regression, which is a Classifier a classifier. Okay, so you may initiate a pipeline processing very Very easily. Okay, see and then you call the fit method on the pipeline and the fit method will And then the predict the only constraint here is that the last step of the pipeline should be a Class that implements the predict method sold a predictor. Okay, so far so good Okay, great So let's see some example scikit in action. We have it's very introductory example The first thing to to consider is the data representation Actually scikit is based on noom pie inside pies, you know So all the data are usually represented as matrices and vectors in general in martial learning by definition We have the x matrix over there, which is usually Identified by the cup to letter because it is a matrix as a matrix of and different rows and D different colors in this case I'm sorry in this case and is the number of samples we have in our dataset and D is the number of features so the number of Relevant information on the data we have okay, so the data comes The training data come in this flavor and it under the hood it is implemented by sci-fi dot sparse mattresses okay, usually it is if I'm Not mistaken should be CSR implementations who comma sparse row a compressive sparse row. Okay, and Finally, we have the labels because we know the values for each of this Data about the problem we have the problem We are going to consider is about the iris data set and we want to design a logarithm that is able to Automatically recognize iris species. Okay, so we have three different species of iris We have iris versicolor in on the left iris virginica here and Iris satosa here. Okay The features we're going to consider are four and Are this length of the sample and the width of the sample the length of the petal and the width of the petal? Okay, so every data in this dataset comes as a vector and every sample Sorry comes as a vector of four different features. Okay this four here scikit are red has a great Package to handle the data sets actually these particular data set is very well known in many fields and Is already embedded in the scikit learn library. So you only need to Import the data set package and call the load iris and then you you call the function load iris And the iris object is a bunch object that contains different keys It has the target names the data the target a description of the data set and the feature names Okay, description is the description verb was the description of the day set feature names are the four different features I already mentioned in the previous slides The target names are the the targets we expected on this data set in particular satosa versicolor and virginica the three Different iris species we want to predict Then we have the data So we Iris dot data comes as a noom pie Matrix noom pie and the array the shape of this matrix is 150 100 150 rows times for Four which is four different colors columns and The targets are 150 because we have a value for the target a value of target for each sample in the data set So and the number of samples in this case is 150 D the number of feature in this case is four and That's it The targets here is the result of the target Okay, so we have a value that ranges from zero to two corresponding to the three different classes. We want to predict We may try to apply a classification problem on this Data we want to exploit the KNN algorithm the idea of the KNN classifiers is pretty simple in for example, if we consider a k which is equal to six we're going to check the The classes as This is a new data. We train our model with the training data and we want to predict the class of this new data on the The classes of the the six dearest neighbors of this data. Okay in this case should be the Virginica, okay the dot the red dot Okay, we're simple in second few lines of code. We import the data set We call the K neighbor classifier algorithm in this case We select and neighbors equals to one then we call the fit method and we train our model Then if this is what we get actually if you want to plot the data these these are called the decision boundaries of the Classifier and if you want to know for new data, which is the kind Which is a species of iris that has three centimeter times five centimeters? Seppel and four times two centimeters Petal width. Okay, right. Let's check iris dot target names of KNN dot predict because KNN is a classifier So it may fit the data and also predict After the training and it says, okay, it's a virginica Okay So far so good, right? Then we may also try to instead of Facing this problem as a classification. You may also face this problem as a Nonsupervised setting so as a clustering problem in this case We are going to use the k-means algorithm the k-means algorithm is the idea is pretty simple that we want to We create a Cluster of object and each each object is equal distance to the center of this of this cluster, okay, and that's it and Psychic it's very simple. We have the k-means we Specified the number of clusters we want to have in the k-means in this case We want three clusters because we're going to predict three different Species for the iris and then this is the ground through so this is the value we expected This is what we got after calling the K-means as you may already Notice the interface for the two algorithm is exactly the same even if the Masha learning settings are completely different in the formal case It was supervised in this latter case is unsupervised. Okay, so classification versus clustering Finally very few slides to conclude another great battery included in scikit and I'm I don't know how many other Masha learning libraries in Python Are so complete in terms of batteries is about the model evaluation algorithm model evaluation Isn't necessary to know how do we know if our predictor or our prediction model is good So we apply model validation techniques. We may Simply try to verify that every prediction correspond to the actual to the actual target. Okay, but this is Meaningless because we are trying to verify if we train all the data on the training Okay, so this is this kind of evaluation is very poor because Because it's based only on the train So we we are just checking if we are able to fit the data, but we are not able to To test if the model the final model is able to generalize Okay, because a key feature of this kind of technique is the generalization. So No Go too much to the training data because it's it you will end up in a problem, which is called overfitting But you need to generalize to to to be able to noise and to be able to predict even new data that are not actually Identical to the training data. Okay, one Usually technique use a technique in martial learning is the so-called confusion matrix. Okay scikit provides are in the the matrix package provides different Kind of metrics to evaluate your performance in this case We're we're going to use the confusion matrix the confusion matrix is very simple is a matrix where It's the number has is a square matrix where the rows and the columns corresponds to the number of classes You want to predict. Okay, and then the diagonal you have all the the classes That you expect with respect to the classes that you predict. Okay, so you have all the possible matchings if you have all the data there on the On the diagonal itself that you predicted perfectly all the classes. Okay, is that clear? Okay, great. Thank you But I grew a very well known for you guys that are already aware of martial learning is The cross-validation technique cross-validation is a mode of validation techniques for assessing how the results of this statistical Nauseau of the data is able to generalize to independent Data sets not only to the data set we used for training Okay, and say kid already provide all the features to handle this kind of stuff. So say kid Imposes us to write very few code just the few lines of code necessary to Import the functions already provided in the library In other cases we Were need we were required to implement this kind of function over and over for every time in In our Python code. Okay, so this is very Very useful even for lazy programmers like me, okay In this case we have we exploit the train test plate. So we the idea of the cross-validation here is The to splitting the data The training data in two different sets The the training set and the test set so we fit on the training set and we predict on the test set Okay, so in this case we will see we see we see that there are some Errors, okay coming from this prediction. Okay, this is a more robust way to evaluate our prediction model Okay, so the last couple of things. Thank you the last couple of things is Large-scale out of the box. Okay, another great battery included in scikit is the support for large-scale computation and already out of the box you may combine scikit-learn code with every library you want to use for Multiprocessing or power computation distributed computation, but if you Want to exploit the already provided features for this kind of stuff Some there are many techniques in the library that allows for a parameter, which is called n underscore jobs if you set these parameters with a value different to one, which is the default value it Perform the performs the computation on the different CPU you have in your machine if you put the minus one value here This means that is gone. It is going to exploit all these CPUs you have on your single machine, okay, and This is for different settings or for different kind of application in martial learning You may apply multiple processing for clustering the k-means examples We made a few slides ago for cross validation for instance offer a greed search greed search is another great features include that a feature included in scikit that is able to Identify the best parameter for a predictor that for a predictor that maximizes the value for the cross validation So we want to get the best parameters for our model That maximizes the cross validation so that is able to generalize the best Okay, just to do to give the intuition Okay, this is a Possible thanks to the job lead a library which is provided in the background Okay, so under the hood it the new number jobs here correspond to a call to the job live Okay, the job live is well documented as well So you might read the documentation for any additional details and last but by no means least scikit meets any other libraries, okay Sorry scikit could be integrated with an LTK. That is that is natural language toolkit and for scikit image Just to make a couple of example In details a scikit meets natural language toolkit by design and LTK includes an Additional module which is an LTK dot classified up scikit learn which is actually a wrapper in the NLTK library that allows to Translate the API of scikit in the API use the NLTK, okay So if you have code on an LTK you want to apply a classifier exploiting the scikit library, okay, you may translate you may import the classifier from scikit and then you may use the scikit learn classifier class from the NLTK package over there and Wrap the interface for this classifier to the one of scikit that it is in this case Linear SBC that stands for support vector classifier Okay, and then you may also include this kind of stuff in a pipeline processing of scikit so in conclusion Scikit learn is not the only martial learning library available in python But it is powerful and in my opinion easy to use very efficient implementation provided It's based on numpy scipy and swyton under the hood and it is highly integrated for example in an LTK or scikit image just to make an example. So I really hope that you're looking forward to using it and Thanks a lot for your kind attention Thank you. Thank you very very real. We have six minutes left for your questions Please raise your hand and I'll come by with the microphone Well, thanks for the talk. I have two short questions. Does scikit learn and provide any online learning methods? Yes Yes, yeah, actually this is a point I I wasn't able to include in the slides the online learning is already provided and there are many Classifiers or techniques that allows for a method which is called partial fit. Okay, so you have this method to Provide the the model a bunch of data one at a time Okay, so the interface has been Extended by a partial fit method. So some techniques allow for online learning and another very Great Usage of this partial fit is in case of the so-called out of core Learning in that case the in the out of core out of core. Sorry Learning setting your your data are too too big to fit in the memory Okay, so you provide the data one bunch of Bunch of data one at a time because they're too big to fit in the memory So you call the partial fit Method to train in case of a classifier to fit your model a bunch at a bunch of the time now Okay, thanks second Quick question Is there any support for missing values or missing labels apart from just deleting them? In case of online learning, right? No, just in general for any machine learning for missing labels missing labels or missing data What do you mean? So like if you have a feature vector that just misses like value at the third component Actually, I don't know. Okay. Actually. I don't know Yeah, thank you I'll just let come by So we have a very simple imputor that's going to impute by Median or mean in the different directions So if you have very few missing data, that's gonna work. Well, if you have a lot Then you you might want to look at matrix completion Methods which we do not have we had a Google summer of code project on this last year. They didn't finish We welcome contributions, of course Thank you Hello Have some experience actually was psychic before and I'm actually a mathematician mathematician I haven't I had no idea about all the The stuff under the hood and I didn't want to deep to inside to be too deep inside of the whole algorithm starts and mathematics and such and The biggest problem for me was to realize what do I do wrong? So if you got some kind of big data set with features labeled supervised learning how What would you advise to someone who doesn't know how does it work inside? What which steps or which? Small smaller easy solutions should I Consider to improve the results of the classification classification. Thanks. Yeah, actually Much learning is about finding the right model with the right parameters, okay So there are many steps you may want to apply in your training the different algorithms In general you apply data normalization steps So you you might first of all the the first step I suggest is Pre-processing of the data, okay, so you analyze the data you make some Statistical tests on the data some preprocessing some visualization of your data in order to know What kind of data you're dealing with? Okay, so this is the first step the second one is Try the the simplest Model you you want to apply and then improve it One step at a time. Okay, if you find the right model you want to use then you want to Find you should you're required to find the best Settings for that model. Okay in that case you might end up using the greed search method for instance, which is a method provided out of the box just to Find the best combination of parameters that maximizes the values of the cross validation for instance and Of course, it's a Training on the job, right? So you You may find the right model for your productions or you may find the worst model and then you Start over again and look for different models Okay, I'll be there hoping this helps. Yes. Thanks again Valerio I think he is going he's going to be give a talk at pay data as well I think on Saturday, isn't it? Yeah on Saturday. So if you attend pay data, don't miss that talk as well and yeah, thanks again Thank you very much