 So, I prepared those slides for a webcast on our ILA with Andrea Smuler, who is also one of the maintainers of the Scikit-Lan library. But because we just have 20 minutes, I cut most of the slides because I don't know what he was talking about. All right. So just I want to quickly introduce the Scikit-Lan project. It's a machine learning library written in Python and for Python developers. How many of you do not know Scikit-Lan here? Don't. Don't. Oh, they are more than expected. Okay. So basically, the goal of Scikit-Lan is to make it easy to do predictive modeling, as was introduced previously, with in-memory data structure. So it's basically non-scalable predictive modeling or machine learning. So this project started in 2010, or we did the first release in 2010. And since then it grows in terms of contributors, so it's an open source project, now it's hosted on GitHub. And so we, for the new release that we try to do every one to six months, we add 160 unique contributors. And on the website, those are the statistics from Google Analytics, so unique users on the website per month, it's approximately 150 unique users per month. So quickly machine learning in Scikit-Lan. So you start from the training data. If you are doing supervised machine learning, you want to predict some interesting value. So for instance, whether or not a user will click on an ad. So you need to collect both the input data, the description of the user, the ad, the environment, and whether or not the past users clicked on that ad. So in Scikit-Lan, we do that by importing a model, so model class here, and we give it some parameters, a dictionary of parameters with possible default values. And those are called hyperparameters in machine learning because we like fancy words. This is just by opposition of the internal parameters of the model that are trained from the data. The user parameters and the data parameters are different. So those are the parameters that the user selects. Then we pass the input data, the training data. So we are dealing with Python, with NumPy data structures, so the NumPy array. So X is an input array in two dimensions. The columns are descriptors and rows are samples, individual cases. And Y is a sequence of labels of target values for each rows in the X table, basically. Then the model is updated, updates its internal parameters from the data, and it's able to make predictions on new data, so test data, and to produce some prediction. And in general, you want to know whether or not you're predicting something that is better than chance. So you want to compute the accuracy of your prediction by comparing them to a ground truth test labels that you manually annotated, for instance. So this is the scikit-learn API, basically. So there are many different models as model classes. But they tend to follow a very similar API so that you can try to switch them and see how it goes and compare their accuracy. So you have many applications, I skipped that, you know probably about that. So in the latest release, we introduced a bunch of interesting new features. So one of the first ones that is kind of unique to machine learning libraries is what we call probability calibration. When you train a classifier, sometimes the classifier can give you an estimate of its confidence level. But most of the time, it lies to you. It's not the raw confidence level doesn't mean anything. A good confidence level would mean that if it is at 0.8, from all of the classification that are made at 0.8 confidence level, 80% of them should be correct, should be positive, for instance. This would be a calibrated confidence level. So most scikit-learn classifiers, they have a predict probability method, predict probability. But the raw prediction value is cheap to compute, but it's not necessarily good. So by using an external calibration tool, we can fix this. So to diagnose whether or not your model is well calibrated, what you can do is do a calibration plot such as this one. And in scikit-learn, we provide tools to make that easy with matplotlib. So on the x-axis, you have the confidence level predicted by the model for a bin of the data set. And on the y-axis, you see in each of those bins, the fraction of positive classification. So here, this is a support vector machine. So it's a linear model that tries to focus its decision function, tries to focus on samples that are hard to classify. So you can see that for most of the prediction it makes, they are close to 0.5, so it's not very confident. Because it's straight to focus on the hard cases, and then, therefore, it's not very confident of its own classifications. In real life, if you see from all the samples that are predicted at 0.6 confidence level, most of them are correct, actually. So it's very pessimistic. So you cannot really use that as a probability of good classification, because most of the time it's making very good classification, even when it says that it's not very confident. On the other hand, if you take logistic regression, it's naturally a probabilistic model that directly tries to estimate this probability. So you see that by default it's well-calibrated. The accuracy is not necessarily better. It can be exactly the same as the previous model, but it's calibrated by default. And you can see that most of the examples here and here, they are close to the 100% or 0% confidence level. So here I compare logistic regression, the blue line, which is well-calibrated. Support vector machine, the red line, which is not calibrated in a pessimistic way. And naive bias, which does a mistake also, but the opposite mistake of support vector machine. You can see that it's overly confident, even when it's making mistakes. So you shouldn't really trust it. It's naive for that reason, among other things. We generated some data to make it fail also. It's maybe a hard problem in that case. So it's possible, once you have identified that your model is not well-calibrated, to use different calibration methods. One is called the sigmoid calibration, the PLAT method. So it's a parametric method. So it's good when you don't have a lot of data. And it's specifically designed to address the issue of the support vector machine, so the pessimistic models. Isotonic calibration is a second method that doesn't make that assumption. But on the other hand, it's more data to calibrate the model correctly. So if you take the support vector machines, the original model is still the S shape, the sigmoid shape calibration curve, the blue curve. And you can see that both the red calibration, which is isotonic, and the green calibration, which is sigmoid, are kind of moving the model towards the diagonal, so well-calibrated model. So you can see that there is an actual off-diagonal element here, but we discovered that there was a bug in the release, so we fixed it in Master's, not yet really yet, but it's actually a bug. And this one, too. So for an ASBIS, you see that the sigmoid parametric model does not fail completely to calibrate the model, because it's making an assumption that is not true for the naive base model, whereas the isotonic calibration method does not make that assumption and kind of work, except the bug. It's going to work better. So why would you like to calibrate? In most situations, you don't really care about the confidence level. You really care about the accuracy, the percentage of the time that the model makes a correct prediction. But for some specific situations, it's very important to calibrate, especially if your target prediction directly relates to your business metric. So for instance, if you're doing add auctions on a real-time bidding platforms, you want to estimate the bid price. And to do so, you know the cost per click. And what you want to do is estimate for a specific user and a specific add the click-through rate, the probability of click. And this is the real priority that you want to calibrate, because it's directly related to the price that you want to bid on the auction. In most situations, if you're really interested in that, you can also decide to just use a calibrated model like logistic regression instead of trying a non-calibrated model and recalibrating it afterwards, because it's expensive and it's cheaper to directly train a calibrated model in the first place. So think about that before calibrating. Then in scikit-learn 016, there is also a lot of improvements to make clustering algorithm more scalable. So for instance, we have one implementation of DBScan, which is a very nice algorithm that scans the whole data set. And for each sample, we'll build a neighborhood of samples that are close to it in a fixed radius. And identify points that have more than five neighbors. They will label them as core points. And those core points are interesting, because they might be an important part of a cluster. And it will expand the clusters around those core points. If you have a good way to index the distances, and we have some in scikit-learn, the complexity of this algorithm is n log n, which is kind of scalable. So this is the outline of the algorithm. So let's say, for instance, you start from B. This is not a core point, because there is no, if the minimum of point is 2 in that case, there is a single point in that region. So you move on to the next one, and you identify that this is a core point. So you put it in a cluster. And B is still a neighbor, so you include it in the cluster as well. And you propagate. And then you include all of those in the same cluster. But n is outside of the radius of any core points. So you label it as noise, and you leave it as noise. So it's an interesting feature of dbScan, is that it can identify stuff that are outliers, and leave them away, and do not create clusters for noise points, basically. Another feature of dbScan that is very interesting is that it's able to find the separation boundaries between clusters that are non-linear, non-convex clusters, which can be useful, especially in low-dimensional data, like geodata, for instance. So in 0.16, we improved the implementation of that algorithm. And so you can label hundreds of thousands of samples without any problem in the order of a second. So it's a 30-time speedup compared to the previous implementation, which is really naive in several respects. So I mentioned the pros already. So the ability to discard the noise and to isolate non-separable cluster. And it's quite fast in general. So with that algorithm, you can tackle a couple of millions of points in a couple of minutes, with no problem. The cons is that it's not really scalable, because you still need to have an index of the whole data set. And to have this index work fast, it needs to be in memory, at least in scikit-learn. We just have an in-memory implementation of those indexes. So if your data does not fit in memory, we have implemented another algorithm, which is called Birch. And Birch is an algorithm that will scan the data and incrementally maintain a summary of the whole data set. And it preserves enough statistics on the centroid locations, the cluster locations, so that you can do a final clustering at the end without keeping the whole data set in memory. So in practice, in scikit-learn, it looks like this. So you create the model, so Birch, as usual. But instead of calling fit on the full data set directly, you will iterate over a data source that will give you chunks of data one at a time. And you will call partial fit on each chunk on the same model by iterating on the full data set. So this full data set could be several millions or billions of samples. It would just take longer. But it will use a constant memory size. At the end, you can see that Birch has subcluster centers, which is a compressed representation of the full data set. And finally, we have a weird API to trigger the final clustering by setting the number of clusters that you want to extract out of that summary, and call partial fit once more without any data. And that will trigger the global clustering. And then you can use it to compute prediction on new data. If you want to change the number of clusters, you can change that at the end. You don't need to re-scan the full data set. So it's very quick to re-cluster the final cluster. So here is an example of the behavior of Birch on a uniform data set in 2D. So a perfect clustering will be like a checkerboard. And you can compare this to mini-batch payments on the right, which is another out-of-core algorithm for clustering in scikit-learn. But this does not maintain the internet data structure. So if it makes mistakes at the initialization time at the beginning, it tends to retain those mistakes. Whereas with Birch, as the final clustering is done at the end on the summary, it can fix and get a better clustering than mini-batch payments. So what's next? There is an ongoing pull request in scikit-learn to implement some neural nets. So we didn't have neural nets until now. So we have the constraint in that project that we don't want to introduce a dependency on GPU or CUDA, which is often necessary for very large neural nets nowadays. But we want to keep scikit-learn like an easy to-go library. If you really want to do deep learning and do research on that in computer vision, I would advise you to buy a GPU or several and to use Torch, Tiano, or Cafe, which our library is dedicated to that. But if you really just want to benchmark a baseline linear model on your data and compare it to a random forest just to know whether or not neural nets could be promising or not on your data, then this implementation might be enough. And we have a simple hyperparameter set in the constructor. And it follows the traditional feedback API of scikit-learn. So it's much easier to use than a Tiano or Torch or Cafe, for instance. We plan the two optimizers, one optimizer, which is good for small data, which is called LBFGS, which has no hyperparameter for the optimizer. So it's easier to use. And if your data doesn't fit in memory, we also implement SGD with nuster of momentum with the partial fit API that we've seen with Birch, the same way to scan over the data, even if it doesn't fit in memory. And we try to provide reasonable initialization of the weights, which is very tricky to get right. So we try to follow the state of the art. There is another final interesting feature in scikit-learn, which is useful for visualizing high-dimensional data in 2D, which is called t-sne, I will skip the math. But what is interesting is that when you want to compute the previous formula, you have to compute the per-wise gradients that depends on all the possible per-wise distances between the elements of your data set. Basically, what the Berners-Hutte approximation that we are implementing right now does is that it ignores points that are too far away from one another to compute that gradient using a quadtree data structure that makes it possible to find which points are too far away from another, very efficiently. And basically, that reduces the complexity of the algorithm from a quadratic complexity to n log n, which is much, much better, especially when you go above 10,000 of samples. So it makes it possible to run it on MNIST quite easily. So this is the scalability curve of the new implementation in red compared to the standard implementation that we already have in the last release. And this is a logarithmic scale on the y-axis in second. So you can see that it stays below a couple of minutes, whereas the other would move to the hours and days quite quickly. So this is the kind of representation that you get on MNIST. So the different colors are different points from different digits for a digit recognition data set. I don't have time to explain or show more. But basically, to get such nice clusters to identify the region of your data set, it's really hard to do on that data set if you use something called PCA, which is the traditional way to do it. You would see a lot of overlapping data. But T-SNE is able to use to focus on the local structure of the data set and identify those regions this way. So it's a very nice way to explore a data set, especially if you integrate that with Bokeh. And I think you might do a demo of that later. Thank you very much. Thank you.