 Okay. Thank you. Thank you all for being here. It's a pleasure to be in Big Data Spain and I'm going to talk to you about some of the stuff that we are doing now at my research group this is old picture of some of us and I would like to start saying that when we first entered the big data area Decade or more ago the toolbox was almost empty And we were centered on working new platforms that we can use these platforms that we have now already all Have their own machine learning library And so we have a little leap or flink ML or sick to learn that we all use and most of these libraries come with the classical supervised algorithms for classification and regression and the unsupervised ones for clustering, but there are more algorithms that are of interest like feature selection ranking algorithms or anomaly detection algorithms that most of the platforms don't have available if we take a look at the a picture of the data industry landscape Before we have devoted a lot of effort to infrastructure Development and as you can see regulation and ethics. Is there a pointer here or something? No Regulation and ethics was just showing the head there and Now it's becoming more and more important at the same time that the science is also achieving More importance than before for the future that science will remain Being important because we need more powerful and more quick algorithms, but regulation and ethics will Even increase its importance As you probably know in May we entered the the European Union started to apply the general data protection regulation and the European Commission It started a high-level expert group on artificial intelligence Whose mission is to elaborate? The strategy of AI for for Europe in September we have the first worship on this and Part of the worship was devoted to regulation and ethics while the other part was Devoted to which kind of applications should Europe concentrate on developing us We don't have as many money to invest as China or the US and the words that were named in the regulation and ethics part Where those words down there? transparency of our algorithms Accountability explainability Privacy of the date of security all of them related also with the GDPR law and And so what is the meaning of this for machine learning? Well, this means that we no longer May have our black box machine learning algorithms with an input and an output in order For our algorithms to be valid we need a more multimodal approach to evaluate them So of course we will have to measure accuracy because accuracy is important But we will need more things. We will need To measure how scalable the algorithms are as we are in a big data Context we will need to have human understandable explanations of our algorithms Of course our algorithms need to be interpretable So a human expert can take advantage of the knowledge and the information that's derived from them And other things like robustness trust confidence are important because AI is a mainstream technology and we need to have society trusting our Our technology and privacy of the data is also important so I'm going to Tell you very briefly which our which are our works on privacy preserving on Distributed the learning context. So as you already know Distributed learning is something that we can it's it's a paradigm that we can use if we have Data that's too big to be processed in one note. So you can distribute it or Perhaps you already have Distributed situation in a region and what happens for example, if you have different companies that are Willing to share the information or the knowledge that's derived from the data that they have But they don't want to share their data for several reasons so in that case privacy by design is interesting and This is the context of the algorithms that I'm going to show to you in which we Share parameter values in the network, but we not we don't share data So I'll make this very brief. This is our land the SVD algorithm It's not a very marketing name as you can see. It's a privacy preserving Classical classification or regression algorithm and what we do is we have Local learner in each of the notes or your of your distributed context This is a one larger neural network a little different from the standard ones basically we have Done this before in our research group and what we do is that instead of measuring the classical Error measurement point after the activation function. We use an approach In a measurement point before the activation function. So given certain restriction Restrictions sorry in that function we arrive to an equation Which allow us to have a non iterative method instead of having the typical gradient descent And if we are substitute part of it, sorry because I don't have the pointer with the If we use a singular value the composition The situation the final situation is that we have still a non iterative method with another advantage And it's that the computational complexity now depends on the minimum quantity Between the number of samples and the number of features before we only could treat Data that's big in samples, but not in features and if we use like we can see here at the right Ring a scam for for learning what we have is we learn in local node one, which might be a company and We transmit on the network the parameters of our equation down there We send that to know to we learn at the parameters and such and such until we finally end up Learning with all the data, but only transmitting parameters of the network I just very briefly To show you this is an experiment comparing our results with a deep neural net network working also on ring a scheme and Important thing here is as you can see the area under curve is very similar perhaps a little better for our Proposal and which is also a little bit more robust because the standard deviation is lower But the nice thing is that the CPU time for our algorithm is half the time that we need for the DNM And also another thing that I think is remarkable is that if you take a look at the batch time That's that's the overload of the network because the DNA is transmitting data also and not only parameters So ours the difference in incremental mode and batch mode in our proposal is almost non-existent and So we have a distributed machine learning algorithm, which is privacy preserving with which was our beautiful Word but also obtain some an algorithm that's scalable is more accurate uses less training time provides incremental learning Which is an added plus and is hyper parameter free and that's also nice for big data and also for treating streaming data for example All right So I'm moving from because I said there are classical supervised algorithms now I'll move to an anomaly detection algorithm. What's that and an anomaly detection algorithm is An algorithm in which you want to detect class B But you only have data from class a normally that means we have a normal situation And we want to detect a fault In a machinery for example or a fraud or something that we don't have data and the data is not leveled normally The approach I'm going to explain is well we can use several ones This is a geometric one based on convex hole a Comeback school is the smallest convex polytope that contains the full set of points and the problem with this is that is Too costly for high-dimensional spaces and it's prone to overfitting because if we have another liar there We might end up with a shape like this, which is not what we want So our approach we approximate the decision using an ensemble Of the local decisions made on 2d spaces that are random projections of the original ones like the example that you see on the right So our decision rule there is if I have a new point And I want to know if this point belongs to the normal shape that we should contain the normal points I project I made the projections and if the point it is outside the projected Shapes in at least one of those projections then I have an anomaly Okay, I'll we I'll go quickly over this So we have a learning algorithm and what I want you to see is that the algorithm can be employed in a Distributed or non-distributed weight weight. Sorry. We have to fix the number of projections Calculate the expansion parameter and use the vertices and types of centers We have several so we can treat situations for example in which The density of points is high in one part of the form, but Some others are on the populated and On doing this we have also a testing algorithm in which we what we do is project the new point and obtain the local decisions Finally what we apply a global decision rule, which might be assume or a majority voting Whatever what I want you to To see is that we ended up having a distributed Comeback school a scaled come back school algorithm, which is privacy preserving Because on the training phase on the training process. Sorry We have one head training node, which might be any note on the circuit Transmitting the set of projection matrices P and on the testing process We have the decision or testing note, which again may be any Sending the projections of the new test points in if then the testing notes send the local decisions and the decision note Makes the final global decision rule. So again, we are only transmitting parameters of the algorithm not data Just Very briefly We tested the algorithm over several data sets in bold phase the best results in red those best results that are statistically Different and as you can see our algorithm, which is the SCH behaves Better than the other two which are to a state of the art one class support vector machine and an ensemble algorithm also for one class and This is the final the number of times that we win or lose with both with both algorithms If we go and check the distributed performance you see on the right the Area under the core for one note until 20 notes as you can see the results are Similar in accuracy a little better sometimes But on time the reduction of time is important. So the algorithm behaves very well in a distributed way and again if we make the comparative as you can see in three Data sets our results are very good also for this algorithm So we ended up having something that's privacy preserving but also it's more accurate than other algorithms So we are not losing on that Requires less training time and is scalable all right, so This will be More or less all about privacy preserving another thing is explainability As you know explainable AI XAA is one of the hot topics nowadays They're having several several workshops in the last two years on this and it aims to obtain artificial intelligence algorithms that are that give Enough explanation to a human expert. So the results can be understand and And actionable. I mean a human expert can do something with that that results What happens it happens that unfortunately most of the algorithm that There appears to be a kind of trade-off between explainability and accuracy so the algorithm that's that the algorithms sorry that are Good in accuracy like deep learning are low at the spline ability while some that are More explainable like decision trees are not so good at accuracy So this is our contribution in an example of data segmentation data segmentation algorithm is an algorithm that aims at having homogeneous groups of Of the actors that intervening a market so we have for example A matrix in which we have the users and the items they use and the eyes and we assign each pair a utility function minus one plus one so we are able to device homogeneous groups with These values for example item one and item M are in the same cluster for user one And then I can do I don't know for example Directed marketing campaigns for these homogeneous groups or do something with the results I obtain Which is the problem? Well, the problem is that the evaluation metric uses is accuracy only and So the results are very good. We have very accurate segmentation Algorithms the problem is that sometimes we cannot interpret the results so also very briefly what we do is what we have done is the Device a new quality metric in which we have two quantities on the one way We use the weighted entropy, which is our measure for homogeneity so we we cluster the Utility function and we take into account as you see in the equations the other the sizes of the clusters and Also, we aim to have that with the less number of variables possible So we maintain the algorithm as explainable as possible and we have a hyper parameter that balance both So how it works we construct a decision tree The leaves are the clusters of the utility function and we explore that trees Using heuristics for ready search as we have implemented this in a spark We are able to take advantage of the parallelism and we can explore in best sub trees like five or so for example And we also prune the tree Just to show the result This is the example that we obtain after with the data of one day in the out rain data set I don't know if you know the out rain Data set this is a record of the Advertisements that are offered to users of web pages that contain news So one day has more than one billion samples and six More than six hundred attributes and this is the tree that we derive We have the percentage of data over there in green is the Weighted entropy and the variable that divides with the value it uses The important thing here With our algorithm with our measure really we ended up having 80 18 groups You can count them over there that use As you can see two three five variables each group in a hole We have we have used 12 out of the 647 attributes and if we compare the results with a k-means a k-means will obtain 100 groups and use all this the 647 attributes of course the weighted entropy in our case Well, I don't know if of course is the word to use here But our weighted entropy is a little worse But it still is quite competitive and the results can be interpreted by a human expert because five variables at the most that we have are Useful for a human expert to devise an strategy for that homogeneous group for example and as you can see is a scalable So we have a segmentation algorithm that's a scalable is accurate uses less training time and is higher at explain ability that the other approaches and Let's go for another different thing that we are doing also and it's on reduced precision algorithms Well, as you know on the last year as we assist at high use of portable Devices for sports for health and industry and in many things and We go here again to the small data paradigm. So small is also beautiful. Not only big It's important We can use the things that we learn with small data to derive patterns that are of use Personal use for example, and then that data can be aggregated and anonymized to Calculate I don't know or to derive. How is the health of a region or of a country or whatever we need? But we don't have machine learning. We don't have many machine learning algorithms for embedded systems In particular we aimed at having a feature selection algorithm for In a reduced precision context Why? Well, we work on feature selection. We think is of interest and as you have seen on the explainable Site it's good because it provides deeper insight into the process that we are managing and for example if you have if you have Wearable for health you have you measure. There are a lot of things and perhaps some of them are not not of interest For certain situations So what happens with feature selection in all these portable? Devices well typically feature selection is performed or machines that use high numerical representation. That's We need 64 bits for make the calculation and these requirements are not met by embedded systems So we need to optimize the hardware resources What we have done is derive a new mutual information measure or just translate the mutual information measure to a reduced precision environment We have used the measure of dependence that you see here That's a ratio between probabilities and in a real situation like the one I have put there We need to measure the number of occurrences of events in certain situations so The probability that we calculate is that one that you can see here and in most real applications We don't need to store all decimal digits. Of course in other situations. We do need it and And thus this is a good We think a good context to or a good measure to use for reduced precision feature selection method I'm going to show you just some results For example here, we have two scenarios One with ten features as you can see here That's the real ranking and a more difficult one with 20 features the one that down there and we tested our approach Using a spearman's rock coefficient, which is plus one when the rankings are identical so the results are you can see there for Varying the number of samples and for 10 and 20 features as you can see the correlation is not very good for four bits But from eight bits and a head is quite good. We May say that the rankings are identical to the real ones using our reduced precision approach and They are even better with the number of samples when the number of samples increase the results are even better Okay, so the measure behaves as we wanted but How about Implementing this in a feature selection algorithm. So we did this with well-known feature selection method, which is MRMR perhaps, you know it and Over several synthetic and real datasets. We evaluate the true positive rate That's the proportion of features correctly identified using a gold standard the 64 bit Results and as some datasets have like five Thousand features we focus on the first K ones Well, this where our worst results. I'm not showing the best ones and as you can see in the Graph on the right 16 bits is enough to obtain the same ranking as the 64 bits in all situations while in the Left we can see that for the five top features is the same case for any But then some others are not so good Anyhow in average 60 bits are more than enough to select the same features of the full version Okay, so we select the and the same features and how about the classification and This is the classification accuracy. We have used us using a classifier Tends to us then tends to obscure a little bit the results of the feature selection process We have used a simple one is a nearest neighbor algorithm with K3 So we don't need to Obtain parameters or or Do anything with the algorithm We estimate the error rate using a classical five-fold cross-validation and I use a Friedman test with Nemenji post-doc test and here in the down in the Down part of the Screen we can see the results as you can see for five and 20 top features all situations are the same 16 bits is in the same group of the best Algorithms to say so because of the results of the classification, but the results are not statistically different for four bits or eight bits So the variation Does not affect the classification accuracy and we think this is also a beautiful result because we can implement this reduced precision in portable systems So we ended up with a feature selection algorithm that selects quality features We have seen that with 60 bits. We obtained the same rankings the classification accuracy is just the same So we have computational runtime and memory benefits with this reduced precision approach And so this is this is all thank you So I'm ready for your questions. If you have some do we have a microphone or something for the questions? Okay test. Yeah works I've got a question about the privacy method you You you explained So I was wondering like one of my daily dilemmas is there on one side I really benefit from like machine learning algorithms on the other hand I don't want to share my data for example I really benefit from Google Maps traffic jam prediction on the other hand. I don't want to Give give them all of my location data. So In your graph All the all the data remained on node one So could this node one actually be my mobile phone for example So could I keep all my data on my phone while still benefiting the larger groups? Yeah, sure. That's the idea of the algorithm. So you you have your data in your different nodes for example, I know a mobile telephone or a company or a bank site of whatever and You learn in one note and when you learn everything you transmit only the parameters to another note Well, this is a ring a scam that I did you can use it in another kind of scheme. I just I thought it was easy to To show how it works But yes, that's the idea and the computational hard part is actually transferred to the high end nodes, right or What's the yeah, but the transfer I showed we don't overload the network because you only understand in parameters So that that was the difference between that I tried to show between the DNN strategy and the LAN strategy The difference in time in our approach was Was nothing because we are not sending nothing heavy on the network and if Communications are interrupted for some reason You are only transmitting parameters So there is no knowledge about the data that we have used before Okay, thank you. Sure Yes as The idea is that you can have all your samples in one note and use distribution as a way of coping with the big data problem So you might distribute it again. There is no Problem with that and if it's already distributed then Of course, there are the things that I have not Said like what happens if you have I Don't know Samples that are in balance, for example Which is something that we also have worked in in our lab in our research lab, but not in this algorithm or what happens if you have Features that are only in one node, but the rest of the nodes are not homogeneous on that So you have different features all of these is also are also things that you can Find in your real problem and that we have worked on it, but I have not presented in here But yes, that's the problem how how to the problem of if you need to distribute a sample What do you do you distribute it almost in use lead to distribute it by features? I mean because if you have a lot of features, perhaps it's more interesting to distribute Vertically or horizontally and so how to join everything needs also strategies of doing that I don't know if that answers your question. Yeah, okay Any other yeah, yes, there is a question there Hi, thanks for the talk in terms of interpretability since we are humans have like some kind of limited comprehension of like the probabilistic functions are all that I mean all the latent Mathematical structures that in the end the the models are learning Don't you think that there's like a kind of trade-off in terms of Understanding the models and bias and bias in them since we have that limited comprehension. So if you think that's That's going to happen. How can we like? Try to solve that problem since well, if we understand the models better then it is probable that we are going to reach like Not that optimal solutions because we can like interpret them better, but Like less accurate or yeah, I understand It depends on the application I imagine because of course you can't avoid the need for having accurate models However, of course, for example, let's imagine a health situation in which you have DNN approach and another one and the DNN is more accurate and Depending on what you want to do with that, perhaps you use the less explainable, but it works But in some cases, but the problem is you cannot make use of The knowledge that's involved on that algorithm because you know that is something at the There is an input and there is an output and the output is a lot accurate the best Possible, but you have learned nothing and the the the approach that I have Proposed is okay. There are situations in which a little less accurate Algorithm like the one I have proposed here, which is a preliminary work at the moment can help us to interpret the results that we obtain because for example with the k-means and Well, and we limited also the depth of the tree in the k-means There is no limit So I use all the variables and I am more accurate, but what then if I have a human expert Who for example needs to interpret how to make a campaign? He cannot make use here. She cannot make use of the results because it's impossible to interpret that So depending on the situations at the moment Explainable AI may be of use. Of course, perhaps there are the situations which we cannot confront Now I don't know if in the future, but I think that the splinable AI it's it's a challenge Because to obtain trust in our algorithms We need to be able to explain what we do and of course because of the law and as you know GDPR Says that if you have Some algorithm take a decision that affects you you can ask for an explanation and with certain algorithms you don't have it Simply I don't know if this Completely. Yeah, thanks. I think that that idea of combining these two distributed systems with With like more spinability kind of solves the the problem, which is that well the main like issue people used to do to like give as a premise that we don't have enough like Computational power in order to do that in a more spinable way in a more simple mode. So Think it's like a very interesting work. Thanks and congrats. Thank you. So more questions Okay. Thank you very much