 The title of our next talk is quite descriptive without me giving too much of an introduction, I think Basically, we live in a world in which algorithms make more and more decisions about our daily lives Many people are working on improving these algorithms not as many are actually thinking about the implications Say hi to your new boss how algorithms might soon control our lives is the title of our next talk Please give a warm welcome to Andreas Davis So hello everyone. I have to say I'm really quite excited being here again I'm a bit terrified as well, but mostly excited and First and foremost, I want to thank to the organizers for inviting me again and for letting me speak at this really really great conference so as we said before the title of my talk is say hi to new boss and I'm going to talk about algorithms and about like shifting decision powers from humans to machines and In case you were asking yourself. Why is this important? Well, let's just ask a friend Here I usually like to do that with Google Auto complete and Normally this gives me kind of like some controversial statements or algorithms are stupid or algorithms one never work But in this case it seems like hmm It's pretty unambiguous that algorithms will play a very large role in this world and as I said, this is like a big chance Because algorithms can improve our lives a lot, but it's also a problem because we're shifting a lot of the Decisions that are now made by people to the hands of machines and in many cases We don't understand much how these machines work and how exactly that make the decisions and so I would say my Qualification for making this talk is that I shot myself in the foot with data analysis a lot of times And I became interested in how why algorithms are doing like so many things that we don't anticipate and why they sometimes behave in ways That seems strange and like Contradictory to what we actually want to achieve with them And so that's what I want to talk about and we're going to do it like this So first I will give you some theory about what an algorithm actually is How machine learning helps algorithms to make decisions and how this whole big data thing and Like the new data-driven society age plays into this whole whole affair Finally, I will show you some of the use cases for algorithms in our daily lives today After that we will be equipped with everything we need to know in order to start with some Experiments so I'm coming from physics and when I would try to understand something I usually do an experiment and try to to break the thing or like make it explode or whatever And so we're trying to do the same thing here with our algorithms and I have picked out two Two case studies that I'm going to present one about discrimination true algorithms and another one about Deanalysation so and finally I want to end with Some proposals and some ideas and how we can actually make the most of algorithms in this Kind of setting and also like to control and better understand what algorithms are doing Okay so first As I said, I want to talk a bit about algorithms now I just want to give you a very very basic overview of Machine learning and decision-making by algorithm. So please excuse me if there are any experts in the audience I'm probably making like a lot of simplifications here Okay What is an algorithm? I'm here. I give you an example basically an algorithm is just a Recipe that can be followed by a computer or a human being and gives that human being out of the computer Step-by-step instructions to achieve a certain goal. So in this case, we want to activate a trap door and we Want to do that only if I'm on the trap door So the algorithm has to decide if it's me and then if it's me it can open the trap door Otherwise it has to wait and now this is pretty fancy algorithm because it needs some information about me and It needs kind of an intelligent Way to decide if it's the right person that is standing on the trap door or not So how does the algorithm get that information? Well, it uses machine learning and machine learning is a Way to automatically generate a model that we can check is on against some training data And which we can then use to explain that data and in addition to also predict some unknown data so as you might know from school are just memorizing data and like Reproducing what you already know can get you true tests, but normally it won't Make you pass with flying colors So ideally we want to have something that can in addition of memorizing data also make prediction about data That we have never seen before and this is what machine learning helps us to do in a bit more formalized way we could like look at it as a Model and some data so here on the right I just show you several Possible models that we can choose from normally we can write them as Explaining some variable y and as a function m for model which takes some attributes or variables x and some parameters p and Returns some value for the quantity that we want to predict and Now we can use data to train our models and to like select the models that are Compatible to our training and eliminate the ones that are not compatible to it and You can see here on the right We have eliminated all these models that are shown in red Whereas the models that are green can actually be compatible with our data And now we can use those models to make a prediction about unknown data points as well and which is shown here and usually you see there is some Errors or some discrepancy between the model and the data that we try to explain and this error is usually called epsilon now Epsilon can be decomposed further into like several parts. So there's a systematic error, which is mostly due to Like miscalibration or like measurements that we make each time when we try to like like measure a given variable So we can think about this as for example the speedometer in your car Which gives you intentionally a reading which is too low in order to make sure that you not overstep the speed limit and In addition to the systematic errors, we have also some noise in our data Which is due to like the either the internal process that has generated the data or the model that we use on the measurement apparatus sorry that we use to capture the data and Finally, we have some hidden variable errors, which is not random noise But which are errors that are due to variables that have an impact on the outcome of the model But which we do not know and which we therefore cannot use to model the data so That's the basics of like model generation and now you probably all have heard about big data and data-driven society and the effect that this has on model generation is Treefold for once you can see here for example, we have more or less the data volume in 2000 compared to the data volume in 2015 you can see that today we have a lot more data on our hands to make predictions and train models and we also have data of a much greater variety than before so to understand this effect We can have a look at this graph here which shows some random data That we measure with a pretty large noise as you can see and this data also contains some information and I don't know who of you can tell me If either the green points or the red points are showing have a higher value. So I guess not But now what we can do is we can just take that data and average it and Bank doing that we can reduce The amount of noise in our data and when we have enough samples that we can look at We can make the noise so small that we can really detect some signal in the data in this case The signal is just 0.01 high and so having more data in our hands allows us to train models Which can take into account smaller effects? Also, as I said big data does not only give us more of the same data, but it gives us different kinds of data So we can think for example about all the smart devices that you have in your home like your smart fridge your door The maybe automated smoke detector that all collect data about you and your interactions And like we can use the data to incorporate it into our models and to make better predictions And so this moves Some of the noise that were that was in the hidden variables into the model where we can use it to make predictions now Interpreting models can be hard or can be very simple depending on the model So there's some machine learning algorithms like measure like decision tree classifiers, which are pretty easy to Interpret because we can just follow This graph here and see exactly how the algorithms makes his decision or its decision about a given data point other models like for example This neural network here on the right side are really hard to interpret so we can't get an intuitive feeling for how this model actually makes its decisions and In fact, you maybe have seen those pictures here They show basically a neural network working in reverse. So They give us an idea of how like this neural network Understands a picture in this case and you can see that for example We have several structures here that emerge at different places in the image and that are generated by the neural network while it's Recognizing the features of the image and this method has been developed actually because it's really really difficult to understand What a neural network is doing otherwise. So the only way we have to do that is to like try to see What kind of input data and the network produces when we give it a certain output data? so Now what can you do with algorithms? Here I try to classify The uses of algorithms in our daily life into like three different risk groups So you could say that you have a low risk group which basically just affects our lives on the super Superficially and and the algorithms that make the decisions there if they go wrong or if they misbehave It would be only mildly annoying. Then we have the medium risk area where Failure or misbehavior of an algorithm would be a bit more severe to our life But wouldn't be fatal which is only in the high risk area here where algorithms really can take decisions that Can affect human lives and that can really Be life-changing in that sense Now a few examples for the first group Would be for example personalization services. So whenever you go to a website like Facebook or Amazon or Netflix The website shows you some content and it tries to show your content Which is very interesting to you and it uses an algorithm to do that and it tries to predict from the articles That you have viewed before which articles for example, you will find interesting. So this is so-called recommendation engine And it's in wide use in all kinds of services today Also, we have of course individualized at targeting you might notice if you go to some website and then you Like view a product and afterwards you serve around on the website on the the web Then ads for this kind of product like seem to like haunt you everywhere you go And this is also due to like machine learning algorithms that like try to predict Which kind of ads you will find interesting and that like show these ads to you on all kinds of different websites And of course there are algorithms that can do customer ratings So for example, if you want to like all the product online they could Estimate how likely it is that you would pay the invoice for that article And if it's not very high then it would the system would only send you the article if you pay the money in advance And of course, there are things like customer demand predictions So the holy grail of this would be to actually know what you want to buy Before you know it and then send it to you to your door and I think After reading a pattern, I think this is also what Amazon is trying to do in some cases So these things just like affect our lives superficially and if something happens there it affects us not very in a very deep way There are other uses of algorithms in our life for example a big topic that is coming up now with Big data and more data that we can collect about individuals is personalized health so making decisions about possible treatments and lifestyle based on data that we collect about you for example your heart rate Your pulse your how much you move around how many stairs you climb each day. So this is a large potential for improving for example areas such as medicine, but also other ones and We use the same or similar classification algorithms as and the applications that I showed you before So another thing is person classification. So Here we want to predict for example how likely it is that a person will commit a crime or Will be a terrorist and these are kind of algorithms are already in use today by for example governments to like issue restricted travel permits and to like Mark some people that have a high risk profile due to the algorithm for screening I think there are many talks here also that deal with this problem a problem especially and Of course, there are autonomous cars planes and machines Which are currently being developed or already in service and which will take over like driving from people in a few years or a few decades maybe and Finally there's automated trading which is mostly invisible to us But which has also a huge impact because 95 or even 99% of all trades today are actually performed by algorithms and not by machines anymore, so Finally there's the high risk area Where we have such things like military intelligence and in order intervention We also have already some governments that already use algorithms to like predict Targets for for example drone strikes and we also can have of course governments that use machine learning and algorithms for political oppression So for example to train firewall systems using heuristic algorithms to detect the traffic that should be filtered out and There's also critical infrastructure services like the electricity grid or other things that are Like critical to us and which are also sometimes governed or like controlled by algorithms already So as you can see already today We have many areas in our life where algorithms and not humans make the decisions And if you would like plot this again on this graph You would see that that most things where algorithms decide today are actually in the green on the yellow area And we have some things that there might be touching the critical part of our lives And now what big data and advanced in machine learning will do in the coming years is probably to make To both widen the applicability of algorithms so we can use them for domains where we couldn't use them before like speak speech recognition Customer service and many other things and we will also like penetrate deeper into our lives So making decisions which really can affect us on a more personal more intimate and more critical level Good, so this is all I wanted to show you in theory and now I want to use the remaining time to Show you two experiments, which I did So there are lots of things that can go wrong when you use algorithms But I picked two topics here that I find especially important and the first topic that we are looking at Is discrimination to algorithms so here the question is can an algorithm that is trained by human or by an Early and manual decision process actually also discriminate against certain groups of people You know like discrimination still is a very big problem in our society and we have like fought for many many years to like push it back And the question is of course now as we shift so much of the decision power from humans to machines Can we actually? Eliminate the discrimination that we still have in the system or are we gonna carry it over to like this automated decision-making the This is the definition of discrimination again here is a treatment or consideration of a certain person That is made based on his or her group class or category category and it's not based on in on his or her individual merit so That means that we like prefer or we put at a disadvantage certain kinds of people according to their their group or some protected attribute which can be for example the ethnicity the gender or the sexual orientation of that person and now we need a way of course to measure this discrimination and The measurement that I choose here. I mean there are several of course Is has been developed in the US and it's called disparate impact and it's quite nice because it uses a very clear and simple mathematical model to To explain discrimination so basically this model says that we have a process see Which acts on people that either have a given attribute X or don't have it so for example man and woman and We measure the outcome of this process and we are interested in the probability of the decision being yes For a member of the group X versus the probability of being yes for the member of the other group And so we can just have a look at the conditional probability P of making it through this process being a member of group X Equal zero divided by the probability of making it through the process of being and being a member of the other group And when we can when we divide these two quantities we get the parameter tau which describes the amount of discrimination that we have in the system and For normal purposes we can choose a given tau for example 80% and if we see that the tau is smaller We can say oh this process contains discrimination and this is nice because It measures discrimination not only if it's done Intentionally, but also when it's happening inadvertently so without wanting it so it doesn't really matter if the The process or the people in this process want to discriminate if they do it nevertheless Maybe unconsciously this measure can give us an idea about it and of course in practice we can't deal with probabilities that we have to to measure the Sorry the the number of people in each each category and then we can like make an estimate of this parameter But just dividing this number here by these two numbers and dividing this again by this number divided by the other two numbers so it's pretty easy very straightforward and Now I want to show how we can Use this to test a given process that we Take from we take decision power from People and give it to an algorithm and the example that I choose here is a HR process or a high-ring process And we want to here Select candidates based on their data for example the CV and other data about themselves that they submit to a potential employee employer and The benefits of this are of course saving time in the screening process and also improve The choice of candidates and this is a I choose this example because it's actually something that is widely done already So chances are that if you have applied to a job recently you have probably been subject to Subjected to this kind of process and there are also several startups in the US But also in Europe that try to implement these kind of data-driven high-ring processes So it's something that's really already happening Okay, again the setup is pretty simple We have some information about the candidate that we submit to a human reviewer and that human reviewer makes a decision If to invite the candidate or not and it also like gives that information to an algorithm as a training data and the algorithm tries Then to replicate the decision of the human whether to hire the candidate or not so The setup as I say we use the CV any work samples and other publicly available information about the candidate that we can get as an input we then Use a human to like make the decision about a given candidate either yes or no and we train the algorithm on this data So and the approach that we have here is a so-called big data approach So we basically try to get as much data about Every about the candidate as we can and put it all into the algorithm and let the algorithm figure out what it does with it So the decision model for this Rather simple I show it here In order to decide if we hire a given candidate we can define a function s Which is the score that has several parts one part is the merit the score of the merit of the candidate Which is based on on his or her Abilities and another part is a discrimination malice of bonus So you can see this as increasing or decreasing the total score of the candidate based on his or her membership in a given group And then of course we also have some element of luck, which we have set to 20% for example here and then we just add all these three components together and if the If they are larger than so we add these two things together And if they are larger than a given bar we invite the candidate if they are not we don't invite him or her and You can see here The bar has a different height depending on the group of the candidate if there is discrimination in the system Okay Now we can train a predictor for that model To which we give the information why about the candidate and also a lot of other information which we call z here So everything else that we can find for example in public records or in other information that we can get our hands on and then We train this predictor for the to predict the outcome of the hiring process and we can see What the results are so since it's pretty hard to get our data on to get our hands on real-world data What I did instead here is to simulate 10,000 samples of an agent-based model where we just Choose a function C and some disparate impact tau and generate training data with that and then we can use a standard machine learning algorithm in this case a support vector machine to test that data and Measure the Discrimination that the algorithm produces Okay, this is shown in this graph here It's a bit complicated. So let's go through it one by one. So what we see on the x-axis is the Amount of information that our algorithm has about the attribute x of the candidate So this is the protected attribute to which we don't want to give information away And so if we are at zero it means that the algorithm doesn't have any information at all about the candidate if we're at one It means the algorithm has all the information about the protected attribute of the candidate and if we are at 0.5 It means that we have to correct information in about 50% of the cases Okay, then we have Our parameter tau which is the disparate impact and here we have set this value to 0.5 This means that the chance of making it through the process being a member of group X is twice as high as for members For people that are not a member of this group Now here above we see the prediction fidelity of our algorithm which is between 86 and about 90% and Which also increases as we increase the number at the rate gamma to which the information leaks into the system and Finally we have here The tau so the discrimination the amount of discrimination Done by the algorithm measured again as a function of the information leakage gamma Now so what that means is that the more information we provide about this protected attribute? We provide to the algorithm the better it is able to discriminate against people in that group So if we are here and the algorithm does not any information at all about the protected group It can't discriminate against those people so the ratio of success between the individual groups is one This is actually great because it means if we can build an algorithm that doesn't have any idea about these protected attributes We can eliminate all the discrimination that's in the system on the other hand if by some accident the algorithm gets full information about these attributes it can Discriminate just as good as well as a human against people in either group So that means if we give too much information to our algorithm We will have the same problem in the hiring process as before so we'll also have to have discrimination against people not by humans this time but by machine and the state and Now you say probably okay, this is stupid why would we give information about this protected group to the algorithm and Of course the answer is we don't normally But the problem with big data and with like having a lot of different data types and different data sources on our hand Is that even if we don't give that information to the algorithm explicitly? There is some amount of information about the attribute X that leaks true with all the other information that we provide and This is basically the essence of the dilemma of having too much data on our hands because it's always very hard to To keep information about sensitive things leaking into our data set and of course, this is like a purely theoretic Pure theoretic formulation now, but I'm actually Try to validate this by using publicly available data. So what I did was to get GitHub user data, which is Which we can obtain through an API and which gives us information about all the people on GitHub. So First we need of course Information about the protected attribute of the group in this case we choose the gender so either man or woman and to do that We have to manually classify The the people that we put into this study So what I did was just to look at profile pictures on GitHub about five thousand and like classify the people into like like Man and woman so this gave us the training data for this kind of simulation and What I did then was to retrieve additional information about each user so for example the number of projects on the website the number of Stargazers the number of followers, etc. etc. So everything I could get my hands on and Then I would use that data to Make a prediction about the gender of the user just based on the information that I put into the system So and I want to say again. This is only a proof of concept. So I used a very small data set I didn't do any optimization. I just wanted to see how easy it is actually to get this kind of contamination into our algorithm so first When I used only very basic things like for example the number of stargazers or followers of each user I couldn't get any prediction about the gender of the person and I mean this is already great because if Your colleague says oh, you know woman. They are not good programmers You can just now show him this data and then you basically Can disprove him and that's because it's not possible to predict the gender from this publicly available github data and But for me it was of course a bit disappointing because I wanted to prove that we can discriminate against these people So I needed to get more data and Luckily github helps us out of that by providing a events API Which contains a full event stream of any action or almost any action that a given user has done on the side So every time that you open a pull request or you make a commit or you do something else on github There's an event created for that and you can download all the public events on the side through this website here and like process them and Use them for for example data analysis and this is what I did So for all the users that I had in my sample I downloaded this event data and like tried to get some more information that I could use to discriminate the gender of the people and For example the data that are you that I looked at Here we see The event frequency so average over all the events as a function of the hour and you can see that now There seem to be some significant differences between man and woman in our data set so that's something that the algorithm could use to make a prediction about the gender and likewise in the type of events that we have in our data set there are also differences in the The frequency of individual event types, so that's also something that the algorithm can use to make a decision about the gender now for the last thing I went a bit crazy and I did something that you don't normally do and spam detection that is taking like the commit messages of individual contributors and just like inputting them into like a support vector classifier that basically looks at the frequencies of individual words in each commit and tries to find a difference in the Texts between man and woman and this already gave me some Good this like good fidelity of predicting the gender and combining it with the other information that I had I could In fact achieve 15% better Chance of like predicting the gender than by just guessing so this is not very impressive And we can probably do much better But again, this was only like a proof of concept to see how easy it actually is to get get this kind of information Leaking into the system and so this basically means that if we Can make a predictor for the gender of the person in GitHub The algorithm that we use to like make the decision about the hiring process can also Generate this kind of information if we give him it is if we give it this data and use it against the the people So the takeaway from this is that the algorithm will readily learn discrimination from us if you provide him with the right training data and Also information leakage so the getting information about protected attributes in our datasets that we don't want to have there Is actually pretty easy and can happen If we are not careful How can we fix this well, it's actually harder than you might think because often we don't even have the information About the protected attributes in our datasets because we don't want to to take the data from the user I mean imagine if you would apply for a job and your employer or potential employer would ask you for information about like your sexual preferences your gender your ethnicity and everything and plenty of other things probably wouldn't go down so well But this is actually the kind of information that we would need in order to see if there is some disparate impact in our data Because if we don't have that attribute information, we cannot like Calculate any fidelity or any like like measure of the discrimination that is in the process And this is what is so dangerous about this because our algorithm can discriminate against people without us even noticing okay this was already the first case study that I wanted to show and we have seen that um That getting information into our data set that we shouldn't have is pretty bad and Like the worst kind of information leakage that you can imagine is if you can identify Someone from the data that you have obtained for them earlier and I mean again if we ask Google about its opinion on privacy, it's the picture is rather bleak and It seems that many people have already gotten used to the idea that we are in the post-privacy area era now and so with the second experiment here, I want to show how easy it is actually to Deanonymize and given user data even without wanting it and What is actually the anonymization? Well the anonymization Means that we have some information recorded about an individual or a person and we use that information To predict the identity of that individual in another data set So it's kind of like your data is following you around even if you like for example Change the devices which you're working on you change your Your user accounts the system is still able to identify you just by using the data that you have like put into the system earlier or that it was like measured of you earlier and Deanonymization becomes an increasing risk as the data sets that we can use about individual users get bigger and bigger actually So the clicker I hope it's working. Okay Now let's have a look at the math here Though this the de-anonymization is a pretty bleak subject the math is rather fun. I assure you so and You maybe have played This game with some of your friends where you just think about some famous person and The other your friend has to guess who that is by just asking you a series of yes or no question and This actually works pretty efficiently so that after like maybe 10 or 20 question questions you can know exactly which person your friend was thinking of and This works so well because if we have like several buckets that we can That are either true or false for a given user we can Create a unique fingerprint for a user in our system and If you look at the probability of like having a collision so like having two users that have exactly the same True false values. This is getting increasingly unlikely the more buckets or the more different types of information We can put into our system And so like the exact number or the exact probability for finding a collision between users is depending on the actual distribution of the Information in the buckets. So if you have a uniform distribution We can calculate that number and as you can see it decreases Exponentially which is why this game that I talked about earlier is working so well since for example, if you assume that you have like one million Famous people that you can think of then it would probably be sufficient to have like 32 bits of information to uniquely identify them all and We can imagine that with big data. We have much we have many more Buckets that we can actually use so we can identify not only a few million people's but easily a few billion different people using that technique and most real-world data sets are of course not Uniformly distributed so there we have more the case that that many users are in the same bucket So for example, many there are many people that like the same kind of music So they all like have the same information or the same attribute and Using that attribute to deanonymize that the users wouldn't give us much wouldn't do as much good because it wouldn't help us to like Narrow down the number of users in our system But there are also many attributes that are pretty unique to each one of us For example the place that we are living or the combination of that place with the work where we're going So having a few of those quite unique data points for each user is usually already enough to deanonymize us with a very high fidelity And again, I wanted to see if this is actually working in practice So what I did was to get a data set in this case from Microsoft Research Asia, which contains GPS data about of about 200 people that track their whole activity for sometimes several years sometimes several months and like Use the data to create a movement profile so to say So I also have an animated version of that There you can see here the different trajectories of individual users. I don't know if anyone recognizes the city No, it's Beijing actually and If you're wondering what this square here is so I looked on Google Maps and it seems to be the university So I guess it's like in other fields of study whenever you need like some guinea pigs that take data for you You go and ask students so Okay, so this is a pretty rich data set We have like for in some cases hundreds of thousands of data points per individual And I wanted to see how easy it would be with this data set to actually deanonymize users So what I did was to first look at individual trajectories so here we have like the GPS traces of the individuals color coded and Then to apply a very simple grid so like in this case a four by four grid and Just measure The frequency with which a given individual Has some data in this on this on this square so doing this for The 200 people gives me something like this So this is for the four by four grid and you can see The colors represent the number of times a given person has been in a given square So white would mean that the person has been here very often black would mean the person has never been in this given Square and you can already see that With like the 60 examples that I show you here The many of them they seem to be quite unique for example this one and this one so it should be possible to like Kind of make a fingerprint for a given user using that data And of course if we need more resolution to like for example this ambiguous Uses like these here where we have like the same data more or less and we can't like decide which user we have We can just increase the resolution for example to 8 by 8 or to 16 by 16 a show here and now Coming back to our buckets if you would measure the distribution of the attributes that we have here We can get an idea how good our choice is actually and you can see The choice that we have make made is actually pretty bad because in the first bucket or the bucket with the most data points We have about like 10 to the 6 or 1 million points But the interesting part of this curve, which is by the way logarithmic Is here so in this like long tail of the distribution where we have only Sometimes one or sometimes a couple of individual persons in that given bucket So if we can get some information in these buckets, it's easy to use that to de anonymize our users Okay, and how do we do that? Again, we use a very simple measure We just take the fingerprint of one user or one trace and Multiply it with the fingerprint of another trace by pixel by pixel Which gives us then the value here on the right and then we take these individual values here and sum them up And this gives us kind of a score of how similar two users or two trajectories are so Doing this we can take 75% of our data as a training set So we just like teach our algorithm to like recognize individual users Then we can use the remaining 25% to test how good our algorithm is at recognizing the users now Then we look at the average probability of identification and also of the rank that the user has in this in this prediction and this is shown here so what I show is the the probability of Like finding the right user within for example this the first two the first four the first six users that have the highest scores for a given trace and You can see that for even 16 16 squares so the four by four grid that I showed you in the beginning the identification rate is already 20% here So we can identify uniquely one-fifths of our user by just using 16 data points and the more data points we use actually the more The better we can identify the users in our data set and with 1,024 individual data points which would be quite easy to get in a real-world setting we can Uniquely identify almost 30% of the users and again I want to state that this is just a proof of concept and so there hasn't been no optimization done and no like fine tuning of parameters or anything and We can also use that technique to not only identify single users, but also to find similarities between Users so this could be interesting for example to see Who is like related to whom and who are you visiting who are your friends? maybe and This is what I did here So I used the same metric as before and just told the system to give me the users that are most similar to each other You can see here in green trajectories of one user and in red trajectories of the other uses and the areas that are yellow I actually where the two of them coins coincide you can see there are some hits here of the system Which don't seem too good, but there are lots. There's some hits also where you can see like a really big agreement between the two data sets and I mean I don't know Who was taking this data because it's anonymized But I would guess in this case that it's either like a taxi driver or maybe a bus driver because you can see that We cover almost the whole Beijing area with these two traces And so this technique makes it really really easy to like identify users and also like find out who they are related and who which other users are similar to them and We can of course like improve the identification rate of the system by For example taking into account not only the spatial information But also like the temporal information for example the day night cycle which you see here in the background So here the green curves have been taken at night and the red curves have been taken at day during day So like you have it or for example going to work in the morning and of coming back in the evening and can Then be used to increase the prediction fidelity for identifying a given user and of course we could also like change the Choice of our buckets and like change the way we do the fingerprinting in order to increase the fidelity of the algorithm So there's plenty of room for optimization and as I said, this is only like a proof of principle But there are other like similar works in the literature which show that with even very simple methods You can achieve quite good identification rates in such a data set Now to summarize this this means that the more data We have about a given entity your person the more difficult it is actually to keep Algorithms from directly learning and using the identity of that object for prediction instead of an attribute That means as I said before that the data which a given user or given person generates Follows him or her around the whole life So even if you like would change all of your smartphones all of your devices Some parts of your behavior would probably stay the same and these could be used to identify you later in the Process again with a pretty high fidelity. So that's one of the biggest risks of big data for me because it's like very easy if we not avoid it to like Destroy privacy of our users Okay. Yeah, thanks So what can we do about this? I don't have all the answers of course, but I have a few ideas and I mean there are lots of people working on like Political and societal and technological solutions for this So here I just want to give a brief overview of things that can be important in order to avoid These two scenarios that I have shown here. So the The group of people that we probably have to to educate the most urgently about this is of course Data scientists are the people that actually work with the data and create these algorithms because today And there is in Germany, for example, you need like a three-year apprenticeship in order to tell cheesecake But there's nothing comparable in order to like be an algorithm assistant like to develop these kind of algorithms that have a large Influence on our daily lives. So there probably should be a better curriculum in universities and even in schools maybe to educate people not only about the possibilities of data analysis and Like about scraping even like the last few percent of fidelity from a given algorithm But also about like the risks and the danger of using these kinds of technologies especially when other people are involved and Another thing that we should be careful with is Indulging data without actually needing it. So today the Like one of the the most popular approaches in big data is just to take everything That you can get so all the data that we can get our hands on to give it to the algorithm and to let it decide how it uses it and This is good normally because it increases the fidelity of our predictions But as I explained earlier it can be also very dangerous because maybe the algorithm can learn things Which it isn't supposed to learn at all. So we should really be more careful with the data that we give into these systems and Of course, there are other things that we can do We can try to do it to remove discrimination and disparate impact and there's also like a lot of academic work Giving techniques and methods that we can use for for doing this But here the problem again is that most people that actually work in the fields where these algorithms are put into practice Either don't know about these things are not interested in those So I think here we have a big potential for like improving the education of data scientists and data analysts As citizens we can also do something of course. So the first thing is to not blindly trust the decisions made by algorithms So if most people have kind of a bias to think that a decision made by a computer by algorithm It's maybe more fair than than a decision made by human And I think this is something we have to get rid of because algorithms as I showed can be just as discriminating against people as humans can So and if we can't like Question their decisions we can at least test them and see if there's actually discrimination in the system and now this sounds pretty Easy, but it's actually very hard because the algorithms are mostly like in the hands of big organizations or corporations and are of course like a closely guarded trade secrets in most times and This means that we have to use techniques such as reverse engineering in order to like find out How the internals of the algorithm might work and I have to say I'm a bit pessimistic about this because Whereas where the companies or the organizations could use like like huge pockets and huge amounts of data to train these algorithms The amount of data that we can use for reverse engineering them Is miniscule is very small in comparison So it's really not very likely that we will be able to make a good decision based on these kinds of techniques And of course we can also one thing that we can do is to fight back with data So by collecting data about decisions that are made over about us for bi algorithms and by centralizing that we can like Create a lot of opportunities for other researchers and other people to Analyze these data sets and to like find discrimination and other things in them. And so I would encourage you to If you like are reluctant to like like give away your data, I can of course understand it But in some cases, it's really the only way to make sure that someone can actually work with the data and Detect also or like like find injustices that are caused by it. So we have to really think about differently of giving away our data and like like also creating data and machine learning against machine learning so As a society we can of course Create better regulations for algorithm. And this is actually something that has been done I mean and beginning of the year our Minister of Justice was demanding of Facebook to to open up their algorithm and This was much ridiculous at the time But I think it actually has some merit because if we can't understand how Corporations or companies are using algorithms We can't know if they're discriminating against certain people or if they're treating us fairly so Having is an auditing system in place that allows at least a group of people to have a look at these algorithm and to see how they're working would be a first step in the direction of making these things more transparent and Of course making access to the data more easy in a safe way is also important to be able to to detect any problems that we have with it and Finally, of course, I mean, this is maybe already too late But we should do our best to impede like the creation of so-called data monopolies because if one Organization or one actor has all the data in its hands We have already lost because even if we have the same algorithms the same Technologies are at our hands most of the value and data analysis is an amount of the data that we can can have So if there's an adversary or like an organization that has like orders of magnitude more data to work with than we It's really unlikely that we will be able to like Compete with that adversary on the same scale so As a final word I would say that Algorithms are probably a lot like children. So they're very smart and they're really eager to learn things and We as the data analyst orders the programmers we have to teach them to behave in the right way And we should try to raise them to be responsible adults Okay, so thanks We do have a few minutes left for Q&A I would like to ask you to queue up at the microphones in the style if you're watching at home we also Have a human computer interface to relay questions to us. I'd say we begin with that Do you have a question for us? Yes? Rudy is asking what discrimination number would you guess for discrimination from politicians over people's choice? in one or several countries Politicians about people's choice you mean can you Be a bit more precise on that. I think it's difficult to get We'll get back to that question. Okay. Yeah, we have one question in sal number two, please Thank you for your talk Does it make any sense? Or is there any hope that I am as a individual can? fake my my data patterns are Can I disturb? the pattern recognition in a sensible way in a sensitive Yes, I think you surely can it's only the question is only if this is will be effective to for example protect you against de-anonymization because as I said like Faking 90% of your data can be useless if 10% of your data points are in Buckets or like in attributes that are unique or almost unique to your person So if you want this method to be effective, I think you would have to be really convincing and I mean I haven't had a look at the very big data sets So I really can't give a quantitative answer, but I'm rather pessimistic about this approach. I have to say We do have a few more questions I would ask ask the people in the room if you have to change rooms right now Please do so in a quiet manner so we can do the Q&A Without yelling we do have another question in the IRC and after that it's number four IRC, please Atomic NGR is asking if a human is generally able to create an algorithm which is not discriminating He's doing an analogy to random numbers where a human cannot really Create truly random numbers because he or she would always have a preference Yeah, that's a very interesting question I Mean it really comes down to on the algorithm having the information about a protected class or not having it So if it doesn't have the information, it can't be discriminating by definition because it can only randomly guess if a person belongs to a given group or not so in that sense algorithm can be Perfectly unbiased, but only if they don't have any information that that gives away The protected status of an object or a person that they're making a decision about so it's definitely possible. Yeah Okay, the next question by number four, please Thank you for your talk You say that algorithms discriminate in the same way that humans can But I wonder if the real challenge is that algorithms discriminate in a slightly different way than humans do and For example, you gave the example that we can Person we can identify gender or other markers from the data set Yeah, but what if these attributes that identify that correlate with gender class race, etc I'll also correlate with other positive attributes such as The study that you're more efficient worker when you live closer to your the side of your employer But if you have a very segregated society, that means that those who are richer are also then classified as more Efficient workers and win in the scoring of potential employees If such a thing occurs It's not just that discrimination can can be an unintended outcome But also if the company wants to discriminate you cannot prove it because you say we just hired the most Qualified candidate, but in fact you just hired certain kinds of people. Yes. Yes. I mean, that's exactly the the argument about discrimination because If you don't have the information about how many people of a given Class of a given protected status applied for example for a given job You can't figure out if there is any discrimination in the process And so that means that you have somehow to get that information into the system in order to make an audit and actually see if There's some unfair bias in there and I mean the other question. I don't know if I understood that correctly is if you if you can like in fear information about the gender from other things or And this I mean this is certainly the case because as I said in the talk Like many things like for example the neighborhood that you live in as you said will give away information about the protected attributes as well All right, we have a few more questions I would like I would ask you to keep them please short, please the microphone number five in the back And often hurts an often hurt statement is that the more data you actually collect the less you can actually do with it Because it's just too much. Is there any scenario where this statement makes any sense? Yeah, that definitely is I mean Given an algorithm more data to train with is not always a good thing. It's pretty easy to to over train algorithms Not so to give it to make a model that is like perfectly fitting the data that you give it But that has very little predictive power for new data that you see but in general increasing the number of day of data points Is always like improving the quality of the model if the data that you have is from the same Model as well So it could also happen that the data that you have is not homogeneous so that one part of the data Is fitting well with one model, but the other part of the data is fitting well with another one So in that case it might be difficult Training a large amount of data on the single model, but it depends on the individual case I would say so it's really not easy to answer in that sense Thank you. We have time for two more short questions. I would ask One question from the I see again Yes, and I see Luke is asking isn't the black box nature of the machine learning algorithms one of the biggest problems Can those be solved by better visualization or understanding and what it really is doing? Yeah, for me having algorithms that are not open to scrutiny and that we can't understand is one of the biggest problems, of course and Visualizing data can help of course But as I said briefly in the talk since the space of possible parameters in the space space of possible data points is so enormous even for very small Machine learning problems that it's really difficult to produce a given visualization That would you that would give you with a high confidence and a good information about for example discrimination in the data set So it can certainly help, but I think it's not a perfect answer either Okay, we have time for one more question at microphone number one, please Thank you. In the beginning you display the harks green yellow and red Expand that has some high grid to give me more damaging the example You made about the green was about some kind of algorithm that that give to you information Don't you think that the time of exposure? Influencer how much is damaging because if I get an influencer the for two years is worse than just two days Can you say that again? I think I did time of exposure the time of exposure to an algorithm that influence your behavior has to be considered as a Factor to understand if it's oh, yeah, that's a very important point I mean I also had like an experiment and where I look at the interaction of the algorithms with a person that he is like for example showing articles to and This is like a topic of itself I would say so there's definitely very rich interaction that is also not captured by most models So like the algorithm influencing the behavior of the person and then that again influencing the actions of the person and like influencing the Machine learning of the further data. So there's definitely some feedback in the system. Absolutely. Yeah Okay, that's all the time we have thanks again to Andreas for the great talk