 Hello. Hello. I'm Harsh Gupta. I work at Nelenso. Nelenso is India's first software cooperative. So I'm from an engineering background and many of my friends are also engineers and of many of these friends they believe that late, they believe in this God of data science, they believe that given enough data any problem is solvable. Of many of those people, they would happily replace lawyers, judges, doctors, pretty much everyone with machine learning. You might be aware of, there are already companies who are using machine learning to determine who should be higher or even who should be put in jail. In the U.S., there are courts which are using machine learning to determine criminal risk assessment and talk. So I'm quite skeptical of their claims and I think these things can backfire a lot. So let's talk, think about the process of machine learning. So the way you do it is you have some data, you put it into machine learning algorithm and you come up with some predictions or output. But the data comes from somewhere and the predictions are used for something and not taking into account how the data was generated or how the process or how your predictions are being used can be fatal. It can have like a lot of hard consequences for society and innocent people. Let's talk about data. So here's an example from World War II. You see a picture of fighter aircraft from World War II. These planes would go to the battlefield, fight with the enemy and come back and they will come back damaged and some won't come back at all. So the task of Abraim Wald and his team was to determine where to put more shields on the aircraft and they could not put more shields everywhere because that might make aircraft too heavy. There are also cost constraints, right? So they had the data of where there were more damage in aircraft, where there was more damage in the aircraft and their team had this opinion that places which are more damaged are more likely to be hit by the enemy. And I guess a naive machine learning algorithm will also have the same consequence. But Abraim Wald was smart. He realized that not all planes come back and these planes are able to come back even after being damaged. So the places where we see no damage are the places where aircraft were hit and those aircraft didn't return. So Abraim Wald decided to put more shields on the places which are less damaged. What do we learn from this? We learn that the context of the data is really important. And taking the context into account, the same data, you reach different conclusions, even opposite conclusions. Let's take another example. This is from the US presidential election of 1936. The candidates were Franklin D. Roosevelt and Al Flandon. Literary digest was in the business of conducting exit polls. That is to predict who will be the winner of the election even before the election happened. They had a sample size of 2.4 million. There's another person called George Gallup. For a sample size of just 50,000. Literary digest predicted that Roosevelt will win and George Gallup predicted... Sorry, literary digest predicted Landen will win and George Gallup predicted Rosa will win. And Gallup also predicted that literary digest prediction will be wrong. Who do you think was right? Well, Gallup. Even with a 50 times smaller sample. And why was that? It was because literary digest ignored how their data was generated. The way they generated their data was that they were a very popular magazine and they sent out exit poll sample forms with their magazine. People filled out the form and sent them back. At that time, more Republicans were likely to subscribe to their magazines than Democrats. So the sample over represented Republicans instead of Democrats. And even with a way bigger sample, they produced wrong results. Another thing to note here is the wrong bigger sample can be worse because it might lead you to make wrong predictions with more accuracy or more confidence. So data, getting right data can be really hard. And if you talk to any statistician, they will tell you that getting like how hard it is to get the right data. But let's assume somehow you have got the right data. But you are very careful with it. You can do all the complexities of the process and what kind of things which can affect your data quality. But even after you've done that, you are using the data to model some inherent system. And that system can be really, really complex. Okay, so here are some data points. You see a pattern. Do you? So I see two lines diverging. There are some more data points on the same underlying function. There are some more. There are some more. There are some even more. Do you see a pattern now? I see a sine curve. And that was, and the data points were from the same sine curve. But from the amount of data you had, you could not have predicted that it was a sine curve. So the amount of data you need corresponds to the complexity of the problem you are trying to deal with. What does it mean in real life? Well, there are problems for which you will never have enough data. Why do you think cryptographers aren't worried by development and deep learning? Because you'll never have enough data. Solve an ace algorithm of a key size of 256 bit. You'll probably need data of order of 2 to the power of 128. And I guess like the whole humanity doesn't have the processing power to process that amount of data. So you'll never have enough data for this kind of problem. What happens when you don't have enough data corresponding to the problem you are dealing with? So the complexity of the problem increases with the number of variables you have in the process, and also on the interactions they have. Now, when you don't have enough data corresponding to your problem, you'll find patterns which don't exist. So here is a graph of number of variables for a fixed amount of data and number of false patterns you can find. And you see the number of false patterns you can find increases with the number of variables you have. So how does it translate into real life? So I like food and I guess a lot of you also do. My friend Jyoti will tell me that not eating non-veg is unhealthy. But my mom will tell otherwise. A keto appears to be the fat diet at this time, but like 5 years later, 10 years later, there was some other diet which was the most popular diet and they would say scientifically proven that this diet is the best diet. I'm sure 5 years later there will be some other diet which will claim that the current diet is not the right diet. Why does it happen? Because there are so many factors which goes into how your body reacts to the diet you have is that it's very hard to get the right amount of data. So you just you need to take the right things in the right quantity in the right form. For example, taking vitamin pills is not the same as taking vitamins through food sources. It also depends on like what geography you live in, what your body type is, what your genetics are. And you play and all of these diets essentially takes only certain things or certain variables of the equation and claim that they have got the right thing. But when you take only certain things in isolation in complex systems, you'll always you might think that you have got the right thing, but you probably haven't. And diet is not the only complex system. Social systems are complex, finance is complex, genetics is a complex system and in complex systems you cannot take certain variables in isolation. So complex systems can fool you. So let's talk about the output. So like you are using this machine learning to predict some output and predicts and you are like essentially making some like complex matrix, complex things into reducing the complex things into simplistic metrics. For example, health is a very complex, complex phenomena and you often reduce it to BMI. You reduce health of economy to GDP or like knowledge of someone to the grades. And when you like reduce very complex goals to simplistic metrics, these metrics can backfire. Why? Because people respond like they are not rocks. So they respond in the ways you try to judge them on. So there is this law called Godhard's law which says when a measure becomes a target, it sees this to be a good metric. And their history has lots and lots of examples for that. So in 18th century France and England, the government wanted to tax people on their wealth. But it's very hard to determine how much wealth someone has. So some smart person in the government realized that the wealth someone has is probably proportional to the size of their house. And the size of their house is probably proportional to the number of doors and windows they have. So they started counting the numbers of doors and windows someone houses have. And they started taxing people on that. Well, what happened in result? People started having these weird buildings where they would like remove the doors and windows. So the government eventually removed the stats, the stupid tax, but they still have these buildings. And these have like a lot of repercussions in terms of the like amount of exposure to light people have, the health and ventilation and all. And there are like lots of examples like that. So when we believe we will be judged by silicate areas, we adopt in silly ways. So to recap, you need to take into account how your data was generated. Because everything which, your data is the core of the thing. And if your data is flawed and you are not thinking through it, everything which follows might fall apart. Even if your data is right, the inherent system can be very complex and these complex systems can fool you. You also need to be very careful about the metrics you use because they can backfire. So some questions which I think you might ask. Well, do I mean machine learning does not work at all? Well, I do not mean so. But I think like there are so many things which can go wrong that it probably doesn't relate unless you are super careful. And good fit on a test set does not prove anything. Maybe like rigorous field testing does, but yeah, like you cannot just take static data set and say come up with some predictions. They might not make any sense. Why doesn't good fit on test set prove anything? Well, the reason is the way you do testing is usually in machine learning, you have static fixed data set and you divide the data set in training set. But the test set is supposed to be used only once. But what if your algorithm doesn't perform well on that? You change some hyper parameter, train the algorithm again and test it on test set. So if you do that enough number of times, you might end up overfitting on a test set itself. Also, like if your data is bad, a good fit on a bad data does not mean anything. What about machine learning at Google, Facebook, Microsoft and Netflix? Well, the same thing. If it works for you, doesn't mean it will work for you. If it works for them, doesn't mean it will work for you. Also, I guess there's like this survival bias in the machine learning stories we see. So I guess there will be a lot of companies which use machine learning and they failed and we won't hear their stories because well, media has no incentive to popularize them. Even they don't have any incentive to popularize them. Machine learning can work and they have worked in certain scenarios, but you should think about your case. But doesn't it hurt to try? Well, it depends who is hurt, right? If you are some e-commerce company and you are trying to, you have a fancy machine learning based pricing model and you blow yourself up. Well, you blow yourself up. No one else is hurt. But if you are like, say, putting people on jail based on machine learning, you should be like, I'll be very scared because you are putting innocent people's life in danger. All models are wrong, but some are harmful. Some further things like, I really love this guy, Nassim Nicholas Salih. I would recommend you read everything by him. There's this book called Weapons of Math Destruction by Katharine O'Neill. And she talks about like a lot of ways where people have tried to use mathematical models including machine learning in scenarios and have done a lot of damage to people. And there's some other things which you can read. Okay, thank you. Questions? Raise your hands where I can see them. Where? I see that whatever you are telling is like the data should be generated in the right way. And we're also telling that the complicated systems may not work sometimes with machine learning. No, I'm saying that like, you won't know if it is working or not. You'll think it is working, it's probably isn't. Okay, but from the context of this talk, right, like what cannot be done through machine learning, you're telling that it's kind of risky to like be there or like machine learning may not work with many things. But is it not, will it not post challenges that okay, we have to resolve them? So, I hope my question is clear. So like, okay, so to paraphrase your question, like the question is, don't we see these things as challenges to resolve instead of things which we cannot do? So you see the cryptography example, right? So that system, you don't have enough data at all. And you probably have never enough data. They are so, for very complex, so this is a theorem called no freelance theorem, which is like as if you assume anything is possible in the underlying system you are trying to build, like you cannot have any predictor power at all. I guess that also applies to complex systems because the complex systems are so complex and the underlying functions can be so varied that you might not have enough data. So like I think of this in this way that like the problem is you might think that things are working, but like you have no proof that they are. It's unless like at least from just from the data itself. Like so and you can you can do field testing and go there and like actually see if things are working. That's different. But like from the data you can always find patterns which are probably not, which are there and not true. Does that resolve your question somewhere?