 they have solved the problem or that they haven't been able to solve the problem. So it's kind of heartening to see that the ecosystem is opening up and that there's more possibility for collaboration. I mentioned this yesterday at Antelinside that clearly there is more collaboration that is happening or at least the spirit is starting to come alive. And I think that's what we kind of need to also do to grow both the fifth elephant and the community itself, which is to kind of be more open, share more openly and recognize that the vulnerabilities are all lower, whether it's in terms of organizational positioning in terms of running teams, et cetera, and that these vulnerabilities can actually be can be accepted and sorted if we were to kind of talk openly about where the challenge is really lie, whether it's in terms of how we leverage the data or in terms of how we build our teams or how we do data engineering. On that note, I won't take too much time, but just a quick walkthrough. In auditorium one, there are talks today and tomorrow all through the day. Auditorium two has a mix of talks and BoF sessions that will be happening. There are also BoF sessions that will be happening upstairs. For those of you who are at Antel yesterday, there is a there's a session that's there's an open space upstairs where the BoFs are happening. So there are BoFs and talks in Auditorium two. There are BoFs upstairs as well between today and tomorrow. And Auditorium three has two has a workshop today and tomorrow. There's a SageMaker workshop today in auditorium three, and there is a deep learning workshop in auditorium three tomorrow. So feel free to to circulate as per your needs and your requirements and what interests you. And we've tried to put together a two day schedule that hopefully will cater to a lot of needs. Do feel free to participate in the BoF sessions. We've taken quite a bit of effort to make sure that there are user sessions. So there are if your user of airflow solar and spark, the sessions are really for you in terms of being able to discuss what the challenges and limitations of working with these tools are. There are a bunch of sessions on on data engineering on data ops and a whole bunch of topics. So feel free to participate in them. I suppose the last point remaining that please turn off your phones or put them on silent. It is extremely insulting to speakers if the phone rings in the middle of the conference and of course very disturbing to people sitting around you. On that note, I'm going to hand over to Shakti who will be emceeing and and shepherding you through the next two days is shepherding the right word for elephants. We'll find out. Very good morning once again. So before we begin, just a few more announcements. You can connect to the Hasgeek Wi-Fi. The password is geeksrs golf echo echo kilo Sierra Romeo uniform Sierra. Please don't run any torrents, do any downloads. And please don't create any Wi-Fi hotspots as well. There'll be a ML with Amazon Sage maker workshop in Audi three at 10 AM. And refreshments are not allowed inside this auditorium. Right. So with that, let's get started. What better way to start the morning session than with an understanding of a mathematical function. So you think you know about linear regression. Chris Tushio is here to answer your question. Chris Tushio. Does this work? Okay, so I'm going to be talking about linear regression, which is a topic. I think a lot of us feel we've seen. We all know the standard graph, a bunch of dots that are moving in a line and you can fit a line to it. Something I've discovered in this community is that a lot of people seem to treat basically everything in SK learn kind of as a black box. And what I want to do in this talk is open up the black box in the simplest case possible and just show how if you can open the black box, you understand how it works. If it doesn't quite work for your application, there's a very good chance you can probably just alter it slightly in order to get good results. So I'm just going to illustrate a few examples where I've used linear regression or someone I know has done so to make money typically by gambling. So around this time last year, I was making not a lot of money, but at least some money gambling on. I wasn't actually doing cricket. I was doing baseball because I was in the US at the time, but similar idea. I had data that was along these lines. We have a batsman, we have a bowler. In a certain game, this pairing of a batsman against a bowler resulted in this many points. So Major League Baseball is actually very nice because they have an XML API where you can get all the data you want. So what I did is I set up a linear model. It's using one hot encoding where basically I have a batsman has a skill, which is BI. The bowler or the pitcher has a skill, which is PJ. And so in the pairing against those, we would expect to get this many points. This notation is worth noting. SIDK is the score of batsman I against bowler J in game K. And G is a probability distribution, which I'm not going to specify right now. So you could think of it as in some sense as being a Gaussian centered at mean constant term plus batsman skill plus bowler skill. But what I'm going to get to later is that it doesn't have to be a Gaussian. So you make some important assumptions that the game performance actually behaves linearly. We also assume that the score in one game doesn't affect the other, which are mostly reasonable assumptions. But nevertheless, I was making money in this case, and I was beating the average gambler, even though most of the standard assumptions of ordinary least squares, which we're all familiar with, didn't apply. Here's another example that a friend of mine is working on. He wants to price heavy equipment. So he's got a backhoe or some other piece of heavy machinery. He knows how many hours it's been working, how old it is, whether it's got an air conditioner, what's its horsepower, whether it's got a four-wheel drive, and he wants to know what this thing will sell for. Because what he's doing is he's building a pricing model to help lenders accurately lend money, and this is collateral for the loan. So an accurate price is valuable. And in this model, we again set up the same framework. We say the value of a machine is some constant. We add in the value of, or in this case, subtract some amount of money based on how many hours it's spent working, something else, whatever the value of four-wheel drive is, and this generates a reasonably accurate prediction of what a machine might be worth. Some years back, I was running a model in the stock market. I had a little Python script that ran a couple of times a day. It would submit trades on my behalf, and in that case, I was essentially using the prices of fast-moving stocks, such as Google or Goldman Sachs, to predict much slower-moving stocks, such as JNUG a little bit later. I wasn't actually trading JNUG. That's just a Wall Street, that's a joke. But I was trading similar, shady stocks on the basis of the movements of faster stocks. And so again, the model was, in some sense, linear regression. I predicted that the price of the slow stock at, let's say, 10 milliseconds later, I wasn't actually doing it on the millisecond scale with the Python script, would be drawn from some probability distribution. So the delta means the change in stock price movement. So if Google went up, JNUG will also go up. That would also be drawn from, again, some probability distribution, which is linearly derived from the input variables, namely the fast-moving stocks in this case. This strategy is now tapped out. So if you want to know more details, I'll be happy to tell you them. But it's not going to make money anymore. It stopped making money a couple of years ago. So we're making an assumption that whenever we do linear regression, or more or less any parametric statistics method, we're sort of assuming that the world we live in is, in some sense, a simulation. And moreover, we know specifically what is being simulated. We have an idea of the code. And the code is typically, in the linear regression case, we draw a variable z, which is a dot product between some unknown vector a and x, which are observable input data. And then we take g. So g is the probability distribution of the exact, choose any of your favorite probability distribution classes from scipy.stats, normal distribution, exponential distribution, whatever. And you build that distribution, you set the mean and the scale parameters equal to some linear function, and you draw a random variable from it, which is what the RBS is. And then we assume our data is generated in this way. So just for the sake of notation and to move us forward, I'm going to always call x the input and y will be the output. So if you ever see x and y on the slides, I should be consistent with them. And so our goal is assuming we knew the probability distribution that the simulation chooses, we can find a and then use a to predict y. So this is by the way called a generalized linear model. It's a different thing from a general linear model. You can look on Wikipedia for the distinction. It's very confusing terminology. So I want to distinguish between data science and statistics here. This is an important distinction. Statistician says that we live in a simulation, the matrix of some sort, and we say the machines are running this program, not just any program, but this particular one and we know the code. The only thing we don't know is the config file, which is the value of true A. So what a statistician does is he will then essentially attempt to look at the output data and try to figure out what the config parameters are, what true A is. Data scientist takes a very different, more computer computational perspective. He doesn't assume he knows the specific code the machines are running and instead just sort of chooses from a larger space of computer programs. He searches through this much larger space to find some totally different program that bears no relationship to the code that the machines are simulation overlords wrote. And what he instead does is he tries to write a similar program, a clean room implementation that looks nothing like the original, but generates the same output. So I'm going to be talking about statistics in this talk. I'm mostly going to be leaving machine learning aside. I'm not sure there's consensus on this distinction, but this is the distinction that I make that makes the most sense to me. So don't consider this to be any kind of anything that everyone is going to agree with. So I'm going to discuss this talk from the perspective of Bayesian statistics. So in Bayesian statistics we represent, we try to form opinions about the world and we represent our opinions as math. Specifically an opinion is a probability distribution. So I've just reminded everyone here what a probability distribution means. It's a function that takes the input to the function is your true parameters. You know, how much a four-wheel drive is worth if you're in back hose, how much J and UG is going to move if Google moves if you're looking at stock prices. Those parameters that you're having difficulty finding. So our function P of A, it has to be positive, it has to integrate out to one. And where this function is larger, that means we have a stronger opinion that that value of A is likely to be the true one. So in Bayesian statistics, we're strictly saying that probabilities are opinions. And Bayesian statistics is about changing your opinion, but it can't give you one, you can't create an opinion out of nothing. So what you do is you start with an opinion, just anything you believe before you see any data. This is called pulling a prior out of your posterior. Then you can use Bayes' rule once you have data to change your opinion. And this will change in a way that is mathematically optimal to something that is more accurate. So essentially what happens is the function P, probability of data given A means assuming A was known and disvalued, what is the probability of seeing the data we just saw? We're multiplying by that which means that if the data is highly inconsistent with the parameter we just saw, that will have a low value. So if the data and your theory disagree, that's going to be small, and if the data and your theory agree completely, that will be large. So it's basically making your opinion smaller where it's inconsistent with reality and larger where it is consistent. So let's discuss linear regression from this perspective. We are saying that given our outputs, y is the output, x is the input, and A is the parameter that is unknown. So A is going to be the slope of the line and the constant. We can compute the problem given our assumption of independence and that's the price of one backhoe doesn't affect the price of a different one. The score in one game doesn't affect the score in a different game. We can essentially write the likelihood, probability of data given our parameter as a product and it's a product of the density function of the distributions we're sort of assuming govern the world. So with this formula we can base what the rest of this talk is going to be about is different ways you can take this formula, plug it into Bayes' rule, and get better results even when all the assumptions of these squares don't apply. So the simplest example we'll start with is we assume g is a normal distribution and that's just a Gaussian, which we all know and are familiar with and we're normally told that linear regression requires a Gaussian, requires Gaussian errors. So the other important fact and I'll flip back to the previous slide is that term at the bottom probability of data, that term doesn't vary with a. So it's going to be kind of lucky that it turns out this way because I'm going to get to ignore it, that term is very complicated to compute but I'm going to get to ignore it for the rest of this talk and you'll see why shortly. So here is the heart of Bayesian linear regression. We write down our prior which is something we're going to have to choose based on our opinion about the world and what we think is reasonable. Then we write down the distribution probability of a given data which is a constant times our likelihood and as we said on the previous slide that's just a product of the individual probability distributions. Times are prior and then step three there's different ways to set this up but it's basically let the computer do a bunch of work and when I say a bunch of work it can be quadratic amounts of work in your error. So it's not a specific algorithm but it's a framework for coming up with other ones and you'll see that from this framework we get a lot of the things you've probably seen before. So the first way to let the computer do a massive amount of work is mark off chain Monte Carlo. What MCMC is is an algorithm as an input it takes a function. This function should be proportional to a probability distribution but it doesn't have to be the actual probability distribution. I'm not going to say how it does it but what this function does very slowly is it outputs samples from the probability distribution. So this is you can get this from the pi MC library. The convergence rate is o of n squared which just comes from the central limit theorem and if you can avoid doing this your computer will thank you but if you in the absolute worst case you can always just run this and eventually get a result if you're willing to throw computer power at it. So just to really emphasize it can be hours versus seconds if you can do something more clever than this but it's still a good place to start. So as I said you don't need to know that denominator which is hard to compute anyway. So in 2008 a guy working on Wall Street told me that about 70 80 percent I forget the exact number but admit a very large fraction of the computing power in the world is computing MCMC. This is because it is even though it's slow it's a brutal workhorse that will get you the answer and if you're trading bonds you really have enough money to throw at MCMC if it gets you the right result and lets you make a correct trade. Nowadays it might be SGD given that deep learning is now a big thing. So here's the implementation of pi MC. We essentially set up our equation so we have this deterministic function linear regression we apply a Python decorator to it and prior to that we specify the distributions our parameters come from. So A and C are the constant terms in linear regression and the value X is our data. So as you can see we say value equals X data what that means is whatever the input to regression is whatever the data we're trying to fit is this variable is specified there and similarly for Y so you see that the value equals Y data parameter that's where we say this is known we ran an experiment this is our input data whereas A and C are parameters that we don't know when we're trying to find. So then we run the model and essentially just by setting up the model this way we can get results. So the output is not a single answer but a sequence of answers. So this trace function outputs a set of A's and a set a sequence of A's and a sequence of C's that all represent values that are plausible given the data. So this graph here the dots are the actual true data and what the red lines are so can everyone see that those that's a large number of red lines or does it look like just a blur in the back? So where the red came from is for every value of A and C I plotted along a red line so what we're getting out of this is a sequence of plausible regression lines together with what is essentially an estimate of the uncertainty in our beliefs. So we're forming an opinion and just because we have one best fit line doesn't mean that other nearby lines are just impossible. It actually means that quite a few of them are plausible and one very good reason for doing this is even though it's much more computationally intensive is let's say we had fewer data points. There's more lines that can plausibly fit 10 data points than can fit 100. So as you can see if we compare this we have a pretty narrow band especially in the middle. The red lines are pretty sharp there's not too much width and when we go down to 10 data points our estimates are a lot fatter we have a lot more uncertainty and this is kind of what you expect less data more uncertainty. So if you use MCMC and you use the full probability distribution you actually get a good estimate of this uncertainty which doesn't come out if you just use a method that gives you a single point estimate. Another way to sort of brute force this which is a lot faster than MCMC but gives up that uncertainty is something called max posteriori. So what this says is we take the prior and we take the likelihood essentially we take that probability distribution that comes from Bayes' rule and what we do is we find the value of A which makes that as large as possible. So this is giving this is much faster you can do gradient descent usually or something similar and it'll give you a single best estimate but it won't necessarily give you an estimate of the uncertainty and as a computational trick we usually take the log because the actual functions will be very sharp whereas the log will be much smoother and so all your optimization algorithms will actually work and in code we're essentially doing this we define our prior we can define a log likelihood so that k is just training versus test and then we can compute a result using the sci-pi minimize function. So ignoring some tricks that are in linear regression this is basically what sklearn.linear model is actually doing is doing some kind of optimization and they use tricks to make it faster it's a carefully optimized module but this is a big idea they've defined a likelihood they've defined a prior and they're optimizing A. So at the end you get an optimization result hopefully the message will be optimization terminated successfully didn't happen every single time when I was preparing for this talk so I had to tweak a few of these things to actually work and then you can read off the result of regression from the x parameter. So let's let's talk about ordinary least squares which is the standard thing I think everyone in this room knows about. So what ordinary least squares does is it takes Bayes rule and it ignores the p of a term it ignores the prior so it's not technically correct from a Bayesian perspective but the frequencies say that's okay we want an objective unbiased estimate so here's a simple version we're going to let our function g be a normal distribution so we're assuming the data was drawn where in that previous slide where I said I had a function g it just imports psi pi dot stats norm of these parameters this is our function g and this is the data set generated that way now what I've done here is the stupidest way of optimization possible namely a grid search so I'm not even doing gradient descent right now I'm computing the likelihood as a function of the data and as you can see this exponential is literally just the likelihood actually this code is wrong there should be multiplication that should be times equals but if that were times equals then this plot would be about right and so what you can see in this data which on the previous slide I generated with the true a being three this likelihood has a peak in the general vicinity of three it's not exact but it's close the other thing that comes out so you could find out grid search and you could use gradient descent but in one dimension who cares you can also just look at the graph and see the result obviously that doesn't work in practice but in examples it works and then the other thing that comes out is when you change the sample size this distribution becomes either fatter or narrower which is again what you expect but if you're only looking at the top if you're just taking that one best estimate and not looking at the width you're essentially throwing this information away so this is ordinary least squares I've done nothing Bayesian here and so what we can do now is let's look at it from that perspective nevertheless let's if we take the probability of the data given a so we've ignored the p of a term we get if we do a little bit of algebra we get the exponential of the sum of the squares if we take the log of both sides we just get the sum of the squared errors and if we and essentially what these squares is doing is just finding a to minimize this this is literally where these squares come from so this is called maximum likelihood because rather than max a posteriori because we're ignoring the prior and only looking at the likelihood now I'm going to explain to you a little bit here I know there's some talks later on about intuition and data science or maybe that was an answer I can't keep track of exactly where everything is I'm going to explain why you actually need that prior opinion if you want to get good results multi-code linearity is is a word that a lot of people know and what it basically means is that your input data is highly correlated with itself so if we're looking at machines we want to know the price of the machine there's two two pieces of input data that move together a lot one of them is how many hours the machine has spent working so that's literally the time the machine is digging a hole or moving concrete or something of that sort the other is the age of the machine which is how many years ago it was purchased so except in rare cases where a machine was sitting in a barn for two months or for two years and not doing work generally speaking the number of hours and the age are basically the same I mean up to a constant hours hours is measured in hours ages measured in years but there's a clear linear relationship between them this is a data set where leaf squares is not going to work well so if there were no relationship between age and hours and we looked at the likelihood it would look like this it would be just a perfect little spot at some location so the x-axis is hours the y-axis is age what happens when your data is correlated is that the likelihood gets smeared out along the line and that line is essentially the kernel the null space of the data so essentially since hours and age are correlated we're saying one possibility one one choice of parameters that's plausible with the data is age is the only thing that matters and hours don't matter at all another thing that's plausible is hours matter age doesn't the third thing that's plausible is age has a large negative effect and hours has a slightly less large positive effect and they magically cancel out like if you look at this some of these parameters have different signs so in some cases we're saying an older machine is better and a newer machine is worse but the hours worked cancel that out so this is what comes if you just look at the data so now when you try to do ordinary least squares you'll get a result and if you use those same coefficients on data that's drawn from the exact same probability distribution you're going to get really good results it turns out this multicollinearity is not going to matter but what happens is if you have slightly out of sample data if your model changes just a little bit between test data training and production all of a sudden everything goes wrong so what I've done here is I've generated similar data so in this sample data data is generated where x0 and x1 are 50-50 matched in the out of sample data they become slightly mismatched 40-60 instead of 50-50 and also the value of y has changed a little bit from 2x0 plus 3x1 to 1.96x0 to 3.03 so really small model changes that you hope your model will be stable against and then what happens is with multicollinearity the out of sample data your predictions become horrible they just don't work at all using Bayesian reasoning we can essentially minimize this issue so the company I'm advising have the equipment research I asked them what matters more age or hours and he's like oh hours that's where you get wear on care who the fuck cares if it's a three-year-old machine as long as it's still in working order it doesn't matter but the thing is hours are where you actually do damage to the machine so what I did is I took a prior I took his opinion and I turned it into math and the math says age doesn't matter a lot but hours matter quite a bit so what I'm saying is the coefficient on age should be close to zero the coefficient on hours can be large can't be infinite but it can be large then what I do is I actually compute a Bayesian posterior which is the likelihood times the prior and what happens here is remember we have this diagonal line we multiply that by this horizontal line and we essentially get the intersection so now we have a pretty clear result that hours matter age matters but only a little bit and so what we would want to do then is max a posteriori again and we'll get results that are not quite as accurate on our test data as we get from ordinary least squares but on the out-of-sample data things work really well here's what's happening ordinary linear regression says I don't care what I'm going to fit the data as perfectly as possible I don't care how plausible my results are so if you have a bunch of numbers between like zero and five ordinary least squares is happy to tell you okay yeah we get the number one by saying 101 minus 100 it does add up to one but we're essentially saying there's this magic perfect cancellation that happens out there in the real world with real messy data Bayesian says that's just crazy there's no way the real world is generated by 101 cancelling 100 so we'll do something a little less accurate but that seems more plausible it's essentially putting your opinions above the data a little bit but then what happens is if you look at out-of-sample results the frequentist is going to get a small tweak to the model gets you crazy results from the frequentist ordinary least squares method whereas the Bayesian model is still plausible it may not be perfect but it's in the realm of plausibility it hasn't completely blown up if we again do the same trick we used to derive the least squares what happens is the prior is an e to the minus gamma term so again we do some arithmetic take the log of both sides we get this penalized version of linear we get essentially ordinary least squares with an extra term and that term basically penalizes coefficients that are large so it tries to fit the data without making the coefficients too big and this is actually what ridge regression is this is where ridge regression comes from it's basically putting a prior on your data okay I'm running a little slower than I'd like to be so I'll just mention there's quite a few other choices depending on what you think the data looks like a popular choice is l1 instead of l2 which basically means you towards sets of coefficients that have a lot of zeros so you'll still try to fit the data but if you have a point zero one you'll round that down to zero because you think that's more likely most of the problems you run into from least squares are caused by this frequentist lack of an opinion so I've got another example here that I'm going to sort of gloss over a little bit the big idea here is when I was gambling on baseball well a lot of players have been in 50 games and I have a good estimate of how good they are some of them have only been in two games I still need to place bets based on what's going to happen tonight because it's a new player some hot shot he's in the game I can't just ignore him so let's say Chris is the new player and I've been in one game what's essentially going to happen is if you just do ordinary least squares the results of my last game that one out of one game I might have done really well it's probably a fluke if I did well in one game and then it was probably just luck the first time and least squares will just say that's not luck that's truth another thing you can do is you can just sort of round everyone close to the average so this is again a prior that says most people are average until I have evidence otherwise I'm going to gloss over this one a little bit in the interest of time but basically what happens is as you get more games you'll move further away from average towards whatever the true value is so in cases where you have a lot of new data with very little points just forming an opinion most people are average until proven otherwise is a great way to mitigate that issue and avoid the instability that otherwise comes outliers one of my favorite interview questions is to ask about outliers and here's the question what the heck is an outlier it all starts with answers like it's a thing that doesn't seem to fit with the data it's caused by errors no one quite knows how to define it problem with outliers is that they're a miracle so for those unfamiliar this is a photograph this is a painting of Jesus walking on water I think the odds of seeing a person walking on water as opposed to just falling in is very low now if I actually see this in real life rather than a painting I will just become a diehard Christian right there I've seen something wildly unlikely but I saw it with my own eyes I have to update my opinion on that basis so outliers are essentially the same idea so here's the problem with outliers I have a nice clean data set and you see in the bottom right corner there's that one extra data point that just doesn't fit now ordinary least squares assumes your errors are a normal distribution and as a result they have to fit that data point the net result so you can see in this test data the prediction just doesn't match reality at all and the problem with this is that the normal distribution is a really strong opinion saying that point way there in the lower right corner is impossible so when something impossible happens you have to really change your view on the world so it's a four-sigma outlier and the probability of it occurring is 10 to the minus five whereas so what happens is you radically change your world view you skew that line very far just to fit that one data point so here's another possibility maybe my errors are not normally distributed this when you see a bunch of when you see have a data set with a bunch of points that are just not that close to the line maybe the Gaussian assumption is unreasonable so here's a graph of a Gaussian versus an exponential decay and a polynomial decay and as you can see these are different choices on how rare I believe one of these distant events to be and if you start to see a bunch of rare a bunch of four sigma five sigma events all the time you know that it's just not a Gaussian on the other hand if we compute the probability of that distant four sigma event with an exponential distribution it's not common but it happens about one percent of the time a little less with a Cauchy distribution it happened seven percent of the time these so if we choose a different have a different opinion as to what our how our errors are distributed we might update our inferences so that was a data set with about 100 points and one of them was kind of far out that kind of agrees with exponential distribution this is a very hand-wavy way to do it there's more rigorous methods but sort of a guesstimate that is hopefully plausible here so all I had to do to use the fatter tails is I replace my Gaussian with an exponential distribution like this a two-sided exponential distribution not the normal one-sided exponential so I literally just altered my log likelihood and that happens to be just a vector norm in one dimension and then and then we minimize it so we have to use this again there are some tweaks necessary here neldermied is a method that deals with the fact that absolute value x is not is as a discontinuity this is minor minor computational optimizations and so as you can see the green line is the true line and the yellow line which matches it I hope everyone can see the yellow line just says okay yeah one data point is far off but all the other data points are on this line that one data point it fits our assumptions it's perfectly plausible I'm not gonna change everything just because of that it's not a miracle it's just what you expect once in a while so essentially by tweaking this error distribution you don't have to do anything special for outliers outliers are a real part of your data and if you understand where they come from and what they look like you can just build a model that fits them this is Nassim Taleb fairly famous figure he is constantly pointing out to everyone in the world that you should not assume a normal distribution he says that no one knows how to use fat tails he exaggerates a little bit because that is his nature but he raises a very good point that everyone assumes a Gaussian everybody knows a Gaussian is false and they keep doing it anyway and getting bad results you shouldn't you should be aware that it's wrong and try to take it into account I'm gonna run it a little short on time so I'm gonna skip ahead just slightly the other thing that happens it's got the non-Gaussian data a little bit more so in the real world a lot of data is just not even highly centered but here's one example so I have a bad cricket data set that I don't fully trust and I have a good baseball data set that I trust tremendously so this data comes from baseball this is the distribution of scores for the player Omar Infante this is the distribution of scores for Chris Stewart so if you notice this is approximately exponentially distributed for each of these players Mike Trout is really good which means that he's more likely to get 30 or 40 points these are fantasy points by the way not real points this is for fantasy sports whereas Omar Infante he can also get a 30 but it's less likely Chris Stewart probably not he's not as good what's varying here is the decay rate of this exponential distribution it's not the data is not centered anywhere if you grab the data most people score zero points in any individual game so this is not even this is not an exponential distribution this isn't even it's not a Gaussian this is not any kind of a centered distribution here's another issue that you run into a lot particularly in stock market data so one of the assumptions of least squares is called homoschisticity what this means is that the variance stays the same regardless of the value of x so if we're looking at stock prices we have some predictive factor and we're hoping to see kind of a line where that predictive factor goes up the delta s goes up and therefore where that predictive factor is high you buy when it's low you sell in the real world a lot of data looks like this when that predictive factor is low things seem to go up a bit and when that predictive factor gets very high everything kind of blows up sometimes it goes very high sometimes it goes very low and moreover the variance of your distribution changes with the input the variation in y changes with x and this is a real world behavior and so one thing to really note is if you want to be gambling on the stock market this is an area where you go all in this is a pretty safe bet over here this is zero you don't want to gamble all your money out here if problem there's a good chance you're going to make money there's also a really good chance you're going to lose everything so you really want to understand this it's really important to understand how this variance changes because that's where your risk lives even if it's a even if it's a bet with a high expectation the variance is so large that you should be afraid of it so again we're just going to do the same trick since we look at our data we kind of formulate a hypothesis about it and the hypothesis I'm going to form is that not only does the mean of my normal distribution change with x so does the variance and they change at different rates I don't know what those rates are but I'll try to find out so if you plug this into the same formula do all the same arithmetic you basically get this similar Python code it's just a slightly more complicated log likelihood do the same minimization a couple of computational tricks I'm going to gloss over because I'm running low on time and essentially where similarly least squares just doesn't fit very well the maximum likelihood fit with this modified model that takes into account where your data is coming from fits the data very nicely another really great thing about this method is you see the light gray lines at the top and the bottom that is what's coming from so we found a 2x what a 2x represents is how the variance changes with x so by doing this not only do we get an estimate of the average we're also getting an estimate of the variance which actually helps us really avoid that danger zone it helps us basically make bets when it's nice and safe and avoid the really dangerous cases or at least reduce my exposure there so basically just by tweaking these parameters based on some understanding of the data and careful modeling we can extend linear regression to a wide variety of cases where it seems like it shouldn't apply and again these are three examples where either I or someone I know has really made money off it so I should also mention that none of this is limited to linear regression if you have some other differentiable model if you're a neural network person this is also a great way to essentially figure out what your loss function should be or to figure out what parameters you should be predicting so essentially if f of x were something other than linear it could be your favorite neural network favorite LSTM whatever you just wrap it in the appropriate loss function which you derive by looking at your data and figuring out what you're trying to optimize do the same tricks gradient descent sgd all that stuff is going to work for the most part assuming you don't choose g2 crazy and you can do all the same stuff but more accurately predict what you want just by understanding the problem then so I'm going to conclude here Richard Feynman was a famous physicist and he had what he called the Feynman problem solving algorithm to write down the problem think really hard and then write down the solution it was very tongue-in-cheek he was much smarter than most people and most people couldn't actually do this key point I want everyone to take away from this talk is don't skip step one if you do everything I've done in this talk that solves a lot of real-world problems is literally just step one write down the problem carefully and then everything else follows and I'm hoping some of you guys leave here and do the same thing do I have time for questions we have five minutes for questions hello hi hi I have two questions actually first is that when you have a non-conjugate prior right non-conjugate prior sorry please I'm not hearing you very well when you have non-conjugate prior then how do you how do you estimate the parameter you still do this MCMC or any other trick sorry are you asking what to do about a ha so when you have non-conjugate prior right so prior and posterior are not in the same distribution and then you can still apply MCMC algorithm you can use MCMC if you so if you want to understand the full variance of your estimates you would need to use MCMC or something like it I mean MCMC is just a brutal first choice it works well but if you can optimize it it'll be better or you can use maxi-posteriori if you just want one point estimate it really depends on what your use case is okay okay and one more question is like once you do this Bayesian linear regression right so in for example in simple or less we verify a lot of assumptions sorry I'm having a little difficult so when you do this Bayesian linear regression right hello yeah yeah so once we finish this Bayesian linear regression can anyone yeah so once you do this Bayesian regression right so what are actually the for example in linear regression simple or less we check a lot of I can't hear anything back here sorry can you translate sorry I'm just having difficulty hearing I think the acoustics are less than great hi my name is Amit thanks for the talk in the initial start at the start of the talk you said that something does something does not work right there's stock prediction that you were doing it no longer works correct so what actually changed in the did you did you analyze why it didn't work anymore what changed so essentially what happened is so when you were running a strategy like what I was running you want volatility which means the market goes crazy and essentially what happened I guess it's about four or five years ago now um the markets became boring and politics became interesting so essentially all the volatility in the stock markets kind of vanished and basically things only move once in big perfect movements when like what happened yesterday facebook's earnings weren't great and instead of being going like this they were just like done and that wasn't really something the model was built to take advantage of essentially what happened is there were a lot of guys just like me in the market running algorithms like this and the markets became more efficient anymore questions we have time for at least two more questions any other questions from the audience we also have slido so if you want to ask questions anonymously you can do that you can also upload your questions there's one from slido here which I like to ask is it okay to use betas from penalized regression for attribution purposes sorry is it okay to use betas betas betas from penalized regression for attribution purposes what do you mean for attribution uh the second part is can I multiply beta times x value and say that that is the impact of x on y so that's more a question about causality versus um correlation so linear regression technically only finds you correlations if you want to determine true causality it would be better if you could you would really need to inject some rent some controlled randomness into the situation like you would need to change one of these x1 at random in a way that you're certain does not change x2 or on the flip side you would need some kind of a clear understanding of why x1 and beta one must be the causal factor and x2 is just correlated with it somehow so causality is a much tricky trickier thing to measure okay thank you chris uh for people who are standing behind we have a balcony upstairs so if you don't have a seat you can move upstairs you can use the has geek wi-fi uh the password is geeks artist to connect but we have the um emerald with amazon stage maker workshop starting at 10 in audit three and we have uh akash kandil balls improving product discovery via relevance and ranking optimization starting in audit two at 10 20 we also have feedback forms so we'll appreciate if you give feedback to improve the future editions of the conference what better topic to discuss after regression than a study in classification in particular the problem of harmonize system classification by none other than raman and balakrishnan okay great so i can hear myself now okay great so let's start with the second talk for today i hope everyone's settled in i promise there won't be as much math as before but hopefully there's still some takeaways for you guys to make use of so my name is raman and balakrishnan i work at this company called semantics three if you guys have not heard about it i'll probably talk to you later but today it's going to be a talk about classification in general so for those of you who get their reference i study in classification yeah it's related to something else so before i jump in here's a small blurb so the company i work at it's a small startup here it's called semantics three we help with e-commerce businesses all over the world and we do a number of data or ai solutions for them so and some of the common tasks which most of you are also familiar with probably is like classification and also entity recognition so if there's a lot of nlp going on you probably want to extract it out we also have certain expertise in like unsupervised extraction product matching search and of course wide large-scale distributed crawling so all of these things sort of fit together and today and we're talking about the first part over there which is classification in general and how you go about approaching some of these problems because some of the times it seems like a simple enough problem but only with like certain effort you start seeing what are the complications that it involves so with that out of the way let's get started i'm quite sure we have a lot of fans in the house here looks like the party just ended a lot of upsets a lot of surprise victories but uh let's see i'm guessing all of you had your own favorite team you're rooting for hopefully they won or maybe not but let's pick a team and then we want to monetize on our whole aspect right we want to like go ahead and let's say we want to sell jerseys for this team across the world and then we want to like make money out of it 19 dollars 90 dollars that seems a lot but okay so here the interesting part is that you make these jerseys and then they come with their own set of features for this use case it's mostly going to be the more abstract part it says 51% polyester 49% recycled polyester and double knit this is probably going to be relevant later on in the dog so the idea is that you go to china where apparently everything is made these days and then you decide you're going to end up buying a lot of these jerseys and then you want to sell it across the world you want to sell it in all the countries that take part hopefully more countries and one of them which didn't make the cut this year was the united states but you decide you anyway want to go ahead and sell there so what is the process so how do you go about it this is just to give you the domain idea right so you go you say you have a t-shirt and then you go to the customs officer he gives you a book so imagine like 6000 pages so that's like this is like 500 and then there's like 10 more volumes of it so you get this book you start flipping through pages you flip through page one page two page five page five hundred six thousand and then at the end of the day you want to find how you call this jersey so it's a jersey right but how do you want to classify it and to give a short story of how it looks it's actually called a harmonized system code or a HS code this is probably familiar for those who are in the logistic space but the idea is that you want this sort of classification system to standardize global trade and for this specific use case it falls into let's call something as a chapter 61 which is for apparel knitted or cross-shaded and the idea is that you shouldn't make a mistake right off the bat where 62 is actually something very very similar everything is the same except like a few riders and then this is of course very simple if it's only 61 but then you go further you start saying it's for jerseys pullovers cardigans and then you go even further you say it's made of man-made fibers so the idea of this whole task is to start with a specific product and then the ideas and then the day you come up with a classification label and this is going to impact a lot of things downstream and that is what we want to focus on for this talk it's not just about getting labels correct it's about analyzing the impact also so just to give you like the case for this HS codes they are set up by the UN or the world customs organization within it and they are the signatories are over 180 countries they have agreements all over the world so roughly it's trillions of dollars I guess of like global trade of goods so it's a pretty important code to like think about there's a lot riding on this or if you think about like a closer to home example there are fans here and then there are the other extremes of people who'd like and don't like the GST so for example you have this bottle of jam right it says kiss on jam so is this how do you classify this do you think about it as a preparation of a fruit in which case the government is going to ask you for 12% or do you end up thinking of it as something which just has sugar which just has like confectionary which you spread on like your cakes or something your breads so which has like 18% GST so these are going to impact this is probably going to be like a 50% increase in the tax that you end up paying out so these sort of classification systems are present all over the world the idea is that the impact is going to be slightly different depending on the use case that you're talking about so the first step you heard Chris write it down so let's write it down how right like we want to formulate a problem statement otherwise it's not going to be of much use and the very simplistic way of writing it down is just to say solve the HS code assignment problem and then at this point you're thinking it's just a classification task right you end up just classifying things and then for that I had this Q ball over here from XKCD so Q ball is thinking that okay I'm good at understanding numbers so that's step one and then you say the stock market is made of numbers so then he comes up thinking that okay so therefore and then boom suddenly where did all my money go so the idea is that you don't want to like preempt yourself by starting with such a simplistic problem statement and you want to make sure that you have the right ideas in mind so for that we went back and we wanted to see what is the current state of the art like you want to do something which is replacing a process very likely it's been there for decades and then so we went ahead we went to our own companies that we work with and this is for example like they have this huge warehouse where they start shipping products in and these are where the inventories manage products are tagged price are assigned boxes are filled and then they literally have people who open up the boxes they scan the codes they start seeing inside the boxes they want to inspect it they write it up and then they look at books they have these huge books and then they start annotating processes so if someone had considered so much of effort to do some things like this the idea is to see that it's really worth replacing so this process probably takes of the order of minutes you're talking maybe three minutes five minutes and then if you're shipping multiple boxes it's probably going to be a very time consuming task so let's not oversimplify the people who are working at it let's try to come back and formulate like a better version of the problem statement or at least formulate it in a simpler format right or a more understanding format so the idea is that life imitates art or let's see if our systems can imitate the state of the art over here so you have a input product this is also very similar to the talk yesterday which Shailesh gave on stimulus and response so if you think about stimulus and response as how your inputs and outputs are being set up you probably have an input product you pass it into a black box we'll come back to that later and then you get a output label now most cases if you start with just simple data it's probably going to be a barcode a UPC code and then it's not going to give you enough information about the product if you end up only using things like this you might say that one thick line translates to this code whereas a thin line translates to some other code that's just going to be rubbish data so you're probably going to need slightly more complicated information you probably need a much better set of features whether it's the name whether it's a description whether it's something to do with how the product was made the country of origin things like that and then of course you have the output label following the same example from a few slides back you end up with 61 10 30 which in like real life speak talks about those HS code labels so the idea is that we want to build this classifier in the middle this sort of black box which people like to think of as I just throw numbers or throw data in and then hopefully gives me the correct label and for that I usually like to use this picture over here which is from sk learn we'll look at it in a bigger slide later but the idea is that you want to use multiple models so you want to use at different parts of the system you want to consider different options so for that before we get started on that but what do we do right we need to collect data it's better to start monitoring first rather than like starting realizing that you need to collect data after you start modeling so the first part of collection like where do we start where we start the whole process it's better to start at the beginning hopefully you are able to get label data that's probably where the crux of the problem is like whenever someone asks how do I get started with any of these the blocker is not in the model so tensorflow is probably open source you have all these libraries the cloud is free for you to use but the idea is not free but it's like open to everyone but the idea is that the label data is going to be the problem the blocker so hopefully for this use case we can leverage on international shipping records most come most countries need that to be publicly disclosed whenever you bring something in a container or out it's probably available to the government but the problem is that it's going to be really shitty people are just going to say shoe and then leave it at that also there are third party aggregators these are things of just thinking about it you probably have someone who's already monetizing these data if you're giving it away for someone and then they probably collected store it together and hopefully they can make much more sense of it and then of course the worst case scenario you sit down you you open your excel sheet and then you start writing records yourself hopefully it's not too expensive but this is probably the highest ROI if you're especially starting a new project like if you're starting from scratch or if you're opening up a new project collecting label data is probably the best use of your time even though it's not glamorous even though it's not the data science role that you are hoping to do but the idea is that you start with data and then from there you go to the modeling part and hopefully after like many weeks or days or hopefully not months so you end up with some huge sheet like this this is just a like an animation which just has like columns of information and then the HS code as the target output so hopefully you end up with something like this you collect this and hopefully it's good enough quality for you to use so once you have this sort of data set what do you do right now we model so this was the picture from earlier this is just a nice way of like thinking about processes so the idea is that you want to solve problems by pattern matching or it's more like problem solving by pattern matching so over time you build up an experience that's going to be talks about intuition later but you build up an experience of what sort of works or what sort of doesn't work in different paradigms and then once you have something like this you have your own version of this type of chart you start thinking is it a clustering problem is it a regression problem is it something for me to do an unsupervised extraction on top of or things like even a classification problem you probably have different versions of this in your mental model and then you start separating out you start separating out the versions which are completely incompatible with your idea and then you start thinking about ones in which you will actually spend effort on and that effort is something I like to call gradient descent but just that this is not the gradient descent which your computer is probably calculating this is probably your own mental model of gradient descent where you have your own hyper parameters that you want to tune and get the right results so instead of thinking it think of it like a hill climbing problem it's like yourself you want to like consider multiple options you probably want to tweak your parameters and then this sort of human power gradient descent is what we also end up doing when we select different models or we select different parameters and settings within each model again these are things which come with experience they're probably much more qualified people who might be talking about like much more experience in the integrity of the models and the details but for me the idea was you get something done over here you get a model out but then you want to subsequently improve on the model you want to subsequently see what are the impacts what are the outputs like the classification model works but how do you start measuring around it so we arrive after like multiple trials so if you can think of the sk learn prediction classifiers you can think of deep learning models you can think of fast text which is really popular these days so we tried like a whole bunch of different models and then we end up choosing one and then the next step is to go for some sort of validation so you start thinking like how do I go about testing my model how do I go about like verifying whether it's actually good enough so at the end of the day this is going to be the bottom line like if you go out and tell your client or tell your CEO that I have everything it works great they're not going to be interested on whether you used a million nodes or like a hundred billion nodes they're going to be interested on what is the bottom line accuracy and then they're going to give you some example like this they're going to bring back the same product and then they're going to ask you like what do you want what is the output answer so the bottom line was that at this point we were at like 40 percent so at that point you just like give up you lie back and then you think where have I gone wrong so at this point the idea is that these numbers are not a hundred percent realistic of our efforts at that time but the scales or orders of magnitude should be roughly the same so again at this point you start reconsidering your life choices you're like should I have done economics in school or something like that and then you go back to the drawing board hopefully not all the way to your childhood but to the time you started the model and then you go back and then you start reconsidering so one of the assumptions that you make early on is that the data that you collected has the necessary information so I like to call that input has that right so that's like the first question that you might want to ask so again feature engineering is something which people really like or spend a lot of time on it's probably like a huge place where you can get lost like trying to tweak trying to attribute importance to different features and everything but the idea is that for our use case for here if there's name description material usage for the product you want to make sure that the assumptions or the underlying factors which you consider holds still correctly are still correct and for that there are three aspects which are usually like one is availability so given a shipping record is the information is is the feature really available so what do you do if the feature is not available some guy might just choose not to say what the product is going to be used for or what the product is made of so just the complete lack of the product information might be a factor there's also the accuracy if it's made of wool but someone says it's made of cotton then it's like completely wrong again those are again factors which you would consider when either buying your dataset or training your dataset or building your dataset and then later when you're evaluating the accuracy numbers as well and finally there's also one more part which is specificity because this is something which you might ignore initially also because when we build our idealized versions of training systems we sometimes give a lot more detail than what we might expect of in the real world that's because we should really be prepared for more mere output data or input data and that's what I like to call as lesser data to an extent so specificity is a problem where for example it might say it's made of 50 percent polyester maybe that's the right answer but as someone just says polyester whereas the governments are so strict in their regulation that a one percent difference probably makes like a double the cost of the tax so again if it was roughly just a single line entry which just said the item name and the price that is some data which you probably have to solve using other means you probably have to go back and then look up the product information using some other source and those are what people are doing when they do manual tasks so they get a product but they try to google it they try to find the relevant information take that into the system before feeding it to the model so things like that often become essential before you like start just optimizing on one specific aspect of your system so the other aspect over here one was assuming the data was there the other was using the structure of the output itself that you expect so one of the things that I glossed over initially was that the output label which is like a full level code over here is not really like one one to n sort of system so if you consider like random monkeys on typewriters you have like a 6000 output class system and then you have a randomized output which falls like far below one percent the idea is that you don't want to jump into making a such a discrete small decision so one of the things which was there was that there's also structure in the data this is of course something which might seem obvious in hindsight but this is of course end of the day people usually only report the first number but they're really interested on the impact as it scales down so if you're able to isolate like the three levels of categorization which sort of helps and then you see that as an inherent nested tree type of structure so as you make subsequent decisions the search space for your own output output range yeah your output range is like far lower so you end up making these sort of decisions where the relative impact of being wrong becomes less worse as you go deeper into your decision tree and probably at the end of the day some countries probably have the same tax bracket once you're at two levels or once you're at four levels that will probably not affect the bottom line for the company and of course predictions imitate data so what you feed into the system is what is going to come out if you end up having outliers in your in your own input data and your model is very sensitive to those sort of outliers you end up tweaking your own answers to like much worse results so one of the things just for context is that the code for regulation which is sort of the this is the us one i think it recommends a standard of reasonable care so they say that importers must consider reasonable care or must exercise reasonable care this is like legalese so i have no idea what that means but then when we look at the data what that translates to is that one in three products were misclassified even in our own training data set and this is something which was a shock to us because this is something which has cleared customs this is not something which we build manually this is something which everyone in the country said i mean the the officer probably said okay i'll take the product this seems right to me and then we see the same product go again and then it gets a different code and then a different person who probably came to work late that night or hadn't has coffee and then he was like okay i'll still take the code even though it's different from what was previously approved and these sort of training data sets are what you build upon if your data is going to be like this your predictions need to account for it now the other question is what if repeatability is not guaranteed so when the same product clears with like different codes what do you end up doing so given that the classification is built into your own system right it's built into the process that you're looking to replace so what do you do do you like approach like a consensus problem do you want to send it to two different people see if they agree if they don't send it to a third person so these sort of consensus or proportional labeling will end up influencing how you consider your data as well so again we went back we tried to tweak all of these things we tried to like make all the changes necessary and then the boss comes in he asked the same question again he asked like a thousand more usually but then the bottom of accuracy just for like scaling purposes was much higher was closer to like 80 percent and then at this point you're going like what's the point i get like 97 percent on like image net or something but the idea is that it's not about the accuracy numbers it's also what about the processes that you're looking to build and the process they are looking to replace so think harder that's step two it's just a surprise that they lined up with the problem statement and think harder but yeah think really hard this time what did it cost so at the end of the day you ask Thanos what did it cost and then he says there are compliance issues he says if you get it wrong if you get it wrong they're going to be like statutory audits there are mandated requirements that you disclose all the things that went wrong and then there are requirements which cost a lot more legal troubles for your company there's also going to be monetary problems what if you declare a code and then it ends up with a seven percent match or a seven percent twenty seven percent tax rate and then finally if someone is not very happy with your system they might just push the terminal to one side and they might go back to their old place and then you're back to processing shipments in minutes instead of seconds so these are cost functions which are not baked into your system but these are cost functions which you would probably need to consider for your own idea of the business use case that you're looking to solve and of course it's all in the data like whatever you want is probably in there so one of the things is that we work with e-commerce companies so we probably don't care about shipping cows to countries or we probably don't care about the grain stock which goes from one place to the other we are probably interested in like smaller sections of the code book so we try to see do we have enough samples in those proportional sections if most of the shipments is going to be electronics and not grain then does it matter if you get grain wrong or do you need to collect more data just for electronics instead of like trying to get shipment records for which most of the data is different and then of course this is one thing which we really like which is like to train people to train machines this is a really nice project which I wanted to talk about but before that we have Richard Socher he was at Metamine and I think at Salesforce now but not here but I think he's at Salesforce but the idea is that instead of spending months trying to solve a problem the idea is that you just label some data for a week and hopefully you're able to solve it much better and this is a tool I really like there are probably versions of this which you could build yourself but this is from one of the creators of the NLP package spacey this one is called Prodigy the idea is that you have these sort of interfaces whether you're a data scientist whether you're a QA person whether you're someone who's just looking at the data to make sure it's all kosher the idea is that you look at this the system sort of makes a prediction it guesses that the Nintendo switch is what you're interested in and then as yourself you just simply annotate yes no or you reject it so there are two aspects which are into this one is of course you help build a higher quality data set with like minimal trouble you could probably get anyone who's like bored for like a few minutes to just go through this game and it also helps your system understand which areas it's sort of wrong in so you can sort of build in a online real-time learning system which shows the next example based on the ones which were built wrong so this sort of human in the loop systems which are becoming really popular these days are some things which you could really consider to improve the performance of your model in irrespective of just like measuring bottom line buying data sets the other thing is it's really important to not be afraid to peek under the hood of your whole process so let's say this is like a typical workflow it goes from a seller it goes to a warehouse that's your classifier processing workflow systems you have a qa team which inspects maybe one in a hundred packages or however you want it and then it goes to customs and then finally goes to the buyer the idea is that you usually monitor the qa team and whatever they say that's your effect but the idea is that you start making it into all the other parts of your system you try to ask the seller themselves to validate the output of your system or you ask the customs official if you get it wrong you get the feedback back into your classifier so the idea is that you need to this is probably harder in companies which are distributed across multiple continents where this probably happens in different parts of different countries so the idea is that these are things which need to be considered instead of like raw accuracy numbers for the beginning and the end you could probably get a lot more valuable feedback as the domain experts are probably situated in a different team than the team that you finally talked to so in the long run so what happens in the long run these are this probably my last slide also the idea is to consider the aspects of how your system is going to evolve so there's going to be a skew over time countries are going to change their own version of how they classify products there are going to be tariff wars announced over twitter suddenly there's going to be a hundred and twenty billion dollars on the line and then if you go look up the actual press release for one of these tweets it's going to be a list of codes which literally says how much percentage each code is going to be affected and then you have businesses which specifically avoid those codes and they're selling the same shoe but they call it something else they call it like a sports accessory instead of like a apparel piece of apparel so those type of hacks happen where people start classifying their codes in different aspects just to make workarounds so these are things which cause evolution in the same data looking different and of course there are newer product categories say someone brings out the iPad or what do you say a VR headset say a VR headset comes out soon right how do you classify these do they go into entertainment do they go into computation or do they go into iware so there's this interesting talk by Titus Winters he calls it he wants to talk about the difference between programming and software engineering so he talks about programming as getting it to work the first time where you mash out your hello world or you try to pin the fibonacci sequence and it works software engineering is that integrated over time so he thinks about software engineering as the process that need to evolve around the system so that your programming stays relevant for me this was interesting because i'm not sure what the parallel is so if you think about programming is to software engineering and then you look at the equivalent how do you go from data science to what would be that i think that's also the idea here where you want to talk to people who are probably implementing systems and then hopefully we have a lot more fruitful discussions over the next few days so that was my talk you can get the slides this at this link that was a study in classification i'm Ramadan Balakrishnan thank you very much we have about five minutes for questions any questions from the audience hello can you hear me yeah i can hear you sorry so my question was more around you talked about how do we really generate our data for the training purpose because that's a very important factor when it comes to you know if you really can't do anything with the algorithm then something is wrong with the data but there are a lot of cases where that turns out to be a one-time effort right and you would ideally want that the process that you follow for whatever you know model you're creating is scalable to incoming data over time so is there some sort of strategy that you guys follow to make sure that manual effort can be structured in a certain way or you know done in a certain way which can be you know scalable over time as new data fresh data keeps coming in so one of the major issues is having consistency in the data set so there are there are multiple practices around this where so there's also costs and effect right so when you said that when your system is now online and there's new data coming in how do you end up like conflating that data with what you already have and some practices in the past involve like even though you have your whole system replacing a process you try to keep that process still alive but at like a much smaller factor so when someone sends 10 products probably nine of those still go to the original system but you try to keep a version of it separate or you do all of them and then you still do one more and then those sort of help you predict where the deviation sort of occur this is much more common in like ranking systems where you show five products in order and then just the fact that you showed it first influences your output so to have like a clearly separated data it's not a solved problem but there are approaches to it sure hi Ramnan this is Nikesh thank you for the talk one two questions one is just curious about what was the accuracy of the humans like I said so if one in three products were misclassified so if you're training against that you're probably looking at like 70 percent baseline on repeatability so you can't do a human accuracy what we do is we ask multiple people to classify it and then we consider them as multiple inputs and then we see what is their internal consistency it comes to about 70 roughly yeah okay and in these kind of scenarios where especially you could have huge costs in compliance and all those kind of things does it make make sense to have a machine assisted system for humans where maybe the machine could figure out the first three levels right and then maybe the finer bit can still save time so yeah right so the question is it's augmented learning yes that's the that's that's exactly how we have it rolled out for many customers as well the idea is not to like remove people from the equation the idea is to simplify a task which probably took hours into like minutes and that's also more effective for us to make a sale we don't want to tell people that we'll protect you we don't want to take liability or risk we probably help them with like making a faster decision and it hopefully at the cost of like having better accuracy as well so a lot of the modern systems where it's possible to do have humans in the loop it's not a real-time ad-serving technology it's a processing line so in those places we do recommend people still monitor the results thank you okay so we'll break now for morning beverage break and we'll have our keynote at 11 o'clock see you in 30 minutes thank you Pacific was one of the most bullish talks at the time it was such a random trade that even the broker was surprised are you sure do you want to sell did you mean buy but he just felt like it was the right thing to do so he sold the stock and surprisingly enough the next day the great earthquake of san francisco hit causing a lot of destruction and millions of losses to human Pacific and property check check one two hello welcome back we are on to the second session of day one at fifth elephant a couple of announcements which I wanted all of you to follow carefully as you can see by the ails we have kept the feedback forms the forms will have the speaker and the talks mentioned for all the tracks which is happening in od1 and od2 as well as for the birds of feather this feedback is very critical for us to grow and as to review the talks once it's done it's also good valuable feedback for the speakers as well so we request you to take your time out and give your feedbacks and drop them at the entrance gate whenever you leave also take a note of this website slido sli.do we may not have time for Q&A sessions for all our for all our talks so if you just post your queries there the speakers can get back to you at any point of time and if we have additional time we can pick it up at the end of each talk and then push it to the speaker so that they can answer you personally as well thank you please note that refreshments glasses tea coffee it's not allowed inside the auditorium please help us keep the space clean thank you we're also running poll in slido so if you wish to participate you can do that so with mathematics and data science to embrace and algorithms to crunch the data with grace do we need intuition in the first place awi pachawa and chandini jane are here to respond to this case the keynote of the conference the power of intuition in data science and why it will always have a role awi pachawa great good morning everyone so um this is not going to be a very technical talk but it is going to be a talk rich with data some science and also a lot of theory some conjecture the power of intuition in data science and why at least i believe it will always have a role but let's debate that quick introduction to myself and chandini that have been working on this piece we've come to data science from actually quite a few different disciplines chandini was an engineer who spent many years in trading and then developed her toolkit in data science and she's currently the founder of a startup called aqwan myself my background is actually in economics and in the philosophy of economics before i spent many years in consulting in the field of advanced analytics uh working across actually eight sectors of the economy um and today i work with immobi a major ad tech player in india uh and i lead the team in bangalore the data sciences and machine learning teams cool so what started us on this inquiry it was actually a bit of a spark of inspiration from a friend of mine who said this to me this friend he is a professor of economics at nyu and i met him three or four months ago and i try to ask him so what is it about how were you so successful how um just 30 years of age you achieved tenure so young and this is what he said to me he said well the thing is my research i've realized has been well received not because of my abilities with mathematics in fact he's the kind of guy who he didn't actually know what even a normal distribution was when he started his phd program but he says because of his ability to spot interesting questions and then investigate them further and that's what god is thinking what is this ability what is this spotting interesting questions so what we're going to take you through today is we're going to give you an interpretation of what intuition is a lot of us have very different ideas we're going to try to put three definitions on the table in each of those definitions we're going to argue what it means why it's relevant to data science how it applies to data science and we're going to give you a few tips for how it affects your life as a data scientist how it should change your behavior if you believe in this concept of intuition and lastly we also want to put out a theory of how it we believe the understanding of intuition will inform future work in ml and ai some people might say that look uh forget intuition let's just build better models and just be done with it but that's not the point that's missing the point we have to think of intuition as a tool in our toolkit to be better data scientists something that's going to enhance our ability with data and here's the big thesis we actually want to push put out there one is that as you try to understand the world you try to understand different processes you've got access to data and experiments ab testing randomized control trials what we believe is with intuition understanding together with both these elements is when you'll develop system views of what you're trying to understand the full system of what's going on and only when you understand the system the causes the consequences are you able to think about alternative possibilities what's called counterfactuals imagine different triggers different interventions to understand how the system will behave alternative possibilities lead to something else but the thing with ai if you look at all the work all the progress we've made in ai the things that we do well today we can anticipate and predict we've got very well very sophisticated supervised learning deep learning algorithms we can recognize and associate we can label speech recognition image recognition we can optimize and evolve we have good reinforcement learning systems emerging and this field of evolutionary algorithms and increasingly we're talking about being able to generalize the work of deep mind artificial general intelligence but something for us is missing and that's how do you get machines to imagine to think about new possibilities to actually develop hypotheses themselves and it's only we believe when you have all these five elements that you truly will have what we can call artificial intelligence so our thesis is that and what we want to argue today is that it's the system view that ultimately enables you to do imagination and we have to teach machines to be able to do just that now looking at just how people historically have treated intuition I really like this quote from Nicola Tesla who said instinct is something which transcends knowledge we have undoubtedly certain finer fibers that enable us to perceive truth when logical deduction or any other willful effort of the brain is futile and not just Tesla there's enough quotes from other great thinkers especially Einstein you have this same instinct that there was something deeper in how the human mind worked that went beyond formalization of the mathematics but then at the same time there's been a more recent trend and some of you will be rolling in your seat saying no but what about Daniel Kahneman right thinking fast thinking slow four or five years ago who told us expert intuition works less often than we think in fact someone's confidence in their intuition is not a good indicator of whether their hypothesis or their view of the world is valid and increasingly right with the explosion of big data with analytics we see these kinds of views which I picked up from an Intel executive who says actually the methods practices of intuition are redundant now we have data let's just use data to tell us what the answer is the way the world works the way the systems are working so how do you reconcile right both of these very different viewpoints well let's have a look at what we all thought we sent out this survey a couple of days ago and fortunately a hundred of you in this room I'm hoping had a chance to get back and the survey was is there a sixth elephant in the room let's see what we thought the first thing is we asked can you remember a time when you used a hunch or your feeling of the underlying problem to better understand what you were doing with your model and I was really happy to hear I was happy to see that actually the vast majority of us agree that we've done this at least once or more that we've been guided by our instinct our hunch but then when we asked so give us three words to describe what you think intuition means we saw a smattering of very different responses this is the word cloud and actually I said there were a hundred of you that respond responded at least three times and there was such a diversity of responses that even in the word cloud there's just a handful of words that were frequent but we still see words like data coming up right when you're thinking about intuition we still see words like common sense we see words like bias so not everyone was aligned and really what intuition means to us and how to use it and when we asked this provocative question we said when your results clash with your intuition what do you do the vast majority of us two-thirds of us say I'll scrutinize my results yes but are the results hold the results hold now frankly me personally I would have been in the green bucket I would have actually said I would have a deep discomfort for a long time until I was able to reconcile what the data is telling me and what my intuition was telling me but a third of us only a third of us say we're actually in that bucket so we have different views and if I was to ask you hey what are the characteristics of intuition we might say things like it's instantaneous we might say things like it's spontaneous you can't explain where it comes from we might say it's beyond logic it's not logical it's it's it's not logical and it's not non-logical we might say it's about having a holistic hunch just an immediate holistic hunch but the problem with all these views right it doesn't really explain what it is for us and what its impact can be so we want to propose this three level hierarchy to understanding intuition the first point is intuition is having judgment the second intuition is being able to see the causes and consequences and what you're trying to understand and lastly the kicker intuition is seeing the system view as a whole appreciating the full system now one caveat just before I get started there's also a another view of intuition that my colleague baron alerted me to and I saw it a lot on the talks over the last few days there's a notion of intuition when you're trying to understand how an algorithm works the mathematical intuition of what's happening that's not what we're talking about to be clear we're talking about intuition when you're trying to understand what's happening in the world and your algorithm is just a tool to understand what's happening in the world so we're talking about stuff that happens out there whether a business problem a social problem whatever it may be let's take take one of these and explore them further so I really like this quote by Beno Mandelbrot a famous polymer the mathematician who said you know intuition is not something that's been god given to me it's not the whisperings of a goddess so more a phrase for my friend Farhad it's something I've trained it's something that I've trained to be able to understand obvious shapes which were initially rejected as absurd what does this training mean about a year ago I was having a conversation with Prithvi who's the he's the chief risk officer at a growing Indian startup called in cred he's someone who actually spent a long time training me a few years ago in advanced analytics and Prithvi is someone who spent years in the world of credit assessing he five or six years at cap to one and also five or six years at McKinsey with the financial practice and Prithvi said to me that as he's building algorithms to do credit assessing to decide whether someone should receive a loan or not what he finds is that no matter what he does he still can't quite get the same performance that people in the cities and towns of indias that played this role of credit assesses are able to do people that go face to face with someone and do the credit assessment and use their internal instinct to decide whether someone deserves credit or not and he found this a genuine challenge to build models that could truly replicate that he said that from years of experience what they developed in their mind was just not something he can easily replicate this thread of thinking of intuition like this actually goes a long way back as far back as the 1940s where Herbert Simon a Nobel Prize-winning economist started looking at intuition and said it's it's a way of codifying simple analyses that get frozen into habit in the person so let's try to sketch a definition if I was to try to describe it these are the words I would use I'd say it's in this case it's deciding between different alternatives evaluation of decisions based on alternatives it's something that with talent and experience even a beginner can become an expert who sees without needing to rely on rules or historical cases and it's also something that we can't always articulate it's a feeling in the gut and it's a literal feeling in the gut which I'll explain in just a second I think of it as it's when people have become a trained black box algorithm themselves in doing some kind of decision-making typically from years of experience and then there's good work by Gary Klein in this book that was written in the late 90s sources of truth building on this people have done research in different professions we've seen that there's evidence that hedge fund managers rely on this notion of intuition nurses midwives even the Marine Corps the US Marine Corps hires for people that have intuition and then a famous a good strong research company like Bell Atlantic similarly looks explicitly for people who they believe had had good intuition so it's not something that we're making up what would the science of this be well like for me the most influential work I've come across is actually this work by John Coates and I love John Coates because he's very multi-disciplinary he's a guy who spent years at a Wall Street trading desk has a PhD in economics and finance but then went and did a PhD in psychology before he wrote this book and what he explained was that what's actually happening here is that in addition to our conscious brain our conscious processing we all have a pre-attention processing system now this system is not driven by the brain because the brain only has limited capacity the conscious brain only has limited capacity for what it can process despite all the rich sensory information we receive daily he said what's happening is actually there's actually a brain of the gut there's a brain that's driven by the endocrine system by the flush of hormones that we receive and learning is actually also encoded from those flush of hormones and from those experiences that we have it's not just conscious learning explain this intuitive knowledge that we feel that we get actually has genuine physiological phenomenons somatic charges meaning somatic is something experienced by the body I said that literal feeling of the gut or effective charges emotions but those are actual genuine guides compasses that are telling us what to do how to do how to behave based on genuine historical work historical experience that we might have had how does this applies to data scientists are three tips one is when you're trying to solve a new problem or process get into it experience it yourself if you're doing credit assessment go see what it's like when someone is being a credit assessor study and learn from the people who are the trained human algorithms what are they trying what are they thinking about it's a way for you to generate ideas on features or new data that you you can collect if you can push them to understand what they're thinking about and then when someone comes to you and says no it's my intuition why i'm deciding this right maybe it's a very adamant product manager maybe it's someone from business I only trust them when I'm convinced that they've actually genuinely had time to develop such intuition meaning that I know that they've seen and experienced a lot of data that's relevant to the problem in their career and to share with you one example an example from arpin from our team who started working on creatives to to about a last last quarter it was a few months ago and creatives are those advertising images that we all see when advertising is put to us and what arpin did the first thing he did right before reading any papers he spent time with the creative artists to make creatives he spent a good solid two weeks just sitting next to them shadowing them understanding what they were doing interviewing them he must have interviewed about 10 of them and built an idea of what actually is going on when they're making creatives how do they see and understand creatives and he even he put out a report that he took our whole team through for a couple of hours explaining what was going on before he even started on building algorithms to address the problem there's an image of Jurassic Park because we spend a lot of time actually looking at Jurassic Park creatives for some reason it was that time of year when the movie was coming out so that's how i remember this example best it can break down it's not this is not full proof because you do need a high level of expertise there's no doubt kahneman says look for discipline intuition when someone has had immediacy of feedback the quality of feedback has been good and we know there's been regularity in the process that supported their human intuition and there are issues like cognitive biases that do happen among people and if you read thinking fast thinking slow you'll be familiar with all of this and some people will also say a lot of the data and research i shared suffers from selection bias potentially but very hard to really prove one way or the other but i would stay a healthy skeptic for that reason let's look at the more challenging a more challenging interpretation seeing causes and consequences and this quote by the computer scientist judia pearl strikes me and he says now he only looks at causal relationships as fundamental building blocks he doesn't look at probabilistic relationships he thinks of them as surface phenomena but really what matters to us is that causal machinery that underlies and propels our understanding of the world what would this look like it's seeing causal connections based on your understanding of the concepts at work if intuitive judgment is the black box we can think of this is actually the open box that's happening in our minds where there is a causal understanding that's driving our decisions it enables us to do things like a rapid consistency check is everything that we're seeing hearing actually connected in a consistent way as we would expect and often if you push someone you can find this theoretical causal flow that they're thinking about so to share with you an example from the world of trading and from my my co-author chandani she's got one example she can't be here so she recorded this a couple of nights ago a really good example of this is the very famous 1906 short of union pacific by jessie livermore jessie livermore is this legendary short trader he was on a vacation in atlantic city 1906 and he just casually happened to visit a brokerage house saw the stock union pacific on the ticket tape and just shorted it on a hudge union pacific was one of the most bullish stocks at the time it was such a random trade that even the broker was surprised are you sure do you want to sell did you mean buy but he just felt like it was the right thing to do so he sold the stock and surprisingly enough the next day the great earthquake of san francisco hit causing a lot of destruction and millions of losses to union pacific and property actually he he cut his vacation short and came back to new york and the stock still held strong but after a few days when the news of how much destruction happened in new york city the stock finally started falling he held on to his position and he finally exited almost close to the low making 300 000 dollars in a week it's a lot of money now but back then it was sure a lot of money he obviously could not have known that the earthquake is going to hit the city but when he re-analyzed this trade later he felt like what had sort of formed his intuition was this feeling at the back of his mind saying in a bull market what happens to an overbought stock when there's unexpected bad news overnight and there are no buyers some of you might say that intuition of an earthquake is that what we're talking about and the answer is no the point is not that he saw the earthquake coming that's not the point the point is he understood the system implicitly the causal process that would have happened if a bad shock happened and he said that there is a bull market and there's such a big bull market that if a bad shock happens i know things are going to switch all of a sudden and it's that causal understanding that drove the decision not that he saw the earthquake coming that's missing the point an example from our own work at imobi uh in actually just the last month um canal from our team who works in e-commerce and predicting whether people will transact a proposal was put out there for a new model that was intended to learn from people's historical behavior certain events in people's historical behavior in order to predict whether they would would do a certain transaction and in the proposal the point of view was made that we can actually junk a lot of that historical data it's just not worth logging it's not worth the storage cost because it will not it will not affect subsequent transactions i can't go into all the details but just to paint the picture of uh what was happening there and canal had a very strong knee jerk reaction to it and he genuinely believed some of those signals in that historical data in those events were going to affect subsequent transactions and he could have just let it be and said i'll just accept the proposal and just get on and build my model but he was fundamentally uncomfortable with it he took two weeks out working with one of our interns rachna and developed a way developed a data set to test his hypothesis it wasn't easy work it generally took two weeks of effort and proved that actually his intuition on this on that causal process was valid specifically without telling you what the signals were 40 percent of the transactions that we saw for particular e-commerce players were driven by the signals that canal insisted had to be in the data set he was driven by intuition in making that call how do you apply this you need to look at your data with great curiosity to be honest and you have to get your hands dirty and looking at the data how is the data connected don't sit two or three stages removed from your data and just focus on the algorithms when you have an intuition commit to it and test it the way canal did actively tested it i have this test i do quite often i've been doing it with some interns recently i call it the beer test and i say when i have an intuition i'm going to tell you my intuition and i'm going to bet you a beer on it and if you prove to me that it's not right i will buy you that beer and there's lots of bees beers here because i get it wrong a lot of the time i have to confess but i allow myself to test my intuitions which only strengthens them over time if you find a surprise something that you weren't expecting you have to expect in order to have a surprise but that's a signal for you to go deeper and probe what's going on and we don't do this enough as data scientists we need to keep asking what are the causes what are the causes what are the causes for the results i'm seeing if you're asking that with every result you're seeing you will start to build a much better intuition of every problem that you're working on there are objections the famous correlation versus causality we've all heard it a hundred times i will let it be and kahneman can have his day on that or you can say hey why do i need this i'll just feature engineer and that's what deep learning is for give it the data and i'll just feature engineer it's not always that simple we saw in that example with canal that we had to decide whether we were going to connect the data we had to make that decision before we could even do any feature engineering and to make that decision that has to be guided by intuition is it worth collecting the data is it worth storing the data is it worth logging the data those kinds of decisions often have to be guided by intuition there's also a risk of people forming global impressions very quickly which kahneman talks about um but and he says the and the way to deal with this is sometimes delay your intuition don't jump to conclusions but delay your understanding until you make a claim until you put a hypothesis out there the last piece of this puzzle um actually aligns with my like my previous work in development economics and this is a chart very complex chart i know which i made about eight nine years ago and i was trying to summarize the world of the method a lot methodologies applied in development economics and i know you can't read the chart but i believe there were three paradigms these the three edges of the triangle and one is randomized control trials in our world we might call that ab testing on the bottom right is econometric modeling in our world that is actually data science that's taking historical data that exists and finding statistical statistical regularities in that data but the third piece of every puzzle for me every way of learning in the world is what's in the bottom left here which is theory which is a view of the theory of how the system holds together not informed by data not informed by testing and that leads me to this point on intuition as the system view what does this mean quote by ronald koase we all probably know ronal koase very koase well from a quote that often quoted in data science which is about data sciences torturing the data until it confesses that's also a quote from ronal koase but this one he says that you know face between results and a theory i would much always prefer having a theory because the theory explains to me what's going on how it works and we asked us right we all answered or a hundred of us answered this question do we think theories and hypotheses are as important as data i'm very happy to see that the vast majority of us nearly 18 90 percent of us do think so so what does that mean for us well one great example i love this example is we have discovered the secret of life watson and crick 1953 they walk into a pub in cambridge the pub was eagle pub if you ever visit cambridge you have to visit that pub and they announced this loudly we've discovered the secret of life what did they discovered dna they discovered dna and you might say hey that was great empirical work they discovered dna through empirical methods not entirely true 10 years earlier in 1943 they were sitting in a set of lectures by schrodinger erna schrodinger in dublin and and what erna schrodinger did was actually explain the concept of dna from theory he said the laws of thermodynamics mean that we have to have some kind of consistent molecule that's stable in spite of the laws of thermodynamics and he described that molecule as an aperiodic crystal and it's this concept and he really he fleshed out what he meant by an aperiodic crystal but this concept that stuck in the minds of watson and crick so they knew what they were looking for so much so that in 1953 after they discovered dna they sent their paper to erna schrodinger and they said we were inspired by you so even at this example they were led with a concept that came from theory sketch a definition how do how would i think about this having a mental model having representations of the world of the system of the processes that you're trying to solve ideally itself enclose and it's an exhaustive model of what you're trying to explain i almost think of it as an internal simulator you've built in your mind for what's going on and the more you experience a particular system a set of processes the stronger that internal simulator will become what's the science of this vastly underdeveloped hugely underdeveloped because if i was to ask this question of how does intuition drive imagination which drives learn just drives learning we know very little about that and i love this quote from gary marcus a psychologist at nyu who says in all the ways that we've taught computers to learn we still can't replicate this behavior by his three or four year old child who when given a chair will find a hundred different ways to use that chair which we've not taught her we'll find a hundred different uses just from understanding the concept of a chair and the crux of this is that even our strongest learning algorithms today deep learning or reinforcement learning still do not even approximate the way a child learns don't get anywhere close so what would a science of this field seek to achieve and i really like this definition by sridhar madhavan they've republished this year he says we need to develop an imagination science a way of generating data samples that are novel different from the distribution that we might have in our training data and this is what will address the issue of understanding the causal processes enabling us to do causal reasoning back to this view that we suggested at the beginning and hopefully it's starting to make sense what we're saying is that if you want to develop that system view there you can't just do it directly from the data or your experiments it needs a better understanding of the system which is driven by your intuition and with that with that understanding of the system you're able to do counterfactuals imagine an alternative possibilities being able to imagine an alternative possibilities is actually a very rational process the very systematic process process that will give rise to imagination and a way of us coding how imagination can happen as we said imagination is a big piece for us and what's missing in artificial intelligence one step that i haven't had time to go through in detail is why alternative possibilities counterfactuals actually drive imagination and how that's actually a very rational process if we're interested there's an excellent piece of work by Ruth Byron on the rational imagination was written many years ago now but she made this argument very strongly and i'd refer you to the book if you need to be proven otherwise this view also aligns with what Judea Pearl the Israeli computer scientist is starting to describe as a causal hierarchy that there's association probability of y given x finding correlation that's intuition is judgment for us there's intervention probability of y given x happened or z happened that we definitely saw x happening that's seeing the causes and consequences for us and there's probability of y given an x prime or z prime that we haven't seen happen we have no evidence that we have no direct evidence that those things have happened that for us is what imagination gives us the ability to think through these counterfactuals and that comes from what we said is a system view and Judea Pearl has even gone as far as trying to articulate what a learning architecture would look like i don't have time to go into details of this i refer i could refer you to the paper which was just written a few months ago but you'll see here that in this learning architecture what's important is he has a role for assumptions and assumptions is another a simpler word for the concept of theory and in his architecture he explains that the theory has to be captured in graphical models that's how we'll be able to codify this theory but what i really want to pay you to pay attention to is that he's saying that with an understanding of a theory in the data we can actually develop this concept of fit indices which is actually saying how well is our theory reflected in the data how well is theory and data combining to tell the same story and he talks about how to develop such methods but i really believe that an imagination architecture will have a system like this we still have to do the imagining but you'll be able to test that the imagining is making sense in a learning system like this and when we asked everyone we said hey do you think intuition or the role that it plays in your lives as data scientists as data engineers is it going to get replaced will the machine ever be able to do it i think we were very divided very nice normal distribution which tells me none of us has any clue whatsoever but there are people who strongly agree but there's a chunk of us who's saying like we partly agree maybe but there's still something i believe that me as a human only me as a human can do and so my advice to you is keep your job as a data scientist for as long as possible by using intuition keep your job and our three tips rajeev our senior mentor at immobi pushes us on our our thursday calls with this phrase examine the system artifacts long before you start building your model what does he mean get into the data look at the data try to think what's going on from your data long before you slap an algorithm on it that gives you direct results immediately and every time we jump into the problem and all our team knows this and we start getting into the algorithm he stops us and he says why haven't you examined the system artifacts why haven't you explored the data and showed me what the data is telling you long before you applied a model you've applied a model consciously determine what your own mental model is how you think about it and have proactive theories i love it when someone comes forward and say look i this is what i believe is going on and they put they put across a very proactive theory and lastly be open to your intuitions you have to be open to this possibility and not believe that you're all mechanical robots you have to realize that there's something else happening in our brains which can guide which you can learn from so allow that to flow like allow that to come out capture ideas articulate your ideas be open to this finally like no presentation is done right without a thought from steve jobs of course right it's not finished until you've got a call from steve jobs because that's when everyone believes it but i like this particular quote from him where he said from all this time in india like his biggest takeaway was how in india actually we haven't developed with that same system of excruciating rational thought that was developed in the west he believed that in india there was almost a special power of people using their intuitions in particular in the indian countryside and he said that's what's needed to drive progress this understanding of intuitions which is as i said he thought was strongest in india something that really winds me up recently is that as all this talk about china's ai strategy right there's not a day or a week that goes by where someone is not writing about china's big mega ai strategy just i picked up this report that was written in march at oxford deciphering china's ai dream you know i get so complex it's so well developed that people have to in the west decipher china's ai dream and it winds me up because i'm like you know where's india's ai strategy why don't we have lots of documents and a clear strategy and i just wonder that maybe our our india strategy our counter to china can be okay we're gonna build ai but we're just gonna use our intuition to do it that's how we're gonna get there so yeah all i say is i do think it's a superpower we have and hopefully it's something we strong we have in india so keep calm and and keep intuiting okay apparently i've finished earlier than i was expecting so we have a few minutes for questions i was kind of avoiding the questions because i know they're gonna be controversial questions hi everybody thanks for the talk my name is nikesh um i was wondering are there processes around collecting intuitions right so maybe i have an intuition yeah but other people have intuitions too yeah right and uh is there have you seen uh kind of structured ways of yeah drinking through that and kind of uh i mean any examples of that yeah i think it's a fair question um so just to repeat if people didn't hear that uh nikesh's question was hey do we have any structured processes for actually collecting intuitions to be honest i mean all we've suggested is tips right heuristics or things that you can apply to do it and it's just because this area as a science is not so well developed and people we don't really understand what's going on in the brain to get to this that it's hard to say there's a structured process so in all honesty the best i can offer is like applying these tips of being open to it like trying to push out proactive theories pushing yourself always asking what are the causes what's going on building a view is the best i could possibly best i can offer i mean and it's best like great scientists who push boundaries noble prize-winning scientists are actively the people who are doing this to do this best i take inspiration from from them most any other questions this is not the question yeah hi yeah so uh hello hi my question is uh related to the scenario when you talked about uh tell you to score from your friend uh presby yeah so uh uh when you say that they are not able to achieve the same level of performance as done by the humans in in that system yeah what what do you think is is the main element missing is it human bias as as as uh daniel also daniel khaneman also mentioned that is the only thing missing and if it is missing yeah how can we incorporate that limited amount of required bias into our model or data science solutions and is is is this bias also refers to the fifth element which you're talking about yeah so interesting question so just to recap if people didn't hear that this question was how do we what's going on when a credit assessor a human credit assessor is making decisions um is there a way of incorporating that in the algorithms or is it just bias that really actually should be ignored so the first thing is it's not bias because the results actually play out when you actually look at these decisions made they tend to be high quality decisions and you can directly compare them to the decisions that algorithms are making and often enough people are making more higher quality decisions uh what i think is going on and it's my view that people are picking up on a lot of nonverbal cues in in someone's behavior similarly to how a lot of us have an instinct on whether someone's telling the truth or not even if we can't say why we know that and there are all these nonverbal cues that is we as people give away when we're interacting with another person which implicitly uh is what a credit assessor is picking up when a story is being highlighted when uh data is being provided very from uh someone being assessed uh they're picking up on these additional nonverbal cues can algorithm do that absolutely we just have to be able to collect that data to the same extent that a human can collect it with sensory information so in intuition is having judgment there's no doubt that algorithms will get there but our intuition will help us decide right what are the features and how we have to learn for algorithms to be able to mimic that same level of performance cool i think it was not the question here yeah uh interesting point you mentioned about that one about uh the assessors picking up different cues from what they are seeing the next the person in front of them so there are a lot of instincts that are at play and which have uh gotten developed over millions of years of evolution like survival instincts and fear and uh what not things come into picture uh so uh indians are going to be treated by intuition that's actually indians are going to be treated by their uh habit of judging others i don't know that is one part and second part was about there is a divide about uh how you how people see intuition so some people are uh supporting it some people are against it this is again going towards the personality types itself so uh there is a factor about how a person uh reacts in the their typical life so if if i am to refer to the mbti types there is a division on the intuitive types and the sensing types a sensing type of person would be more inclined to take data as it is and take that as the the truth whereas an intuitive person would rely more on intuition so i think when we take intuition instead of taking it as a box of inputs or it being same for every person we should try to also look at what uh variation in intuition itself is right so if we could model intuition itself instead of modeling a system per se then that can act as an input to several other problems that we are trying to solve yeah so i think you had four four interesting questions that let me try to recall them and get them in order at least so uh one is that the m mbti distinction is not the same for me because the intuition we're talking about is actually built from sensing like mbti sets up this dichotomy between people who just think intuitively which which is what they mean as thinking at a high level or seeing the concepts that are at play versus people who think with granular data what they call sensing i don't think that's a dichotomy i think you can have someone who absorbs and learns from granular data and still does the intuitive and understands the concepts which is the same type of intuition that we're talking about here learning from sensory data uh the second point you made on remind me the first point actually you said a lot of stuff it was actually born of a comment where we are saying that uh the survival and fear kind of behaviors that drive judgment yes so i think so you're absolutely right i pointed this out there are all sorts of cognitive biases uh that people have and that will affect different interpretations of intuition as we've laid out here there's no doubt about that which is why i think you also have to be skeptical when you're learning from someone and if you think that someone has not seen enough data or there's too much risk of one of kahneman's 20 cognitive biases at work you're still aware of them sensing them if that's the case then use your better judgment and don't learn from those cases uh so i do think we have to be aware of them but i can't think of anything else other than saying you have to understand them look out for them and filter for them if it's if it's happening and just remind me of your third point or hopefully i've answered enough so the third one third point was about uh intuition itself is not just one entity like it's not the same for 10 10 experts in the same domain it's that there is a variance within that as well so how about just trying to model what intuition itself is and then it could be a problem about credit assessments or it could be some other problem let's take that variance or that thing as an input to that model instead of looking that as a model in isolation so i i mean i totally agree and that was the the third interpretation we had of intuition as a system view because the experts on a system on a given system are going to be different different systems will have different experts who better understand them if we can codify how their their understanding of the system and get an algorithm to tease that out or even put it in directly as in pearl's case as assumptions um i think that's the answer so we still we start to learn from those domain experts and every case every system we build will build from different intuition there's no doubt about that for me there's anything else we'll take it offline but you have you had a lot of interesting ideas hi thanks for the talk so most of us here are in a corporate world and intuition presumably needs a lot of thought even though it's a gut feeling it needs a lot of thought so how in the corporate world how do you we run after deadlines one after the other right how how do you handle employees can i need time to think about it right i mean we could we could think never endingly right yeah a fair point i think there are lots of challenges with trying to do this in the corporate world um just as an aside one challenge which i didn't bring out today is what's called the hippo problem the highest person in the room's view and so often when it comes down to highest paid person in the room's view so often when it comes down to these decisions where the data only shows you part of the system and then someone has to make a call on what's really happening it will be the highest person hide pays persons view in the room so there is there are risks like that and at your point like how do you cultivate the environment that people are able to do this i think that's what data science data science teams have to take ownership of and the reason it's a data science team and has a bent towards research and thinking as opposed to a data analyst team or a pure play engineering team is to be able to create that space for people to come up with novel ideas of what's happening that they can then test via models we whether we we as in our team at a mobi we constantly have that balance to strike and we're constantly like trying to remind ourselves okay we get we try to do a lot of the here the here and now work and that's important for us we also want to keep time aside for more speculative stuff for new ideas where we'll just talk we won't talk to the product managers or the engineers or the business we'll just talk among ourselves and create speculative ideas that we can that might lead to the next use case we try to get that balance but it's tough uh any other was there a question here no is that it oh one okay one last question over here someone will have to do the run hi hi my question is so intuition is like varies from person to person so my intuition might not be valid in the situation that i'm dealing so what's the trade of that you see in terms of effort and time i should put so my intuition also get tested and also i don't miss on the new things that data might tell me yeah absolutely uh i think a lot of this was we try to get across in our tips so one is if you're coming new to a problem except you don't have intuition and you're not going to be the person that's going to see through the data and find relationships immediately but that's why when you're coming new to a problem you have to over invest right as arpin was doing in that example to building the intuition yourself and getting more experience of what's happening looking into data from different angles looking at the people that are making decisions who have better judgment in a particular case you have to over invest in it up front and there'll come a time and i don't know if it's in it's probably not in one month maybe it's two or three months in some cases it's only when someone's looked at data for three or four years they truly have this intuition frankly it's problem dependent but there'll come a time when you start to see things that you haven't seen directly in the data and you'll have like strong gut instinct on certain relationships you'll see a significant coefficient and you'll understand why that coefficient is significant when you've done a regression analysis it won't just be you know a single data point that you're reading but there'll be some deeper understanding which you'll be able to articulate that point comes but you have to when you don't have it you're entering in your problem you have to over invest in developing it cool i think we pose there good right thanks everyone a few announcements you can visit the sapiens booth on your way to the food court they are leveraging technology to enhance the shopping experience so they have a twitter conference going on where you can win some amazon coupons you can also win prizes at the sales force booth you can fill a survey and collect your gifts you can also visit the matlab booth where you can learn about integrated deep learning algorithm development embedded and enterprise deployments uh there are exclusive boat headsets to win at the walmart booth and also you can win exciting prizes at the uber booth and xpedia how many of you bicycle to work please raise your hands quite a few okay me in order to use bike sharing for urban commuting where accessibility and bike availability are everything ashish cabra can find the nearest bicycle you're liking using analysis and principles of structural modeling ashish cabra on using structural estimation methods from economics to model user behavior in bike sharing systems so thank you for the great introduction and thanks every as well for that fascinating talk so actually first off i actually want to thank the organizers for perfectly scheduling my talk because i do plan to build upon some of the ideas that we mentioned particularly around using theory to guide your models and also quite a bit on crisis talk this morning which is opening up the black box and actually seeing what's really driving uh your inputs how they translate to you output uh so i'm ashish cabra i am assistant professor at university of maryland and i'm in this department called decisions operations and information technology which is a nice overlap of people from operations research industrial engineering computer science economics statistics so and the main and we're housing the business school there's the main idea is to actually work closely with business application rather than just building theory on its own which can actually sometimes happen in the cadmium departments so what i want to talk about today is using structural estimation methods which have their roots in economics and more than a concrete method they are more uh so to say uh guiding philosophy or a paradigm and the idea is actually to follow what the data generating process more closely like think deeper into what's really going on as it's not too different from the idea now popularized by ellen musk which is to think from first principles so that is something that i'm going to do today and these models actually take up there is no concrete formulation they take up various forms depending on the exact problem that you're studying so i do have a very concrete application in mind which is these bike sharing systems and so in general you would want to apply these methods when you don't just care about making predictions from certain inputs to certain outputs but you also want the discovered relationships to make some sort of a decision on top of that so with that let me just start so when i'm thinking about solving a particular problem especially in these kind of contexts i'm thinking of my model to have two main features in them so the first thing i would like to have my model to have is the idea of interpretability that is i do want to understand how my inputs are driving my outputs i am not really comfortable with using a black box model and the advantage of doing that is basically one it's very easy to communicate to managers and stakeholders what's really going on so that they are more comfortable in using the output of the model in making some decisions based on that setting up prices making some sort of a decision in terms of acquisition and things like that in addition catching of errors in our inconsistencies in your data becomes much more easier if your model is really transparent otherwise the things would just flow on without you ever noticing that there is something else there could exist many biases could exist in your data especially around some sort of a racial biases or gender-based biases and if your model is not transparent enough you might just let that pass through whereas if you actually have a good understanding of what's driving input to output you will actually be much more cognizant of what kind of biases exist in your data and finally once you have a very transparent view of what your model is you actually are more cognizant of what your limitations are and therefore you probably not apply to situations where you should not apply your model and that could be a good thing so that's the first thing i'm looking for in my model is my model should be interpretable it should make sense to me and the second thing which again builds up on avi stock is i would like my model to have causal relationship and so we have we are very familiar with this idea of correlation is not causation you have many funny examples running around internet this is one example where there is a huge correlation between people drowning after falling off a fishing boat and marriage rate in certain state and it would be actually foolish to say one thing drives the another and somehow try to influence one thing to make an impact on the other thing and that's the main idea and you want to establish causal relationship if you want to use your model to make any kind of decision making so this idea is well understood except what really happens in practice is we get so hung up on throwing thousands of models at data and sort of chasing this measure of accuracy and then finally seeing what really works that this idea somehow takes a backseat and we never really come to making sure is our model a causal model or is just an association and things can go really wrong if we are just willing to get ignore this particular thing so that will be the focus of the model building exercise that i'll do next so let me get to the application that i have in mind which is the bike share systems so just to get an overview of what these systems really look like so bikes i mean bicycles here it's a little bit different from how we use the term bike here but basically the idea is that you want a bike share system to be used by public as a public mode of transportation usually shared by different people so basically you have these kind of bikes which are sturdy enough to be used by many people typically you have these stations where there are these empty docks and the bikes can basically sit in those empty docks and so the way these would be used by people is basically you just go sign up have some sort of a daily monthly yearly subscription you then go to a particular station swipe out a bike ride it to any destination of your choice doesn't necessarily have to be the same place that you started with and that's it you finish your journey so that's how these systems typically work how people have you typically use them is for a lot of short trips going from one particular place to another particular place also one popular use is sort of this last mile connectivity that you are planning to take a long journey but you do the initial first mile using a bike share system go to a metro station take that metro tip and then again take a short bike trip to your work destination so that's typically how these systems are being used now what I'll talk about today will be in the context of bike share systems but a lot of that will just apply much more broadly to systems which are similar to these so in particular I'll be focusing on two features of the systems so the first is the location aspect what I'm thinking of in what is important in these systems is the location of a user and location of where the supply is say nearest bike is is quite an important factor in determining whether that match really happens okay so that's the location or the hyper local aspect of these systems and the second aspect is the on-demand nature of these systems which is you don't typically reserve these systems well in advance you just have a need for one of these let's say a bike trip right now and you then of an open an app go to the nearest station check out a bike right then and there and so this temporal on-demand nature is quite important in these systems and well if you think about other smart mobility systems be it the dockless bike share systems that are coming up now zoom car is launching the pedal system ulla has something so all of that have both these location and on-demand features if you think of uber if you think of ulla again very similar driver has a particular location I have a particular location and for the match to be happened they have to be close enough and they're at the same amount of time and again more broadly around delivery platforms be it swiggy big basket and so on all these kind of applications have very similar features so a lot of what I'll talk about today will apply just much more broadly in systems like this and even so the very concrete system that I'll be looking at is this belief system that I've worked with which is a huge bike share system in Parish so it was launched in 2007 it's actually one of the most popular bike share systems around the world also was one of the first modern system of its kind a lot of other systems have actually followed this system in how to design it so it has more than 1200 stations about 20,000 bikes hundreds of thousands of subscribers in terms of number of trips it has done over 173 million trips in last six years and has a huge impact has had a huge impact in terms of say the carbon impact and saved sort of a car journey trips being replaced by these bike share trips or a lot of health benefits as well where people actually bike instead of sitting in a cab and just driving to this place so that's actually one of the fascinating angles as well for me to be working with these kind of systems that there has to be there is this potential to make an impact in either a climate or a health sort of a effect to the community in addition to just doing data science and these kind of things so in particular when I think about these systems a lot of the concrete problems actually depend on understanding two main features of how users behave in this kind of systems so the first is it is important to think about how users think about distances that they have to walk to access one of the nearby station so that plays an important rule in understanding if I want to see how a user behaves if the distance to walk is 100 meters versus 500 meters how less likely are they to make that journey and subsequently use the system that will actually play an important role and the second aspect of the system which actually underlies a lot of problems that we can study about these is how people think about availability of a bikes which is say I go to a I have this preferred station somewhere and I usually go and check out a bike and do my subsequent journey but at this point in time the station does not have any bike over there so what do I do next do I actually substitute to a nearby station in which case I still make that journey using the system or I just abandon the system and do something else let's say order will our Uber and just not make a trip what how do users behave with respect to that availability feature and why the understanding do you think these two things are important is because if I think about a lot of questions around the system be it where should I how should I think about where where do I locate these stations well it's important to understand how people think about walking distances if people are very sensitive to walking even a few hundred meters then I actually need a lot of stations in the city whereas if people are okay walking 500 meters or even more then I don't need as many stations so that will actually guide the design of how dense I want my system to be once I have an understanding of how people think about walking distances or if I want to understand about how should I manage the inventory of the system how many bikes do I need how should I think of rebalancing bringing bikes from the full areas to the empty areas again what will guide those those algorithms or those strategies will be basically how people think about availabilities are people very sensitive to find not finding a bike at that station in which case I'll have to be very careful making sure the supply is there at each and every point in time whereas if people are okay not finding a bike once in a while and actually substitute to nearby stations well in that case I don't need to be so stringent about my inventory management policies so that will basically the agenda of my talk today I'll try and uncover how do people think about walking distances how do people think about bike availabilities and then use those primitive user level primitives to actually guide a lot of system design kind of questions for these systems and and it will turn out that actually these are a little bit non-trivial problems I'll first illustrate how traditional problems struggle with capturing a lot of realities for these for modeling these kind of features around users and how structural estimation methods can guide us into getting a better answer okay so the data set that I'll be using primarily comes from the belief system in Paris so this particular primary data set looks very simple I'll complement it with a few more data sets from other sources but basically the idea is that I observe each and every station at a very regular interval every two minutes in this case so what I have is this station ID a particular station 666 and every two minutes I observe what is the number of bikes those systems have so that's it that's the basic structure of the very simple data set that I'm starting with and once I have that I can uncover what is the demand at each of these stations at every two minute one bike was taken out three bikes were taken out and so on so that's the number of trips originating at each of the stations and also I can construct from that whether a station has any bikes or does not have any bikes so this binary measure of whether that station is sort of available for use for users or not really available so that will be an important factor as well in understanding the availability aspect so that's the basic structure of the data set now let's actually come back to a question which is we want to understand how people think about walking distances so another way to put that question is let's say this is my existing design I have four of these stations in my city blue station yellow station a pink station and a green station and say I want to move this blue station or I'm considering if I want to move this blue station by 100 meters and based on the data that I have collected so far in how people are using these stations what is the demand at each of these stations I want to say something about what will be the new demand once I move this blue station by 100 meters right so that will be another way to put this accessibility data so let's say I start in a very traditional way basically the dependent variable that I'm thinking of is demand at this station f and time t so that's something that I care about and I want to build them all what guides that demand okay so what I start doing is basically building in different features and how what could impact that demand so let's say I start with some neighborhood characteristics say if there is a metro nearby there is a is it the commercial location are there cafes nearby and so on so all of that should have some sort of impact on if a station is popular or not really popular I can include some weather characteristics whether it's raining whether it's a hot day what's really going on in terms of other humidity and so on so that could actually have a potentially an impact on the demand at the station but remember I also want to have some sort of notion about where the station exactly located the distance aspect in how that is driving users to actually use the system or not use the system so one way to think about accessibility here is I can include a measure for what is the distance to my nearest station so if the nearest station is very close well in that case typically the users who are coming to this station do not have to walk all that much so they might be willing to use this station more often so that could be one proxy for how I include the distance effect in my model which is distance to nearest station I can go a little bit crazy I can include distance to second nearest station third nearest station and so on right so those all of that would be kind of a proxy measures for including this distance effect in my model but then I take a step back and I think have I really captured how the demand is driven by the location of these systems well turns out not exactly because by including this distance measures let's say I'm focusing on this blue station over here and my measures say that well the nearest station is 100 meters away and the second nearest station is 200 meters away the design could still look either like this or like this in either case my features look exactly the same whereas we know just by looking at this that in this scenario this blue station would have more users coming to it just because there is no other station in this particular area so that spatial positioning of station is kind of not really captured in this distance measures that I'm including here so I can then well go ahead and try to construct some features which actually capture not just these distances but also some sort of a 2d measure of how they're positioned around this station but essentially what I'm really doing is trying to come up with a lot of proxy measures to capture this single notion of how people think about walking distances so it's kind of getting a little bit not that clean if I think of some other things which is I also want to model that well there is the metro station which is 300 meters away so that's the distance part but also has a lot of people coming per hour let's say there are more than thousand people using this station very regularly again to include that kind of feature in this particular model I'll have to come up with a lot of interactions building around the distance aspect and the number of people aspect and what's really going on here is I'm including a lot of proxy features a lot of interactions and my accessibility aspect is really getting scattered all over the place I'm not really getting a hold of the thing that I'm really after so let me actually start from a little bit different angle instead of using this particular approach which is what we are really familiar with having an output putting in some features let me actually start thinking about what really drives a particular unit demand at each of the station at time like what really happens to make a user use that particular station at a given time so think of this as one particular use case which is a user got out of a metro station and now is thinking how do I get to my work and he has a few choices which is either to order Ola and just not think about bike share system at all or he knows that there are a few stations which are a certain distance away from him and he can actually go and use one of the bikes from these stations and that could actually that is what generates a demand at the station so if we think about the demand generation process like this then we are still capturing this accessibility effect and the way that happens is basically say if I move this blue station away by 100 meters and now I think about this process again that there is this particular user who got out of this metro station and really now consider using this blue station well now the station has moved much further away so probably he'll be a little bit less willing to use that station so that might drive the demand at this blue station a little bit lower whereas there might be other users who are somewhere over here and they now may want to use this blue station just because it's like very close to them so actually by starting from this very user level paradigm I have in a much concise and neater way captured this accessibility effect so I'll dig deeper into have I really captured it all those problems around spatial positioning of stations and things like that do they really fit into this particular way of thinking about the demand generation process but that's the main idea is basically I am following a user instead of just putting in a demand ft and plugging in features to explain that particular demand so let me formalize this thing what I'll okay so before that actually one thing I do want to mention is I don't exactly have user location data right I don't know where the users are coming up and where are they now starting to think about okay there is a nearby station and I want to go there so that will become a little bit of a challenge but I'll get around that and one more thing to note is that a lot of applications these days are based on these apps which are able to track user location so in some cases you might have user location data and I'll talk a little bit later in if you have that data what kind of change the model required to incorporate that data and does it really solve the challenges that I'm talking about or are there still some challenges to think about so so let's start with this model building process so it's it's fairly simple design I'm still formalizing that intuition which is following the user so in the first step I'm allowing my users to originate at different points in the city so at every single point in the city which is this location l it's a pwl it basically captures the rate at which users are originating in my city and that I can make that as a function of lot of city characteristics right so I know that a lot of people will originate where there are metro locations with a lot of traffic where there are a lot of tourist locations where there are grocery stores cafes museums things like that I can also include some sort of a census data in how many people live there how many people work there what are the demographics and so all of that will basically guide the number of people originating at different locations of the city so these are not the people I'm not thinking about using the bike share system right now it's just people are originating and they could be potential users of my system and then there is a second aspect which is this neighborhood bike availability which is if people are more confident that they'll be able to find a bike then more people will sort of originate at those locations so that can play a little bit of a role in your density model so at this point these are the people who are originating and then once they've originated at a location now they start thinking about if I want to use this bike share system or if I want to do something else so to formalize that process I'm using a choice model which is which is not too complicated I'll just give you a brief primer on what that really means so what I'm saying is well there is this user and the way he's thinking about this system is he has a few options and for each of those options he writes down what is the value of using that option so that is what I call a utility and that can be function of what is the distance you have to walk and some other station level characteristics and once he has utility from all different options including the utility from an outside option which is say using an ola the option that gives him the highest value he basically goes and chooses that option so if that turns out to be this particular station this blue station well then you have a demand at this station and if it turns out something else well then in that case you don't have a demand okay so for a user this is a very deterministic process choosing an option which gives him the highest utility but then there are certain components of it which are only known to users but not known to us which are these error terms we typically have a standard distribution for those terms so essentially what we are going to do is from this choice model we are going to assign probabilities of a user look originating at a particular location li of using some station f at a location lf okay so what I'm going from here is writing down these particular probabilities of a user i using a station f at time given that he's originating at particular location li and all that distance aspect is basically embedded in there okay so what I've done is basically two things one I have created this model of where users are originating in different places in the city and the second is once they originate how do they think about using either my system at different locations or something else so my demand model is basically now very easy it's just a composite of these two things if I just integrate over all the locations in the city where are people originating and how they make a choice well then I have gotten a function which describes the demand at each particular station at a given time so let's dig a bit deeper into what this model really means has it really captured the distance thing that we are after or the by capability aspect that we are after so well so say if I think about a particular user and if there is a station which is just further away from where my you all my users are will that station dry have a smaller amount of demand well my model would say yes because there is this utility part in the model which depends on the distance to the station and stations that are further away will make them less attractive to users and therefore users will not show up there is the effect of density of stations captured in my model if there are a lot of stations in the city well then typically users would have to walk smaller distances to access these stations so again the utility model that I had there will make stations more attractive just because they have to walk smaller distances and so having a higher density of stations will make users use the system much more so that's again embedded directly in there without me having to do anything explicit and the third important thing is which is this basically this spatial positioning of station that is you have a station here and you have a station nearby here and you have a station nearby there does it have an impact on demand of this particular station again I did not have to do anything explicit it's just naturally in there because once I have a station that is nearby user always over there I find that station much more attractive than coming to this particular station so again the utility model just drives all of these dynamics very naturally without us having to really sit down and think of each of the corner cases and design these proxy feature variables to capture that so that's the sort of beauty of these structural models that I'm talking about once you follow the data generation process a lot of things just are included very naturally and the same thing about availability if a station is out of bikes well that does not show up in users choice it so very naturally users with their substitute to nearby station or use say OLA caps and again those ability aspects are being captured there but so let me actually come back to this particular aspect which is if I do have location data how would that player own in this model so it's great if you have the location data but then we have to be a little bit careful about what that location data actually means because remember the location we are after is a user it has particular location and he's considering whether to walk this particular distance to use the station from this particular place let's say so if you have this particular user whose journey looks like this that let's say he's at home right now and there is this first metro journey he's taking from metro station one to metro station two and after that he is considering going to this blue bike share station he can open his phone and check the app at two different places so one is after finishing that metro tip he can check his location or he can actually be more proactive he can check sort of what is the state bike share station status even before taking the metro journey so if you're not really sure if the location you're capturing either this or this this would actually be a bad data supply your model so this is something that you'll have to be a little bit careful about again not all the users are often checking the apps so you probably don't have location data on all the users and that could actually be a little bit problematic do you ignore those users what do you do about that missing data so that is something that you'll have to think about the most important aspect is well you might have location data on your current users but what about those users who are currently not using the system but if you were to add more stations or if you were to improve the inventory policies they would actually start using your system know nothing about them because they are not currently users but to make any sort of counterfactual analysis using this data that is if I were to add more stations who is who's coming to me you want to have some understanding of what are the locations of these non-users and that is actually very hard to get so even though you have the location data you haven't actually completely solved the problem and there are a few more problems sort of that comes along with that so this is something that I'm currently working on so like including how to actually exactly use some of the location data but that doesn't seem a straightforward process that's the main idea so what I'll do is I'll go back to assuming that there is no location data just because that's more concrete to work with and what I've shown you is a very bare one model but you can easily enhance that in any way you'd like so if you want to include let's say the type of commuters so some people are just commuters who actually know where they're going and their behave maybe differently compared to let's say tourists who might have very different preferences in terms of what they want to do with the system you can actually just tweak your density model or the utility model to have different parameters based on user type you can have based on intention of users be it a food run or some sort of exercise kind of trip or going to the nearest grocery store that can have different user level parameters based on that you can include different pricing data you can include heterogeneous parameters based on what kind of neighborhoods people are using from or what is the time of the day depending on that some of the parameters might change so all of that can be very naturally included because now you have a good understanding of what actually each of the parameters really mean okay this is about this user they're originating and so whatever you want to tweak has a very clear place in the model that's the main idea all right so before I get to estimating this model and making some sort of policy design kind of questions I want to think about my estimates am I going to get correct estimates that I'm after or they might be misleading in some manner and so that is what is called endogenetic economics and it depends on what is your data like it's it's generally a good exercise to understand what kind of variation you are using in your data so the first thing to note is that it's an existing system and then you are thinking of how do I change that system so whenever this system was built the stations were not just located randomly in the city there was some sort of a logic to placing stations in the city so more often than this you would have been this scenario where there are more stations in the more popular areas be it let's say the downtown area where there are many amenities around and so what how that would be a challenge to the data is basically if you look at these stations they have a high demand just because they are in more popular areas and typically users who have to access these stations have to walk smaller distances so if I just naively use this data and fit my model what I might conclude is that it is these smaller distances which are causing these stations to have high demand but that is actually not the case high demand is because these areas are just more popular to start with and that would actually bias my estimates if I'm not cognizant of what's really what really my data is and similarly about bike ability which is basically if you have stations which are really popular then a lot of bikes are checked out very fast and that would actually lower the fraction of time there are bikes available at that station so what you'd see in the raw data is stations with very low availability have actually high demand and that would be a wrong conclusion to make that it is the low ability which is driving a high demand and so you need to correct for this reverse causality from the dependent variable to the independent variable so we need to correct for these biases and so the way you can go about them is in more or less two ways so the first ideal way is to do some sort of a b testing that instead of just relying on raw data you want to do this control manipulation of the system features in this particular case it actually is a very expensive exercise to do this so if I'm thinking about the station location to do any sort of randomized experiment well I'll have to include some new stations at really some random places I'll have to allow some time to for users to figure out that there is a station incorporate that in their commute I'll have to change all my inventory rebalancing policies based on that change that I've made in the system and at least as a first step that can be a really expensive thing to do so what I have instead is some sort of a modeling solutions for this kind of variation that you want to use in the data and so what I have his first thing is just to control for all these features that might just bias your system which is basically you show in the station location example that I gave before that there are some popular areas and so you want to include what makes them really popular it is the metro stations etc etc so that your model attributes the high demand to presence of these other amenities rather than the smaller distances on top of that what you can also use is this idea of instrumental variables which are which the main idea is basically this that find some variable z which is outside your model that affects the independent variable x in that you have and once you have x in the model there is no place for including z in the model just because z has doesn't directly affect why doesn't directly affect why but the way it can be useful is let's say this z takes two values z low and z high that can actually make x takes value x low and x i and you can use that variation as some sort of a pseudo experiment to figure out how does x impact why so that's that's the main basic idea and then you can come up with many instrumental variables like in this example there are bikes coming in at that station which serve as a shock for the ability at that station and that can be used an instrument variable to figure out the effect of changing availability on the demand at the stations so this is the kind of optimization process that i used to figure out my estimates in interest of time i'll not go too much into it but basically what i'm using is this idea of generalized method of movements which is just the generalization of the kind of objective function that you have for linear regressions or maximum likelihood estimates where the idea is that i'm just imposing this condition that these instrumental variables should be orthogonal to my error terms that have so let me skip this a bit one more thing that you'll have to be cognizant of when you're using a model like this is even though it's great because it's very granular and it's following the user what's really driving the demand there could be a lot of computation that you're doing to actually make this model run and that is happening because what you have so i have this toy example here where i have 11 stations and these blue and red dots denote at every two minute level whether a station has bikes or does not have by so let me call this a system state a system state is basically this binary representation of zeros and ones for that particular time and given that these are fast-moving systems the system state can be changing over time very rapidly so no two time two minute time intervals look exactly same to one another so you will have to treat them differently and on top of that you have these lot of users who are originating at different locations in your model and so combination can be trillions or even more computations and on top of that if you're optimizing that system to figure out what your optimal parameters are that can be a whole lot of computations so what i do for that is basically exploit this idea that there is this location aspect which is important in these systems if i'm thinking of the station f6 here i know that if the station f1 is say two kilometers away then whether that station f1 has any bike available or does not have any bike available should not impact the demand at the station f6 so the way i can exploit that is basically as far as this station f6 is concerned i can just focus on the state of the stations that are nearby to this particular station and once i do that what i'll find is a lot of data points look exactly similar in terms of whether they have the bike available or not have the bike available and so that i basically build on this particular idea in terms of collapsing the data at each station level and that gives me actually a huge advantage like a thousand fold advantage in terms of the computations that i'm doing to estimate this particular model so i'm not going into too many details but they're present in the paper that is associated with the talk but that's the main idea using this location part right so now i am actually ready to estimate my model so what i have is a bunch of different estimates for my parameters i also do the validation which is keeping some test data some training data and what i actually see is not only this model is sort of more intuitive captures the process very well it actually performs much better than the other models that i were initially trying to construct beat regression models choice models or some variations of that so that's actually a good news i'm looking here at r square or mean square error kind of a thing so that gives me a little bit more confidence that this is a good approach to use in this particular setting the main power now is since i've understood this process in how the demand is being generated what i can now do is take my existing design and come up with a lot of counterfactual designs say i move stations around in any crazy i want crazy way i want i say change my inventory management policies so stations have different bike availabilities and what i can now do is just run my demand model like let my users come at different places and now make choices for this new perturbed system and since i know everything about how users are making those choices i've figured out all those parameters i can just predict what will be the demand for this newly configured system and i'm more confident because i focused on the causality part so that i know it is not just association but really if i change these locations this is how the system should behave and so i can exploit that quite a bit uh so the first thing i'm just showing you is some very first order understanding of what's really going on the system so if i focus on a single user i think about how they think about walking distances so what i just figured out is that uh they of course hate walking no one likes walking but the way that function is it it behaves in a little bit convex manner which is walking smaller distance is kind of okay so up to 300 meters the demand does not drop as much but after 300 meters or so users actually hate walking quite a bit in at least in this system so the demand decays quite rapidly so that gives some guidance that typically you want to be in the 300 meter range of the stations if i want to capture that particular user demand uh so that's just a very first have insight which i wouldn't have gotten so neatly from some of the other models that i was thinking of again i can just plot where are my use how much distance my users are walking and again note that i did not have this data to start with i had no idea where my users are but having this model i have sort of this backtraced what kind of distances my users are walking and i see again that a lot not a lot of them are walking more than 300 meters i can also trace out where are my where is my most of my demand coming from so it's good to know that a lot of it is coming from residential areas where supermarkets are where cafes are where metro stations are so that can again guide my design in where should i be locating more stations so that's just the first order insight i have done nothing fancy but just to exploit this particular model let me just illustrate what else you can do so so i have this use case here where so i'm starting with this hypothetical scenario that i have certain budget and that budget in this particular system would mean i can purchase some bikes or i can because that's kind of a huge capital cost in this kind of systems and i can purchase some docs to where these bikes should be placed and typically bikes and docs have a very proportional ratio like the number of docs would be let's say 1.5 times the number of bikes so i can think that i have money for let's say thousand bikes and once i have that i can think about distributing those thousand bikes in the city in let's say these two different ways so one is i can have say 20 stations in my city but each of the stations would have 50 bikes so large stations but very few stations so that's this design on the left or i can have a lot of stations in my city but each of the stations would be on average smaller in size so let's say 100 stations but each of the stations has only 10 bikes so the advantage of the system on the left is that each of the station is pooling in a lot of demand since they're bigger in size so we know that pooling has a lot of economies of scale and that would cancel out a lot of variability so basically what would happen is the system on the left each of the station would run out of bikes much less often so they would be good in terms of having higher availability of bikes but if the system on the right we know that would have lower availability just because each of the station is smaller in size but since there are more in number they would be very good in terms of having smaller distances for users so essentially what is happening is the system on the right is very good in terms of the accessibility part and the system on the left is very good for the availability aspect and now if i'm thinking well which of the design is good design because in general it is good to have both your system to have higher availability and higher accessibility but in this particular case it's really a trade-off you can really have more of one or more of another and to figure out what is a good design well it depends on how people think about these two aspects and well so far i have actually figured out how people think about these two aspects so what i can do is basically generate a lot of these scenarios and then use my this demand model that i know how about how people are sort of coming up in the system how people are making these different choices and that is what i exactly do i generate a lot of these system designs and i simulate my demand model and what i see is that this is where the status quo is this is what it is right now if i keep on adding a little bit more stations my demand actually goes up so this system becomes more accessible a little bit less available in terms of bike availability but still overall it is good for generating higher demand and after a while just it doesn't pay off that much so basically this simple exercise helps me reconfigure my system design by guiding me that i should be probably placing more stations in my city at least a little bit more than what the status quo is right now and so these are the kind of counterfactual designs you can do right away just because you've understood a lot about what is driving demand at the stations so that's just one illustration of it okay so let me conclude here and probably i'll have a few minutes for questions so just to wrap up what what i've basically shown you here is a little bit alternative paradigm to the typical traditional machining model that we have where i focus much more on having a model that is more interpretable more causal and i've shown you that it is more flexible in terms of including a lot of variations that you'd like in your model it you it is good when you not just want to make prescriptions predictions but also want to make some sort of prescriptions or decisions on top of the relationship that you already encountered in this particular case it turns out that there are huge improvement opportunities which this model is able to perfectly guide you and there is one more thing which i want to just showcase is that since you're now understood user primitives which is how users think about walking distances you can actually take that estimate and use it elsewhere so for example if there is this company which is running this bike share system and it wants to launch let's say motorbike sharing system there is no data whatsoever but still people are not going to think about walking differently just because it's a motorbike sharing station right so you can take this estimate plug into that model there would be still few more things to figure out how people how much prices are people willing to pay etc etc but at least these basic primitives can flow from one model to another whereas that would be a hard stretch to do if you have a random forest model because there is no understanding of how people think about walking distances so that's the kind of beauty these models come up with so so let me stop here and if there is time i would be happy to take some questions seems like there is time okay perfect yes please hi yeah so you are using you're focusing a model on dock bikes that have docking stations right yes so how much would the model change say if you have a dockless system and what would be the advantages and disadvantages of that so you mean advantages and disadvantages of the dockless system in general yes so okay so first how would the model apply to those systems there would be some differences there would be some similarities so you would still think similarly about users starting from a particular location and then having some distance to walk to so those kind of features will stay the same the availability will become a little bit different because now there are no more this idea of stations which have bikes so don't have bikes but just that there could be bikes scattered around and you would have to walk some distances so it would be more like a continuous rather than a disc right exactly yeah so instead of having these hubs there to be more dispersed so that's the first thing a second thing is how would i think that just goes beyond this model a little bit is how this is those dockless systems differ from these dock based systems so just for sort of information for everyone so the system that i'm looking at right now has these stations where bikes have to go at and what is happening right now is a lot of companies like ofo mo bike sort of ola are coming up with these designs where a lot of technology in the bike itself so you don't need these docks to park your bikes you can basically pick up a bike from anywhere by unlocking through your phone and drop them off literally anywhere on a footpath whatever you don't need a dock to dock for them to for your trip to end so the way that would differ is a little bit in terms of so model i said already how it would be differ and those designs i'm looking that looking into that quite a bit working with a few companies but basically the idea is that the sort of the ending part of the trip gets more convenient just because you can't leave you your bikes anywhere but also the fetching part of this whole thing becomes a little bit more difficult as well because instead of having this certainty that you have to go to the station and you will most certainly find a bike there now you almost every time have to open an app and sort of look around where a bike could be nearby me so that creates a little bit of differences in terms of how convenient that system is for a user or not but that kind of remains a little bit of an open question we have to see look at more data more closely thanks thank you yes we have a mic here i'll repeat the question that's okay so we have a question here as well so can we just have time for one question so please take it offline okay i'll do that yes yeah just a couple of observations you don't have user location data but obviously you can see superimpose the population density you know of the city on your to use your model and the second was your model should be biased on availability i'm sorry accessibility because you can always increase availability by adding more bikes so don't you think accessibility is more important or is that how your model is configured okay so two questions so first is you're saying i can always model sort of well people are originating just by looking at the population data so that is one aspect which is if people live somewhere there would be a more reason that people would start their journeys from that and that is part of my model except that that only accounts for like less than 50% of the trips because people would often start their trips not from basically where they live but also they would take a metro trip and start a journey from there or they would go to a nearby cafe at some time in the evening and then start a journey from there so all of those aspects actually play a substantial role in sort of driving the demand for this system so it has to be more than just the population data and so that's basically what's being built in in this particular system and the second question was about the availability part that you should always emphasize on the accessibility because availability is something that you can always tweak around well it costs money i mean these bikes especially in these dock based systems these don't come cheap like each of the bike in this wellyp system was about thousand dollars or so so increasing that number of bikes even though it seems like okay we should always be able to do it and sort of never allow any station to run our bikes that doesn't actually happen because it's just a very expensive investment to go about it. Does it make sense? All right well thank you so much and we'll look forward to the next talk. So we have boff sessions post lunch so we have the woman in data science boff at 2 p.m and the data science and production boff at 4 30 these boffs are in the first floor. So the next speaker is Nitin Hardinia Ji who has 30 minutes to go over quickly the recommendations and food discovery at Swiggy the right talk before lunch to make you all hungry. Nitin. I'm Nitin I'm currently a data scientist with Swiggy and I've been there for last one and a half year and today I'm going to talk about some of the experiments we have done with the food recommendations food discovery and in terms of experiments I will talk about some of the insights interesting insights and I'll also talk about the the challenges or the lesson learned in the process. So I hope it's a it will be interesting just talk because just before lunch it's it's kind of a bad timing I would say. So just quick show of hand how many have you actually ordered from Swiggy? Wow good marketing I guess. So I think I think people are aware of Swiggy and they do it do use regularly but I think just for completeness this is what Swiggy is I guess it's it's mostly these three major players that that actually drives the entire ecosystem you have millions of customers you have thousands of restaurant partners thousands of delivery partners and any problem that you're talking about if you're thinking about just from the one player it's actually going to impact the other two. So if you're thinking from the discovery and relevance problems and you're not thinking about restaurant as well as the delivery executive or the entire ecosystem it's not going to solve the problem. So at the same time you have a lot of the delivery side of stuff like predicting the the estimated times for the delivery how can we batch the orders or we can actually do the best assignment possible for the given order. All these problems are pretty tricky and I think today what I'm going to focus on is the first part of the story where I'll focus on the discovery as well as the recommendation side of the things. So let's take a step back and just formally define what is the recommendation system. So I think the idea of a recommendation system is just to come up with a utility function with even in the constrained environment like serviceability and availability if you have a list of items can you define this utility function that predicts the likelihood of a given customer and what's he gonna like or what's he gonna buy from the serviceable option. So and when I talk about items in this context it could be anything in context of movies or in context of Netflix it could be movies in context of let's say Spotify it will be those the songs and in the context of Swiggy it is one is the restaurant first view where you have we have the home feed you have a lot of recommendation going through on the home feed and then you also have some of these inline collections where those collection has been actually cut through the specific filtering criteria and then you have some of this discovery widgets where you can actually click on just pure vegetarian restaurant and those are also sort of personalized for your taste. Now we are also venturing into more of an item first world where in Bangalore at least we have released this new property called dish discovery where the idea is I think if you can from your historic behavior or from your implicit feedback if you can predict what is the main anchor dishes that you like and if I know that you generally like pizza versus let's say biryani if I show you a beautiful collection about those those pizzas rather than just you actually driving through multiple menus and then deciding what is the best pizza for you it's actually a great experience so we are venturing more from this to this and I think in the end it will be more of a generation of the entire app rather than just focusing on a specific list or a specific collection. We also have some recommendations at menu so we do recommend what are the best sellers what are the typical sold items in the recommended sections and also we have a traditional cross sell where you have a typical association rule mining base algorithm where but it's still smart enough to understand that if you have already ordered a main course then I'm sure that you're not going to add one more main course so it will suggest you some of the drinks or other desert place. If you take the abstract of idea of recommendation system it actually sort of a make my matchmaking between customer and what you're trying to sell so in in context of swiggy the items will be either restaurant first thing or it will be the exact food item and in some cases it could be the entire meal itself so though so that I think the secret ingredients is to understand your customer and if you can understand your catalog or the food items very well if you can solve these two then I think you can do the matchmaking much better so from the customer point of view you want to understand what is the taste profile of this customer whether he's a vegetarian whether he's a non-vegetarian what kind of time slot he generally orders whether he order for family whether he order for for a home or a work working lunch he generally like healthy stuff or he's focusing on let's say the speed similarly you want to understand from the restaurant view as well that what generally this restaurant means is it a taste profile that matters of that restaurant what it says is it like the cuisines or the dish that defines that restaurant or is it the price points or the CFT of that restaurant that talks about so if you understand these two parameters then I think it will be always additional features to the to the conventional models now still the big question is what is so unique or so niche about this recommendation because recommendation systems are not this is not the first company that they are working on recommendation system right so it has been solved across industry there are examples where people have solved it in a way that everybody experience the best possible recommendation that they have so I think one of the major or the most important aspects that actually differentiates Swiggy from some of the other recommendation systems are these major factors where I think that any hyper local business adds a lot more complexity to the system it it always we are always by I think constrained by the service will be or the ability of our system we do have restaurant stress or the entire stress on on the entire ecosystem as well and people do care about different elements like speed which are not there in the traditional recommendation system so think about this right of I know you like pizzas and you have a very specific joint that you really love now for some reason it's not available so the I think and that happens quite a lot in terms of Swiggy then what will be your the most relevant substitute to that and that's where I think the service will be constraints are very important in ours so we have to think about the the larger service ability and ability rather than just looking for a specific customer restaurant stress is one of the other angle where for example a restaurant like truffle is getting a lot more orders and it's really in the stress and at the same time people are keep on ordering from that and suddenly you have a lot more orders driving towards that and you you start pushing a lot more demands towards that same restaurant that what will happen I think eventually this restaurant gonna fail and then they will not be able to deliver the promises that we they have made and it will be very bad for the entire ecosystem so we do have some of these recommend real-time models that actually gives you the this restaurant factors like stress the prep times and the other things in the real-time and if a recommendation recommendation system can actually take some of these inputs and do some of the demand shaping based on these things it will be always helpful for the then the last consumer and people do care about speed in the in the online food delivery world so our faster options should always get some level of prevalence now the other factors that I'm talking about are not very specific to Swiggy and it's just very much very hard choices when you have to design a recommendation system so you you generally like let's say a biryani or I mean in the 99% cases you generally ordered biryanis in the past showing all the biryanis place in the world is it a good recommendation it might be accurate recommendation but there has been proven examples where adding some level of diversity into your recommendation system always helps so I would still go for getting some level of diversity with getting more Chinese points or adding some level of a Italian place into your listing or in the home feed discovery versus repeat is also one other other option where in general contained related websites like news websites or even your Netflix or Amazon's where I think there is a new original that has come on the Netflix so they they want to make I mean hopefully people will not once they have you have finished that original you will not keep on going on for the same original again and again now in the context of food delivery you still prefer the the restaurant that you have already ordered so the repeat behavior is much much more important in in the context here right so showing just the restaurant that you have already ordered maybe the top 10 restaurant that you have keep on ordering from is it a good recommendation or definitely yes I guess but still there there should be a balance between how much we want to do explore versus the exploit so I think this is again one of the tricky or the differentiator that that that we see across the industry freshness versus stability is another angle where you want to see that the recommendation you are getting on day today if they keep on changing very rapidly it doesn't give you the the the notion that you know you are not able to understand what why it's coming so the explainability or the scalability also has a has a significant importance at the same time freshness is also matters so if if you like pizzas and the real the new pizza place that that really excites you is not there in in your home feed gonna hurt your recommendation serendipitous is something I would say is is kind of a definition of discovery where some of these course from CNN they talk about that this is more of an end of era of search where you have explicit query you're looking for a pizza and get the pizzas now the real discovery comes when when a when a rare items actually finds you and that surprise you so the same case let's say in the netflix when i original that you feel that you know you started watching and you feel how many how these guys are able to get this or the in the case of song that the rare song that long tail that gets you then only you feel that you know this is the real discovery so the same context if you if i indian desert that you feel that you know whether it's available or not on on swiggy and that finds you that would be something really magical right uh for any recommendation system you need a lot more data and these are the different variety of data set that we are having we capture a lot of interaction on the app as well on the website we do have orders data or a typical transactional data we do have a very rich image data which is very specific to the indian foods and which sort of labeled as well for the for the specific categories and you also have this other matter enrichment like dish families cuisines which we have sort of generated and you we also capture a lot of unstructured data where if i if you generally order and then you you you ask for intra certain type of instructions as well as in the in the rating space where we generally capture the rating data as well and also ask for a feedback a lot of these things has been used for generating the different recommendation system then recommendation these are the two major philosophy one is more about if i can figure it out what who are the customers like you and this is pretty much the philosophy of collaborative filtering the other one is more on the generally people like similar items so if you can define the similarity very nicely then you obviously can do better recommendations and that comes from the contained side of the thing now collaborative filtering i think eventually you want to build this customer item matrix in case of this it could be either restaurant cuisine category or dish and then you have and some of these methods are defined for explicit data like ratings and coming from netflix challenge where the idea is to predict those blank boxes and if you can predict the rating of that then it will be a sort of a good recommendation you just sort the the recommendation based on the ratings now i think steven randall or korean they actually defined some of the methods for some of the similar methods of matrix factorization for implicit data and if you want to use this kind of implicit data like orders menu visits search and all these things it's it's actually kind of much fulfilling because there are specific papers where they talk about generating the explicit data is much more costly as well as it has its own bias while if you use the implicit data it in many cases it has actually outperform the explicit data so we tried out traditional matrix factorization methods like svds al s and we also did a few experiments with learning to rank and even in that we added some of these customer features as well as the content feature as a feature and then that actually improved the performance of the model now the problem with those methods of collaborative filtering is one is a very standard collab cold start problem and the other one is actually it's in some cases it's biased by popularity so idea was that how can how can we overcome these things right so that the traditional methods that that actually are the solution for that are the content best method where you're trying to define this notion of similarity and in the in the case of sugi how to define this so generally the content platforms like news and other websites they have a lot of maca like news articles they have maybe the titles and the genre in in case of movies you take that you build a nice backdrop of it and then you do some level of similarity calculation now in case of sugi we could have done two ways one okay so we could have done i think either go by the the the meta that restaurant partners or the the customer the items we have or we could have done something for of our own so what we did we actually take the order data at a restaurant level create a dummy document and then build a sort of a topic visualization out of this and somehow it's the demo is not working but if you click on one of the specific topic this actually captures the notion of the entire catalog what we have in sugi so if you zoom out in zoom in into one of the topic it talks about whether it's a biryani thing or a maybe a south indian thing so the idea is to project your restaurant into this and then you will have a decent vector out of it and then you can actually calculate the similarity notion that's again a more of a restaurant first kind of recommendation if you and this is how it will look like in picture right so you will have some level of notion for similarity and if you have already ordered from let's say truffle there is a good chance that cafe 12 will will be there on your listing you can have a restaurant first view based on the previous previous method but still if you want to do the recommendation at item level you have to know what is the standard taxonomy for for an individual item so you have to differentiate between somebody calling pizza as main course versus somebody is calling maybe a italian kind of thing so we define the standard taxonomy and these are the standard enums or the standards enums classes that we have defined and then we build some level of supervised model we have metadata related to some of these items which is either in terms of item names description this recipes and in some cases we also have the images so we did some level of image classification once you have this kind of set then it empowers you to do something that I was talking about in the initial slides where you can once you can filter your data on the cuts like a dish cuisine and all you can again go back to something like this and then you can create this matrix and then you can predict what are the most relevant cuisines or the dishes for you I'll just skip the word embedding part but I think this is so far the journey so far and where we have done a lot more work on the collaborative plus contained base and use ensemble of these two to generate some of the prediction which is currently the state is somewhere on the hybrid as well as on the learning to rank part while we are also looking into DNN based methods where generating the item embeddings product embeddings or the restaurant embeddings are obviously open area and also deep collaborative filtering and RNN based methods to generate the recommendation at a session level so can you capture the the real nuances on the on a specific session instead of just relying highly on the history and the final slide actually talks about the journey of the swiggy from the delivery or discovery platform so here you currently we highly rely on our home feed and some of the initial personalized collection we are moving towards lot more collections actually catered towards anchor theme and then slowly moving towards the generation of the entire app where lot of the person like all the different collection will be personalized towards your taste and slots and even the other context and finally if you can you do some level up real-time adjustments to make it much better I think we are done okay so this is the team that that we have and I think I'm available for the question we have some time for questions hello yeah I have one question hello hello yeah yeah go ahead yeah so my question is uh when you are showing when you are recommending restaurants or dishes what is the objective function you are maximizing for is it just the relevance for the user or like revenue generated for swiggy or is it a mix of these things or yeah I think definitely we are constrained by some of the other factors like service will be another so if if overall stress levels are decent so we have a way of identifying what is the stress for swiggy and if that that suffices then we are chasing the relevance as a major proxy and if if that is not the case let's say it's raining or it's we are not in the in the in the in the best of supply then we degrade our things and then we just look for a I think a general availability so the idea will be to optimize the entire system based on the efficiencies rather than just going for relevance but in in normal context which is where there is no stress there is no specific constraint generally the focus is always the relevance but we do try it out when we talk about l2r or learning to make better method where you can actually induce some of these other constraint into your objective functions so we tried out demand shippings towards certain things like fosters and in case of certain restaurant you do should not demand shape or you should demand shape in a much more holistic way so you do need to consider the restaurant view as well as the entire swiggy as a platform a question here yes so just want to understand what is your implicit signal and the second question is are you still using ALS for the factorization no so ALS was the first one but I think the the one I was talking about is is coming from this guy Stephen Randall I think it's a very a known publication and there you are using a wrap loss with I mean effectively it's a learning to rank method where you're doing a pair wise loss and ALS was obviously the the start kind of sorry I have one question about learning to rank so which model forms are you using within learning to rank there are many sub models within that right so can you repeat that by learning to rank which particular model forms do you mean I mean like a lambda mart or like rank net or what type of models are you using I think currently the models are pretty much on the matrix factorization based methods or some notion of these contained based methods what you're I think I don't my question is about the learning to rank models yes which you mentioned that you are exploring or you have been putting it in place I think we are just exploring the pair wise thing I think you're talking about I think maybe the list list wise methods so I think overall if you see the literature also the pair wise methods are there for a long time and they have been proven because that they are working for something list wise still I think I would say not not at the label where some of these they are for the research and other thing but I haven't come across a very standard method that that has been widely used but it's still I think that is the direction where people are hitting one more question on the data so when you're estimating any models like this and not forget about the last function yes you also need to think about what data do I use do I use those who just browse around but don't order but also those who have ordered something so for them the relevance will be different sure so there'll be these issues around you know the nature of search will be different and those who are just browsing it just click a ton of things and then go but then they don't book yeah right so have you looked into this did you what's your opinion about it did you try to account for all those in modeling I think the implicit data that we use is somewhat listed on this one so there is a long list but I think the idea is I think if you have attended the talk from Salish I think yesterday he was also talking about do you have tons of implicit data how you actually smartly take that and create a meaningful supervised label or meaningful implicit feedback is something that we should think about so though somebody has visited a menu but he never actually ordered from that specific restaurant should be ideally penalized at the same time if somebody is spending a lot more time on menu should give it we should give it a boost right or if somebody has already added to the card so the idea is that he actually likes that restaurant he liked the recommendation but he may be hung on some other factors like offers and all so you need to give some level of weightage to some of these things and that's where I think the real data will lie rather than just throwing everything at one go yeah so sorry last question my question is rather simple how do you evaluate your recommendation system whether it is working or not so that's a very standard ndcg matrix and so at least at a modeling step you can so if you think about these problems they started with a traditional matrix of rating prediction and if you can get the rating prediction right and do a basic like a root mean square error on that that was the initial start but then gradually people started hinging towards uh can you do a uc so you if you can think about it think it as a as a more of a classification model and then you can predict like then you can at least on your ensemble data you can do calculate a uc and all but when you are running this in the production like maybe a b system or something then the typical matrix will be either ndcg or a more conventional method that we talks about in in a larger group is how many people are actually ordering from the first 15 what you saw or how many people are actually not ordering from search so currently if our orders 60 percent of our orders let's say come from the feed and 30 percent people actually come from search if you can decrease so ideally that if your recommendations are very good people will not go for search though there is a behavioral changing as well but if you can reduce that then there is a good potential that you're whatever you're recommending to the customer is is a proxy like that you're getting right so overall at an online level you can always see the click through at a funnel of from the starting from the listing to the the card so that is some of the business object or business metrics but you can always define it in much more mathematical terms in in terms of ndcg or mean reciprocal ranks and yeah so one final question interest of time yeah we'll sure we'll break for lunch now and we'll see you in an hour thank you yes check check welcome back check quick announcements please do not bring any food inside the auditorium thank you we have a couple of bobs at two o'clock now we have the woman in data science boff starting in the first floor at 4 30 we have the data science in production boff also on the first floor evening at around 5 30 we have the open discussion on the fifth elephant agenda and community needs it's an open discussion that will happen again on the first floor right so most public discussions are centered around ai or ml saga and two underrated components are workflow and data explaining the data collection and quality with this dogma will be paul main sourcing on designing for data thank you uh good afternoon everybody i hope everyone had a good lunch uh certainly a good morning um a lot of thoughtful and provocative talks so thank you all for for continuing to be here my name is paul mine salson um and currently i'm a data scientist in residence at montane ventures which is an early stage vc fund here in india and i've spent most of my professional life as a data scientist but in the past year as an investor i've had the time and space to step back a little bit from the day to day work of data science and uh think about things at a little bit of a higher level and i've also gotten to see and talk to a lot of startups doing really fascinating and promising work in ai and ml and analytics um and i think the thing that characterizes the the best data science projects i've seen is the way a team approaches the foundation of its data um good quality data and a well-defined source of data i think are the critical factors to success and that's what i want to talk to you about today and the thesis i'm going to present uh is that quality is less uh less to do with data itself it's not uh a met a list of metrics or characteristics that you can rank it's not an intrinsic feature to the data it's more to do with how we look at it uh how we work to understand the data we have and how it came into its existence especially um that and so the quality comes from that symbiosis between the appropriateness of methods uh and the data that you have and i'm borrowing from the statistician george box here originally pointed out that all models are wrong but some are useful and i'm saying that something very similar applies to data and whether it's good data ends up being about how you approach it and i'll start with a story that motivates my message during world war two uh the the u.s. government uh put together and set up a small organization of statisticians and mathematicians in manhattan uh it called the statistical research group and the s.r.g worked on a variety of problems related to the war effort among these was a problem that the air force was facing taking off from england bombers were flying deep into northern europe uh and into europe to bomb strategic targets in germany they were also getting shot down and uh the air force was losing a lot of planes so the air force they decided leadership decided that they needed to up armor their planes but they also realized as with most problems we face i think uh that they faced the trade-off the more armor they put on the plane the heavier it would get uh which meant that that would decrease the distance that the planes could fly they could uh reach a fewer set of targets uh and they'd be closer to the line of running out of fuel so they decided they had an optimization problem and they turned to the s.r.g to help them with it and they said look uh we've been tracking all the planes that are coming back in and landing and that are flying these routes and we've created this data set of where they've been shot and where they've been hit where they're taking damage take it please and model it uh so that we can understand where we're getting hit and we'll focus our our armor there and there's a statistician on the team named abraham walled and he took on this problem and he looked at it and he thought about it for a bit and then he came back and he said he told the air force here on the engines this is where we're not getting hit this is where we don't seem to be taking damage and that's where we need to apply armor because you see he realized that the data he was seeing was this a sample of the world and there was key data missing namely the planes that were taking off getting hit and crashing and not making it back to your making it back to england right and so on those planes we we don't see them those those are not a part of our data set and so we can make the inference that missingness corresponds to fatal damage to planes and instead of focusing on armor our armor on where we're taking most fire we focus it on where the fire is doing the greatest damage and i think this characterizes story for me characterizes what i want to talk about and and to convey my message today i want to kind of walk through four stories from my own experience and i hope to persuade you that in data science work that you should be caring very deeply about what we'll call the data generating process to care enough about it to do the work that doesn't always get the glory and doesn't always feel like it's on the cutting edge to understand where your data is coming from and also to do the work of communicating this because i think both are necessary for for real success and i'm going to going to spend most of my time telling you stories about my encounters with interesting data generating processing experiences and these are not nicely wrapped stories they they're mostly open-ended there's no easy obvious clearly right answer this is not a sat or je exam because i think that's how life is mostly i'll be walking through a kind of a sample of experiences of discovery as i've experienced them and i'd like you to walk away thinking huh not not thinking okay in scenario a i apply x method and scenario b i apply y method but rather to kind of walk away and think huh how i see how i'm facing or might be facing something similar something analogous in my own work that deserves further thought that deserves attention or maybe even just sort of as a process of intellectual simulation that you see something about the examples i give and you can simulate in your own mind something that i seem to have overlooked or or how a line of approach might take you the distance towards a better solution so what is the data generating process one of my professors and statistics in graduate school used to describe it this way to convey kind of it's all the things that take data uh into the world and in from the world and into your data set and there are three general things that i like to think about uh they define it a little more concisely or a little more specifically um the first one which we often overlook and i think that's a little bit about what where you know chris's talk earlier this morning which i thought was quite good uh talked about it and probably sick of the things we learned in statistics class about sampling strategies or you know how marbles are selected from a jar they're actually really important but often overlooked statistical model of a data the statistical generated data generating process is often what we're trying to model when we set up a regression for example we're explaining some effect or some phenomena in in line you know using some other set we're explaining a y with a with a x matrix and then finally there's this data collection process which is often not what data uh which is often not what statisticians are talking about it's often more what we should be thinking about as say software engineers or developers or builders it's the routes and technical procedures by which data reach a database and that's data engineering so my first story picks up the thread from the statistical research group except we fast forward to india in 2015 and i think that we often overlook issues with our data generating process not because they're incredibly subtle or or hard to spot we have to be brilliant instead we overlook them because we're so excited about building solutions and finding a direct answer to our primary business problem in 2015 i was vice president of data science at housing.com in Mumbai and we were excited about building an excellent real estate experience so that instead of wasting your weekend you know crossing the city and traffic stuck just to see a flat that it turns out was obviously not right for you you should be able to find your next home from the comfort of your home and a core part of that effort was bringing as much real estate inventory online as possible in the summer of 2015 our ops team reached a million properties and that was a huge success for us registered online and up to the time i joined the data science lab had very little insight really no insight into that process that ops process and very little attention for it given all the other things we were thinking about for example we were really thinking a lot and very excited about overhauling our recommender system and we are intrigued by really intrigued by all the ways that conventional recommender systems for e-commerce or say media all on netflix weren't right for real estate and that's what had our mind share and that's where our data scientists were keen to start except we remember that we needed to start with thorough data exploration and our exploratory work led us to looking at this our data in this kind of way each of these plots shows two curves for a city in blue is the kind of properties that we that our users were searching for and in green were the properties that we had in our inventory for that city and in looking at these we realized we had a problem see in cities like Mumbai and Delhi you can see there's considerable area under the blue curve under what the demand curve that didn't follow within our supply curve which means we had users looking for kinds of flats that weren't in our inventory that weren't on our site and so it didn't really matter how sophisticated our recommender system would get if we didn't have the right inventory to recommend and these plots introduced a question for us that we needed to answer about our data is it representative of the market because if it is then our users are looking for flats that don't exist they're not available um and we need to educate them to that like you're looking for a particular kind of flatten band or west in Mumbai and it's just not there it's good for you to know that uh so you don't waste time on the other hand if our data isn't representative then there's something wrong with our data collection process and that deserves our attention as well so what is our data what was our data collection process well in each city and locality we hire data collector teams to go out and find properties to register and bring on the site and those collectors had been incentivized based on quantity get as many properties on the site as possible and demographically our data collectors tend to be young male often bachelors um and they were great at the job we'd given them getting inventory on the site but then we realized that uh potentially we hadn't been precise enough in our direction see in their rush to get inventory on the site uh they were going to brokers who they were familiar with who it was just it was like path of least resistance right and that means they tended to be going to flats like this on the left large lower end it's more affordable brokers who were kind of just renting out in volume and they were very happy to just put 12 sites on 12 flats on the site except those 12 flats are all basically the same right so there's very little value add from a user's perspective for looking through 12 flats that look exactly the same have the same layout and our collectors weren't nearly as much going to flats like this higher and more expensive lower because the brokers there were were not they're not so comfortable right they don't deal with large volume and um and the two worlds just don't really overlap so when we realized this we were like well we can actually change our data generating process we can say let's just build a feed our data collectors already have an app that uh that they use to collect so instead of just having it them go and and be incentivized based on raw numbers uh we can actually create just a simple api that says here's what we're looking for this month feed it to the app and then incentivize the brokers to do the extra work to break out of their comfort zone to do the things that can get us a more diverse set of inventory so our data generation generating process can often be influenced by uh social factors demographic factors things that we just wouldn't normally think about as a computer scientist but actually end up being incredibly important and we also had data generating process problems on the demand side at housing see we were ensuring that we collected as much data as possible about each flat uh and then we tried to make the search and filtering process for users as clean and delightfully user experience as possible to do that we needed to track and learn about how our users interacted with our next flat basically that meant using clickstream data across the user's product journey all the things they're doing so what is that uh what does that look like basically users uh get on the site and they they take a couple of actions and they sort of sequentially look through different um different uh different properties and we're learning something about the kind of properties they're looking at but we're making inferences based on the sequence in which they do that users user's preferences are conditional right so they'll trade things off they'll say closer to job i'll pay more for less if i'm further away i want to pay less i'll get more and so there's this event stream that you can kind of imagine so most of you are probably familiar with clickstream data but in case not your standard stuff that you know pretty much every website is tracking is things like okay the page a timestamp some kind of user i you know either an ip address or if you're logged in a user id um the url things like that but services like google analytics and others that kind of out of the box set this up will only take you so far if you want to really understand it and see more granular data about sequences where people are which filters they're selecting etc you end up getting into quite a more complex event data modeling and our product analytics team is part of the broader data group uh spent a lot of time working on this because they realized this is our data generating process and it's really important so it's worth the time and they would go to all they went to a lot of work mapping out the the flow that the site facilitated to sort of so we could track it and learn about it over time and this would be fine if we had a very stable uh product except we didn't because while housing was known for good data science at the time i like to think we were uh housing was also known at the time for really great design and the design team what they cared about most was how the how the product felt and which means they were willing to do all kinds of work to improve how it felt how it flowed how it was used and our front end team which had to implement all of that cared about performance they cared about making it fast making sure that that information is uh compressed or flowing smoothly and uh so what we realized was that all the tracking we set up when workflows are changing when the site's changing constantly those aren't because the teams the front end teams and design teams weren't really thinking about the data they were generating they weren't necessarily always taking the effort to go back in and remap or like work with the data team to understand if this event is no longer even possible on the site uh a drop in that event doesn't mean that people aren't doing it anymore it means they can't do it anymore and basically the lesson here is that uh when you have a data generating process that's changing very quickly it's very hard to make inferences uh from it you lose the longitudinal value of your data and it what it means is that this is a technological process by which data is getting into your database and if you're not paying attention to how and how that's set up uh then you're going to get data that's not appropriate for the questions you're asking of it now the two stories i've told from housing so far both illustrate how people it might be business partners it might be ops teams it might be front-end developers are somehow affecting an influencer data generating process and the data we work with and in my next story i kind of want to disrupt the tidy dichotomy between data and the world on the one hand and the data science team on the other uh many of you have heard the phrase that data is like the new oil right and housing we like to say no it's not really like oil it's more like soil because oil is a non-renewable resource right once you use it it's done but data is renewable you can use it you can you you can use it for multiple purposes and it can even increase in value over time but the issue is is that the things we do to it affect the data it generates new data we're affecting that uh and then over time there's new people coming onto the team and there's new ways data is being used and we still need to understand the things we did in the past to affect how our data has come into being so i'll talk about about our story so after i left housing i was chief data officer at a startup a fintech startup in Mumbai called PaySense PaySense does mobile loans through an app and we were working on a problem that's very common probably for everyone in this room you're all kind of familiar with it as a standard data science problem how to decide to give a person a loan and a lot of times we jump into to you know using building a model and we might uh you know use a linear regression which performs pretty well right that dris was talking about but what i want to talk about is is these these features that actually are going before we get to the model where the data is coming from and i'll talk about for instance a feature that we decided was important uh average monthly bank balance an average monthly bank balance uh you know is is this thing that represents uh deeper process so most people who are applying for a loan their lives are sick their financial lives are cyclical every month they make some money and every month they spend some money and what we're interested in knowing is what's the top and what's the bottom and what's the curve so really we're interested in knowing something like this normalized across day of the month here's where their balance tends to be and then we're going to reduce that right to just one number monthly average balance and you might think well that doesn't capture enough information about this curve you might want to know uh like monthly average range or variance or max or min or something like that and yeah you can add lots of features in there but we just need to understand the underlying process and then we realize that like this looks pretty pretty it looks like we've got enough data and we've kind of drawn a nice curve and we kind of understand what's happening but actually a lot of times the data looks more like this so we've not normalized anymore now or take say we have three months of data for observations of a person's bank balance and we can kind of see okay so in June and April it sort of we sort of see this uh this incline uh it looks like it might be representative and we kind of see a similar slope in May but it's it's lifted and it's somehow shifted rightward for some reason we could think about why this is happening but the point is is that it's not a neat curve uh and we need to figure out how to interpolate and do other things so we also need to ask ourselves where is this data coming from let's not jump into just like coming up with the right curve or coming up with the right number so where we're getting it from where we got it from was sms's financial sms's when you do a debit transaction when you do a credit transaction you get an sms and it contains a reference to your bank balance okay but not everyone receives the same rate of sms's not everyone some people delete them there's lots of different processes so let's just look at like a histogram of how many sms's we have for a given user and we kind of see well there's kind of a bit of a curve so we expect our users mostly to have around 20 to 50 sms's and then we might ask ourselves well let's describe it in a different way so we can also look at how much time because you can have a thousand messages that came in a week and that's really not conveying enough information to you right you care about over time so let's take the earliest date observation we see in the last date and we look at the span of time there and we can also draw a curve for that and we kind of see oh there's these peaks well by the way these peaks had to do early on in the process with the fact that we realized at first we were saying let's let's take 30 days let's and then deciding well one month is not going to be enough anyway so let's take 60 days so you can kind of see this that show up in the data has nothing to do with people's finance has nothing to do with how they do transactions has everything to do with how we set up our SDK and you can continue to kind of ask yourself how are you modeling this what is the underlying sms why do we expect people to have all their bank balance transactions being recorded on their phone so now we can say let's take the count of dates so like the number of dates covered but just like a thousand messages in one week is not sufficient similarly if we have an observation and then two months later we have another observation but like maybe only two observations in between if you look at it just in terms of the count of dates between the two it looks like a lot of dates but actually if you look at the count of unique dates it's quite low you have insufficient data so if you plot this together just very simply right we can kind of see ideally our our users would follow along this line if you have 40 message 40 days of messages you should have at least like around 40 or or high 30s of unique dates but of course that's not how all people are some have large spans of time but not a lot of good coverage and we can also kind of look at the density of of messages within those time spans all of this is just to sort of say that all of this work kind of needs to be done before we just jump in with a raw algorithm or like an equation and when we come up with that average balance we we're not just creating an average from observations right because if we have five balances across a month you don't average over five because there's all the dates for which you don't have observations so you you have to interpolate or you have to impute missing data so what do you do do you forward fill do you backfill so let's say I have an observation on the fifth of the month and it has a thousand rupees and then I see another bank balance on the tenth of the month and it's two thousand rupees now do I say all the days between between those do I do I use a thousand rupees or do I use two thousand rupees or do I split the difference it makes a difference because time spent at a different level of bank balance affects the credit that people are going to be able to get and your evaluation how whether they can repay that average bank balance is not just going to be used in the decision to give them a loan or not it's going to be used in the decision by a different model of how much to give them a loan a different model of when to market to them another model that's going to be asking you know how likely they are to be you know engaging in fraud and other data scientists are using it so we generated that data that then just becomes data in our database it's in the monthly average balance table or whatever table it's in but we affected its data generating process the decisions we made affected how the data ends up in our database okay so let's continue with the story of giving a person a loan and we take another feature and a lot of this you know so the stories we've looked at so far we've kind of thought about a single data generating process that kind of covers all of our users but this last story shows how sometimes figuring out our data quality involves recognizing where we have multiple data generating processes and we have where we have to some work to do to tell the difference so our next feature we might think about is monthly salary and when a person applies for a loan obviously their salary is a core feature of how much you decide to give them and well in the application process we rely on them to tell us to report on their salary but obviously we know that we can't just take that at face value there's a problem there and so there's another process that you can follow which is kind of logistically complex of sending a courier and collecting a paper that has a salary slip or an offer letter and then you kind of have to convert that into data and that's complex and expensive as well so we kind of want to know is can we validate salary more efficiently and more cleverly using our data so we say again turn to the SMS's because we observe that some of our SMS's seem to be telling us that when a person receives a credit you know a credit into their account associated with their salary then we've kind of observed a fairly trustworthy you know a little bit better or additional to the report they've given us and sometimes by the way we're really confident that that's a true SMS sometimes we think okay this is not it doesn't look like we're not sure it's a salary but we feel like it is right right here in red it doesn't say salary but we're pretty sure this is an employee of ebay and that they've received this uh this salary this month so we can say well let's use this and map it up against what people are reporting and so if we do that and we'll just log transform it so our distribution looks a little cleaner and we realize that yeah this kind of represents what we were hoping people there is some match there is some mapping between the two but we kind of can also suspect or what our intuition would tell us is that there's a lot more people telling us they make more than their SMS says than there are you know vice versa and we need to understand is what else is going on here and which ones can we trust and which one's not so let's think about how this data is being generated where is it coming from what is it well we can go all the way back to the beginning and we can say well how is salary itself generated what's the data generating process there well it's actually kind of a process of negotiation right and then you're like getting way off track and you're kind of thinking about like qualifications and and how a person gets a salary but it's important to at least think through that okay that comes up with the salary that you agree to with your company and that's usually a humanly meaningful number so you never agree to a salary of 37 thousand 237 that doesn't you like we like round numbers we like simple things but when we get our salary that's not exactly what we agreed to because the finance department gets their hands on our check and they do all kinds of deductions and things like that right and so that's a data generating process for what actually gets deposited into our account and we can kind of follow that through and we say okay that's our actual salary and then we have our user report how does that get generated what's the generating process there well a user is not looking at what their finance department gives them usually they don't even look at their payslip they're applying for the loan like say in the mall or as they're buying for something and so they're thinking of their humanly meaningful number they don't calculate they know their ctc is say 40 000 a month they don't know like that this tax and this debit and this blah blah blah so they're going to report that so to a certain extent okay so that's the user report and then we know that some users are going to exaggerate their salary that means they're just going to like bump it up a little bit because honestly they're going to get this bonus next month or when you calculate it out it's going they're going to get an increase within three months so that's really safe and then there's some that are like dishonest they're just telling us a lie they're making 25 but they say 60 because they think they'll get a better loan meanwhile on the other side we have our sms's we know that our salary is generated by a financial process and then we're getting a notification and that's generated by an sms whether the person deleted their s and all the things that we talked about before with balance and then also there's the this process where we're not even sure all the time where we say it's a salary sms whether it actually is okay so you know thinking about this we can kind of ask ourselves well we still know we have two sources of information what a person tells us and what their sms tells us and how do we look at the difference so we might take their income and and subtract what we've identified in the salary and we and we draw distribution for what the difference looks like and how big does it tend to be and it's kind of nice that um it's fairly closely approximate to to zero so it seems to be a good number of people who are somehow being honest we have a match but what we can and you might try and like you know run a regression to kind of come up with what the difference should be what's the bias because then you can kind of be a little bit smarter in the salary but you can also ask yourself is do we actually have multiple data generating processes here should we be estimating a bias that a user has as if it was a deterministic effect as if it's like just a coefficient and a regression and I don't think so I think because I think each person has a bias and that's that's not like uniform across the population so maybe we'd want to be modeling it that way but we can kind of take what's called a mixed distributions approach and we can say all right uh well we actually have our two populations here one is an error distribution where people just are off a little bit it's narrow and then bias distribution which covers when people exaggerate and it turns out that yeah we can kind of expect around 61% of our users to fall into the error distribution and we can assign each one based on a say like a likelihood ratio test and there's good estimation models where you can kind of choose the parameters of the distributions themselves you know model those as distributions but the point is now we can sort of say for some people we're going to like choose their true salary and for some we're going to adjust them based on sms's so the point is is that there's we don't have to apply just like one uniform probability model to all our data there's if we think about what's what itself is generating that the information we're looking at I think like you know this is a tweet from many years ago but I think it really represents well I think what a lot of data science work is it's this question of like we're just we're counting and we're figuring out what's our true population what's the relevant population for the data for the problem we're looking at and can we reproduce it and I'm gonna talk a little bit about that later so I also want to kind of quickly talk about the that second part of this because it's not just about knowing this or thinking about or working through it it's also about how do you why is it so important to actually talk to people about data generating processes especially your business partners especially your product managers and I think it comes down to something to my background like most of my research work was done in cognitive science and applying to the school and computation to how people make decisions and there's this thing called theory of mind which has to do with how we understand what's in another person's head and there's also this thing that comes from that called the curse of knowledge which represents the problem all humans have with knowing what they know knowing what someone else knows and like not telescoping or telegraphing what we know onto that person it's hard to remember that we know things that other people don't know that sounds weird it's usually shown through and you can kind of look and sort of see the humans aren't born with this capability it kind of is generated over time and so this is an experiment that's often done with young children and basically they're shown this series of events basically there's two young there's two two little two children once Sally has a basket and and as a box and Sally is a marble and she puts the marble in her basket and then she goes out for a walk she leaves the room and Anne takes the marble out of the basket and puts it into the box she moves it while Sally's out and that Sally comes back and she wants to play with her marble where will she look now we all know that she's going to look in the basket because that's where she put it but children of a particular age will ask I will say she's going to look at the box because they know it's in the box and they just kind of assume that Sally knows it's in the box as well we stopped doing this when we're children but there's still a lot of it that sort of stays with us it's still very hard to understand another way this shows up uh is is is uh in our language for example so I'm just going to do a quick experiment so let's say if I say highly likely actually can someone help me can you help me just just stand just stand and memorize the numbers I tell you so like if I say highly likely raise your hand in the room if you think I mean it's about 50 chance nobody right maybe two or three people uh now let's jump to 65 anyone 75 all right we're getting there 85 good good 99 okay so we're right around 85 or 90 remember that for highly likely yeah what about probably 50 okay a lot more 60 uh it's kind of going down actually okay so say 55 yeah you're remembering these numbers good all right probably not 50 45 35 25 15 that's weird we're kind of skewed on probably not somehow it's stronger than probably all right let's say 25 yeah and then finally highly unlikely let's skip jump we're not gonna start at 50 let's start at 30 30 25 18 15 sorry I'm minus from 100 uh yeah sorry no that's right uh 10 5 3 2 okay it's around 5 5 right okay well actually there's been work that's done I looked at this and it's it's pretty cool it's a nice visualization the point is is like if I say highly likely or probably not uh it's we don't really know what I'm saying I'm trying to be more statistical right I'm trying to be more database but there's a like quite a range on what people believe we're saying and this is what happens with our data like our business people get really excited they're like data science and AI and now we've got data and we've collected all this data and what's the answer and I know like oh I know his race and his sex and his income and I know everything it's not really true but we have to do the work to help them understand what are the limitations what are the appropriate questions that they should be asking of our data because it's kind of like turning the world into data as this process that's both really wonderful and letting us be more concise and more precise and more accurate but also loses all the things that Abhi was talking about this morning right about all the things that we pack into our knowledge uh that we forget to explain so right we have and and that's the best we can do or we can use it for a completely different purpose that it's more appropriate for and the final the final thing is how do we know we're using it in a good way and how do we know we're going through this process in a good way of communicating our data generating process and I think like coming from computational social science we in the last five to ten years there's been a lot of work that's been done in understanding the importance of reproducible research the fact that we shouldn't just believe a paper because it was published we need to understand what was happening behind it what were the methods being used what was the analysis done was it appropriate etc but reproducible is actually just the first step so here's a sort of a two by two that I like to think about we can think about our data and in our left column we have the same data and different data and then you think about our code or infrastructure or our algorithm and we can have the same and different and when we have the same data what that means is as a data scientist I do a problem I use some data I use some code and I come to a result now I might have messed it up somehow or overlook something I need someone else to be able to go in and do it again and come to the same result before we can really believe in it but that's not as robust as it could be if we use different data that's when it becomes replicable and if we use different code that's when it becomes robust it means yeah you use this use this linear regression but if I use logistic correction I couldn't have completely different outcomes there should be some resonance some similarity some consistency and finally uh when we when we're saying like we can use different data and we can use a different approach then we think okay now our model is generalizable we're really learning something about the world that we can believe and I think uh this is the kind of thing I like to think about how we kind of measure ourselves in terms of the appropriateness of what we're looking at and uh and that that's it uh thank you I don't know if we have time for um sorry you can sit down questions uh hello questions we have time for a couple of questions good because I think I ran over but I will be around I will be around and also doing a hello okay yeah uh there's a bof on data engineering we're going to talk more about actual methods and techniques and and and like technologies that you can use to kind of do these things all the work of data engineering uh tomorrow I'm pretty excited about that so if you're interested in this topic and you want to hear more like details please do come by that and anyway I'll be around if you have questions or thoughts or just want to chat cool thank you so people who are standing at the back uh the balcony is open upstairs so you can occupy the seats uh we've found a car key it's at the help desk the original owner is requested to contact the help desk and identify yourself and the car before you collect the keys a solar bof is starting in oddy 2 at 245 now anand Venkat Narayanan is here to tell us the entire tale on compromising a dollar six billion data acquisition project the aadhar case study at scale disabling security features and defeating the backend analytics is a hash fail hopefully our aadhar numbers are not for sale anand Venkat Narayanan so I would be talking I would be talking about the aadhar project and particularly on the technology angle about the real quality but not about anything else right fair enough so uh as part of this presentation uh you can see the blue links I've also linked all the articles public public research that we've done as part of that so if you have any questions you can go look at it right thank you huh it is on the talk funnel right uh about myself I exist so I haven't found a need to have another yet okay uh I'm a dependent for the petitioners in the supreme court of India in the ongoing cases so uh most of the technical arguments on the project has been drafted by me right uh I'm a security researcher and I do a lot of financial modeling and I also program that's basically my background so what is a big point I want to drive in today's talk uh if you're talking about a data acquisition system which is countrywide eventual consistency is a big weakness that's basically the point I want to keep driving towards okay and this part is important because why is eventually consistency is a weakness because it's it's going to be exploited and the way in which you think about aadhar uh it's much more easier if you think about it like an identity platform um and one platform that you are very very aware of is what we call as a currency note right so if you have a currency note you already know about what a platform is uh an identity document is something like a permanent currency note which does not have double spend problems right so if you historically look at all the bank scams and fake note scams you can basically think of currency notes as having read access and having write access okay so if you look at most of the common ATM frauds and sim card frauds and what kind of stuff it's basically about getting read access to your money and the guy basically comes and picks your pocket and takes some money and goes on right uh however uh fake currency scams where someone is basically printing notes is a lot harder okay because of the entire process involved and this is basically how the bank notes work the the way in which you can think about aadhar is is that what if you had great access to the entire database okay so I'm not going to talk about the leaks I'm not going to talk about uh all the various data leaks and it leaked there it leaked there I'm going to talk about only write access right so so what it what it really means by having write access is that if you are able to compromise the right process you basically have a permanent currency note that does not have double spend problems basically that's that's basically the impact of it enough uh this is again reiterate my point which says that if you're able to control what right into the database I mean you're you're basically a very rich person okay so to illustrate this problem of write access we have four existing case studies from the field which I call is the four horsemen so if in case you look at the x-men apocalypse usually records four horsemen before the apocalypse right that's basically the four horsemen studies all these are real these are not made up and we have extensive public documentation that these things happened and we also know about the impact right and the first one is what is called as an explorer because this happened on 2012 and this was probably the first template I've seen where people had write access to the database to control what is being written into it uh the second one is what is called as a 2016 lineage it started at 2016 it's still ongoing it has not been stopped and this I call as a patch maker who specialized from trading patches to get into the database uh this third one is is what is called as a fingerprint forger the fingerprint forger is primarily used not for creating entries but for updating existing entries using uh fake fingerprints right and the last one I heard is on uh deli which I call as a mixer who basically mixes all these techniques together and create what we call as a super write access okay and finally I will also tell you who's the apocalypse at the end of the talk right uh before we go further into the four horsemen I want you to understand about how the enrollment process itself works and the various checks and balances that is there in the system uh so if you look at uh the data acquisition there's a very specific reason why I call a data acquisition uh every enrollment that has happened in adar so far is a data acquisition uh there have been as per the official uh documentation by the adar authority 1.2 billion data acquisitions uh however they do not actually tell you the amount of rejects okay uh so if you consider the rejects it's much much more uh almost all the enrollment is done using what is called as an offline mode and the reason why it is called as an offline mode is because uh basically they wanted enrollments to be done even in places where there is no internet connectivity so so what they basically have is uh it is done through a software and a packet is created and it is uploaded later that's why we call it as an offline uh you know generation process that's why it takes a long time for you to get another maybe 16 days 90 days right so what does the enrollment really mean uh it contains three different uh set of documents right uh one is what we call as an xml uh which is basically your demographic data in an xml format right and then there's what is called as media uh which is basically your fingerprints and iris and the third is what is called as a derived uh documentation or your original poi poa scans right so these three things are basically what gets uh captured as part of the uh enrollment okay fair enough so if you look at this end-to-end process uh I want you to pay special attention uh towards data transfer to cdr with a pin drive okay so that is the part that you're to pay special attention towards it's it's very important for all that you're seeing today later right uh so if you if you look at it it just explains about uh how the whole thing works but for us the most important thing is uh there is devices hardware software and there's verification procedures and there's data transfer okay so as part of any data generation process or data capture process uh you have to ensure that there is a human uh checks that are performed on it so some of the checks that they have is an introducer so in case you are one of those people rare people who never had an existing id document before uh there has to be an introducer who would introduce you in the system right and there's also a document very fire and a supervisor and and the reason why uh these two gentlemen are very important uh is because you have to think about other as a derived identifier so the only new thing it actually adds into your existing documentation is the biometrics okay so you have supposed to submit your po a proof of identity proof of address and the date of birth uh and that's basically what gets into your ekvc etc and the only new thing that it is actually adding uh is the biometrics so uh in order to verify their derived IDs are actually real you need to have a I mean enrollment uh you need to have a document checker okay uh there is also a lot of incentives for successful generation this is basically for the operators uh so basically the the the structure works is by saying that if you generate uh if you capture data properly uh there is and you get another number successfully you get 50 bucks and if you make a mistake and there are a whole bunch of process errors etc etc there is a penalty for it so basic this is basically the structure that has been set up now uh the software checks is important because if you look at what they have done is they have rolled out an enrollment software and thus there are uh checks on the software about data capture and quality uh so if you look at the if you look at the software you can actually say the basic minimum biometric quality required for fingerprint analysis 52 percent just don't ask me why does not 80 percent okay uh and then uh there is uh onboarding for supervisor and operators using their own fingerprints and passcords and there is gps and parameter synchronization so I'll come back and tell you why you need uh operator and supervising onboarding using fingerprints because they figured out uh a long time ago that how do I know if I were an operator who's enrolling people how do I know this it is me who's enrolling but not some random person so you need username and passcords but that is it is done offline because you just basically take around in a mobile one and move around so they've wanted the guy's fingerprint okay so so the prerequisite for being an operator is that you need to have another number okay uh and uh when you actually enroll people you basically are given a username password and you have to put your fingerprint on it that's basically what we call as an onboarding okay the gps and parameter synchronization are important uh because uh unless until you put a gps how do you know the person whom you're recruiting is not in pakistan because it's offline okay enough that's really the reason why the gps was added in uh the parameter synchronization is fundamentally a method uh to update some parameters of the enrollment client uh from the backend uh it's usually about how much of data packets are pending how much of them are not uploaded how much of them uh fail checks etc etc uh so once you generate this uh enrollment uh before it even hits the deduplication there are a series of checks that are done on an enrollment packet and uh these are basically structure validation demographic checks operator checks is a guy is is a is a operator really blacklisted is allowed to enroll people then is a supervisor proper is it if it is a biometric exception you need a supervisor approval so is the guy really there and then uh there's an introduce a check and then finally there is a resident deduplication right so at this point of time is when you basically get another number so i will now touch upon what the explorer has found which is case tree number one so the first version of the explorer uh found a very interesting bug on the enrollment software so remember i told you that before an operator enrolls anyone he has to give us fingerprint okay so what the explorer figured out is that you can give any person's random fingerprint it's going to fail the first three times but the fourth time it's going to accept it okay so that's basically what it is right this is two zero one two right uh now let's go further so the other interesting bug that the end that the explorer had figured out is um so remember when it went to start on two zero one two uh they were using uh what is called as a document management agency which is heavily packered so basically you have a guy hang on basically you have a guy to whom you go and give your documents he will just put it on the printer take a scan and he also has original documents and all of them are piled up together and they're basically sent to the crdr for digitization okay so if you look at on two zero one two or even until last year you would see in any other center a big pile of documents right just don't ask me what happened to those documents those documents have their own history right but the key thing you have to remember is if you look at the process workflow uh basically from the start which is one to the end it's about 12 steps it takes three months a day's i mean three months is the best number that i've seen it takes three months for the documents to actually get digitized and go to the crdr okay so what happens during the time is it is like your passport uh you can apply for the call passport and basically they will give you the passport now and the police verification comes later right uh so what these guys do is when you basically make an enrollment give all these documents right uh if your biometric checks are done and all the structure validations are done you will get another number but probably in about two weeks or four weeks but the documents may take a long time for you to reach the crdr so ideally they're supposed to audit but they don't but that's a different story right so now here's the interesting part what is the other interventional consistency thing that i want to tell you about uh this is basically an email that was in uh that was basically sent to all the operators and just look at what they're saying any non-synced packets please take back up and upload to google drive okay right right this is 2017 so what do you mean by a non-synced packet a non-synced packet is basically the operator due to enrollment and they have pending documents and sometimes it so happens that they lose hard drives they miss it and whatever is left over they say please put it on google drive okay enough okay uh so the the interesting thing about the explorer is that he used these two weaknesses uh which is the the fourth time fingerprint luck over when it works and the long tail of documents uh he made 870 biometric exemptions at with missing documents of 30 000 okay so what happened was uh Yuri runs a back end and they figured out that there was a single operator called Muhammad Ali uh who made 30 000 enrollments in 20 days right uh which they thought was probably improbable and they said okay how can the guy who'd have made it they did computation say oh he must be enrolling 50 per hour oh that's not possible okay so let's go and figure out what he did and and he turned out that uh the operator Muhammad Ali has been uh deactivated long time ago but his fingerprint was invalidated but someone is basically using the fourth fourth fingerprint thing and just putting all this stuff okay right and because of the long tail uh they figured out they said okay let's go audit everything what this guy has done they went and figured out that of all the 30 000 enrollments he did was fraudulent and uh they were supposed to give a ration card as a pypoa but those ration cards does not exist right the document management agency did not find the ration card okay so what it really means is that he got all those other numbers uh so finally if you look at the last week which I reported uh what I reported is uh there was a whistleblower who sent a set of documents uh to the supreme court justices after the writer privacy judgment in which he had basically said out of the 115 crore other numbers that were generated 46 crore other numbers generated do not have any documents okay that's basically what it is right uh so as usual uh what the what what the whistleblower says we need an official source right uh the u r a admits in their own website that they don't have documents for 7.82 crores of enrollments so between 7.2 82 crores of what the government admits and what the whistleblower says is 46 crores that's basically a range okay now the patchmaker so we're gonna we're gonna talk about a very interesting uh uh thing about the eventual consistent software so the patchmaker operated by patching the uh enrollment software okay so let's go back to what is the enrollment client the enrollment client is basically a java sc application and it is optimized for no internet connectivity uh so it's another way of saying it's eventually consistent with the crdr and if you actually think about it it's a brilliant right what is an enrollment really speaking an enrollment is basically a zip file right what if you can create the zip file by modifying the enrollment client right that's basically what the patchmaker is all about okay enough so what did the patchmaker do so he basically looked at the java uh sc client he looked at all the jars so if you understand java uh sc basically almost all the functionality resides in uh the jars in the lib directory so he just replays uh all the jars in the lib directory with your own versions uh you know with the interface the matching interfaces so so what are the checks that have been removed the checks that have been removed are local logins so the the earlier one the explorer at least have to put four four times wrong fingerprint this guy beat it by just removing the biometrics itself okay so basically what you had was just login and password which of course they used to publish it on uh csc.gov.n right i mean i still have those logins and passwords okay so at the end of it what happens is you basically have a zip file and you upload to the crdr and you get another right which of them are real right so so are these guys successful so this is basically an interview uh we did with the IPS officer uh who uh who basically cracked the scam on 2017 September his name is Triveni Singh this is an official uh document so we were asking why would they collect all the data how are they benefiting from this and the guy says uh one of it is for profit maximization because he's able to just distribute 25 more times and get more uh money to him and of course the other thing is misuse because he had figured out how to generate two other numbers by defeating biometrics okay so the duplication thing is also gone because your data capture is gone fair enough uh so from when in this is active the first known incident i have tracked on the web is on Telangana which is on 2016 so this is the first time ever i had seen a public version of the cracked ECMP software right and for how long has it continued it has continued until april two zero sorry june three zero two zero one eight until it is on deli so if you basically draw a map of india uh i would probably say iphone close to about 25 incidences officially recognized all over the north india as well as karnataka okay so this is basically the crack ECMP software right and the best part about the crack ECMP software is uh it comes with a youtube support channel right so if you look at the first one uh it is basically digi digi seva center digi seva center is the official uh government run csc website uh you can even google and the guy is fantastic he actually put his phone number and a ptm number okay you can actually call him and say oh my god it doesn't work and he says right and and so this is what uh so the next time the next one is bina mobile bina operator ka ecmp print aadar okay so whatever that is okay fair enough so this i will show you an official cracked video that is on youtube so this is not something i made up thank you virgin 3.3 ka okay how do i get this back so that completes uh the patch maker right uh the fingerprint forges are uh a slightly low version of the technology and remember all of it is still not read access it's right access right so what these guys do is uh as part of 2017 uh when all the stuff came out urea said i'm gonna remove all the operators enrollment operators everyone is out of the system only bank official is supposed to actually allow to operate the other enrollment right so what these guys did is they basically track the bank chairman in gujarat uh suraj surat is a is a washa bank chairman he had access right to the fingerprint uh to the other edits right they got a fingerprint without him knowing about it and uh they basically uh gave it to a lot of other guys right uh so what these guys uh figured out is uh thumb impression of the authorize of i mean other officer was used to actually edit so if you have a problem changing a name to anand to dayanand you just go to these guys pay them 500 bucks they will make your change that's basically what it is right okay so when they caught him this is a very interesting thing uh which is not reported in national media but is reported extensively on gujarat media you can basically call these guys and says bank chairman the guy says 25 000 bucks you'll give you bank chairman okay and you can basically call a guy and say a mp 50 000 bucks mla 75 000 bucks so it's basically on call demand service right for fingerprints that's basically gujarat biometric leaks right so the next interesting thing about these guys is uh when you basically call them and get them you basically get their other number username passwords and also their fingerprint uh in uh raw format and if you're one of those guys who don't know how to convert a raw format image into this gel thingy it's 10 percent extra okay right and from then uh this has been going on uh this is basically an official rta request which we got it's been going on from 2014 or 13 so that's basically the level of how these guys have been operating right uh our final uh horsemen is the mixer so now what he basically did was he created uh his name is so again when i go back and ask you this it's basically a three ring crime network uh the first ring uh which we had caught so far or at least know of so far is basically the operators themselves okay uh but these are not the guys who made all this stuff right uh the operators are getting all this from distributors this is ring two right uh the distributors is how we have gotten so far right uh we really do not know who's actually the kingpin number one who's basically making all this right all we know is they got it from behar that's where the trail ends okay right so the assembly factory as we call it is basically a self-organized unit which if you call an order will give you all the information that you need to run and fake other enrollment center which actually works right so it contains a fake fingerprints of bank officers uh it gives you crack software and it gives you also techniques to basically you know uh create fingerprint gels and whatnot enough so what is the really the implication of it the implication of this the union says that all these people are in the database that's basically their claim in reality I don't know what is the database there is just a database right so so the question that we have to go further is what is the reason why people are doing all this so we have to basically ask the question in every crime there is more right we have established process we have established how people do it uh why but not what the what what is really the what right uh the first and foremost thing about the other enrollment is itself is that it is actually unviable right the reason why I call it is unviable is just like many other things that has been told by the government uh it is basically sold well to the operators that you can make a lot of money on it right so if you look at what is the lot of money uh so if you look at what is another enrollment kit it costs you two lakhs for you to make it uh so the first successful enrollment is about 25 to 50 rupees so if you basically do a backup envelope calculation for an operator you can break even in 8000 enrollments that's basically what you think about it right the rejection rate of enrollment however is 25 percent on an average so what I mean by rejection rate in enrollment is for you to get 8000 enrollments you have to probably make 10 000 enrollments because you only get money for reject for generated ones but not for rejected ones right uh the distance in terms is where it starts hurting you because uh it is possible that the that enrollment got rejected because you didn't capture someone's fingerprint properly okay or there is some data entry error or a quality error or a zip error whatever right for every search uh capture error or data entry error you're basically fine from 300 bucks to 10 000 bucks right so if you look at if you look at the permutation and combination uh it's six times more for rejection and 25 percent on average basically makes you unviable that's basically what these guys have figured out right uh so what had really happened was when these disincentives were created because data quality issues were going higher and higher the operators basically said and this is the important part this was introduced around 214 and most of these guys were saying that we were on the edge of bankruptcy because we can't actually sustain our business right so this is the fundamental reason uh why the patch software at least came into picture uh because there were a large demand right and it has to be satisfied under any circumstances because it's all popular because of all right okay so the interesting part about uh the uh payment what do you call the fake other network as we call it as it had a trained workforce of 60 000 operators who've been laid off okay they know in and out of the system that's the reason why you see the youtube channel right uh the capital expense is already covered uh the cost of the crack software is 5000 bucks right and uh cost of one fake other is 5000 bucks so you can just see the incentive structure pretty much aligned for everyone the disincentives of course is please do not get caught okay uh so then we're going to ask we ask i told you the basic question who's going to be the who's the i mean if you have the four horsemen's who's the upper calyps the answer unfortunately is us right and the reason why i use the word as is because it's technical audience we don't really understand the purpose and value of technology right i mean we believe when we create a technology it's going to be used only for good we don't actually understand uh that we have blindnesses we have biases and those biases actually hurt us very bad when we load technology at a countrywide level right and why i'm saying that is because uh scale comprehension even on fraud is actually very hard for us right so if you historically notice our revolutionary path uh we have been uh living in tribes of not more than 10 to 100 for a very very very very long time okay so not long ago my father knew almost every single person in his village right and i come to a city i am still not completely used to the anonymized large masses kind of living right so so what it really means is that if you are a software engineer and you design a system you don't you don't really understand that your mind has probably not grown beyond the technical capability of a software you can create right so while Mr Ajay Bhushan Pandey goes around and says that i'm going to waiting for a big attack but that's not going how it's going to work right you're going to see a million small attacks like what these guys have done on the on the ground right and remember evolutionarily speaking the battery always wins and elephant always dies right okay so what are the data acquisition lessons that we've learned from this dollar six billion episode uh is that uh computer systems add a complexity vector called control right so when we think about when we think about software right we think about possession and ownership but we don't actually you know understand control so what do you mean by control uh in this particular case who controls their endpoint of data acquisition right you may believe that you gave them the software you may believe that it's generating a zip file but who's actually controlling the software right and and also remember in asynchronous data acquisition systems drops delays and frauds are indistinguishable that's basically our explorer right and uh so one way in which they try to fix the problem was uh they made gps mandatory on the enrollment software uh but that has been patched out so so the gps has been patched out in such a wonderful form that every single enrollment was made by the fake software the guys are actually cheeky every single enrollment is made by the fake software actually points out to the u r d office in pragati maidan and delhi okay so so people look at it and say yeah why are there so many enrollments in delhi why is it exceeding the population we know the answers for a very long time right uh so the other interesting thing about control is this is not this is probably day before yesterday's uh packet upload status in u r d i uh you can just see that packet uploader today is four lakh packet director today is 29 lakh what do you think it's happening give a guess it's probably the fake enrollment software which is d dosing it are basically so much of data quality issues i mean it's an uploads dot u r d i dot gov dot n please click on it before they bring it down because usually they do when they deported right uh so what are the final lessons our final lessons is if you're running a country-wide enrollment system on which you spend six billion dollars please do not outsource it right please pay for the maintenance and do not undercut them apart from the technical stuff right i'm pretty much done questions we have fine fine questions if you have any sure hi uh i have a question according to your your opinion do you think uh the problem was with the second part which you said that we as software developers do not really go most of the time go beyond our scope and understand all the different nuances which are there was that the basic problem or was the basic problem that the developers itself were not that good uh see how do you distinguish your business and developers so so if you if you basically think of it like a business let's say you actually ran a big data business capturing stuff from the entire country what i was just trying to tell them uh tell was that when we design it we look for scale but we are we are not we are not very good in figuring out that uh such a system has side effects so when these guys did it uh they designed it for the use case of optimizing enrollment speed uh not for enrollment fraud detection yeah that's uh like basic flaw like there are certain obviously there can be always ways of tricking systems like really writing hard go viruses but there are certain very basic security mistakes in this whole thing so uh so the basic security mistake i would probably say is is the offline part okay and the reason why the offline part was done is because of the enrollment speed problem right if it has been online you could have just directed it a lot more faster than i mean see we do it all the time right if you log in let's say that your account in your company is compromised how do people detect it they know that you can't be in singapore when you are in your home right i mean two logins which are geographically distance can be easily detected but not here it's fundamentally offline how do you detect it and without the offline part the system will never work they could have actually at least done a two factor authentication like you use two factor authentication is the operator fingerprint no you use a fingerprint some code will come to your mobile and you have to give that code also it doesn't work like that i'll tell you why okay uh with the mobile other linking even if you so if you look at the guy right he the guy actually has a mobile number but you can't trace him so what is two factor authentication when you actually control the source hello yeah yeah so uh sir here yeah okay why not the owner of the document update their details it's not possible isn't it i'm not let's say i have my details are wrong why not the system allow me to correct my data excuse me may ask the audience to be a little quiet the session is still on you can carry on your discussions outside please take your discussions outside the session is still on okay is there a way the owner of the document update their own details if there is a connection correct uh see you should ideally because you just keep moving around we're all migrants right but the problem is who's gonna check it ideally not ideally no it doesn't allow you need you need someone to say that uh you move to a different place there is a document that says you moved a different place that's how it is so people actually are very unhappy about it that is the reason why uh the fake uh the crack software is of great use because you can basically go to the guy and give him 500 bucks and change your address so here um so if this is so blatant uh uadi says our databases perfectly secure and all that on what basis on what basis uh i think denial is a far easier response and competency okay thanks so yeah one question uh this way so now that uh yeah yeah so now that we are in a mess already so so what do we do here from here on because we are in a mess the data is out there it's leaked it's with everyone so we need to clean this up and say we tomorrow the solution can be abolish adhaar or do whatever but how do we do the cleanup it's up to the tech industry again right so any views on how do we clean this up how do you clean this up a full audit yeah a full audit a full audit right so like is there like is some come someone coming up with a plan as like how the audit will be done because that again gets outsourced and then we are in a mess again so the way in which you have to think about adhaar as it exists today is it is a self certified adi you are whatever whom you say you are so there is no cleanup right like they can do a cleanup if they want to exactly it's going to cost a lot yeah effectively not happening then uh yeah so uh as you mentioned like uh outsourcing is the fundamental flaw yeah data acquisition so what do you think like okay let's say like they do in house they do the data acquisition in house instead of outsourcing it to the next mile uh private it would have worked it would have worked much slowly uh slowly is one thing like do you think the data would be secure even then see to a large extent if they're done in house they would have gotten some control so if you look at originally how we used to get id certified in the past there used to be a guested officer who'd put the signature and says you are whom you say you are like if you want to get a tathkal passport for instance one of the commonly used methods is go to some IS officer or an IPS officer would certify and say I know I really know you so in the past what used to happen is there used to be a status which is like government servants are typically more trustworthy okay and hence if they say that whom you are is what you are it is it is like that right so what has now happened is uh I would probably say that they have democratized trust and created distrust hello I have a question here hello here uh so here here yeah uh so uh in hindsight obviously it looks you know like looks like we kind of messed up but uh again to the point uh we being uh technical people who kind of handle big projects decently big projects can we come up with some sort of uh project plan that says okay this is how you can go about cleaning this it's very easy to clean it up it's it's gonna cost you a lot that's all there is to it no but but we always make compromises somewhere and say that the compromises got you this place no but but you can't say that okay this is so uncleanable that we live with this forever right like what is it I didn't catch the last part it it sounds very unrealistic to say that this is what the state is and the picks are going to cost a lot so we don't do it at all that's basically the government's call that's basically what they're saying now right I mean if you personally ask me I'm not against an ID system per se I'm just again I'm just for an ID system where things are okay right it is just not that's a problem and they did it uh not they just did it for speed I mean there's really no overarching value beyond making it too big to fail so if you fundamentally understand uh the dynamics of this project which I don't want to go into much but I'll just touch upon it uh they were really worried it's going to be struck down so their basic uh model is to just spread it so far so wide uh so that it can't be undone okay that is the reason why this is there because when you when you operate at that speed uh I don't think you account for fraud you don't actually care for fraud um my question is like the government did a reasonably good job on the order ID collections and we are running probably the largest democracy in the world why the lessons learned from the order ID collection not used in other and uh like why we end up in this mess where we already have a very so if you if you look at the original uh way in which they looked at the whole thing right this is not supposed to come this far this is supposed to be with pds right uh the original argument has always been that look there is a lot of problems in pds and historically if you look at pds uh in India uh even in karnataka a long time ago uh people had experimented with biometrics for deduplication that's basically how the whole thing started right and um it started there it just went too much beyond the scope and that's where the problems are so if you if you look at it uh you can increase the scope slowly over a period of time by putting a lot of sufficient safeguards I mean we have been doing this for a while we know how to do this kind of stuff except that here control has never been with the technical team it's basically something else done one last question which came on slido uh do you think uh why do you think ada was the only document which was targeted we had uh collection drives uh data collection drives which happened earlier uh those documents were not mandatory for going to the toilet which apparently happens a lot so if you if you look at some of the stuff that happens uh in in uh so there's a scheme for swachh Bharat right uh for you to get some kind of uh aid for swachh Bharat you need to have another card okay so literally people were making fun of it by saying that like I mean you're saying it's this is what it is none of those cards were basically being mandated for each and every single thing in your life right I mean you're not told that if you don't do this your money is gone that's where the problem is so while people had done documents before they had not had big problems because of the fact that they were not mandatory for many things okay enough I'm done we have the uh data science and production boff starting at 4 30 in the first floor and if you're interested in giving a five minute flash talk or lightning talk which starts at 4 20 in auditorium 2 you can register at the help desk so using data analytics and a bag of tricks with micro targeting and advertising in politics Shivam Shankar Singh says you can influence elections leading to conflicts and the use of such data on ethical grounds still exists Shivam Shankar Singh on weaponizing data for politics so this is actually my laptop and that's my Aadhaar guard on there so the government also knows who I am uh could have the clicker please so uh I'm going to give a really quick introduction of who I am my name is Shivam Shankar Singh I graduated from the University of Michigan and our work came back worked in the Indian parliament as a legislative assistant to member of parliament for one year then worked with Prashant Keshore who runs a company called ipac after that I moved to running election campaigns in the northeast and then I worked for Ramadhoji and BJP and handle the campaigns in Manipur and Tripura Manipur was a really tiny state and that's where we started experimenting with a lot of data stuff because each consistency is just like 50 000 people so you can do a lot of things there this is actually a report that was published as soon as I landed here in Bangalore this basically is an Oxford study that says that companies have invested half a billion dollars and different parties have invested half a billion dollars across some 41 countries to experiment with what can be done with data uh study says that since 2010 many political parties and governments have spent over half a billion dollars on research development implementation of psychological operations and public opinion manipulation over social media how did people hear about politics and data together for the first time for a lot of people in India the first thing they heard about was Cambridge Analytica it's a UK based company that basically took people's data from Facebook and then used it to target targeted advertising on Facebook itself this I our parties major made a huge hue and cry about this they said that okay congress is using Cambridge Analytica using Cambridge Analytica as far as I know no political party in India has used Cambridge Analytica and for just Facebook data it would not be a valuable tool on the other hand we have a lot of things that are being done in data and politics in India this is one of the presentations it's a campaign pitch that we pitched to a political party and when the Cambridge Analytica story came out I was contacted by a reporter and the first thing I thought about we should really change the names of our slides weaponizing data is not a good name to have after the Cambridge Analytica story the technology that's used in politics is decently basic it's python to scrape off data from the web it's put into a database for visualization we use things like D3.js we use QGIS and Tableau to make it look pretty to basically put it over a map so that party leaders like the way it looks we've developed mobile apps so that leaders can just look up constituency profiles there's actually a really interesting story on this the first time we went to Tripura with a major party leader he had a tablet on his hand and in a meeting full of party Karikartas he just made random people stand and ask them how many Muslims do you have in your constituency how many people from the Jamathya tribe do you have at your booth and that was something that got the Karikartas really interested because they realized that party leaders have this level of data it incentivize them to work harder it incentivize them to actually go out on their boots and start collecting data for themselves so a big disclaimer before we start the next part I do not recommend that anyone go out and actually do any of this neither am I saying that we've done any of this it's all things that are possible in politics basically don't want to be the next Cambridge Analytica headline here most of the data that's actually used is absolutely publicly available and it's all legal the first part of data that political data analytics starts with is actually from the website of the election commissioner every state has a chief electoral officer from that website you can download something called the electoral role an electoral role is basically a sheet of paper that has people who vote at that booth the information that that sheet has basically like it has the constituency name at the very top then it has the booth name and booth number it has your voter ID it has your name it has your father's name it has the household number it has your age and it has your gender so this is data that the election commission provides to the general public you can log on to the website right now and download PDFs if you zoom into this then you see this is the first 30 names from a random constituency in Bihar the constituency I belong to so let's convert this into an excel sheet with this what you can see is that there is some information that becomes very obvious as soon as you see it what do you think you can tell with just this much data any guesses yep caste is definitely one of them the age is definitely there's a column for age so you know the demographic profile of that constituency based along two dimensions one is caste the other one's age exactly for the cost we've actually written algorithms now where it happens at the constituency level that people fill in that this caste just surname equals this caste if you do this for a big enough data set then it automatically assigns caste to people for some people like the third one in this row you don't know what caste the person is but you can just google it and as soon as you google it google tells you like someone asked a question in kora and someone answered it so that's how you get the caste for people the interesting part is that there is a lot of ambiguity when it comes to last name and cast for one Chaudhary in Bihar actually correlates to two different cast one is Bhumear the other one is Pasi but the algorithm knew what to assign the name to just because of the constituency the allocation happens at the constituency level there are physical people sitting for every constituency and so for an MLA constituency there are about like 2.5 3 lakh voters but there is someone sitting and filling it up for 10,000 voters manually and it happens every for every constituency some surnames stay the same across the state but some change constituency wise so you still need to do this exercise as someone mentioned you can tell family we don't really know what to do with this data yet because whoever is working on the ground for a political party already knows who's like a part of which family like how do you market micro target based on that the accuracy of this exercise for the states that it does work in for the big north indian states like UP and Bihar it works surprisingly well for some states it will not work at all like Punjab everyone's last name is Singh it's a larger schedule cast population in the country but you cannot stratify people based on that because it's Singh and the name of the village that's all you get for a place like that you might have to do it manually but like I haven't done it yet so not really sure this is the information that we have courtesy of the election commission now we have the name we have the age we have the location of the person because that is the booth that they vote at we have the gender what can be derived out of this is the caste and religion religion's actually surprisingly easy to derive caste for the major ones like Riyadhav Rajput is decently easy to derive too for the smaller caste which use multiple surnames it's slightly complicated but then this is what political parties have that no one else has they have a huge amount of free labor you have so many party characters who would be so excited to be a part of strategy you can get thousands of people who come in two hours they're just filling up okay this last name equals this caste this last name equals this caste and we basically gamified the entire process and people are really excited to do this for hours on end this is some more data that the election commission itself provides you this is a booth has less than 1000 voters on average this is some constituency from the northeast which had like 600 to 700 voters you have the voting profile for every booth you also have the caste profile for every booth so if you have data for the past three Lok Sabha elections and the past three assembly elections which we don't we have it for like two assembly elections and one or two Lok Sabha elections because delimitation happened which is when they changed the boundaries of all the constituencies and you just don't know what region fell in part of a new constituency with just this data and basic statistics you can get to a pretty good understanding of what community is voting for what party or what community is voting for candidates of their own party because you have different booths where you have basically stratified them into different castes and different religions and you have the voting record from those booths so just by taking out an intersection you can see what how the community is voting in my experience the intuition talk that we just had it would correlate pretty well with this because whatever the local intelligence and the local party candidate tells you so you might not need this data at the end of it we've seen caste religion age family relations booth level voting trends what comes next is the gray area and i know the slide is black and it's intentionally black on purpose this is a tweet by internet freedom foundation data protection is not some distant intangible elite demand many Indians today use paytm olah just dial flipkart as per a recent unsolicited email we got many of such databases up for sale they open they open up people to not only spam but also identity theft what it also does is that it provides a lot of data to politicians to actually compromise democracy at the end of it there are student databases up for sale you can just get whoever's a student in what stream in the entire state you can get people's smartphone data you can get matrimonial databases i'm sure we could think of a use for that you can get online shopping data you can get the list of government employees and all of this is available through unsolicited emails by random companies the x factor when it comes to political data analytics is phone numbers and it's not the ceo of etel or like the ceo of geo who's selling you these phone numbers it's random people in random constituencies there are people who've just mined the list of every sim card issued in a state and they will just give the data to anyone who wants it like if the adha database the enrollment process is so compromised you can just think what the data security is like on something like a phone number another really important factor that you need before you can actually turn anything into actionable intelligence and politics is people socio-economic status that influences voting behavior to a large degree what do you think would be a good proxy for socio-economic status and he guesses just shout it out what address could be one yes not very specific though like urban clusters everyone lives everywhere twitter's too limited mobile phone brand could be it but we've actually found a much better proxy that has a one to one correlation to socio-economic status that is people's electricity bills so you can use land record census data nsso surveys bpl lists and all of these things are used but they're complicated electricity bill on the other hand everyone has an electricity bill and it's a one-to-one correlation the more air conditioners you have a higher electricity bill higher socio-economic status you just get and so honestly this data is surprisingly easy to find you don't need to contact the power ministers there are people sitting inside discoms who are willing to sell you this and there was a discom in deli where a friend of mine was paying the electricity bill and he realized if you entered the customer id it showed you the bill billing details the name the address and the amounts for the last three months so what he did was he wrote a script it basically just passed through all the numbers like incremented it till it ran and decrypted it and he had the bills for his entire locality so micro targeting facebook advertising is a major part and like twitter and stuff is a major part but what's allowed for true micro targeting in politics in india is an app called whatsapp much more prevalent than facebook much more prevalent than twitter and in this you can actually target an individual instead of trying to target a group by just age when it started out in 2014 one of india's major political parties will not name it but yeah they had 9000 to 10 000 groups in the entire country during the 2014 elections in karnataka itself in the last election they had over 20 000 whatsapp groups operating these groups that people are added to and like some of you might have been added to these are not groups of random numbers when it started out in 2014 it was actually just random numbers they had a list of numbers and they kept adding 256 numbers to each of the groups now it's not random and you guys honestly probably were not added to a group because you were assessed to be high sES urban youth which is a demographic that no one really wants to target right now but what this allows you to do is that you form groups of specific communities you have groups based along certain cast lines you have groups of certain socioeconomic segment you have groups of particular constituencies think of a group of youth between 18 and 25 hindu low to middle socioeconomic status non yadav obc in booth number 5 to 150 in the 12th assembly constituency of uttarpadesh this is basically a group that has a specific age group specific income status specific caste what can you do with something like this for a party that's trying to win an election in this constituency it knows that 22 percent of the constituency is upper caste 16 percent is yadav and the yadav population supports the opposition party that's the core vote bank of the opposition but the region also has 18 percent non yadav obc what they would think of is to get the upper caste vote plus the non yadav obc vote for this they run focus group surveys internal ab testing and like political acumen at the end of it and then they test messages so the general message here is yadavs have cornered all reservation benefits intended for obc's party x only supports yadavs to the detriment of all other obc's we must stand up and fight this injustice we must teach them a lesson this election so this is a very explicit statement of what a political party wants to do this is not how it would work this is where propaganda comes in fake news comes in random facts come in because you want to convince voters of this message what you do is you start sending random made up statistics that 90 percent of the obc reservation in the state in educational institutes is taken by yadavs no one knows if that message is true or not but it is something that resonates with the people they are like ha ha sayy vatam mir ko isi liye nahi mil rahe so here is where polarization starts fake news start distorted facts start caste conflict start people start linking nationalism to one party it doesn't happen instantaneously if i sent you a message today you'd read it and not think about it anymore but if it was a concerted campaign over a two to three month period where i sent you facts where i sent you jokes and all of them are pushing you in a very specific direction to think in a very specific way the the providing your sense of victimhood even when that fact is corrected you will not believe the correction you will continue to believe your originally held belief because it plays in line with your bias some of you might have heard of the stanford prison experiment that was conducted in 1971 this experiment in the u.s. they classified people into two categories one was prisoners the other were guards and they basically showed that people's behavior is dependent upon the position that they are put into so the guards started torturing the prisoners they beat up the prisoners they basically mentally harassed the prisoners around seven eight years after the study was published it was found that the study is absolutely fraudulent no such thing happened it's completely made up but if you tested college students today there's a lot of data on this 70 to 75 percent of the people in intrastat classes still believe that the study is real and this is what happened there's a lot more happening with data today this was just a snapshot of something that can be done with just phone numbers and electricity bills and publicly available data from the election commission right now political parties are collecting datas of beneficiary of government schemes like the ojwala yojna pradhan mantri avas yojna the number of toilets that were constructed what's going to happen is political parties are going to target these specific people for specific campaigns eventually someone might get access to loan data eventually someone might get access to IRCTC or payment wallet data there's just a whole world of possibility think about i don't know if anyone's doing this i'll just clarify that at the outset but think about what someone could do if they had your call records these are not taped phone conversations these are just the numbers that you've called in the duration that you've talked for do you think it would be something that's illegal any guesses should be legal right right now in our books we have no law that would explicitly say that a data compromised in this phase illegal what it's covered under is section 43a of the it act a body corporate who is possessing dealing or handling any sensitive personal data on information and is negligent in implementing and maintaining reasonable security practices resulting in wrongful loss or wrongful pain to any person then such body corporate may be held liable to pay damages to the person so affected a really important component of this is the wrongful law loss clause just because a political party is accessed your data it doesn't mean that some wrongful loss has occurred to you how do you prove that it's a wrongful loss the other thing that actually does a reasonable job of protecting data in India is something called the information technology reasonable security practices and procedures and sensitive personal data or information rules 2011 but this only covers password financial data physical physiological and mental health conditions sexual orientation medical records and biometric information everything else if the data is compromised there is no guarantee that anyone will be punished for it or anything will happen at the end of it you might register an FIR that your data was leaked through a certain medium but till now I don't know of any action that's been taken on such an area for so many other hard data leaks that have happened in the country no one's taken responsibility if you had people's call records this is what you could do with it you could map out the entire network of people with this network you could identify key influencers in society there are some people they're called extra words I've heard who talked to a lot of people on the phone for really long durations they would be influencers in society you identify them you target them with specific messages and then they will start using those arguments in the daily conversations that they have with their friends and families that's how you propagate a message through a network it's not just through social media it's through your friends and family that a certain message is being paddled there is urgent action required to stop all of this so that the essence of democracy is maintained in India we need data privacy laws someone's selling your phone numbers and electricity bills and it's not even clear if it's illegal the other part of this is that we need data storage laws right now the requirements there just aren't any requirements if it's not passwords and financial information it can be on an unencrypted database it can be on an excel sheet you could just scrape it with incrementing a number so you need data protection laws and data storage laws which govern how a data is stored in an encrypted database you need spam restrictions on things like whatsapp you need facebook twitter and whatsapp all of them to come together to start flagging fake news all of this needs to happen together so whatsapp actually made some promises to the government of india they surprisingly responded the ministry of information and technology wrote to them after a series of lynching incidents after fake news of child abductors spreading across the country their responses basically said that they are going to prevent fake news from spreading and this is how they plan to do it new protection to prevent people from adding others back into the group which they have left okay great but i don't think this happens we never add people back to the group it'll be counterproductive they talk against you administrators to decide who gets to send messages within individual groups it's always going to be said to all people it won't matter a new label this is under testing now a new label in india that highlights when a message has been forwarded versus composed by the sender have any of you ever gotten a message at which it says forwarded as received do you think it makes any difference to the people who read it mostly doesn't the people that's being that are being targeted with these messages are people who are already primed to accept that information they already subconsciously believe it they just need a document saying that they're correct about it a new project to work with leading academic experts in india to learn more about the spread of misinformation which will help inform additional improvements going forward i have no idea what this means let's see literacy workshops and advertising campaigns on how to stop fake news so the weird component about this is that whatsapp probably has the best platform for a literacy campaign it's whatsapp but they're not using whatsapp what they're doing is they're giving out newspaper advertisements for some reason uh fact-checking accounts on whatsapp so it doesn't account for boom live and hazrabad police where you just forward a piece of news and they tell you if it's fake or not this requires people actually questioning their beliefs and actually forwarding it themselves which is again something will not happen if you already believe the news all of these are unlikely to address the problems what can actually work it will have to be a combination of technology and new laws in the country i will talk about technology because everyone out here has something to do with data artificial intelligence that can categorize things as fake news would probably be the next logical step and it's probably going to be a lot more effective because education is a very slow process if we start educating people right now on what's fake and what's not it's probably like not going to end very well for like the next 10 elections for for this the real concern is privacy whatsapp says that they're end-to-end encrypted what could happen is that maybe groups which have 25 percent more or more unknown numbers or like say even 75 percent of the numbers don't have you in their address book and you've added them to a group then that group maybe doesn't need to be encrypted because it's not people you know you're just adding random numbers to a group thank you immediate action is required what i am here to do is to tell you about how your data can be misused and how weak the data protection laws are in this country the first part of this is being informed so that you can push for tougher legislation in the country and actually talk about data security in a sense that actually translates into real action and thank you questions hello just make sure you restrict your questions to one because we have a lot of hands so and the volunteers please make sure that you take back the mic once the question is posted please thanks hello hello yeah hi hi so i just wanted to know you mentioned that electricity bills are able to you know target people really really well especially in states of UP and Bihar where power theft is a very common problem does does it result in false positives or anything of that sort does it result in what sorry false positives as in where let's say somebody is using up electricity for just 100 or 200 bucks but he's very very rich or something of that sort so that usually doesn't happen there will be outliers in this but the outliers don't matter the point is if you have like 70% accuracy or 80% accuracy in a field like politics it's more than enough hello yeah hi hi so as you said that you make cuts on cast and everything so what if every election is a different one right so in some cases the people vote against cast so and you work with the politicians so how do you uh can you actually predict that in a new election and how do you deal with the politicians with that when that happens so some cast will never shift their loyalties it's known which side they're gonna vote on there are some cast which do shift loyalty and those are the cast that are targeted for something like this yeah but in a particular new election yeah there are cases now and the cast that voted throughout historically in the same oh yeah strategy is made new for every election strategy is never recycled for every election can that be predicted for a new election like it is so it can be predicted this all of this gets combined with survey data and focus group discussions and stuff like that so it is not just tech it's also a lot of on-ground activity you also have a party cadre that's working on the ground continuously who provides you feedback on which direction the voting sentiment is going in i didn't have any question but i was just thinking one way to counter this force might be to gamify identifying fake news like for example if you can gamify that part where if a person can identify this as fake news and send it to an authority then he or she gets some amount or something like that so that can invert that can actually revert this trend because ultimately i think people always care about you know some kind of reward backlash yeah so that is actually a great suggestion that's one small problem with this so alt news might have heard of it alt news is a website that corrects fake news it releases corrections on a lot of fake news but their reach is so insignificant compared to the fake news itself but how do you ensure that it reaches the same people there is a party mechanism that's arranged for 20 000 whatsapp groups in the state of karnataka yeah if you started rewarding people but whose incentive is it to reward people hello yeah yeah so we are talking about the elections but we are not considering the corporates that are making huge profits using the data segmentation every day we browse many apps and use whatsapp and facebook and all the companies are using the data can we also talk about corporate laws regarding this so definitely a holistic data protection law is required this country needs to think about what companies and what anyone who has access to data can do with it we just don't have laws for anyone and politics is one space in which we see immediate results of something like this for what whatsapp or facebook is going to do with our data and how they're going to monetize it i don't know yet all they've done is like push advertisements towards us which isn't so bad hello yeah i just wanted to know your opinion you said that we need laws i think everybody agrees that we need laws but we also saw that it's against the interest of the politicians to make this laws yeah so how do you think that this can go forward so the way i see it there is eventually going to be a pl on it and the court is going to have to order something the incidents of mob lynching and fake news propelling people to act violently in random parts of india is going to get so much that eventually the court is going to have to act on it the other part that's going to have to do something about it is the election commission because their job is to keep elections sane in india with this technology they haven't really caught up they don't know what to do with it they have no means of monitoring expenses on social media so they will have to get into this field so you just mentioned that you have been working in northeast for some time yeah could you just give your opinion on how you felt say data being utilized data into information or say fake news and all that you've been talking on is different how it is different northeast versus rest of india because we have always so how it's utilized is different for every state it's different for every election in a place like northeast you have a lot of tribes tribes are actually much easier to identify than cast because they use the name of the tribe as their last name in a lot of parts of the northeast that's what happened in tripura so it's very easy to segment people into different tribes what then happens is that you can identify influencers within those tribes and target them specifically because vote transference is a lot more prominent in the northeast you go to basically a village headman and you tell him that he was transfer this is something that's not so prevalent in the rest of india anymore but this data allows you to basically set the right cost for the right village headman thank you hi again so i just wanted to know do you believe the invisible hand of the market can help as in if there is information symmetry amongst the political parties if everybody is using the similar data can this effectively act as an you know insignificant factor so the data is all public like none of this is proprietary data that only i have access to anything like that most of it is on the election commission website the mobile numbers you can just buy off of the street any party can buy it the point is some people have started using it the others just haven't realized how to go about it some parties just have more money than the others hi yeah hey you said that access to call logs is a pretty can be an influencing factor to you know promote campaigns so what do you think has has is still monetized data available illegally like in this space are people already doing it i have no idea i i haven't met anyone who has called records for people no one's tried to sell it to me yet so someday it might happen i'll let you know okay yeah maybe like after this talk goes online someone will be like boss i have this data kharil oh thank you sure we'll break have a break now and we'll be back in 30 minutes we have feedback forms as well so kindly fill it up and you can leave it at the draw draw box in the front entrance thank you happening one is on the first floor on data science and production and already two we have another buff that's going to start at 440 on math for data science okay this is the last talk of the day in this or d1 understanding the scalability limits of spark applications who's debugging with trial and error can lead to frustrations kubol spark lens has been designed as a solution to better time estimate and identify structural constraint limitations so we have rohit to talk on kubol spark lens my name is rohit and today i'll talk about spark lens this is a open source tool that we have developed at kubol and it's basically sort of designed to answer two main questions one is given a spark application how many executors do you really need so do you can if you add more executors will your spark job run faster or you're just wasting compute and not really getting any value out of it so the agenda for the talk basically is that i'll talk about performance tooling principles in general some of them applicable to spark and then something which are you know very very specific to spark and then i'll talk spend a lot of time on theory behind spark lens it's most mostly around how scheduling works and what are the constraints that scheduling adds to the scalability of the spark applications and then i'll go through an example where i'll show you how we can sort of use spark lens and identify what are the areas where there could be problems and how we end up sort of solving them all right so performance tuning principles so mostly if you look at performance tuning right we they're very simple simple things that we typically end up doing one is you know make some part of your competition faster so this is very typical right we'll profile the application find out the area where we're spending most of time and then see if we can make it faster uh you can maybe you're using a order n scare algorithm maybe we can sort of use a order n log n and that makes life easy uh you can also do things like you know make uh you know cpu faster user if you're on cloud you can use a better instance type which has more compute uh most of these things will basically give you some level of you know make if you make some part of the computation faster obviously the total time will reduce the second sort of principle that is normally used is you know don't do what you don't need to do a couple of things here for example if you're using spark and let's say you know you're storing all your data in in some sort of files and you had need to sort of scan them let's say you're doing some computation to find out what were my orders in last one year or something and if your data is not partitioned you'll end up scanning all the data so instead of scanning all the data if you could partition your table you couldn't specifically choose you know what uh you know maybe for example if it's day wise or month wise uh you reduce the amount of work that is required uh and the same thing sort of applies even if you go deeper uh let's say uh you store your data in uh csv files and you have 100 columns and typically your query only requires 10 columns now 90 columns that you have read uh for every record are complete waste and if you can sort of use uh file formats like park a or or c then we don't have to sort of read uh these columns and essentially we save some time some computation the third principle essentially is uh don't do again what you have already done and that basically refers to caching and typically you know in in in sql it's a little hard to sort of you know do these things but very right programs in scala it's pretty easy to for example put a for loop and forget to notice that you know there's some computation which is happening again and again and there might be advantage they are just you know stop back look at it and see if you can cache it and reuse it during the rest of the computation the last part is use more resources paralyze and distribute and that is where spark comes into picture uh because that essentially spark is a platform where you can put a code and it will just get distributed over a set of executors get executed and give you results now sparks make it very very easy to really paralyze and distribute but depending upon how the data is partitioned how what are the sort of constraints how stages interact with each other they all impact the scalability of your application so even though parallelism is easy but ensuring that the work is actually distributed uh is is a difficult task and that is uh what i will be discussing uh today so let's just sort of think about spark uh application so spark application basically has two parts two distinct parts one is the work which is done in the driver and the one is the work which is done in the executors the difference between these two is that uh in the driver we are basically doing work uh in alone by alone i mean you know we're doing work which is uh which is sort of restricted to the driver and when driver is doing any work uh there's nothing which is happening on the executors executors are completely free and once the you know the sort of the state of the execution reaches a point where it goes to the executors then it's only executors which are working and driver is not doing anything and uh this sort of structure makes it very easy to sort of understand you know what's going on because uh so one way to think about it is like let's say uh you were uh if if you imagine you know the basic hardware and you were to ask a question where is my program counter right now uh that the the answer to that question would be either it is in the in the in the driver or it is sort of parallely sort of executing somewhere on all the executors so that is what i mean by you know uh the the application being split into two parts now uh this structure uh so there are other constraints for example some stages should finish before other stages can begin some tasks all tasks of a stage should finish before the next stage can uh can work and uh all these things basically uh you know have have a role to play in determining the scalability and performance of your application to really uh tune your application what you need to understand is its structure so uh so when so when i said you know spark is is a profiler i was sort of a little bit wrong it's actually an inverse of a profiler so typically when you profile an application it profiler will tell you this is where the cpu is being used uh do something about it spark lens sort of works in an inverse way what it tells you is where something is not being done so you have a resource you're not doing anything with it uh how can you do something to make use this resource this cpu which which you have allocated which you have purchased but you're not using that compute so in some sense it's a so let's still sort of stick with the profiler so one way to sort of you know improve the efficiency of any spark application is to look at you know where are my executors not doing anything and then uh work backwards and say how can i make them do something and uh if you if you can basically make sure that all of your executors are doing some work all the time you basically get uh very efficient spark application so essentially uh you know optimizing spark is is is basically a manager kind of job just be a good manager uh okay so this is i believe one of the most important slides and i'll sort of keep referring to this like over and over in the presentation so doing nothing uh is basically you know is what we are going to focus on as part of understanding how spark lens works or how even spark works so if you look at uh the the y axis we have driver and we have cores core one two three and four and then on the x axis i have time if you have looked at spark ui this is probably very very familiar to you but if not then just you know take a look and so these are the resources and the green bars boxes that you see are the tasks which are scheduled on different uh different cores so for the purpose of this particular discussion i'll i'll not distinguish between cores and executors so a four core executor 10 executors with four cores or four executor with 10 cores is same for the purpose of this this discussion it has you know implications but we can ignore it for now and then uh you you can also see that you know there is a there's a we we sort of talk about stage one stage two stage three and that is where you know these tasks are being scheduled now all the gray area that you see here is essentially you know uh the compute time which is getting wasted and we'll sort of try to analyze it and see how can we split it up and understand uh you know uh where it you know what is going on here and what strategies can be used to minimize this gray area and that is when you get a very excellent spark uh application or a fully tuned one all right so the first one is uh driver side computation so as i said earlier right the structure of the spark application is such that when a driver is running uh no stages are actually running and uh so all this orange area that you see here right now uh is covered because driver is doing some work now for example let's say you have 100 executors running and your job let's say take 10 minutes and for five minutes there was some computation which has run on the driver so literally those five minutes will get multiplied by all the executors that is 100 of them and 500 minutes out of your thousand compute minutes are just getting wasted because there is no work for these uh executors to really do so first sort of principle here is you know if you can minimize your driver side computation uh things will become much much faster and uh so what do we do in the driver one of the things that we do in the driver is file listings so especially if you're looking at large uh tables which typically we know partition by date uh you'll see that you know we have seen at least you know thousands or ten thousands or even hundred thousand files uh being listed in in s3 and this is not too much of a problem if you're using hadoop because the name node operations are pretty fast but if you're working on s3 the listing could take a lot of time and at cubool we have invested heavily in in figuring out how to make file listing better for s3 the second you know reason where i've seen you know we spend a lot of time in driver is loading of five tables so if you're writing from to high tables from spark a spark tend to write to a temporary directory and then once the competition is the whole competition is over it will copy the files from uh this temporary location to the final location uh where the table is situated now the problem is that usually when you work on hadoop uh this this movement happens using calling a file system api called rename now rename uh on adoop or htfs is a metadata operation but uh rename when done on s3 is actually uh is a physical operation it basically copies the file over to a new location and deletes the original file so instead of being a constant sort of time operation which is a metadata operation just updating a mem memory somewhere it becomes operation which depends on the size of the data so we have done some work at cubool in basically ensuring that you know we can do uh writes in a parallel way do these copies in using multiple threads so that you know some latency is hidden we have also invested in can we do these writes directly to the to the table instead of sort of going through all this you know temporary location stuff and ensure that in case of failures the things are cleaned up properly the third place where i've seen it is many people sort of use this innocently they'll do a data frame then collect it and start for each loop uh this essentially what it does is that you get all the data from all the executors into the driver so usually you will sort of fail out of will see out of memory happening here but in case you don't uh this is something which is essentially going to cause all the computation to come and happen on the driver and leave everything sort of you know executors totally free uh to the to the extent that i have seen some people actually calling a rest api uh from on the driver itself basically for each record they'll call a rest api and update some external system one more way where i've seen you know people typically in when you're using py spark is they will convert the data frame to two pandas now naturally a spark data frame is a distributed abstraction but the moment you convert it to pandas you basically get a data frame which is running only on the driver and this again leads to computation which is only happening on the driver and you're not using resources available to you the second reason you know why we see this this we see you know wastage of computers not having enough tasks so if you look at stage two and look at core four it doesn't have any work so if you have you know four cores available and you're only giving it three tasks one of the core is not going to get any work and if you look at stage three similarly we have a core one and four there are no tasks so if you don't have tasks there is no way you know that compute can be used so be very very sort of you know sensitive to the fact that there are enough tasks available for spark to really execute otherwise that compute is not going to work for you so how do we control the number of tasks multiple things here so example if you're running on htfs right htfs block size is one parameter the smaller the blocks than the more the number of tasks but similarly you know on s3 you can use min and max split size which sort of defines the granularity of each task then spark default parallelism is another parameter which typically it's a property essentially and at runtime if your code refers to it you will basically get the current number of cores that are available to the spark application now this property sort of varies during the duration of the application so sometimes it's if you're if you're whatever you're doing right it requires you to know to set this to high value sort of set it to high value fourth parameter that is very very sensitive to task is the spark sequel shuffle partitions anytime you do a shuffle you will basically end up you know using this parameter and the default value is 200 so if there are more cores using a you know lot more cores than 200 you might want to sort of revisit it and see can you should you basically go and increase it and to a point where it sort of matches at least the number of cores you have given to your application and the last one is repartition that's a function available on data frame and you can basically convert any data frame into number of partition that you need again if if the if you're changing sort of you know maybe you developed an application when you were doing something on a staging cluster and you picked up a right value that value when you run it on a larger cluster this might sort of bite you later on so sort of understand the context in which your application is running and sort of tune it for it the third reason why we see you know wastage on the executor side is is skew skew basically means that some tasks are taking a lot longer than some other tasks so if you look at stage one this the core one has a task which takes let's say two units of time whereas core two three and four work for only one unit now the way spark is structured is that until a stage finishes the child parent stage finishes the child stage will not sort of get scheduled so it is not the average time that a stage a task takes in a stage which is important it is the worst case time that is impacting the total runtime of your application so if you can reduce skew you'll get something which is even better so now coming to skew itself you know skew basically happens exist in the data and it basically happens because some keys or some partitions have lot more data so for example let's say you're doing you know let's say you have some sales data and you're trying to figure out the sales by city and obviously you know the sales in maybe Delhi or Bangalore are going to be lot more than say in Kanyakumari but what it means is that as you get as you as your data gets polarized and there's lot more data happening in in in one partition it is the runtime of this particular executor who is processing records for this particular partition which is going to determine how much time your application is going to take so handling this skew becomes very very important so for example you know one way to handle this could be that instead of doing a join on on on city you might end up doing a join on a pin code which is far more sort of uniform and once you have data at at at a pin code level then you can do a second level join which aggregates data for for let's say city level so there are some sort of ways to deal with the skew but and it will depend on the nature of the data that you're working with all right so that sort of brings me to the notion of critical path so if you look at follow the arrows right this is defining the critical path so the definition basically is all the time that is spent in the driver plus the time which is spent on each of on the largest task in each of the stages now this is sort of little bit wrong in the sense that some stages can run in parallel so actual computation requires that you you look into the the max between the parallel stages but generally you know the the idea is same so what is critical path so critical path basically tells you that you know this is the least amount of time which your application will take to finish and that is irrespective of the number of executors so even if you give infinite executors to any application there is no way that that application will will finish in less than critical time the good part is that this you can if you get a single run of an application it's possible to compute this number and once you know this number you know if you're close to this number you essentially need to go back and work on your application if you are sort of farther away from this number you still have scope you can add more executors and still see some improvement but if you're closer to this adding executors is not going to help you at all now the logic why it works is is is is follows let's say you know I give adding more executors will not change the time that I spend in the driver so hopefully you know if I add more executors driver will probably have more work but it cannot have less amount of work so adding executors doesn't really changes the time spent in the driver and that is why that is the path that is part of it the second part is unless you change the distribution of the tasks the largest task of a stage doesn't really you know gains anything or cannot be made smaller if you add more executors so adding more executors only help if you have more tasks than the number of course so so that those tasks can also be done in parallel than having to wait and come for the chance and then run again so essentially uh uh using this we can we basically says that you know spark application will never you know cannot run faster than the critical path now what have we learned so far so what we have learned so far is that spark application cannot run faster than its critical path and that is no matter how many executors and the way to make a spark application sort of efficient is by looking at three sort of areas one is reduce driver side computation second have enough tasks for all the course and then reduce the task queue and if you cannot do any of them you might be sort of you know the one way to sort of reduce the wastage is by reducing the number of executors so if you reduce the number of executors is and this is sort of lot more packing of tasks and you'll probably get you know much more bang for the buck so that sort of concludes uh you know my you know work on uh the theory behind the spark list uh in next uh few slides i'll i'll sort of you know talk about spark lens and how it can be used so what is spark lens so spark lens is a open source spark profiling tool and it's written in scalar it's open source and it can be used with any spark application uh what i mean by any essentially is that if you're using cloud era or hot and works or e mr it doesn't matter or even if you're developing your application on your laptop uh you can just uh you can use spark lens and understand uh how your application is performing and it basically helps you tune your applications uh by making it easy to spot opportunities for optimization and these opportunities essentially are the one that we discussed driver side computations lack of parallelism and skew apart from being just a profiler it also has some prediction capabilities it has a built-in uh scheduler simulator which basically you know can let you simulate uh if you were to increase the number of cores or decrease the number of cores how will your application runtime change or how will your cluster utilization change uh with those and it's pretty good because you know instead of trying to experiment with so if let's say if your job takes one hour or two hours to run and you're running 100 executors uh it it will be very hard for you to sort of you know do experiments by running it on 500 nodes and 20 nodes it just takes time and money so being able to sort of predict it uh you know during using one app one run of the application is typically fairly useful all right so uh so the example that i'm going to give you this happened like uh last year sometime and uh there was a customer poc and we we got a 603 lines of skella code and you know somebody said this is sort of taking a lot of time can you optimize it and uh and i think the point to note is that we didn't knew anything about this code we didn't knew anything about uh what the schema was or what the person was doing so it was just like there's too much context in in that code for us to understand uh all all the details of that code so what what we'll do here is that we'll walk through this this uh how the tuning really happened and see you know uh how can we actually tune without really knowing too much about the application so this is uh the first pass so we run spark lens on this application and this is uh what spark lens report is report so first one is uh you know it takes 158 minutes to run this application 41 minutes are spent on the driver and 117 minutes are being spent on the executor side now we also show the critical path which is 127 minutes here and what it tells me is that if i add more executors the the performance or the latency will go down because critical path is we have we're not really close to the critical path yet uh the the last thing that it reports is the ideal application uh so ideal application is sort of defined as let's say if there was no skew and there were all tasks were uniform uh and there were enough tasks for every executor how much time uh will the application take so essentially we're saying if all our application is is the best possible application in the world and it it it sort of scales linearly on the executor side how much time will it take so so that number is 43 minutes so what it tells uh what it tells us is that uh there is skew or at least lack of tasks which is causing this application to be slow and we should sort of figure it out and uh change uh you know make changes appropriately so so we started looking at uh at this the results and one of the thing that we found was that this application had too many stages like almost uh 700 stages and uh so typically you know when we profile applications I have seen that 30 40 50 is is a usual number 700 is a very large number so one thought that came to mind was there could be you know some sort of a loop going on which is where this time is being spent so we started sort of looking at the code and we found that uh there was a there was a there was a right happening to a high table and instead of the instead of writing in a parallel manner the code was basically you know filtering by each partition and doing a right one partition at a time and uh so we just thought this this is probably wrong and spark is sort of designed to write uh you know parallely to all partitions why are we sort of why is this code doing this so we changed that a couple of lines of code so that a normal spark uh right will work and it will write uh to all the partitions in a parallel manner now what we saw you know in the second pause was that instead of 158 minutes the application took only 26 minutes and so that was a good improvement and also the driver time came down very drastically from 40 minutes to about two minutes and total you know time was just 26 minutes so from 158 minutes we came down to 26 minutes so but what what is interesting to note is that the critical path time is is just 25 minutes which means that if i were to add more executives at this point of time uh i really cannot really expect this application to perform any better than 26 minutes but if i look at the ideal application which is about five minutes uh which is showing four minutes and 48 seconds uh there is still scope for improvement uh there is some skew there is probably some lack of tasks which uh which we need to look at and if we if we can sort of fix those things uh instead of spending 25 minutes we could probably bring it down to five minutes or 10 minutes uh some some lower number so uh this is uh so before we sort of debug this further this further right uh one thing that we noticed was that the executor so we spark lens also reports you how much is the wastage happening on the driver side uh versus how much is the wastage happening on the executor side so here we can see that 91 percent of the total executor time is actually wasted uh and we don't know why but but there's huge wastage there and so we should probably look at this wastage and try to minimize it so as i told you earlier spark lens also has a simulation component where it simulates the results for you and here the yellow part which you see 100 executors and 26 minute is is the real uh number all the distance things that you see are simulated and you can easily see that you know whenever adding any executors we go from 100 to 200 to 500 the the time for computation at least in the simulation doesn't goes below the 20s 25 minutes and similarly we if we go up and start reducing the number of executors we see that you know even if we reduce the number of executors to 50 uh the total time will will still be 28 minutes which is just two minutes more than the current time so so you get a nice trade-off between how much compute you want to use versus how much latency you know is is useful to which you want to sort of tolerate and on the other side you also have a utilization metric which tells you how much of the cluster is actually getting used so if you look at here we are only using eight percent of the cluster with 100 uh executors and probably we should target for a lot more now uh one sort of caution here when i say utilization i i don't mean uh CPU utilization per se all i mean is that there was a task which was scheduled on that that that particular core or that particular machine uh that's all so it it's possible that that the machine or the task which you are scheduling is actually IO bound it's not CPU bound so CPU utilization is not you know correlated with with the utilization per se it only means something being scheduled that's all so there are sort of a lot of other metrics that that sparklines provides not the top level metrics that we talked about which is the driver and executor utilization it also provides metrics per stage so the metric that we need to look at is for every stage for example we have wall clock uh percentage uh core computers task count p ratio i'll explain all these numbers but what what what is important to notice here is that if we if you for example look at stage 33 85 percent of the time is being spent uh in this particular stage so instead of sort of approaching this problem uh instead of looking at all the stages uh we can very easily narrow down to the few stages where most of the time is being spent and then try to focus on you know what's going wrong uh in these stages so so in this case for example we see that the number of tasks which are running in this stage is only 10 whereas the number of cores that we have uh is 800 so which is you know it's a huge wastage and if we could somehow figure out uh why this is 10 per why there are distance 10 tasks only we can probably you know bring it down uh increase our utilization so in terms of metrics the key metrics that are printed are uh the wall clock percentage which is the total time spent in the stage uh relative to the time spent in all the stages so which it basically helps you narrow down to few stages where you want to investigate instead of looking at everything uh the p ratio is essentially talks about the parallelism ideally this is this is basically the number of stages the number of tasks divided by the total number of cores that a stage has so typically you want it to be one or two something like that so instead of you know if it is lower than one that is a sort of a that should ring a bell that you're not using all the all the cores uh tasks you again tells you you know what is the ratio between the largest task and versus a median task and that gives you a sense of how much uh you know skew is there in the data and should is it enough for you to sort of you know go back and start investigating what to do with it uh oi ratio for example is is output bytes to the input bytes at at each stage and this lets you know typically you will sort of uh expect that as you move from one stage to another uh slowly the the amount of data that is reachable to the next stage should should actually reduce so if it's not happening then you should probably you know look at maybe there's something wrong with the logic and you should look at it so uh sort of coming back to you know the where we started right this during the during the job what we found is that 85 percent of the time is spent in a single single stage and it has very low number of tasks so when we looked uh into the code we found that repartition 10 was called somewhere in the code and that is what was causing resulting in 10 tasks most likely it was done you know maybe some other sometime on a staging environment or something and the same code sort of uh came to production so so we changed it uh and uh we also changed the spark SQL shuffle partitions as I mentioned earlier uh from default 200 to 800 so when we ran it uh after these changes uh we see that the this application sort of finished in about 10 minutes so we started with 158 minutes made couple of changes uh and we could bring this application down to about about 10 minutes interesting to note two things the critical path at this time is about seven minutes and ideal application is also about seven minutes so what it is telling us is that uh you know adding more executives is not going to help uh we are sort of you know performing in a way where there is uh hardly any skew and uh we basically this is this is sort of a definition of uh highly optimized how does a good application looks like so if your critical path and application path are same then you have achieved you know nirvana essentially in the spark world even if you you know don't have to predict exact but uh this is where you we know your application is two days scalable so uh with that I just wanted to sort of you know uh come to limitations also uh so so essentially if you look at it spark lens is a model which is built using looking at uh run of a job and it predicts the behavior of the same job when uh when you run it on different sort of different number of codes but uh still it's a model and there are of course there are second order effects which we don't take care of so a few of those things are for example uh executor or driver gc so if gc is happening either before or after in in in uh in in some uh some executor for example when you're running your job it's possible that that overpowers you know are sort of configures the model incorrectly and if you run it uh you know in in in some uh if it's when you run it uh with different set of codes the behavior that you see is a little different then uh shuffle service uh performance varies with the size of the cluster a 10 node cluster will have a different uh characteristic than a 100 node cluster so uh that will come up will can be a bottleneck or a limiting factor again uh when working with s3 the throttling uh network bandwidth cpu contention there are a lot of other factors uh beyond just the availability of tasks which can limit uh the scalability of your application so uh those still uh have role to play but in general if you're not you know going from 10 to 10,000 uh in in a in a reasonable range uh you will find that uh the numbers are what the spark length predicts are are pretty useful and also one good part is that when you run spark lens it'll also report its own error saying you know uh this is the actual time and this is what I predicted it should be so if if that number looks to be fairly large uh you have to take the results with a pinch of salt and uh that's uh yeah one more thing is uh if you're using for example large executors by large executor mean you're using executors which are larger than uh when you run the spark lens uh the plan itself can change in the sense that uh broadcast joints become accessible so spark might change the plan to use broadcast joints in which case the structure is not quite comparable so the performance that uh spark length predict will probably not be same and uh the last one is the spark default parallelism since this number actually depends on on the actual number of codes if uh your application is using this number then this number will change as you move to more number of tasks and hence the prediction will will also change so these are some of the reasons why you know where things you know uh the output of the spark lens could be wrong uh but that's fine I generally don't think of output of spark lens in terms of right and wrong mostly in terms of is it useful does it help you guide you are you doing trial and errors or do you have you know some sort of uh idea or some sort of direction what is more important what to look at what to focus on and what is what how you can sort of you know navigate your way around uh tuning your spark application and uh in summary uh spark application cannot run faster than its critical path uh spark application can be made efficient by reducing driver side computation having enough tasks uh for the course and reducing task queue and if completion time is not a issue by reducing the number of uh executors so this is the open source and uh there are only two so when you run your spark summit if you add these two parameters hyphen hyphen packages and uh on spark extra listeners uh you could see as when the job completes you will see all this output that comes from spark lens and its open source so the code is available if you guys want to try out contribute change everything is there that's all uh thank you we have time for questions hello hello yeah so this is useful for a normal spark application but how do we do it for stream applications oh so there is a I think someone from intu task for it and they are working on a pr for this so hopefully there'll be something it's not right now uh one way to do that is to for example uh you can uh you can sort of manually sort of stop the application after one run and see what kind of behavior do you see so even if we stop the application it should be able to produce it should be able to produce how given the nature of assuming that the application characteristics or the data characteristics are not changing the prediction that you get uh based on how many extra cores do you add it should be useful information to have even with one prediction sort of one data source one data those who are leaving the auditorium please make sure that you drop your feedback forms at the help desk there is a there is a box especially kept for feedback forms please drop your feedback forms there hi can you please share some of the best practices to optimize the spark app which does IO so I have to I have to connect to a rest API so there are limited number of outgoing connections I mean there are limited number of connections in a connector pool so can you share some of the best practices to reduce this queue in those scenarios so if you're connecting to using rest API and talking to a rest API so if you look at S3 is a rest API so it's not that there is any problem connecting to rest API the problem essentially is are you doing it from the driver or the executors executors are scalable and if your service is scalable technically you should be able to uh scale it to the point that you want to so if the limitation is not on the spark side as long as the your rest API is being being called from the executor side it is the limitation of the API itself essentially hi so how do you calculate the critical path number and ideal application path number sure so critical path number is essentially the time spent in the driver because if you add more executors your driver will not magically run faster so that time is not going to change the second thing we add to it is for every stage what is the largest task so if you look at the largest task that task is not going to run faster if there are more executors so that is some of the large so essentially there's some little bit more caveat there in the sense that some stages run in parallel so you have to be sort of actually know the DAG and compute it that way but that's that's what it is ideal application is essentially look at all the time that is spent in all all the task in a stage and divided by the total number of executors so what you get essentially is a uniform number average number and ideally if average is what happens that is what you want every executor gets enough tasks and they run all the time hi in the spark job it also happens that sometimes the speed of the spark job also depends on the size of the file right if there are too many small files or if there are very big files so based on each spark executor does your spark lens also tells that your files are too small make it more bigger file so that there will be less disk value or something like that so it depends on if the if the format of the file is splitable or not if the format of the file is splitable then in that case even if it's a big file it's not that one spark executors want to run through all will work on full file and second part is yes if you're spending a lot of time in one of the tasks in one of the executors and there is a skew that will get it's not it will not report it at a file level but it'll report that there's some tasks which are very very large hopefully if you know the stage and if you look at the code you will understand that this is you know the stage is actually about file IO and so that correlation you have to make but for spark lens it's just the tasks and the durations that it looks at collective cool thank you Rohit so we have a open discussion at 5 30 on the first floor driving the next agenda for 5th elephant 2019 and also community discussions so if you're interested please join us and take it offline so this completes day one in or d1 we'll see you tomorrow at 9 o'clock