 Thank you. Thank you. I'm excited to be here and thank you for the invitation of Aiden and also the data group here as well. So thank you for coming. So I just want to talk a little bit about data, which I think I'm preaching to the choir a little bit here. I'm sure everyone here is quite interested in data. Actually, just take a quick survey. How many here are currently employed as data scientists of some form or another? And so how long in the rest of you are interested in data science or want to get into that? Is that correct? So that's great. So maybe you could consider also our NUS master's programs. Just a quick pitch, that all the way. But if not, also I want to kind of talk a little bit about kind of maybe a big picture of analytics and what we're doing. But then talk about how we should be actually more responsible on the way we do analytics. If you're familiar with analytics, then maybe you can get a little bit of a perspective on this. But if you're not, just kind of keep this in the back of your mind as you're learning more about analytics and what we can do, what we interpret and how we can use the data. So that's kind of the motivation for this. And so just, you know, I love data. After this talk, it might seem like I don't like data, but that's not the case. This is one of my favorite visualizations. An article in science looking at the food network, how all the recipes and how they're related to each other. So they analyzed over several hundred thousand recipes worldwide. Looking at the ingredients and the flavor networks. So the types are when the ingredients are connected to each other. So you see like olive oil is connected to all of items. Garlic and onion is commonly used and there's a strong tie between them. So it's a really nice use and you can immediately see these trends and these patterns in recipes. So you can look up this article to get more detail on that. But let me just, as I've made a little bit of it, let me just kind of explain my journey to get to where I am right now doing data. At least help you understand, you know, I think where the industry is going and we're seeing a lot of people following a similar journey as well. So as we mentioned, so I actually started really in the tech sector. I was in your program and first computer like very early on on the Atari systems on the TicDex way back then. And I thought computers are great. So I decided to go to MIT, do computer science, do engineering. This is actually not my first autonomous robot. This is actually the second generation already of autonomous robots. And carefully there's like an iPad, a compact, really oldie days. And this is like the latest thing. This is the only picture that existed because before that was they didn't have digital cameras. So that's going to tell you a little bit. But so, you know, I really got interested in very technical and computer related things. And from that experience, I was a former company with some startup, some friends, doing computer graphics for mobile devices. These so-called smart devices back then, you know, these Nokia bricks running Symbian 1, 2, and 6. Symbian 6 was really cool. But now who even knows about Symbian 6 anymore? There is data as Nokia. And so, but during that experience, one of the key takeaways was that I was actually applying a lot of my technical engineering skills into looking at business, looking at business data. Because we had what we thought was a really great product. You know, I was a young engineer, probably a little bit more confident than I should have been. And thinking we build a great product and they'll come. We build a great product and they didn't come. So I was like, okay, well, let's analyze the data. Where's the market, what's the estimated demand, and all these things. So I applied all those mathematical models to business models. And I realized, wow, that's actually really hard. You know, yeah, we can do some simple navigation, some pivot tables in Excel, but if you go anything beyond that, it's actually quite complicated. And the reason why is because social systems are much more complex than engineering systems. Okay, there are more amorphous. The boundaries are not as clear. I think that's going to be an ongoing theme that you're going to see today. We'll talk about how we can do this. So it's a little bit of a naive approach to say that. We have to look at social systems as an engineering system. And then we'll miss out and sometimes get things. Has to live wrong when we do less. So as we mentioned, now I'm in the Department of Information Systems. I'm also teaching at the NUS Business Analytics Center, which is a cross-disciplinary center. We have a master's in business analytics. We also have an undergraduate degree in that as well. And we've been having a lot of good time teaching students and obeying one of the students and former students in this scenario, we're trying to get you out. And having a lot of fun looking at data from various different companies in Singapore and in the world. So that's my NUS life. So I decided to go back to school after working for a few years after Harvard and just a couple of streets down. And so now my research tends to focus on more social networks and really social sciences, media, retail, consumer retail. And one of the big things is causality. Let me try to push. I would say if my students can address causality with my PhD students, they can graduate. I think one of my PhD students is here. How far are you away from causality or graduating? So it's a hard bomb. And it's definitely not going to be easy, but I think the value is quite big. So we're trying to push down that route. So now I'm teaching a lot of CRM, analytics, and also healthcare analytics. And also I do a little bit of analytics and consulting on-site and talking to companies. So the companies we've talked to are including media companies. Viki is one of our partners that we've been working with quite a bit recently. Facebook, we do a lot of retail. Wingtai Group, it does a lot of fox fashion, G2000s, all of those guys. IT and tech, Singtel, HP, Lenovo, Autodesk. And now we've been working on a lot of these newer platforms that do like microfinance or like crowdfunding and so forth. So it's really broad. And anything that deals with consumer data is something that I'm interested in. So I want to say this is also an invitation if you're working with companies that you think that there could be an interesting collaboration, then we can talk about that and see what we can do as well. So one of the reasons why analytics is becoming so popular right now is just because there's so much data out there. If you want to, what can we do with this? And it's coming from, of course, traditionally in the 90s, 80s and 90s, it's from POS systems. And e-commerce took on and then we have our GPS, our data, and then we have a lot of forex and a lot of trading. And now we have like more data from like macroeconomics, Singapore inflation rates. And then the big ones came in. Facebook, Instagram, Twitter, MyFriendStars, MyFace and so forth. And these things just brought us so much data. And then people ask me, what can we do with this? And then the next big jump is, I would say mobile devices. Telcos and SingTel pretty much just knows your location almost 24-7 and knows that you're here, knows other people here, knows where you're going, knows how long you spend at work, which canteen or restaurant you're visiting and so forth. So much data. IoT is bringing that to a whole new level. And then, of course, a lot of smart nation, ERPs, now the government's launching the Sallat GRP, which will also be a lot of data, and EasyLink and so forth. So what I'll give my example a little bit is today, is on social media analytics. So I'm going to talk about causality, but then I'll just apply it to social media because there's so much interest in this. And I'll talk a little bit about our research and how it can bring some causality into social media. We also, some of my former students, they look at things like, look at Twitter, how happy are single parents? And you might say, well, that looks kind of weird. Why is this CBD? Why are people so happy in CBD? They're usually at work. So we actually, this is a video, this time it's not a video, but they actually show different times the other day and different levels of happiness. So in CBD, usually in the evening, they're out for dinner, like work is over, they're drunk, they're happy, they're treating drunk photos themselves, so everyone's good. And then usually in the suburbs, in the residential area, they're more happy during the day. At night, they have to manage their kids and maybe less happy. So you can use a lot of nice visualization. But one of the child poster, poster child, sorry, for analytics is about this time forecasting. And it's a product that's called predpoll, police prediction, and they're selling it to many police departments throughout the United States and the UK and so forth. And Singapore, they're trying to get to Singapore as well. And the idea is that all the police have these computers in the cars, they can say, well, this is when the crimes can happen next. So it's like, wow, this is fantastic. There's a social benefit to this, and it makes sense. But then a recent article just published about a week or two ago, said something like, well, police are training big data to stop crimes before it happens, but it's predictive policing and bias. Does it even work? Okay. And it gets worse. It says, well, many civil liberties groups are saying it's actually racial and justice organizations argue that algorithms are perpetuating prejudice. Are they worth even the privacy issues that are involved? Basically, these algorithms are saying, let's say in the US, you see black people cause a lot of crimes. They're going to target even more black people cause they're doing racial discrimination and they're pushing these things that we don't want in society even further. Okay. And then what's even worse is that, they said, well, what they find is they're not actually predicting the future, but what they're actually predicting is where the next recorded police observations will be. Even worse, really, where the police will be, not the criminals. Because the police are the ones who are recording the data, that's where they're getting the data source. So if I was a criminal, I'm saying, let's use this software. I want to know where the police are and cause my crimes somewhere else. You can imagine that criminals are, even if they don't have the software, they have some sort of heuristic that's sort of doing this. You know, I was selling narcotics in Holland Village and then some of my friends got arrested to be able to stop selling there. I go somewhere else. They're using these simple heuristics, right? And the software is saying, no, no, no, keep on sticking out at Holland Village and look, our climate decline. It's not the climate decline. It's just that criminals got smarter than our algorithms in our prediction, okay? So that's going to be an issue. And then one of the most catastrophic cases, I would say, was really the Google flu trends. So when it first came out in 2008, people were saying, oh my God, this is an excellent example using big data to help society. And Google flu trends says, look, using search terms, Google can predict the amount of flu in certain regions in the United States and that's going to help for the CDC, the government officials to run vaccinations, to run treatment, and preempt outbreaks. So this is fantastic, right? So Google flu trends can detect outbreaks days before and even they went on further, even months before an outbreak, even a whole season before the outbreak. That's what they're claiming, okay? The same newspaper three years later says, study of Google's multi-flu trend topic consistently overestimated flu cases in the US. It's a failure that highlights the danger of relying on big data technologies, okay? And actually they go on and say, it's costing the government millions of dollars money to these vaccines to the wrong locations and then not treating it, not preventing outbreaks. So what's happening here is that Google, their intent was good, but they're somehow trying to predict social systems and it can go badly and it's quite an expensive mistake. In their defense, they're getting better and they're aware of the situation, okay? So what I'm going to say is that one of the problems with big data is really it's an endless pit of correlations. A lot of these methods are based on correlations. So I like this site which talks about spurious correlations. So what correlations can we find? Internet Explorer is highly correlated to murder rates in the United States, about .99, right? So, you know, you're so frustrated that you can go shoot someone, right? Is that the confusion we're going to make? You have a number of people who drowned by falling into swimming pool with the number of films Nichols Cage appeared in, right? People have suicided after watching Nichols Cage films, okay? U.S. spending on science-based technology and suicides by hanging, strangulation, and suffocation, right? It goes on. Divorce rates in Maine is highly correlated with per capita consumption of marjoram. If you keep on feeding marjoram, I'm going to divorce you. Okay? Age of Miss America by murders, by steam hot vapors, and hot objects, right? It goes on and on, right? And so this bloggers, it's kind of nice. They just contribute these really random things, where there's correlation. Very high correlation, the .99, and above or so, right? So this is just a, if you put this into your model, you might get some really erroneous conclusions, okay? So what I want to kind of give you, take away of us is that when we look at data, social systems, it's really what I call like a data iceberg, okay? So what we see in our data is really above the water. There's both, okay? So, okay, well, I see an iceberg. There must be an iceberg underneath. Yes, that's true. So what do we see above the water? Is that we see, okay, it's large, data, it's still well large, quite large, what's above the water, right? It's cumbersome to work with. In any case, there is an iceberg underneath. So if we see an iceberg floating above water, we should avoid it, okay? It's sometimes sufficient for many of our purposes, right? I mean, that's a good start. But many times I would say it's not. It's really, it can be disastrous, okay? And so what's underneath the iceberg is much larger. We're not capturing in social systems, okay? I think, I don't know about my iceberg meteorology too much, but I think it's like 90% of the icebergs actually under the water, okay? And these are things that, you know, maybe our IT systems or whatever platforms are not capturing how humans or our social systems are actually behaving. This is what we're missing, okay? So below the water, a lot of unknowns. We don't see the data. It's much larger, required support from, but the underneath here, the main important part is that what's underneath is actually going to help us. It's required for the iceberg to support everything that's above it, right? If the underneath isn't there, the top is not going to exist. And there's a certain structure for it to physically stand up otherwise the iceberg would topple over and other things. So by trying to understand the process of how ice is formed, we can understand what's underneath the iceberg, okay? And so, and then what we can do is, if we understand the process and the mechanism, then we can map what's underneath the iceberg and say, well, if this is what looks like underneath, this is what we should see above the water. And then we say, okay, do we see this above the water or not? If we see this, then we can conclude that's what's happening underneath the water, okay? And that's an algebraic kind of data. The data is really above the water. It's good, but we really need to understand the more holistic view of the iceberg to extract more value out of it, okay? So for business analytics at NUS, we view it as really pulling from three different pillars. So we have, of course, traditionally, it's the computer science. And computer science, of course, software development, algorithms, databases, data warehouses, and so forth. But we really also need to pull from, borrow from statistics, things like statistical inference. Can you believe the data? How reliable is the data? You know, these p-values and so forth, okay? And we can do regressions. We can do a lot of interesting tools. But then what tends to be missing, what I observe a lot of times is in organizations, is that they're missing the social science side, right? The social science side is telling you the process of the iceberg. You don't know how people behave. You just think people are just robots, or the number five could be five million dollars, five cats, you're crunching the number the same way. The computer doesn't know. Only humans know the difference, right? You just feed cats milk, five million dollars, you just keep it in your pocket, and so forth, right? So that's what we need to capture, okay? So we formulated in the NUS program as well, is that we really try to give students from all three pillars to give a better understanding and to really extract the most value as we can out of the data, okay? So if you have any questions, please just jump right in. You know, if we can discuss this or if it's extensive, then we can take it offline after the talk as well. Okay? So what I want to give you then is, well, that's nice and all, but in practice, what can we really do? What are some tools we can use to somewhat get a causality, okay? And one of my views on this is that, unfortunately, there's not, as far as I know, there's no automatic tool to do causality. There's no black box, you know, you cannot call up Microsoft and say, hey, can you give me the latest Azure causality box plugged into my data, and it doesn't work that way, okay? You know, Azure makes great products, but it's just not technically feasible to do this, okay? And so causality is going to be very valuable, and as a data scientist, I would say, that's what's going to keep us employed. You know, if your boss can replace you with a Microsoft product, that's not so useful, right? But understanding the context that this is processed, the psychology, is what's going to keep us employed and give us the most value that cannot be replaceable. That's not replaceable, okay? So that's why, given the stock box, causality is insane, okay? We should try ahead there. It's not always feasible, but we try our best, okay? Okay, so what is causation? So, well, before we get there, so I want to say a lot of times machine learning systems, they're not, they're really effective at non-social systems. Okay? You want to do automatic detection of cancer from FMI images. Great. They could probably do pretty well. Are you going to extrapolate maybe features from an image? Great. But then you talk about human behavior, then that's where you, underneath the water, is part of it. Okay? And you know, we're assuming that, of course, if we're smart, then the people we're analyzing are also smart. Right? So if you're assuming that people are analyzing are not smart, then that maybe says something about yourself. Okay? So, it's good for finding patterns. It's limited, but I would say there really is limited usage in business content. And that's a much more interesting question by trying to get causality and trying to do these conceptual games. Okay? And so a lot of questions is, are predictions causal? You know, many times, you've got to talk companies and in my research, people confuse this notion that they think predictions are causal. That if I assign a certain prediction to Obey, you know, you come back to school for your PhD, you work for probably 50%, how do I know if that 50% is truly accurate about like really testing it? Okay? And I can assign, any number I can assign, I can say 9999 to everyone and that's fine too. It's just as good as a random number. We don't really know. Okay? So predictions are limited, but sometimes it's the best we can do. I understand that. So, ideally in causation, we're trying to draw some sort of inference between x drives y, x and only x and nothing else but x. Okay? It's not because of various correlations. It's not because of, you know, some other confounds that we're missing. And this is really important because if you're talking to your manager and saying, well, let me change x. Then you can guarantee with a certain level of certainty that y will also change. Okay? So this is very actionable. This is very directly actionable. All right? So you think about the Google flu trend, you say, okay, well, this is what's driving people, driving flues. If I change x, I can actually reduce the flu incidence in that region. All right? It's not just prediction. I'm saying there will be ten cases of flues in this region. It's actually saying I want to reduce it. Okay? But the problem of data is a lot of times we can only see correlation or some form of correlation. Maybe it's not linear correlation. You know, some not linear models and SVMs, whatever it is. It's more or less just co-occurrences of data, of two variables. Okay? And why do we see this? There's really three reasons for correlation in our data. One, of course, is we'd love to see causality. X causes y. Okay? But then we also can get reverse causality. Y actually drives x. It's the opposite of what we're doing. This is really disastrous when you're trying to advise your boss what to do. Because you're saying, well, if I change x, then y would come out, but no, it's actually the opposite of the way around. You put in x, nothing happens or it's even worse. It may be the opposite. Okay? And then there may be also unobservables. Something else outside the system that's driving this. These confounds, these unobservables are pretty low in the water. You do not have data on this. How can we capture this structure into our models? Okay? And so this kind of gives us a big outline of where we're going. Okay? Of course, if you take our classes, we'll go more in detail on the methods and approaches. That's why I'm going to give you an introduction to this. Okay? So what do you mean by causality? So really, a lot of times people say that the gold standard in causality are these A-B testing on systems or clinical trials and that's the point. So in clinical trials, they do usually double-blind clinical trials. So if I'm going to test a new drug, I'm not going to tell the nurse administering the drug which one is the placebo, the sugar pill, which one's actually with the active ingredients. Okay? And so if we do this with enough two similar groups, one that has the drug of the placebo, the sugar pill, then we say, well, X cures Y. Stereotiparibus, meaning everything else is controlled for. Two similar groups of people. It's not the nurses. Maybe nurses say, this is the treatment line. And it changes the psyche and changes your reaction to the drug. Okay? So really, doctor, I double-blind drugs are very useful. But the most important part is without X, it does not change Y. So we're trying to say X and only X and not anything else but X. Okay? And so without intervention, the problem is that when we look at our data, a lot of our data tends to be historical data. Right? If we could do AB testing, many sites, many tech sites can do this. Fantastic. But the way we do AB testing we should also be careful about. That's a separate issue that we won't talk about today. But what I'm going to focus on is that what happens when we cannot intervene, we cannot change and try different treatments. What can we do to still get causality out of this? Now let's say majority of times this is the case. Your boss says we want to try a new product, a new scheme or some sort of new adventure. You haven't done the test on that yet. So you have to produce some number and can we get some understanding of this? So that's what we're going to try to focus on. Without intervention using only historical observational data, can we get causality? Okay? Is that going too fast? Everyone's dazed. Amazed, lost, or sleeping? Or if you yawn, now there's a new study, if you yawn, people with bigger brains yawn longer. So if you're going to yawn, you have to yawn for a long time. Okay? Okay. So let's consider this issue. What we call the paradox of prediction and causality. So prediction a lot of times is about minimizing error. You can do cost validation, you can minimize the error and then you do a good fit. But let's imagine that we have this system, this theory. Okay? Now let's say it is educational attainment. How many years of education do you have in X? Right? It has a causal effect on Y. You're earning. You know? So the university doesn't let me tell you this, because it's saying if you come to our school, you can get an ROI for, you know, two more additional years of master's program. You can increase your ROI by, I don't know, 10 K per month for the rest of your life. That's what the conclusion might get. Okay? So, but the problem is what if there's some other variable, W, as beneath the water that we don't see? Confound. It's actually driving both. Okay? So maybe you're a really smart person. So because you're smart, you're going to be able to get admitted to these programs, top programs, PhD programs, and so forth. Right? And it's also going to drive your earning. So even without education, you just go to work and your boss knows you're really smart. So I'm going to give you a raise here. So the idea is that we want to say x drives y, but there's this W that we cannot measure that we don't know about beneath the water. Okay? So what do we do? So that's why just a simple simulation test. This is a Python kicking in. It's a Python talk. So what we're going to try to simulate is this network of causality. Okay? So I'm using Pandas, right? And sci-pi and numpy here. So we're going to try to generate 100,000 people. Okay? So you can do just 100,000 rows. We're going to draw a random number, W. So everyone's going to be normally distributed on how intelligent they are. Okay? Mean, zero, but that matters less. Okay? And then we'll say x is 0.5 of W. Okay? Plus some random noise. I mean, you know, random noise happens. Okay? And then we'll say, okay, y then is your income is a function of half smart yard. Let's say by 30.3. And then your education is a weight of 0.4. So, you know, education still helps you more than being smart. Okay? And then some random noise. Okay? And then I put it into a data frame. Okay? We're happy. Okay? So now I'm going to try to do a linear model to predict the income. Okay? So let's see what happens. So this is where it gets. Sorry. Okay, screen shots are a little weird. Okay. So here's the linear regression. So this should be the correct model in the linear model. So it should just be well, if it's a correct model, I mean, okay. Yeah, yeah. Well, I don't know what the correct model is. That's right. But in our data, all we have is this X and Y. So we're going to try to run a naive model to say linear model of y is a function of X. Okay? So that's what we're going to do here exactly here. So we do this. It blows out a lot of gibberish, blah, blah, blah, some fancy numbers. Okay, great. But we're going to try to do a similar model. I had some transparency turned on. So that's why you see a little background here. So I'm going to run a similar model with W. So y is a function of X and W. Okay? And what I want to do is actually I should be able to back out these weights. It's correct weights. Let's focus on X for now. Okay? So what do we see? So first off, what should the X be? What should the correct X be? I'm going to look at the results. What should the X be? So this model. So nothing wrong with thisbot model. So 0.4, right? Or I'll run it to say this is the correct model. I'm kind of going to back this out. So it would be 0.4 and then 0.3 and W. Okay? So what do I get? So everyone in the incorrect model, I get 0.5. But remember, I don't have W. So here if a half W, that's correct. 0.4 and 0.3. Okay? So here, so this is going to be very worrisome. into my model, yes, Arvo tell me something or Pandas or Python or whatever school you're using, will tell me something. You see that's actually the p-value is also significant, very significant. And you tell the boss, just get more, you need to send me to school because I'm going to increase my salary by, you know, 50%. And the boss says, no, no, you're just a smart guy, so go away, but maybe not anymore after you told me this. Okay? So the quantum time takeaway is that the correct model should be this. You can back out the correct weights, but the problem is if you don't have it, you get very biased estimators, very wrong values. Okay? So, but consider a slightly different variation of this. What if you get this variation? Okay? So we're going to say, still a number of education in salary, but then W is, let's say something, both X and W cause, X and Y cause W. And W could be the amount of money you spend on art. Okay? If I'm really educated, I really appreciate all the plastics that I'm willing to buy, really expensive art. Right? If I don't understand Picasso, I think Picasso is just like my child's water-poly work. Okay? But also if I have more money, of course, then I can afford Picasso. Right? So it's driving both. So similar setup. I generate X as a random variable. Y is 0.7 plus, so Y is a function of X. The weight should be 0.7. And then my W then is a function of X and Y. So 1.2 of X, 0.6 of Y, and so forth. Okay? But still, I'm trying to estimate my Y here. Okay? So let's look at this. So I've got a simple model of just X and Y. Okay? And we'll compare that to running a model of X, Y, and W. Okay? Yeah, we're gonna have screen space a little bit here. So first, so let's take the X then. What do we see? So X here should be 0.7, right? So here we don't have W. Let's say we don't, or even we do have W. Okay? If we don't include it, that's a correct model. Right? Max out 0.7. Okay? With W, it's not the correct model. So it gives us the wrong estimate. This is fine. This is good. We don't want the correct value there. Okay? Because this is gibberish. This is an incorrect model. But then, look at R squared. What's happening here? So this is the wrong model. This is the correct model. Okay? What's happening on R squared here? So R squared tells you how much variation in Y is explained in your X, by your X, variation in X. Okay? So higher R squared means it's a better fit. Right? So the best fit is R squared of 1. It's the best fit. It's like a straight line. Okay? What we see here is that in the wrong model, your R squared is 0.5. In the correct model, your R squared is actually lower. This is really problematic. If I'm blindly just predicting, it's saying, look, you know, R, data, Python, whatever, your favorite MATLAB program, Fortran, whatever it is. I was going to tell you this is a correct model. And this is the better model. Your estimate is incorrect. All right? And your model itself is just incorrect. Okay? And then you tell your boss something ridiculous and he's going to probably fire you. Okay? So this is known as a collider. The previous one was known as a confounder. Okay? So this should hopefully pique your interest a little bit and try to get that raise from your boss. Okay? So let's talk about some examples here with my favorite thing, user-generated content on social media. All right? The idea is that, you know, Twitter, you can become rich by analyzing Twitter data. All you're doing is just getting the founders of Twitter rich, but I guess Twitter's struggling now. So maybe you'll get founders of Microsoft rich because they're going to be sold to Microsoft, I think, or something. Okay? Okay. But anyways, so the classic example is that we want to say UTC, user-generated content predicts some sort of outcome. And really kind of want to say the stronger thing. It drives, it causes the outcome. Whether it's product sales, whether it's like elections, whatever it is, right? You're analyzing your Twitter data and you say, okay, well, people love Samsung Note 7. So, you know, I love the exploding battery. I can use it as a grenade. It's fantastic. So it's going to drive up the sales. Right? Okay? But so, you know, a lot of people have looked at it like, so, okay, what's the value of, say, like, the Facebook follower? You have all these fan pages. People, Nike's putting in $900 million per year to maintain their social presence, to maintain their Facebook fan page. So they want to know what is the ROI of each follower? And so there's a couple studies looking at, well, okay, what is it? So some people say 360. Some people say 2293. Some people say 136. Some people say 294. Some people say zero. Right? If you're doing like a capital competition, you average all this and you get the right answer, right? You are a gradient boost and then you average millions of models together and all your competitors models and you get the best answer, right? That's how Kaggle works. So instead then, so the important thing to get is, you know, a lot of these models you're finding issues with them because they're not able to draw some sort of causality at it. How can you draw causality? Can you even do this? Okay? Otherwise you get, you might as well draw a random number. Okay? So another study was saying, can we use social media to predict election results? Trump followers are really vocal, right? And there are a lot of treats on about Trump, right? About like, this is 2016, was it women's rights and all this stuff, you know, the volume just treats I mentioned Trump is huge. Huge, right? I cannot do this. But then people were saying about this one academic saying, okay, he did a survey in all the different studies. Basically said, there's no way that we know. And how the paper was, I wanted to predict election with Twitter and all I got was this lousy paper. This is saying I did not find anything. Okay? So this should be kind of rather worrying for us. Okay? Other things, so we want to really say there's some sort of the desired interpretation of say, like they say book reviews. Okay? Is that user read positive reviews on Amazon and then choose to buy the book? That's assumption that we want to make. Okay? But the problem is that when we look at this, there might be a reverse explanation, reverse causality that user who are happy with the product, write positive reviews. Right? How many people here have written a review on any website? I feel high and normal. But how many people have analyzed data on reviews? So the number of people who analyze the data surpasses the number of people who write reviews. Okay? So there should be some concern here. Right? If you tell yourself write reviews, why would you do some other people write reviews and use it for your analysis? It always kind of make it very reflective. If I don't do something, most likely other people won't do it. Okay? So there's some reverse causality going on here. And then there's also this confound. Is that maybe the firm is doing some other, this is below the water data, they're doing other efforts to promote the book. You know, like book signings, book, and other things that's driving both reviews and product sales. But we cannot observe this. Okay? Okay, what else is happening? Other factors, we don't know how good the book really is. Right? Maybe it's a really good book. Maybe it's a really crappy book. But we can't tell from reviews. That's the other implicit things that we wanted to say. Reviews and ratings implies some sort of product quality. That's not necessarily true. Okay? It may be publicity. Reviews may not be the true opinion of people. In my class, I think I have a big memory of a couple of funny examples on book reviews on Amazon. Some people are writing like very long extensive novels about child slavery and like very funny ones and like this whole community of Amazon viewers writing like novels. Really, they're authors and they're like unemployed English majors trying to get publicity. And then, you know, they're like usually rated as the most useful reviews. Right? So it's a sort of sarcastic thing going on there. Okay? So maybe I'll be reviewing the true opinion. And then also we'll be reviewing the true quality of the product. Okay? Maybe some early adopters. Yes, Samsung phone Note 7 is fantastic. Right? And then early adopters got blown up and said, okay, maybe not so fantastic. You know, I think they said 40%, they're losing more than 40% of people from this Note 7. These people are going to switch to other phones mostly to high phones after this. Okay? And now also selection bias. Right? So maybe some people who are very happy or very unhappy about the product with either review. Right? Because, you know, you buy, I'm assuming most people have bought products on e-commerce sites at some point or another. Right? But the fact that you don't write on it says something. Okay? Okay. So I want to, my time's running a little short. So I'll try to quickly give you a brief overview of different identification strategy. And identification strategy here is basically saying, what kind of methods can we do to target a causality? Okay? So you borrowed from the econometrics literature and methods. Okay? So it's going to be different from CS machine learning. Okay? So first and most important thing is we must understand the context. Are we looking at social media? Are we looking at cats? Are we looking at election results? We must understand the context. We must understand the process of this iceberg. Okay? So these are the five major techniques which are briefly go over. I'm just going to introduce you guys. We can discuss it offline. You can Google it. You can come to take our classes, pay us money, and we will guarantee that you can increase your salary. Not really. But honestly, our patients have been very good, very high participants from this program. So we'll just go one by one. Okay? So in your case, let's first start with the data. A lot of analysis tends to be what's known as cross-sectional data. So each of these are user data. And we just have one snapshot of them. But that's not going to tell us too much. And you'll see why in a little bit. But in better data, what we want to see is actually repeated observation of the same user. So we want to see them over time. Okay? And then multiple. So we have n users. We want to observe them for plus t periods of time. So then we have our matrix. It's going to be n times t matrix, large. And we could do this with our big data. We should do this with our big data. That's the cooling about this. You're not running like surveys that you ask people once a year. You actually can observe every millisecond what they're doing. So we have this data context. Okay? So what I'm going to try to do is then try assign, then let's say, I'm going to focus on linear regression models just for simplicity to illustrate the point. But you got to do this more than just this method, linear models. So the first thing is we're going to try to assign a dummy variable for each entity, for each user. Okay? And why do we want to do this? It's because there's a lot of time invariant factors about that person, about that book that we cannot observe. Okay? So there's many unobservables and we cannot separate them. So for example, you want to say, well, you know, women are smarter than men. Okay? Most people don't change the gender in their lifetime. Okay? But the gender is also highly correlated to many other things like your height. Maybe, you know, it's shorter, because women are shorter, they can serve energy more and more energy goes to their brain, we don't know. Many other things, right? But in our data, we cannot observe this. So we want to eliminate all the time invariance, as much as we can. In CS terminology, sometimes we call this like auto regression. Right? But we shouldn't say, okay, let's just eliminate this, we cannot observe this. Okay? For causality purposes. So what's left then is really time variance. If you look at our data, any fluctuation in time, maybe it can help us to explain that causal effect. Right? Like if say price changes over time for a book, then we say what's the effect of price? Right? And for all the different books. We don't know the quality of the book. The book might be good, it might be terrible, but we don't know the price on book sales. Okay? So that gives us some explanation of the time variance. Okay? And it also reduces reverse causality, especially when you use something like a lag time. Okay? So example, so this is a little bit, this is my animation to quite work out. Well, so I simplify the model a little bit. But we're going to say sales of book I at time t. Okay? Is then some sort of function. We have a dummy for book I. So we have a thousand books, we have a dummy binary. Is it book one? No. Is it book two? Yes. Is it book three? No. So there should be only one book for that book. Okay? So there's n, there's n number of books you have n minus one dummies. Okay? So let's eliminate that. This is going to capture all of our time in variance. Okay? And then we're going to focus on is really something that's time varying. Let's say ratings. Right? And we'll have our beta here for our linear model of I book I. And I'm going to say t minus one in the previous time period, whatever you find time is maybe daily, maybe microseconds. Okay? And then and then some idiosyncratic, some error terms for that's I and t. Okay? But the point here, the intuition might take away is that when we do this, we know that reading looking forward has an effect on sales. Right? This gives us this at least eliminates your first mentality. Okay? So that's that's an easy takeaway. The one that most packages support panel model. Okay? Next step is maybe we want to compare somebody knows a difference in difference approach. And we want to compare really before and after of control and treatment group. So remember, this is over time. Right? So we have two groups. One has whatever treatment that is. And the price change development did not. We'll observe before and after. What this does is that it controls for time in variance and also controls for time variance. Right? Maybe this is a seasonality effect. You look at book sales around holidays, you know, people people buy a lot more on e-commerce or in general everywhere. Right? If you don't have the control, you don't know if that lift that boost that you get is because of holidays, or it's because of price changes or something else that you're trying to test. Okay? So that's the idea. So time variance, such as holiday seasons. So example, some sites that look at like Amazon versus Barnes and Nobles for the same book. Okay? So the idea is if they look at the number of views for the same book, it controls for book quality. It's the same book. It doesn't matter where you buy it from. Right? Barnes and Nobles, I guess it's no more. Or is it still going on? I guess borders is gone. Barnes and Nobles is still struggling, right? Major book sales, okay? And so I look at sales of time I, book I, time T. And it's going to be functional, whether it's on Amazon or Barnes and Nobles. That's just a dummy variable. And now I look at the ratings on that site, whether, you know, maybe there's some fluctuations, differences between Amazon and Barnes and Nobles in ratings. Right? So maybe, maybe people are more sarcastic on Amazon and then more pleasant on Barnes and Nobles. Okay? But that's going to be captured by the site fixed effect dummy. Okay? And then what we're going to do is that we're going to try to multiply them together as an interaction term. This is going to say, you know, the effect of being on Amazon and having this rating gives us like extra boost. So basically it's giving a strong causality of effect of rating compared to the rating on, on Barnes and Nobles. And Barnes and Nobles already has a baseline rating. So we're controlling for that with this dummy. Okay? So that's, that's one way of doing it. So my research on this is, so I work with Facebook for quite some time now. And we look at how do people make friends on social media? There's a big, really big talk, you know, everyone's going to add, everyone probably says, add friends on Facebook. Yes, yes, yes. And you know, someone you don't know, someone you know. But we want to know is how do people make friends? But the problem is that if you just analyze this naively, people argue, well, you know, Facebook friends, real friends, or they just kind of like pseudo friends. Okay? Or about my ex students, right? And so, and so we, we look at then 1.3 million college students in the US. And then we look at the treatment group here is the hurricane, a disaster hits five of the universities, really demolishes them. And then we want to see what happens? How do you change your friendship? Right? So that's our treatment group. And then we have the control group, which are similar university students at non-affected areas. Right? And so the idea visually, you can just kind of see the big takeaway here is a number of friends doesn't change. But who you make friends with, you see, the red is the hurricane group, and the blue is like the non-hurricane group. And the vertical line is when the hurricane hits. So you see as before, they're very similar. And then afterwards, it diverges. And that's a classic different approach. If we didn't do this, there's also some seasonality effects, because every time in fall season in the US, in September, everyone comes and suddenly everyone makes friends with everyone. Right? And then after like the first exam, then you only friends with the smart kids. Okay? And then after holidays, you like then you go back to making friends with the ones who give you presents and so forth. So there's huge seasonality. That's why we're doing this different diff. So that's a very powerful tool if that's feasible in your context. Okay? I love how my time is, how am I doing on time? Okay. Probably more common technique is something known as an instrument variable. This is getting a bit more technical. And so this is saying that, okay, well, since many things can be correlated to x and y, can we find a z in our data that's only correlated to, or z is correlated to our x, but not to our error term or known as our y. Okay? So what's going to do is basically going to capture the correlation, the variance of x, the part of x that's correlated to, that's isolated. Right? So what's only left is that we can say x affects y, not some sort of reverse causality. Okay? So x captures all the correlation and isolates only x. Right? Z captures all the correlation. See what I write, not what I say. Okay? And even then, don't see what I write, see what I mean, and not why I say I don't write. Okay? Okay? So example, again back to our book example, is sometimes people look at state taxes as an instrument on the effect of price. So on Amazon in the United States, they started rolling out some laws on state taxes, sales tax, if you're buying across borders. So on Amazon itself, without tax, everyone faces the same price. But it's the same book, same price, base price, but then the tax gives you extra boost. So then does it affect the sales of book? Right? You probably assume yes. Okay? I think one of my favorite methods is probably this regression discontinuity. So I have two methods, DID and regression discontinuity in the method. And the idea is that it takes advantage of things like rounding mechanisms and ratings. Okay? So if ratings, the idea is let's say Star's rating on Amazon, if you believe ratings affects demand purchase. I'm reading the ratings, if it's a five-star I buy, if it's a four-star I don't buy. Okay? But my question is, if that's the treatment, then any rounding would actually show in our sales. It's not because of reverse causality, maybe people who like the book or people who are paid to do reviews give it a higher rating. Okay? So because of rounding, there should be a discontinuous jump near the boundaries of these rounding areas. Okay? So let's consider then, this is what it's going to look like. Let's say this is the ratings between two and three stars. Okay? And this is like the potential sales. My theory is that because of rounding, let's go from 2.49 to 2.5, the stars on the website jumps from two stars to three stars. If that's the case, you know, a book that's rated at 2.5, maybe that's an indicator of its quality, but to the potential buyer, it looks like a 3.0 star book. So if we look at these jumps, I can look at these rounding, I can actually see a jump in the data, in terms of sales. That tells me that ratings and stars do have an effect on the buyer. And it helps me to choose whether to buy the book or not. Okay? So that's a very clever use of modeling. Okay? And so the model, again regression model, could be very simple. We just have some sort of binary variable, R, that says whether it's rounded up or not. It's one, it's rounded up, and zero otherwise. Okay? So if you consider that we've got R first, is it sales as a function of ratings? Straight forward. It'll give you the straight line. Okay? But then you say if it's rounded up near the boundary, then you should see this sudden shift. And we spit it to the model, you see it, and if it's significant, then it gives you confidence that people do care about the ratings. Okay? So I think a lot of your context, a lot of people who are working at State of Scientist, probably see have some sort of ratings variables. They probably use this technique quite readily. Okay? And then probably the final method that we like to use is something known as propensity score matching. And this is to construct a, to conceptualize a control sample. So we have some sort of treatment, you know, price changes. Can we conceptualize a similar group of people in similar to every other dimension except what we're trying to test? Okay? So that's going to be our control group, and then we put it into our DID models or whatever our other models are. So then the score then should include every other variable which you believe affects the dependent variable, affects the outcome. Okay? Then you can model as a dummy model variable in your main model. You can do just a simple t-test. You can do various other ways to prepare the treatment group with the control group. Okay? So that's the idea of the PSM. So for the hurricane paper I mentioned we use PSM to find the control group. Right? So in CS terms it's calling a nearest neighbor, a method, a nearest mirror, whatever it is. It's basically creating a distance. It might find closest people in every other dimension except the one you try to focus on. Okay? And then of course we can combine methods. We can combine like so as I said we combine control sample using PSM and then it can do it can conduct a diff and diff or model in linear ways or whatever you want to do. Okay? So I just want to conclude the talk a little bit. So you know with icebergs, some people a lot of thought I was going to go here. Yes, I went here. We want to avoid a terrible disaster. That's why I'm saying we need to be responsible. Prediction has its own use cases as its own limitation. We should interpret prediction appropriately. Okay? But it's not a causal model that we should not say it's positive. It should not say change X to track Y. We can now even say that what we estimate from a linear model is the weight of X on Y. You can not even say that. You can only say okay our prediction model says 50 percent. No, that's it. So we can see pretty much too. Okay? So we want to avoid, so we see the data, the iceberg above the water. Okay? And then predictions are generally I would say fairly weak on social systems. Engineering systems fantastic. You know, driverless cars not bad. Right? But for analytics, when you deal with people, businesses, anything that requires someone with a brain that's smarter than a computer, then usually it doesn't do so well. Okay? So that's why I'm preaching for responsible analytics. There's a place and for every method, we just need to understand the limitations of each. Okay? And really most value is adding from adding the context. Again, this is where I think people are all employed for quite a while. You know, deep blue, whatever I'd be able to name this technology, Microsoft is not going to be able to replace this part because they're really missing the context. Okay? We're going for it. Okay? So thank you. So you can try and find me on LinkedIn. Also, as mentioned, this is called an invitation. If you're interested, have some context, we can come in and take a look to do some sort of research collaboration. We do quite a bit of this is how we're building new insights on how people are in the industry, are using analytics as well as understanding fundamental human behavior. And if you're interested also, you know, our master's programs or our undergrad programs, I see some our undergrad students are back there as well. You can ask them how they feel about it. Maybe not before exam time, they might feel less positive about it. But that's the selection bias from the views, right? Tell them this. Okay? Alright, thank you. And yeah, I'll probably stay around for some discussions and you have talks. Fantastic. Thank you. I think someone mentioned that the LinkedIn is not working. All right? But I think if you look at my photo, it's some version of this. Okay. So this worked? Without Q. About the Q? Oh, that's strange. Okay. Okay. Thank you. Thanks. Any questions for the first part before we go on for a break? And then we move on to the next speaker. Questions, please? Getting the predictions correctly. How bad they are? How good they can be? So I think that there are many cases. So I'll illustrate the two. Both those cases were actually, when they first launched, they were considered very positive. So the question, sorry, repeat the question. What are some, if I can hear correctly, is what are some examples of good outcomes of prediction as well as what's some bad ones, right? So what are some of the good cases? Yeah. So I'd say, actually, my general view is on this is that predictions work pretty well in the short term. And I shouldn't say pretty well. In the best case scenario, it works pretty well in the short term. Like you can make an immediate effect, maybe tomorrow. But what's really missing a lot of the times is these time-varying factors or other, you know, the world around it changes, right? So in the best case scenario, it will work. But the worst case scenario is quite damaging. So I think when there's all this big data hype, we tend to have this over-optimistic view of things. But I'm saying that there are good cases. Red poll, when it first came out, very effective. Google food trends, government saved millions of dollars doing this. So at that time, it worked really well. Now in hindsight, we're seeing that that's not the case. So what I'm trying to say is that, yeah, you probably even in your job, you probably would work. I'm not saying it doesn't work. But looking forward, is that going to continue so? Or are you going to run into some big, huge iceberg that you don't even see? So I'm not saying that's what I'm saying. I'm not saying prediction is bad. I'm saying be responsible to analytics. That's the main point. Actually, there was a question in your personality. Yes? I'm interested in the regression discontinuities. So my first question is, whether it depends on the region, what kind of religion doesn't it depend on? And for example, if the region is linear, for example. And the second is, maybe you would like to share some of the context where you can apply this model best. Yeah. So two questions. So one is, well, in trying to be. Whether the linear model makes sense, is that you can use it in other ways? Does it apply for linear models? Yeah. So the question is, does it apply for other models besides this linear model? Actually, all these strategies, most of them can apply to nonlinear models as well. But you have to do math a little bit to modify them. But the intuition is the same. So let's say you have a non-linear model and with this continuity, that's fine. But we've got some curvature or inverted U shape. But you still believe that, maybe you believe that stars have an inverted U shape, right? Number of stars, nothing is super effective. But then you should still see some sort of shift because of the rounding. That's the theory. In the future, it's going to convert to something like steady state. At first, it's going to be basically that if you run it. But at the end of the day, it will not. Yeah. So then you can do more analysis, more subgroup analysis, perhaps. For example, one dollar more for someone with zero dollars is really good. But one more dollars for Donald Trump is maybe worse. Right? And so you can analyze these subgroups first and still do the regression discontinuity to help. So maybe another example is with the stars, you have like one star, two star, three star, four stars. So maybe you're saying between one to two is maybe more effective. But then four to five maybe not. So yeah, so you definitely can do different tricks to segment it further. And then the question was I just want to hear some of the context. Yeah, so yeah, so some of the context. So I mentioned before, so one of them was this book reviews from Amazon compared to number of stars on Amazon. Right? So you can even do this right now. But Amazon right now is doing rounding by half. So as you get 3.5 stars, you get four stars, 4.5 stars. So the edges are a little bit more refined but you can still do the same thing before they just do like round to the nearest stars. Another example is you look at like even government policies. I think there's some studies in India, right? That take the college entrance exam. Some people would be near the threshold. I'd say like 89 to 90. But anyone above 90 would get a scholarship. Someone at 89 would not. You can argue that people between 89 and 90 maybe some random noise. Right? Maybe they're equally as good because we had a bad day. You flunked one question. So they analyzed and if you got 90 and you got the scholarship did that increase your chance of lifehood success? As opposed to someone who's 89 who did not get the scholarship but should be similar people. So there are a lot of really good examples on this. Google, we can talk about this as well as some examples online as well using this method. Other students I'm trying to recall a little bit. We look at also, some of my previous students look at crime rates in the U.S. across state borders because I'm trying to get this right here because of guns. Okay? And in U.S. in your state borders, there's no border. There's no like fence or anything. This imaginary line goes through and you can go through cities freely. But when the gun line suddenly gets mentioned one state versus the other then someone in the same neighborhood on the same street, one side of the street can buy guns, the other does not. The question is, does that increase the crime rate for some similar town, similar function and also if there's a spillover people buy guns in one state, one across to the other state and shoot people in another city in that state that's far away from the border. So there's imaginary discontinuity that should not happen. There's another study in China that also looks at the effects of air pollution and in towns. So there's a couple of cities where it's divided by a river and on one side of the river it's like all the rich people, on the other side people are more poor people, but they get the same amount of dirty pollution coming down the downstream. And so they see then the health problems, maybe because of the worst healthcare access or is it because of dietary or education or is it because of air pollution. And so the effect on both sides should be the same because of air pollution, even though social economic factors may be different. That's another way of using discontinuity. So there are a lot of examples of very, very good strategy for this question. Yeah, I mean it's a bit of a related question but we had a talk about deep learning last session. Where does it your talk leave a model like that where it's a kind of a big black box sitting in between where you have input going and output coming out. Your identification strategies I don't see how it can very well fit into deep learning kind of model. Yeah, so that's a good point. Which is now being touted as like a mother of all models, you can throw everything away and just worship people. So I think I'll try to make a point also across this. That's a very good point. Across the different methods and algorithms there's like somewhat like automated unsupervised, where you don't call it. And there are things which are very structured models that's whole continuum. So I'm saying that one is to keep us employed. We need to go down the more structured models, not just that, but it's giving a lot more value to analysis. Because something like deep learning, which is great they can fit the 99% error that's great or accuracy. But they don't know whether the number 5 is like cats or millions of dollars. So I'm saying that even the methods I'm showing illustrating are using linear models. The intuition can be really carried across any algorithm. So sometimes it doesn't even matter which algorithm you want to use. You can adapt it with the same intuition built in. So I think that's where we're trying to go. But linear models are usually the most common. And if you have a good concept with your hypothesis, you can probably extract it with very simple methods, a T-Test even if possible. Should we go with a methodology like we go around and use the linear models or other simple models to try to find more hidden variables and then throw it on deep learning? Yeah, so yes. So I would say that that's yes. So usually you can do a kind of exploratory with simple models first. Even some statistics you can slice or dice your data like aggregates aggregation, whatever it is, right? And if there's an effect there, then you can dive deeper. But you usually don't want to dive deep right away because there's such overhead of like getting your GPU cards to deep learning and then massaging your data and all this stuff, right? So do simple. And then you can, linear models are fairly simple and we can think about the context a lot more and then go beyond that. Squeeze out the last percent. How are you going to point to showing that this is a causal or a versus-corded? Yeah, so that's a great question. So how do we verify that actually this is really causal or not? So if you can close the loop then you can do a testing. So let's say I build like a model on customer churn on a website right? Well, once we're working on this VT we do like customer churn models for them and so we have a model we've got many competing models and we try to build causality into that but then we can do an A-B testing on those factors and because we built into this model our X and we believe X is causal we can test on X and see how good we are on this using A-B testing framework. So this is sort of the law of all these numbers and they're not sampling so you definitely can do that, yes. So yeah, you're having another issue about these models these linear models are generally done for smaller data samples so if you go to the blockchain everything's going to be significant. But yeah, you can resample it but the idea is that it's not so much machining but trying to build in the intuition is a big part of it. That's my question. Yeah, I believe you mentioned that the linear models they do work so well on social systems. Right. So I'm just wondering whether if it's a social system does it mean human beings and animals and all that because I'm thinking about maybe like that company and I want to do analysis on food data, cats like for example. So with this field of light yes, yes. Absolutely, because when you say what cats like your data is from cells of cat food then you're assuming that cats cannot buy the food, it's actually the pet owner. Cats can generate revenues now. So if that's the assumption then yes. But if you're saying that maybe cats are actually just tasting food whether they like it or not and maybe you can do more engineering but they're now in pieces but the last time I showed animals actually quite they have some foresight. They are a little more cognizant but you're right. There's this two streams. You program computer software you kind of know what determines what's going to happen as opposed to something that's very okay and there's a whole spectrum of streams. Any more questions? I have just a couple of questions. How do you deal with social media and how do you do this? How do you deal with what's next? Relativity. Volatility. What do you mean by volatility? Like snapchat goes away or? Sometimes I look at something like social media or voting and it's high the water problem. Yeah, so exactly. We can do all these methods. You can do a panel model to capture seasonality, so in the panel model in this panel model you can do dummies for example that would capture quite a bit of seasonality but then you would need things like ideally we have multiple years that would capture seasonality better. If you don't you can still do it this monthly say like three months you can still do monthly dummies. So there are methods you can deal with that but I agree with social media there's so much noise and so much spammers and all these other things that it's really hard to measure which these methods a lot more. But yeah, we can go offline like my PhD students and also we can do a lot of social media and looks on this techniques. Thank you. I think that's all. Let's hear it again for the five minutes break and be together back at 8.25. See me in this room.