 All right. Thank you. Good afternoon, everybody. I hope everyone had a good lunch, certainly a good morning, a lot of thoughtful and provocative talks. So thank you all for continuing to be here. My name is Paul Mainz-Helsen and currently I'm a data scientist in residence at Montaigne Ventures, which is an early-stage VC fund here in India, and I've spent most of my professional life as a data scientist. But in the past year as an investor, I've had the time and space to step back a little bit from the day-to-day work of data science and think about things at a little bit of a higher level. And I've also gotten to see and talk to a lot of startups doing really fascinating and promising work in AI and ML and analytics. And I think the thing that characterizes the best data science projects I've seen is the way a team approaches the foundation of its data. Good quality data and a well-defined source of data, I think, are the critical factors to success. And that's what I want to talk to you about today. And the thesis I'm going to present is that quality is less to do with data itself. It's not a list of metrics or characteristics that you can rank. It's not an intrinsic feature to the data. It's more to do with how we look at it, how we work to understand the data we have, and how it came into its existence, especially. And so the quality comes from that symbiosis between the appropriateness of methods and the data that you have. And I'm borrowing from the statistician George Box here. I originally pointed out that all models are wrong, but some are useful. And I'm saying that something very similar applies to data. And whether it's good data and being about how you approach it. And I'll start with a story that motivates my message. During World War II, the US government put together and set up a small organization of statisticians and mathematicians in Manhattan. It called the Statistical Research Group. And the SRG worked on a variety of problems related to the war effort. Among these was the problem that the Air Force was facing. Taking off from England, bombers were flying deep into Northern Europe and into Europe to bomb strategic targets in Germany. They were also getting shut down, and the Air Force was losing a lot of planes. So the Air Force, they decided, leadership decided that they needed to up-armor their planes. They also realized, as with most problems we face, I think, that they faced a trade-off. The more armor they put on the plane, the heavier it would get. Which meant that that would decrease the distance that the planes could fly. They could reach a fewer set of targets, and they'd be closer to the line of running out of fuel. So they decided they had an optimization problem, and they turned to the SRG to help them with it. And they said, look, we've been tracking all the planes that are coming back in and landing and that are flying these routes. And we've created this data set of where they've been shot and where they've been hit, where they're taking damage. Take it, please, and model it so that we can understand where we're getting hit. And we'll focus our armor there. And there's a statistician on the team named Abraham Wald. And he took on this problem, and he looked at it, and he thought about it for a bit. And then he came back and he said, he told the Air Force, here on the engines, this is where we're not getting hit. This is where we don't seem to be taking damage, and that's where we need to apply armor. Because, you see, he realized that the data he was seeing was a sample of the world. And there was key data missing, namely the planes that were taking off, getting hit, and crashing, and not making it back to England. And so on those planes, we don't see them. Those are not a part of our data set. And so we can make the inference that missingness corresponds to fatal damage to planes. And instead of focusing on armor, our armor on where we're taking most fire, we focus it on where the fire is doing the greatest damage. And I think this characterizes story for me characterizes what I want to talk about. And to convey my message today, I want to kind of walk through four stories from my own experience. And I hope to persuade you that in data science work that you should be caring very deeply about what we'll call the data generating process. To care enough about it, to do the work that doesn't always get the glory and doesn't always feel like it's on the cutting edge. To understand where your data is coming from. And also to do the work of communicating this. Because I think both are necessary for real success. And I'm going to spend most of my time telling you stories about my encounters with interesting data generating process experiences. And these are not nicely wrapped stories. They're mostly open-ended. There's no easy, obvious, clearly right answer. This is not a SAT or JEE exam. Because I think that's how life is. Mostly I'll be walking through a kind of a sample of experiences of discovery as I've experienced them. And I'd like you to walk away thinking, huh, not thinking, oh, okay. In scenario A, I apply X method. And scenario B, I apply Y method. But rather to kind of walk away and think, huh, I see how I'm facing or might be facing something similar, something analogous in my own work that deserves further thought, that deserves attention. Or maybe even just sort of as a process of intellectual simulation that you see something about the examples I give. And you can simulate in your own mind something that I seem to have overlooked or how a line of approach might take you the distance towards a better solution. So what is the data generating process? One of my professors and statistics in graduate school used to describe it this way. To convey kind of, it's all the things that take data into the world and from the world and into your data set. And there are three general things that I like to think about that define it a little more concisely or a little more specifically. The first one, which we often overlook, and I think that's a little bit about what, you know, Chris's talk earlier this morning, which I thought was quite good, talked about it. And probably things we learn in statistics class about sampling strategies or, you know, how marbles are selected from a jar. They're actually really important but often overlooked. Statistical model of a data, the statistical generated data generating process is often what we're trying to model when we set up a regression, for example. We're explaining some effect or some phenomena in line, you know, using some other set. We're explaining a Y with a X matrix. And then finally there's this data collection process, which is often not what statisticians are talking about. It's often more what we should be thinking about as, say, software engineers or developers or builders. It's the routes and technical procedures by which data reach a database. And that's data engineering. So my first story picks up the thread from the statistical research group, except we fast forward to India in 2015. And I think that we often overlook issues with our data generating process, not because they're incredibly subtle or hard to spot, we have to be brilliant. Instead we overlook them because we're so excited about building solutions and finding a direct answer to our primary business problem. In 2015 I was vice president of data science at housing.com in Mumbai. And we were excited about building an excellent real estate experience so that instead of wasting your weekend, you know, crossing a city in traffic stuck just to see a flat that it turns out was obviously not right for you, you should be able to find your next home from the comfort of your home. And a core part of that effort was bringing as much real estate inventory online as possible. In the summer of 2015, our ops team reached a million properties and that was a huge success for us registered online. And up to the time I joined the data science lab had very little insight, really no insight into that process, that ops process and very little attention for it, given all the other things we were thinking about. For example, we were really thinking a lot and very excited about overhauling our recommender system. And we are intrigued by, really intrigued by all the ways that conventional recommender systems for e-commerce or say media, all on Netflix weren't right for real estate. And that's what had our mind share and that's where our data scientists were keen to start, except we remembered that we needed to start with thorough data exploration. And our exploratory work led us to looking at our data in this kind of way. Each of these plots shows two curves for a city. In blue is the kind of properties that our users were searching for. And in green were the properties that we had in our inventory for that city. And in looking at these we realized we had a problem. See, in cities like Mumbai and Delhi you can see there's considerable area under the blue curve under what the demand curve that didn't follow within our supply curve. Which means we had users looking for kinds of flats that weren't in our inventory, that weren't on our site. And so it didn't really matter how sophisticated our recommender system would get if we didn't have the right inventory to recommend. And these plots introduced a question for us that we needed to answer about our data. Is it representative of the market? Because if it is, then our users are looking for flats that don't exist. They're not available. And we need to educate them to that. Like you're looking for a particular kind of flat in Bandar West in Mumbai and it's just not there, it's good for you to know that. So you don't waste time. On the other hand if our data isn't representative then there's something wrong with our data collection process and that deserves our attention as well. So what is our data, what was our data collection process? Well in each city and locality we hire data collector teams to go out and find properties to register and bring on the site. And those collectors had been incentivized based on quantity. Get as many properties on the site as possible. And demographically our data collectors tend to be young, male, often bachelors. And they were great at the job we'd given them, getting inventory on the site. But then we realized that potentially we hadn't been precise enough in our direction. See in their rush to get inventory on the site they were going to brokers who they were familiar with. It was like a path of least resistance. And that means they tended to be going to flats like this on the left. Large, lower end, it's more affordable brokers who were kind of just renting out in volume. And they were very happy to just put 12 flats on the site except those 12 flats are all basically the same. So there's very little value add from a user's perspective for looking through 12 flats that look exactly the same and have the same layout. And our collectors weren't nearly as much going to flats like this, higher and more expensive, lower... Because the brokers there were not so comfortable, right? They don't deal with large volume and the two worlds just don't really overlap. So when we realized this, we were like, well, we can actually change our data generating process. We can say let's just build a feed. Our data collectors already have an app that they use to collect. So instead of just having them go and be incentivized based on raw numbers, we can actually create just a simple API that says, here's what we're looking for this month. Feed it to the app and then incentivize the brokers to do the extra work to break out of their comfort zone to do the things that can get us a more diverse set of inventory. So our data generating process can often be influenced by social factors, demographic factors, things that we just wouldn't normally think about as a computer scientist, but actually end up being incredibly important. And we also had data generating process problems on the demand side at housing. See, we were ensuring that we collected as much data as possible about each flat, and then we tried to make the search and filtering process for users as clean and delightful a user experience as possible. To do that, we needed to track and learn about how our users interacted with our next flat. Basically that meant using clickstream data across the user's product journey, all the things they're doing. So what does that look like? Basically users get on the site and they take a couple of actions and they sort of sequentially look through different properties. And we're learning something about the kind of properties they're looking at, but we're making inferences based on the sequence in which they do that. Users' preferences are conditional, so they'll trade things off. They'll say, closer to job, I'll pay more for less. If I'm further away, I want to pay less, I'll get more. And so there's this event stream that you can kind of imagine. So most of you are probably familiar with clickstream data, but in case not, your standard stuff that pretty much every website is tracking is things like, okay, the page, a timestamp, some kind of user, either an IP address or if you're logged in, a user ID, the URL, things like that. But services like Google Analytics and others that kind of out of the box set this up will only take you so far. If you want to really understand and see more granular data about sequences, where people are, which filters they're selecting, et cetera, end up getting into quite a more complex event data modeling. And our product analytics team, as part of the broader data group, spent a lot of time working on this because they realized this is our data-generating process, and it's really important, so it's worth the time. And they went to a lot of work mapping out the flow that the site facilitated so we could track it and learn about it over time. And this would be fine if we had a very stable product, except we didn't because while housing was known for good data science at the time, I like to think we were, housing was also known at the time for really great design and the design team, what they cared about most was how the product felt and which means they were willing to do all kinds of work to improve how it felt, how it flowed, how it was used. And our front-end team, which had to implement all of that, cared about performance, they cared about making it fast, making sure that information is compressed or flowing smoothly. And so what we realized was that all the tracking we set up, when workflows are changing, when the site's changing constantly, those aren't, because the teams, the front-end teams and the design teams weren't really thinking about the data they were generating, they weren't necessarily always taking the effort to go back in and remap or work with the data team to understand if this event is no longer even possible on the site, a drop in that event doesn't mean that people aren't doing it anymore, it means they can't do it anymore. And basically the lesson here is that when you have a data-generating process that's changing very quickly, it's very hard to make inferences from it. You lose the longitudinal value of your data. And what it means is that this is a technological process by which data is getting into your database. And if you're not paying attention to how that's set up, then you're going to get data that's not appropriate for the questions you're asking of it. Now the two stories I've told from housing so far, both illustrate how people, it might be business partners, it might be ops teams, it might be front-end developers, are somehow affecting an influencer data-generating process and the data we work with. And in my next story, I kind of want to disrupt the tidy dichotomy between data and the world on the one hand and the data science team on the other. Many of you have heard the phrase that data is like the new oil, right? And at housing we like to say, no, it's not really like oil, it's more like soil, because oil is a non-renewable resource, right? Once you use it, it's done. But data is renewable, you can use it, you can use it for multiple purposes and it can even increase in value over time. But the issue is that the things we do to it affect the data, it generates new data, we're affecting that. And then over time there's new people coming onto the team and there's new ways data is being used and we still need to understand the things we did in the past to affect how our data has come into being. So I'll talk a bit about our story. So after I left housing, I was Chief Data Officer at a startup, a fintech startup in Mumbai called PaySense. PaySense does mobile loans through an app and we were working on a problem that's very common, probably for everyone in this room, you're all kind of familiar with it as a standard data science problem, how to decide to give a person a loan. And a lot of times we jump into using, building a model and we might use a linear regression which performs pretty well, right, that Chris was talking about. But what I want to talk about is these features that actually are going, before we get to the model, where the data is coming from. And I'll talk about, for instance, a feature that we decided was important, average monthly bank balance. And average monthly bank balance, you know, is this thing that represents a deeper process. So most people who are applying for a loan, their financial lives are cyclical. Every month they make some money and every month they spend some money. And what we're interested in knowing is what's the top and what's the bottom and what's the curve. So really we're interested in knowing something like this. Normalized across day of the month, is where their balance tends to be. And then we're going to reduce that, right, to just one number, monthly average balance. And you might think, well, that doesn't capture enough information about this curve. You might want to know, like, monthly average range or variance or max or min or something like that. And yeah, you can add lots of features in there. But we just need to understand the underlying process. And then we realize that, like, this looks pretty pretty. It looks like we've got enough data and we've kind of drawn a nice curve and we kind of understand what's happening. But actually a lot of times the data looks more like this. So we've not normalized anymore. Now we're saying we have three months of data for observations of a person's bank balance. And we can kind of see, okay, so in June and April, we sort of see this incline. It looks like it might be representative. And we kind of see a similar slope in May, but it's lifted and it's somehow shifted rightward for some reason. We can think about why this is happening, but the point is that it's not a neat curve and we need to figure out how to interpolate and do other things. So we also need to ask ourselves, where is this data coming from? Let's not jump into just, like, coming up with the right curve or coming up with the right number. So where we're getting it from, where we got it from, was SMSs, financial SMSs. When you do a debit transaction, when you do a credit transaction, you get an SMS and it contains a reference to your bank balance. Okay, but not everyone receives the same rate of SMSs. Not everyone, some people delete them. There's lots of different processes. So let's just look at, like, a histogram of how many SMSs we have for a given user. And we kind of see, oh, there's kind of a bit of a curve. So we expect our users mostly to have around 20 to 50 SMSs. And then we might ask ourselves, well, let's describe it in a different way. So we can also look at how much time. Because you can have 1,000 messages that came in a week. And that's really not conveying enough information to you. You care about over time. So let's take the earliest date observation we see in the last date. And we look at the span of time there. And we can also draw a curve for that. And we kind of see, oh, there's these peaks. Well, by the way, these peaks had to do early on in the process with the fact that we realized, first we were saying, let's take 30 days. And then deciding, well, one month is not going to be enough anyway. So let's take 60 days. And we kind of see that show up in the data. It has nothing to do with people's finance. It has nothing to do with how they do transactions. It has everything to do with how we set up our SDK. And you can continue to kind of ask yourself, how are you modeling this? What is the underlying SMS, why do we expect people to have all their bank balance transactions being recorded on their phone? So now we can say, let's take the count of dates so like the number of dates covered. But just like 1,000 messages in one week is not sufficient. If we have an observation, and then two months later we have another observation, but like maybe only two observations in between. If you look at it just in terms of the count of dates between the two, it looks like a lot of dates. But actually if you look at the count of unique dates, it's quite low. You have insufficient data. So if you plot this together just very simply, right, we can kind of see. Ideally, our users would follow along this line. If you have 40 days of messages, you should have at least like around 40 or high 30s of unique dates. But of course that's not how all people are. Some have large spans of time, but not a lot of good coverage. And we can also kind of look at the density of messages within those time spans. All of this is just to sort of say that all of this work kind of needs to be done before we just jump in with a raw algorithm or like an equation. And when we come up with that average balance, we're not just creating an average from observations, right? Because if we have five balances across a month, you don't average over five because there's all the dates for which you don't have observations. So you have to interpolate it. You have to impute missing data. So what do you do? Do you forward fill? Do you back fill? So let's say I have an observation on the fifth of the month and it has 1,000 rupees. And then I see another bank balance on the tenth of the month and it's 2,000 rupees. Now, do I say all the days between those? Do I use 1,000 rupees? Or do I use 2,000 rupees? Or do I split the difference? It makes a difference because time spent at a different level of bank balance affects the credit that people are going to be able to get and your evaluation of whether they can repay. That average bank balance is not just going to be used in the decision to give them a loan or not. It's going to be used in the decision by a different model of how much to give them a loan. A different model of when to market to them. Another model that's going to be asking how likely they are to be engaging in fraud. And other data scientists are using it. So we generated that data that then just becomes data in our database. It's in the monthly average balance table or whatever table it's in. But we affected its data generating process. The decisions we made affected how the data ends up in our database. Okay, so let's continue with the story of giving a person a loan and we take another feature. And a lot of the stories we've looked at so far, we've kind of thought about a single data generating process that kind of covers all of our users. But this last story shows how sometimes figuring out our data quality involves recognizing where we have multiple data generating processes and where we have some work to do to tell the difference. So our next feature we might think about is monthly salary. And when a person applies for a loan, obviously their salary is a core feature of how much you decide to give them. And well, in the application process, we rely on them to tell us, to report on their salary. But obviously we know that we can't just take data at face value. There's a problem there. And so there's another process that you can follow which is kind of logistically complex of sending a courier and collecting a paper that has a salary slip or an offer letter. And then you kind of have to convert that into data. And that's complex and expensive as well. So we kind of want to know is, can we validate salary more efficiently and more cleverly using our data? So we say, again, turn to the SMSs. Because we observe that some of our SMSs seem to be telling us that when a person receives a credit into their account associated with their salary, then we've kind of observed a fairly trustworthy, a little bit better or additional to the report they've given us. And sometimes, by the way, we're really confident that that's a true SMS. Sometimes we think, okay, this is not, it doesn't look like, we're not sure it's a salary, but we feel like it is, right, here in red. It doesn't say salary, but we're pretty sure this is an employee of eBay. And that they've received this salary this month. So we can say, well, let's use this and map it up against what people are reporting. And so if we do that, and we'll just log transform it so our distribution looks a little cleaner, and we realize that, yeah, this kind of represents what we were hoping. People, there is some match. There is some mapping between the two. But we kind of can also suspect what our intuition would tell us is that there's a lot more people telling us they make more than their SMS says than there are vice versa. And what we need to understand is what else is going on here or which ones can we trust and which ones not. So let's think about how this data is being generated. Where is it coming from? What is it? Well, we can go all the way back to the beginning and we can say, well, how is salary itself generated? What's the data generating process there? Well, it's actually kind of a process of negotiation, right? And then you're like getting way off track and you're kind of thinking about like qualifications and how a person gets a salary. But it's important to at least think through that, okay? That comes up with the salary that you agree to with your company. And that's usually a humanly meaningful number. So you never agree to a salary of 37,237. We like round numbers. We like simple things. But when we get our salary, that's not exactly what we agreed to. Because the finance department gets their hands on our check and they do all kinds of deductions and things like that, right? And so that's a data generating process for what actually gets deposited into our account. And we can kind of follow that through when we say, okay, that's our actual salary. And then we have our user report. How does that get generated? What's the generating process there? Well, a user is not looking at what their finance department gives them. Usually they don't even look at their payslip. They're applying for the loan like say in the mall or as they're buying for something. And so they're thinking of their humanly meaningful number. They know their CTC is say 40,000 a month. They don't know that this tax and this debit and this blah, blah, blah. So they're going to report that. So to a certain extent, okay, so that's the user report. And then we know that some users are going to exaggerate their salary. That means they're just going to bump it up a little bit because honestly they're going to get this bonus next month or when you calculate it out, they're going to get an increase within three months. So that's really safe. And then there's some that are dishonest. They're just telling us a lie. They're making 25, but they say 60 because they think they'll get a better loan. Meanwhile, on the other side, we have our SMSs. We know that our salary is generated by a financial process and then we're getting a notification. And that's generated by an SMS, whether the person deleted their SMS, all the things that we talked about before with balance. And then also there's this process where we're not even sure all the time where we say it's a salary SMS, whether it actually is. Okay, so thinking about this, we can kind of ask ourselves, well, we still know we have two sources of information. What a person tells us and what their SMS tells us and how do we look at the difference. So we might take their income and subtract what we've identified in the salary and we draw a distribution for what the difference looks like and how big does it tend to be. And it's kind of nice that it's fairly closely approximate to zero, so it seems to be a good number of people who are somehow being honest as we have a match. But what we can... And you might try and run a regression to kind of come up with what the difference should be. What's the bias? Because then you can kind of be a little bit smarter in the salary. But you can also ask yourself, do we actually have multiple data generating processes here? Should we be estimating a bias that a user has as if it was a deterministic effect, as if it's just a coefficient and a regression? And I don't think so. I think each person has a bias and that's not uniform across the population. So maybe we'd want to be modeling it that way. But we can kind of take what's called a mixed distributions approach and we can say, all right, well, we actually have our two populations here. One is an error distribution where people just are off a little bit. It's narrow and then bias distribution which covers when people exaggerate. So it turns out that, yeah, we can kind of expect around 61% of our users to fall into the error distribution. And we can assign each one based on, say, a likelihood ratio test. And there's good estimation models where you can kind of choose the parameters of the distributions themselves, model those as distributions. But the point is that now we can sort of say, for some people we're going to choose their true salary and for some we're going to adjust them based on SMSes. So the point is that there's, if we apply just one uniform probability model to all our data, if we think about what itself is generating the information we're looking at. I think this is a tweet from many years ago, but I think it really represents well what a lot of data science work is. It's this question of we're counting and we're figuring out what's our true population, what's the relevant population for the problem we're looking at and can we reproduce it? I'll talk about that later. So I also want to kind of quickly talk about that second part of this because it's not just about knowing this or thinking about it or working through it. It's also about how do you, why is it so important to actually talk to people about data generating processes, especially your business partners, especially your product managers? And I think it comes down to something, to my background, like most of my research work was done in cognitive science and applying to school and to presentation to how people make decisions. And there's this thing called theory of mind, which has to do with how we understand what's in another person's head. And there's also this thing that comes from that called the curse of knowledge, which represents the problem all humans have with knowing what they know, knowing what someone else knows, and like not telescoping or telegraphing what we know onto that person. It's hard to remember that we know things that other people don't know. That sounds weird. You can kind of look and sort of see that humans aren't born with this capability. It kind of is generated over time. And so this is an experiment that's often done with young children. And basically they're shown this series of events. Basically there's two young, there's two little, two children. Sally has a basket and Anne has a box. And Sally has a marble and she puts the marble in her basket. And then she goes out for a walk. She leaves the room. And Anne takes the marble out of the basket and puts it into the box. She moves it while Sally's out. And then Sally comes back and she wants to play with her marble. Where will she look? Now we all know that she's going to look in the basket because that's where she put it. But children of a particular age will ask, will say she's going to look at the box because they know it's in the box and they just kind of assume that Sally knows it's in the box as well. We stopped doing this when we're children but there's still a lot of it that sort of stays with us. It's still very hard to understand. Another way this shows up is in our language for example. So I'm just going to do a quick experiment. So let's say, if I say highly likely actually can someone help me? Can you help me? Just stand and memorize the numbers I tell you. So if I say highly likely raise your hand in the room if you think I mean it's about 50% chance. Nobody, right? Maybe two or three people. Now let's jump to 65. Anyone? 75. All right, we're getting there. 85. Good, good. 99. Okay, so we're right around 85 or 90. Remember that for highly likely, yeah? What about probably? 50? Okay, a lot more. 60? It's kind of going down actually. Okay, so say 55, yeah? You remembering these numbers? Good. All right. Probably not. 50. 35. 25. 15. That's weird. We're kind of skewed on probably not. Somehow it's stronger than probably. All right, let's say 25, yeah? And then finally, highly unlikely. Let's skip jump. We're not going to start at 50. Let's start at 30. 30? 25? 18? 15? Sorry. Minus from 100. Yeah, sorry. No, that's right. 10. 5. 3. 2. 5. 5, right? Okay. Well, actually, there's been work that's done and looked at this. And it's pretty cool. It's a nice visualization. The point is, if I say highly likely or probably not, we don't really know what I'm saying. I'm trying to be more statistical, right? I'm trying to be more database. But there's quite a range on what people believe we're saying. And this is what happens with our data. Our business people get really excited. They're like, data science and AI. And we've collected all this data. And what's the answer? And I know, like, oh, I know it's race and it's sex and it's income. And I know everything. It's not really true. But we have to do the work to help them understand. What are the limitations? What are the appropriate questions that they should be asking of our data? Because it's kind of like turning the world into data as this process that's both really wonderful and letting us be more concise and more precise and more accurate. But also loses all the things that Avi was talking about this morning, right? About all the things that we pack into our knowledge that we forget to explain. So, right? We have people here and on the left, like the world we know. When we look at a representation of data like this, we're filling our minds with all kinds of knowledge, like all kinds of context that helps us understand why they like these sports or what have you. And we can kind of, as we move over, we get less baggage, less bias, less things that we're not explicitly incorporating to the model. We're also getting more abstract. We're losing information. My point is just that data is not just data. It contains a lot of stuff. And that communicating it, what's right to ask, what the quality of it is for our given purposes or questions, is something we have to work with, with the business partner and the product partner. Not just something we assume that they know as well as we know. And my final, so that's my point, is that quality is not an intrinsic attribute of data. It's a description of the appropriateness of the question we're asking of the data. Data is always going to be incomplete. It's always going to be problematic and flawed. We don't just come up with a binary, yes, it's good data or not. It's all about, well, we can ask this kind of question. We can use this kind of method. And that's the best we can do. Or we can use it for a completely different purpose that it's more appropriate for. And the final thing is how do we know we're using it in a good way? And how do we know we're going through this process in a good way of communicating our day-generating process? And I think coming from computational social science, in the last five to 10 years, there's been a lot of work that's been done in understanding the importance of reproducible research. The fact that we shouldn't just believe a paper because it was published, we need to understand what was happening behind it, what were the methods being used, but reproducible is actually just the first step. So here's a sort of a two-by-two that I like to think about. We can think about our data, in our left column we have the same data and different data, and then you think about our code or infrastructure or our algorithm, and we can have the same and different. And when we have the same data, what that means is as a data scientist, I do a problem, I use some data, I use some code, and I come to a result. Now, I might have messed it up somehow or overlooked something. I need someone else to be able to go in and do it again and come to the same result before we can really believe in it. But that's not as robust as it could be. If we use different data, that's when it becomes replicable. And if we use different code, that's when it becomes robust. It means, yeah, you use this, use this linear regression, but if I use logistic correction, I shouldn't have completely different outcomes. There should be some resonance, some similarity, some consistency. And finally, when we're saying, like, we can use different data and we can use a different approach, then we think, okay, now our model is generalizable. We're really learning something about the world that we can believe. And I think this is the kind of thing I like to think about how we kind of measure ourselves in terms of the appropriateness of what we're looking at. And that's it. Thank you. I don't know if we have time for... Sorry, you can sit down. Questions? Hello. Questions? We have time for a couple of questions. Good, because I think I ran over. But I will be around. I will be around. And also doing a... Hello? Okay, yeah. There's a VOF on data engineering. We're going to talk more about actual methods and techniques and, like, technologies that you can use to kind of do these things, all the work of data engineering. Tomorrow, I'm pretty excited about that. So if you're interested in this topic please do come by that. And anyway, I'll be around if you have questions or thoughts or just want to chat. Cool. Thank you.