 joined by Sovik Dutta, who is the Data Science and Technical Analytics Lead at Facebook's Product and Integrity Operations. Today, we'll be chatting through how he's worked on actually reading metric variations with statistical models. Data Science SG is a volunteer-run meetup that's been around for several years in Singapore. And we're very grateful that you joined today. Always feel free to let us know what we can do more to improve, what kinds of topics you're interested in, and if you have suggestions for what needs to come next. We're also very grateful to engineers.sg, who are recording this session. And the talk will be up on YouTube later. So I'll hand over to you, Sovik, now. And thanks so much again for your time today. Thanks, Sonima. Hello, everyone. I cannot see anyone, so I'll try my best to just make this as interactive as possible. Very excited to be here. I have attended a few talks in the past, but have been on the other side, so really excited. And hopefully, what I'll share today is of help to everyone. So we'll try to do our best. Give me a moment to share my screen. Awesome. Can you see my screen? Yeah. OK. So this talk is going to be about attributing metric variations with statistical models. The way I'm going to approach this is I'm going to give you all a very relevant introduction of what I have been doing, then talk about some of the general applications in the industry, then move on to the premise of the problem, and then finally into the analytical solution that is going to be the focus of this talk. So yeah, so yeah, statistical models versus machine learning. This was something that was new to me as well when I learned about the fact that there are a few differences between the jargons. They are mostly used in an interchangeable manner, but statistical models probably have more mathematical background to it, and they have been built by statisticians. Machine learning, on the other hand, is first built on the same principles, but has mostly been developed by computer scientists. So it follows an iterative or algorithmic approach in many cases, and statistical models are built on a bunch of mathematical assumptions. So that's the subtle difference, but at a very high level, they are probably interchangeable. And the reason I'm talking about this is because in this talk, I'm going to focus a bit on the mathematical interpretation of the model that I've used, and how that can be used to create impact for the business. So applications of analytics in the industry, pretty sure most of the folks over here are aware of predictive functional applications. That is, how will unknown or unseen data get some quantitative meaning? This probably occupies majority of the analytical modeling applications in the industry. Many want to predict something, right? We also have descriptive and prescriptive. Descriptive means what has happened in the past, and prescriptive is given some data points, what is the best course of action to do. So the reason why they are involved or emboldened is because my solution or the talk that I'm going to be discussing falls somewhere between descriptive and prescriptive and takes pointers from both of these areas. There are a few other jargons as well, I think diagnostic, but like I had a very high level, these are the major categories of any and in the industry. Okay. Yeah, so the premise of this problem is that we want to predict users on Facebook, and hence we want to keep the ecosystem clean, right? We want to remove bad content, not only in an effective manner, but also as soon as possible. So effectively and efficiently, these are the jargons that are of interest over here. What I'm going to focus on is the efficiency part, like that is how fast or how soon can we clear bad content from the platform. So the higher human reviewers to enforce on content alongside machine learning systems, and this talk is going to be focused on the human review portion. We quantify human review performance via multiple metrics. One such metric is efficiency, and this is a fairly broad metric that is used with different connotations in the industry. In our use case, we can pretty much assume that this is jobs decision per hour. So a very high level overview of what jobs over here mean. Content comes in the form of jobs or tickets, and reviewers essentially take a decision on them, whether they should stay live on the ecosystem or not. Of course, the unit of time can be anything. It could be jobs decision per second, minute, hour, day. Just for the sake of simplicity, I'm going to keep it jobs decision per hour. And the way that this ties back to the original problem statement is higher efficiency implies that we want to take down bad content as soon as possible, right? So this ties back to the overall problem statement that way. If we have figured something that is bad, we want to clean it as soon as possible. Now, we track this metric in a periodical manner. So the lowest grain that we track it is at the daily level, and this varies day over day, sometimes outside of the normal range of variance. A non-trivial dip in efficiency could potentially indicate lower productivity. So for example, if I'm tracking a metric and there are spikes and no dips, that could mean that it is outside of the normal range and in the case of efficiency, if it falls below a certain range, it could indicate lower productivity. Of course, it could mean anything else as well. And we will have to do the root cause analysis, but lower productivity is something that is of interest to us. Because at no given point of time, do we want a productivity to go below a certain threshold because that essentially means that potential bad content that we have identified is staying live on the ecosystem. And that is something that we don't want. So we want to clean it as soon as possible. This is a snapshot of how the data might look like. And folks in operations research might be able to relate to it because this is pretty much what control charts look like. You have the within control and outside of control error margins. And you can assume that we also allow an acceptable threshold of efficiency. And as you can see in the graph, a dip is something of interest to us and we want to investigate by the data. So yeah, so the open-ended question over here is, how do we understand what caused efficiency to the test? So there was a dip and we want to understand why that happened. Now there can be multiple ways to approach this question because all of us gathered over here are interested in data. I'm going to focus on how do we get a quantitative solution to it? So what I did, of course, there are multiple possibilities, but what I did was convert this open-ended question into this quantitative question. That is, what factors can we attribute our metric movements to? And this can tell us what areas need more attention. There are multiple ways in which this open-ended questions can be converted into a quantitative question, but this is just one of them. And would like to talk about some of the potential solutions that were there at the top of my mind. So we can always answer this question via qualitative methods by doing research and hypothesis working with certain teams that are responsible for these metrics and build a narrative around that. But again, like this is a qualitative solution and qualitative solutions work to a certain extent. Of course, if we can tie any quantification to it, that just makes it a bit better. It is more impactful and it is more understandable like why a certain regression happened. So one of the processes that we already had in place was to identify causal factors individually and look for correlations with efficiency. So for example, say there are two factors hypothetically that are maybe directly proportional to efficiency and inversely proportional to efficiency. So maybe what we can do is we can just track the movements of these individual causal factors and see how they move during the period of dip. And if there is anything of interest, we can just report that. So we can say that, yeah, we found that causal factor one was moving and hence that might have been the reason behind the efficiency causation. But one of the cons of this approach is that it looks at individual causal factors independently and we are never able to get a cohesive narrative. So this is where the model-based or the statistical model-based solution comes into play. The best way to go about this would be to combine whatever causal factors we have into an explanatory model and see how the narrative comes up from there. So one way to think about this is one causal factor will probably never impact the metric just on its own. It has to be a combination of multiple factors. So can it be maybe model all of them together to see if efficiency has been moving in accordance to all of them, right? So the statistical model that I used is a multivariate regression model where the dependent variable is efficiency and the independent variables are causal factors. I'll go a bit into the details of what the independent variable looks like in a few slides. But this is what I did at a very high level, right? And this is where we have built the model. So what now? This is where the perspective part comes into play. Attribution can essentially be calculated based on the principles of how we interpret all of the coefficients. So pretty sure like whenever you are reading any introductory chapter on multivariate regression, they'll talk about how to interpret coefficients and that is the impact of a particular coefficient on the dependent variable. If only that coefficient moves or the variable attached to that coefficient moves and everything else remains constant. This is exactly the same thing. How much efficiency will change if only one of the factor changes and nothing else changes and so on and so forth. So at a very high level, this is a pretty fundamental concept in regression. And I'll also go ahead and say that it is a pretty simple concept as well. But at the same time, that can be used to create major impact to the business if the problem statement can be tied back to this. So that is what I'm gonna focus on in the next few slides. Just quickly wanted to give a snapshot of how the table might look like. I'm pretty sure like folks over here who have built machine learning models are familiar with such tables. You have all the individual factors as columns and efficiency, which is the dependent variable would feature something like this. What I also did was, of course, like no data comes in this manner, we have to do a bunch of wrangling and manipulation. So the final version of the table that was to be used for the model building process, what I did was I built that in the form of a pipeline so that it can be accessed in an easy manner on an ongoing basis. Okay. So the snapshot of how the statistical model looks like. The final solution looks something like this. We want to explain the regression in efficiency. Actually, I'll probably not use the term regression because that's the model as well, but we want to understand a dip in efficiency, right? And maybe we can say that we want to understand a Y person dip in efficiency. Then can we present a solution where we are able to say that to explain that Y person dip in efficiency, X one person can be attributed to factor one, X two person can be attributed to factor two, and so on and so forth, such that the sum of the percentages is equal to 100%. The factor with the highest contribution to efficiency can then be focused on by the responsibility. So for example, if we are able to say that, hey, efficiency moved 10% and 70% of that can be attributable to a change in causal factor one. And the rest of it is basically attributable. Can we talk to the team that is responsible for factor one and see what's going on and fix that? So this gives us a pretty good way of narrowing down what to focus on and ties back to the open-ended question that is how do we understand the dip in efficiency, right? What I also did over here was package this in the form of a dynamic tool, which can be used by any non-technical audience. This is something that is, I'm pretty sure folks over here will be able to relate to it, but technical solutions need to be used by multiple people and maybe one of the areas of impact that all of those analysts can create is by either communicating that in the simplest form possible or providing the tools and all the sources for anyone to be able to use it. So what I did over here was I implemented this model in the form of a dashboard where you are able to select the dates, the timelines and the individual factors as well and just push a button and a model runs in the background and it gives you some visualization that you can actually use. So the final visualization looks somewhat like this. So you'll see that the by-person dip in efficiency can be attributed 50% to factor one, 20% to factor two, 15% to factor three and so on and so forth and the sum of all of this is equal to 100%. And the factors over here are basically the positive factors. I'm gonna talk a bit about the steps in the model building process. The first step is the research on causal factors. This is both qualitative and quantitative. I brought in on this particular bullet in a few slides, I'll go a little deeper. So I'll leave that for the time being. The data collection process and the data training process is super important. Majority of the data lives in different places and we need to create pipelines, join them on particular keys and basically wrangle and or manipulate the data so that we are able to use it in a manner. The final version of the data looks something like this and so it is definitely a long-trivial portion of the model building process. Outlier treatment is a very important step because we are dealing with human review. There is a lot of variants in the data. There can be variances in the data due to a multitude of other reasons like bugs in the system or just like error in the data logins and all. And it becomes pretty important for us to consider whether sclerious data points should be a part of the model because we need to take a call on whether these data points are important or not because if it's a real effect, then we should be accounting for it, right? If it's not a real effect, then we should be discounting that or at least we should consider what to do with it. So this involved a lot of qualitative research. Like for example, if there is a data point that looks like a pretty major outlier and that happened three months back, it is or it was a hard task for me to figure out why that was the case. So I had to do a lot of qualitative research over here and the major chunk of this qualitative research included me talking to the respective teams to get a sense of what happened back in history. And we did figure out that some of the data points indeed were a function of error logins and or outliers or something went off and it was unexpected and it will not be classified as legitimate. It just happens that in this case, majority of the outliers were erroneous points. So I removed them and that is usually the easier approach. But of course, if that is not the case, then you need to figure out how to treat outliers and like maybe you can just second the money for that. But that was not the case for this particular topic, right? Feature scaling is something pretty important, especially for regression-based models. Although I did scale the features, but just wanted to point out that because we are using a descriptive version of the model and we are not predicting over here, I did not foresee a lot of problems creeping in because we are not dealing like unknown data and so hence everything should be expected. Like the data won't be outside of the range over there because we are building on past historical data. But just wanted to highlight that feature scaling is pretty important. It becomes especially important for predictive tasks, maybe not so much for descriptive and prospective tasks. Exploratory data analysis is pretty important, especially correlations when you're building a regression model. You need to check for correlations between the independent variables individually and take a call on whether they should act as features in the final model or not. Because a lot of the research comes in the form of cause and factors, you would expect some sort of correlation, but in general, it's always a good idea to do the exploratory analysis step. Model diagnostics is also something really important for regression-based models and this is also something that I'm going to go a bit deeper over the next few slides. And I like to stress on the fact that statistical models because there is a lot of mathematical interpretability attached to it and are essentially built on certain assumptions, model diagnostics become pretty, pretty important. But like say in machine learning where you're just building black box models, you probably don't need to take into account what are the assumptions that the model is built on. So one way to look at this is that you probably have more flexibility when you're building black box models, but unfortunately that is not the case for statistical models. You have to look at the assumptions because that is a basis on which you will be able to get some interpretability out of it. So this is something really important. Model evaluation, I use adjusted R squared over here because the model needs to fit my data pretty well in order for it to explain variations that are affecting the business. Higher value of adjusted R squared means my linear model accounts for majority of the variance in efficiency. Adjusted R squared and R squared can be used interchangeably. The only straw about that is R squared automatically increases with higher number of features. So adjusted R squared takes that into consideration. My model did not have a lot of features because these are essentially causal factors. So the adjusted R squared values are not very different from R squared. Statistical significance of coefficients is pretty important. And this is where I'd like to stress again on the fact that we are building a model to get a lot of interpretability out of it. Maybe in predictive tasks, we can have a few features that may not be statistically significant. But over here, it is pretty important because we want to be able to attribute it to the business and take decisions out of it. So if there is a statistically insignificant feature, then we should be talking about at least thinking about why that is the case. And the final step is the attribution calculation, which is really where the meat of the problem is. So causal factors, right? This is the bread and butter of the model because what we are trying to answer here is we want to attribute efficiency movements to causal factors. So identifying causal factors becomes really, really important, right? I'll give you one example of a type of causal factor that I use in the model and that is time spent for job. If you come to think of it, efficiency and time spent for job is pretty much inversely proportional. If you spend more time doing a job than at the end of an hour or the first hour, your efficiency will be pretty low because you'll get to less jobs, right? So since this is inversely proportional to efficiency, we know that there is causality in there. And so this is a very good candidate for having causality factor. And this is a good candidate ultimately to be used as a feature in the model. I used a similar methodology to identify other factors, but I will like to highlight that the cleanest form of inferring causality is to run an experiment. But in many practical situations, it's hard to run individual experiments just because of logistical constraints, maybe time is a constraint. And so what I did was I used a middle ground over here. I used correlations. Of course, we all know that correlation does not input causality. So it was a combination of using this as well as doing a lot of on-ground research with relevant teams and getting that intelligence on whether there is some causality or not. So basically brainstorm with relevant teams on other potential causal factors and get rigid data points that, okay, these factors might have some causal implication on efficiency. So this is how I figured out all the different causal factors that were relevant to them. There were some good candidates of causal factors that I had to completely discount because we did not have enough data points over. And at the end of the day, it's probably better for the business at least to provide a picture that makes sense to the problem statement rather than trying to include as many features as possible because, again, we are pivoting on the interpretability of the model and we are not really focused on whether this model can give us good predictions. So this is the slide about model diagnostics. And one that means is we want to test the assumptions of the model, especially with regression that there's a bunch of assumptions that we need to take into consideration. The graphs that you see come from a synthetic dataset and essentially these are the graphs that you get for free if you're using art. Even though I use Python in my case, I just wanted to highlight what the model diagnostic process looks like. There are a few assumptions in multivariate regression. So for example, we don't want multi-colonarity between features. So I had to check for independence and it just happens in my case that certain causal factors were really highly correlated and I just had to discount one of the two. Of course, you can use other methods if you still want to keep everything like for example, you can use principle components analysis and trim down the dimensionality space. I took a call on eliminating a few features again based on the qualitative inputs from the team because we don't need to spend our energies on all different factors that are out there. Maybe only the important ones that matter are good enough. So multi-colonarity is very important. We need to check for linearity and almost cadasticity, which is constant variance. We want the residuals to follow a normal distribution with mean zero and constant variance and the QQ plot and the histogram of residuals can give us that. You can see the normal QQ. It's also called the normal probability plot and books distance for influential points and or outliers. So the model diagnostics was a pretty important step. There are certain methods that you can use if your model diagnostics are not up to the point like for example, if there is a divergence from linearity or homo-cadasticity, there are certain techniques that you can use in order to make that happen. So for example, you can do a power transformation and so on and so forth. It just happened that in my case, the data fit the model assumptions pretty nicely. So I didn't have to worry about that a lot, but then there are techniques that we can use for the assumptions of linear equation. We'll talk about attribution calculation. This is probably like the most important step and it just happens that this is probably the easiest step and we just needed to think about this at a much earlier phase and this is where the meat is and it is really built on the fundamentals of whenever you need any introductory chapter on regression. The one thing that is needed for the attribution to work is that the model needs to fit the data pretty well, otherwise we cannot get any interpretability out of it and if we find ourselves in situation where the model is not fitting the data well, then we need to take analytical steps in order to fix that. So if you remember the graph that I was showing a while back, you usually have a pre-period and you have a post-period during which the dip happens. So you have a pre-efficiency and you have a post-efficiency. So whenever you see pre and post over here, it essentially means data from the pre-period and data from the post-period. So if I were to attribute efficiency delta to just factor one, the way I would think about this is, okay, how much change can I see in efficiency? If only factor one changes from the pre-period to the post-period and everything else remains constant, right? This is really how we try to integrate the coefficients as well. Similarly, if I want to attribute something to factor two, how much will efficiency change if only factor two changes from the pre-period to the post-period and nothing else changes? So you'll see in the formula for factor one, I only use the factor that is in the post-period. Assuming that only this one has changed and everything else has remained unchanged. So the pre for factor two and factor three, I'm using pre. Similarly, for factor two attribution, I'm using the pre-data for factor one, the post-data for factor two and the pre-data for everything else, right? And so on and so forth. And the percentage attribution is basically the efficiency dropped from factor one divided by the sum of all efficiency dentals due to all of the factors. So that's how we can attribute a certain percentage to factor one and a certain percentage to factor two and so on and so forth. At this point of time, I'll again go back to this graph from the calculation that I just, if I am able to visualize that in this pseudo-waterfall sort of a chart where I say that, okay, the dip in efficiency can be attributed x one percent to factor one and x two percent to factor two and so on and so forth. This gives all the ground level teams a good insight into what was the problematic cause behind this dip. I don't want to call it problematic, but what was the most important factor that we should be focusing on because of which the dip happened, right? And then we can reach out to the ground level teams and figure out why this happened. So hence getting a solution that we can fix in the future and so that we prevent similar dips from happening in the future. Yeah, so the last step that I took in the entire process was to make this one button push to provide all results. So I package this in the form of so that anyone in my team and in my extended team can just use, I conducted tutorials, bomb back sessions and officers because a lot of the code, not like majority of the code is written in Python and if people wanted to learn what was going on in the background, I conducted sessions for that conducted officers and also implemented feedback. This is something that I feel is pretty important and we who build technical solutions sometimes discount for that is the value that comes from feedback. Feedback is essentially a gift and the folks who are using this to build solutions, the feedback loop is really important. So I also iterated the model based on the feedback. Model retraining is pretty important over here. How frequently do you retrain the model? And for this, I had to do a separate analysis as well where I just wanted to check how much variance there is in the type of the data over long periods of time and based on that figured out a frequency that really works. So we don't need to be able to retrain the model on a daily basis, for example. Depending on how we read the details. So that is something that I had to take into consideration. Model interpretability is pretty important. Of course, like in a lot of the super complicated machine learning algorithms like safe loop learning and whatnot, it is still a black box for many of us and we don't get a lot of understanding why a certain decision was made by the model. And that's where I guess statistical models are a bit clearer at least in what they are trying to do. But it is pretty important because just being able to understand and model data and using that interpretation to tie it back to the real one can create a lot of impact to the business. So this is where I'll end and very quickly, I just remember that I forgot to give an introduction of myself. So I'll just quickly do that. I'm sorry, I have been at Facebook for a little above six years now. And before that, I used to work at an ad network and before that I graduated from college with degrees in mechanical engineering and a master's in mathematics. At Facebook, I've had multiple profiles. I started off as an analyst, then became a solutions engineer, then moved back to being an analyst and pretty much work at the intersection of product, operations and risk mitigation. And yeah, very passionate about solving problems with data. So hopefully the session was a bit helpful. Yeah, I think we can end now and more than happy to take questions on this. Yeah, thanks so much, Sovik, for the talk. We've got a lot of questions in the chat, which is excellent, but I'm conscious also of time. So I'll be picking a few and then we'll go from there. So maybe let's start with a question from Johan. He asks, the dips are themselves outliers. So in the outlier treatment process that you outlined, how did you make sure that the dips themselves were not removed? Yeah, yeah, yeah, that's the pretty good question. The outliers that were finally, yeah, yes. So that there is a mechanism for us to identify what data points are outliers and what data points are not. And this pretty much happens at a pretty high level even before we dive into any analytical solution. So for example, if there is a dip that happened for a particular date, we have to get confirmation from the ground level teams and the teams that are working directly with human reviewers that indeed there was a different productivity and that was seen at multiple places or by multiple people and it was verified, right? So the points that we are investigating, they are a part of the model. The parts that were removed also comes from a lot of investigative work where we get again, the same level of confirmation from all the teams working on the ground that those were anonymous points. But it is a pretty hard problem because for all the outliers, for all the individual outliers that were there, I had to research individually through all of them. So that was a really time-consuming process and then had to filter out the ones that I didn't want to use. So in a nutshell to answer the question, it is basically qualitative research that gives us that level of information. Got it. And from Weeming too, let's say we discover that two factors are highly correlated, but both of them are correlated with each other, say 0.5. How can we decide which one to put into the model or can we put both? Yeah, so I think when you put in, when there is multi-coloniality with the model, the model diagnostics take a hit and we lose a lot of the interpretation. So they shouldn't be a part of the model, they can be a part of the model if we are somehow able to remove correlations. So maybe use techniques like principle components analysis, but then once you do that, you also lose a lot of interpretability because you need to convert it to a low-dimensional space, convert it back and then get that level of interpretation and that is something that I didn't want to do. But the answer to this question as well is by talking to the ground level teams. If there are... So first of all, I want to clarify that I did figure out if there is any multi-coloniality or not. So if individual factors are multi-coloniality, we need to take a call just based on what I mentioned, whether we should keep one or the other. Now, the question comes in as to which one of the two we should choose. And yes, the answer to that is also qualitative research by talking to the ground level teams, by also using certain insights like do we understand which factor is more in control for us? If you come to think of it, if factor one and factor two are pi correlation, that basically means both of them will have similar impact on efficiency. But maybe factor one is something that we can't control as a team or the opposite over here is that maybe we can control factor two pretty well by making certain changes in the process, either in the operational process or in the tooling process, we are able to get that level of control and hence that is more relevant to us. So that level of research was also done. And there are certain factors that we don't really care about. So that's also one of the reasons. But yeah, at the end of the day, this needs to be because we are holding a ground level team accountable for the movements. We need to take into consideration what factors we are able to control. So that was the basic principle behind this solution. Got it. And I think we have time for just one more question. So I'll try to combine three. Given what you've been talking about, it's really clear that how you determine the causal factors is critical to solving this. So did you use other kinds of factors beyond what you already gave an example of in terms of what's inverse to efficiency? So for example, Abhinav says, did you use a factor capturing the difficulty of content review impacting efficiency because some pieces might be difficult to flag because of unclear policies and things like that? And just to also add on to that, how do you respond if it's not something that if a cause of factor is something that the team can or can't act upon? Yeah, yeah, I think the answer to this is also similar to the last question. That is the final factors that went into the model building process basically comes after multiple rounds of alignment with the ground level teams and a consensus that these are the factors that we should be focusing on because this is in our control. And this is important to us as well because they might have other reflections. And so that is the most fundamental answer over here about considering other factors. Yeah, that was a fine trade-off between how much time should we spend in getting the best model out there versus what can get the job done, right? Another data point over here is that the basic principle of this approach is interpretability. So we want to get as much interpretability as possible without making the model too complex. Like say if we have 50 different factors then it just becomes pretty hard for us to move anything completely because we don't really know what to focus on. So that was a fine trade-off that I had to find like how many factors are important to us and we can realistically move and then not worry about using or researching on different factors out there beyond the circumstances. Got it. And I think we're right on time. So thank you everyone for your time today. Thank you so much, Sovik, for the session. Really appreciate it.