 All right, so thank you, Paul. Thank you, German, for allowing me to host this exciting industry panel with six excellent, outstanding leaders in the causal science space in various industries in business. So I'm Victor Chen. I'm currently directing the Experimental Design and the Causal Inference at Fidelity Investments in the US. In case you don't know, Paul and I go way back. When I was a faculty at the Copenhagen Business School before I joined the industry. Today I'm joined, it's my great honor to be joined by six outstanding panelists. Without any particular order on the screen, we have Satya Anand, Director of Data Science and Engineering at Netflix, Benjamin Skrenka, who's Data Science Manager in Experimentation at eBay, Samik Gupta, Principal Data Scientist at Experimentation Platform from Microsoft, Eric Weber, Senior Director of Data Science, Experimentation, Causal Inference and Platform at Stage Fix, Michael Conaghan, Software Engineering Manager at Experimentation Platform at Metta. Somehow I still want to call it Facebook. And Yiyin Yu, Yiyin is Applied Research Manager in Experimentation and Causal Inference from LinkedIn. So welcome to this panel. Just a disclaimer I want to announce before we proceed. Today our panelists will share their personal opinions and ideas, which may not represent the views of the employers or the colleagues. So we're free to share our own experiences, ideas, as our opinions. The panel is structured in the following way. We will spend about an hour or so with the panelists to cover three broad themes of topics that are concerning causal science in the industry, especially concerning business decisions, including culture, methodology and infrastructure. And after that, we will hopefully have at least 20 minutes for open Q&As with the attendees. But if you have questions or comments during the panel discussion, don't wait until the end. Feel free to use the Q&A button on the bottom of your Zoom screen and type your questions and comments. And I will sporadically pick up them and ask the panelists to respond. Now, in case you are wondering, as you can tell from my background, I'm sitting in front of Dr. Strange's Secton Sectorine in the parallel universe. If you are a Marvel fan, you know what I'm talking about. So this is to remind me how I show think of causality. To me, causality is the difference that may happen to the same person at the same time from two parallel possibilities that could happen to the person, as counterfactual to each other. So this is my perception of causality and a lot of the work I do, studies, research I do is around that concept that is on my mind. So let me kick off our panel by asking our panelists to define what is your version of causality. So when you are working in your job or in general in the industry, how would you describe causality to your friends? And there's no particular order, but on my screen, the magic list, the first on my magic list, on my screen, top left is Sathya somehow. Sathya, would you like to start? Thank you, Victor. And thanks for having me on the panel as well. I think when I think of causality, it's really the concept of cause and effect. And it has to do with did your action directly lead to some impact or some changes? And whether in the absence of such action, that outcome would not have been observed. So I think it's very similar to your conception of counterfactuals and alternate possibilities. And so in this possibility, you made a change and you're observing an outcome. Is that outcome directly related to the change that you made? Thank you, Sathya. The second on my magic list is Yingying. What do you think? Sure. Well, to me, causality as used in the industry really has to do with the effect of a feature change or an algorithm change on metrics of interest. So there could be many things going on in a company. There could be things going on in the macro economy. But if you want to isolate the direct relationship between A and B, that's where causality comes in. Very good. Eric? You put me in third position. So I'm just close enough to say, yeah, what they said. And also, I think there's also understanding both why something happened, did it affect the outcome? And also, what can we do about it? I think that last part, especially operating in a company and industry, that comes up all the time. Which is there's a lot of interesting questions. But there's also a question of which of these outcomes can we actually change, influence, or manipulate? And I think that is where really tricky nuance comes into defining some of this. So I'm going to say what they said, but also with some of that, like what can we do about it in a company? Totally agree with you, Eric. So I sometimes call causal analysis prescriptive analysis because it has prescriptive value for what to do. Ben, what do you think? Yeah, so anyone who comes from an economics background is probably familiar with the Angus and Pishka book, Mostly Harmless Econometrics. And they have a really nice idea about fundamentally answerable questions, facts, and fundamentally unanswerable questions, books. And so we're trying to make something better in the real world, whether it's a policy to prevent teenage pregnancy or to make better decisions in the corporate world. And often we have questions that are very poorly posed. And so I think there's a real art to posing your question so you can answer meaningful questions that get you to a good decision that's going to bring about a welfare improvement in the world. So at this point in the list, it's kind of hard to say something new. And I think that also a lot of us tend to operate in a daily basis in the potential outcomes framework because we're lucky enough to be able to run experiments. And that's an easy and tractable way. We just heard the keynote speaker who would say that we're all Satanists because we're using potential outcomes. And we should be using a structural causal model view of the world, which is also very good if you can't run an experiment. And it can also be combined with experiments to kind of deepen your understanding of causality, particularly if you have to deal with things in a messy world. Thank you, Ben. Well, speaking of experiment, we have the last two names who are actually responsible for building experimentation capacities in the companies. So Sommick and Michael, what do you think? How would you describe or define causality in your world? Sommick, you want to go first? There are very few things left to say after all the panelists have already spoken. So I'll say I agree with all of them in the view of what is causality. I guess the simplest way I kind of think of, especially in Seattle, is it's raining out there. You have umbrellas. So did the umbrella cause the rain? Or did the rain cause the umbrella to come out? You just see data, both of them are co-occurring. So there's no way to figure out which way does the causal cause and effect goes. And this is what causality or causal inference helps us tease apart that, hey, because of rain, the umbrellas came out. And this enforces, like, gives you a sense of accountability and ownership of your results or your decisions that you make in terms of the impact. So you can really use causal inference to say that because of the decision, or we are making this decision because we know that this will cause X, Y, and Z. OK. That's a very interesting analogy. So, Michael, I don't know if the rain and umbrella resonate with you and folks in the California. But what do you think? Yeah, that example is also one that I always come back to when I'm thinking of things. But the way I think of it is actually a lot more like your example. I like to imagine an elementary school science experiment where you have the volume of the water and the length and the temperature is exactly the same in the two different setups. And you make a small change and you see what happens. And that's how you know that that small change is the cause of the effect you're seeing. So there's a lot of things happening in the world and in the business. And ideally, you can clone the world and have this alternative version where everything is exactly the same except for this one small change. So we can't have that, obviously. But we try to do the best we can with the methods and these experimentation systems that we built. Very nice. Maybe one day we could have a meta. I mean, a meta universe where we can put all our assumptions of changes to the alternative universe and see if that really is causing any changes, right? I mean, you're the right person to talk to. Perfect. So now, thank you very much, everyone, for the excellent setting up of the scene, for causality. Now, I want to start with culture. So one thing, so I recently had a transition from academia to industry. And then the very obvious change I've been encountering is culture, right? So same for causality. So we have data science. So we see the emergence of data science, big data, tools, machine learning, and then causality. So again, we don't have to represent our own companies. It's just our personal opinions and experiences. So why should people care about culture, care about causality from a cultural perspective? So why is that important thing? Again, I'm going to go backward. So let's start with Michael this time. Why should we care? Yeah, similar to, I guess, how we define causality. You're investing a lot of resources and time in making these changes to your system or to your business. And you want to know if the change you made is actually what's causing the metrics you want to go up. It could be revenue, a number of users and whatnot. And if you don't know, actually, then you're just taking shots in the blind. And one of the best ways for that is experimentation and causal inference. OK, so in other words, I think my reading of your answer is we could have a sense of responsibility. We know who or what is responsible for what effect. That is important in the industry. We need to have responsibility. Perfect. So what do you think? So again, I'm going backward this time. Yeah, so actually, just recently, I've thought about it a lot. We published an article in Harvard Data Science Review with Yal Boinov, assistant professor. And for the record, I've read that. OK, so I'm going to quote that article and build on what Michael just said. So I think of it as two levels of why you should care about causality. One is just for your individual experiment or your individual decision, you can be much more confident that your decision will cause X, Y, and Z. Two, in that same frame, you can also limit the risk. So you only expose a very small percentage of your population to the new treatment, not that you were shorting your implementation, but you have humility that even after our best efforts, sometimes we don't know how our customers would react to certain things. So you start with a very small exposure and make sure that there is no negative impact. So that's at the experiment level or a single decision level. But when you use it at scale, it has a cultural advantage. One, it allows you to make consistent decisions regardless of who's in the room while making those decisions. So and that happens through defining really good metrics, goal or overall evaluation criteria metrics that align with the product strategy. And two, it also makes this whole decision-making process a little bit more scrutinizable because now you have these key metrics and everybody's making those decisions in a standardized manner. So you can reevaluate and study that did this decision strategy work. And lastly, it just opens up doors for innovation. So it enables that growth mindset where we don't start with like, we know exactly what our users want and let me spec it out to the last detail, but more like we want to engage our users or whatever goal we have. And we're gonna try to put our best foot forward, but we know that we may not succeed in the first time. So we're gonna try a lot of ideas. And that kind of percolates down into the organization and people try a lot more ideas and have a way to know which one would succeed and build on those. Okay, great. Yeah, by the way, there's a great article. I shared it with many of my friends on LinkedIn. Ben, you are next on my list. I know you have been educating the public a lot on this topic. So we've asked questions in a different way. So have you encountered any cases where people say, no, we don't care about causality? Yeah, actually that's a really great question. And I mean, first you're like, well, this is obvious. Like I'm a scientist, like why wouldn't you care about causality? You have to be mentally challenged. But then you meet a lot of people who really don't run experiments well. And if you step back and you think about their incentives, they have different incentives than you do, which aren't necessarily doing what's best for the company. They may be interested in keeping their job. They may be interested in having an easy life. Or more importantly, a good example of this is a colleague at Walmart ran a great survey of his to understand why people weren't running more experiments there. And he said, like, why don't we run more experiments? And everything at the top of the list was, I've got too much stuff to do, it doesn't really matter. And all these things were like behavioral, cultural reasons for not doing it. None of them were technical. Like the platform could be better. We can't measure heterogeneous treatment effects. And so putting an emphasis in causality to Summit's point is really important because you need to set up the incentives so that managers are compensated for paying attention to causality because we know that on well-optimized sites, roughly 10% of new ideas are good. That's the order of magnitude. So if we don't pay attention to causality, we're gonna ship a lot of bad features and actually take a big step backwards. Interesting. We need a strong incentive to justify the cost of doing causality, right? Thank you, thank you. Eric? Oh, go ahead, Ben. No, I let someone else talk, I've talked too much. Okay, Eric, we can come back to you, Ben, later. One thing I've thought about recently is causality can kind of take a backseat when everything is going up and to the right, when things are going really well and business is growing and everything's great. You're like, okay, we could have maybe made that a bit better. But in general, we're all making money, but now we're in a situation where we're having contraction or flat year-over-year metrics, whatever it might be. And it enables, I think experimentation, causality focus enables, you know, echoing what others have said, things to be a bit more scrutinizable. It democratizes evidence, I think, in a way that can combat a really like powerful person in the room driving a specific narrative to push a product development. And so I think it's especially important in the environment that we're in, across tech right now, is you're able to really look more closely at these decisions, because it may be the difference between growth and contraction, where in the past it may have been, okay, we grew a little bit less than we would have otherwise. Okay, I agree. Thank you. Any in? Would you like to add something? Sure. So in general, I think the question of causality comes up when we have more than one options and we're looking for a data-driven way to help us decide which path to take. A culture of causality can help a company understand what are the ramifications of each decision before potentially making them and causing potential widespread harm. So causality basically is a way for organizations to make data-informed decisions. Okay, awesome. Safia? Yeah, I think what all other panelists said it resonates fairly well, especially Ben's point about aligning incentives for what's best for the organization overall. I'd say that there are many factors that go into decision-making, including strategy, regulatory constraints, competition and so on and so forth. And if as an organization, you believe that data should be one of the inputs to decision-making, then I'd argue that causality is probably the most important factor within how data affects decision-making across your organization. And then for all the reasons that the other panelists mentioned on adding certainty to the process, keeping people accountable for the decisions that they make and having justifications for why you make a certain change. Now, I'll add that this does not mean that or at least I don't subscribe to the view that you always need to do what the results of the AB test tell you or what the results of the causal analysis tell you to do. There may be other very good reasons, strategic, regulatory, competitive threats for you to do something that goes against what the numbers are telling you. But at least having the causal analysis framework or decision framework in place puts the emphasis on the decision-maker to justify why they're going against what the numbers are saying. So it's a very powerful accountability mechanism. Yeah, yeah. So now, Ben, let's go back to the point that you made and then several other panelists also resonated. Alliling the incentives. So can there be any policies or procedures in the decision process so we can implement this alignment? Again, for now, I don't call the name. So if you have thoughts, feel free to jump in and feel free to interrupt. But Ben, would you like to start? Full up on that point. I'm an economist. So for those of you who don't know that, so I think a lot about incentives. In fact, my wife even now says, oh, if that's incentive compatible. So we would like to make good behavior designed into the mechanism so that people willingly participate in doing the right thing for the company or the government or society. And so one of the things we did at eBay that was very effective in building more maturity was we built a maturity model. So we kind of benchmarked where we were against industry throughout the experimentation lifecycle on a variety of dimensions. We did it for all the different key verticals like customer marketing, performance marketing, on-site marketing and so on. And this allows you to put a number on something that is kind of a fuzzy and tangible. And then you can kind of, when we produce this publicly, we track it orderly. And so we can see which groups are doing well, which are lagging, and we can use this to try to push people forward. We can do things like did people log in the experimentation platform, why they made a decision? This is important because it allows us to learn and do meta analysis to detect systematic problems. Okay, so now I wanna read out one comment, one of the attendees put in the Q&A, which is related to what we are discussing now. So is causality used more in products or for production strategies to assist decisions on road mapping and so on? I think to simplify the question is how is causality implemented in the production process? So anyone would like to jump in? Again, I won't call names for now on, for free to jump in. I think I can say for at least LinkedIn, experimentation, which is the gold standard for causality is really used to drive most if not all production decisions. So when we productionize a feature, when we productionize a new algorithm, we first tested against the existing feature, the existing algorithm to see if it's actually better. And then we ramp the new variant if it does indeed do better for our members or customers. And so in that sense, causality is a major component of the production decision making. Okay, and Michael? Yeah, similarly at Meta, there's a really, really strong ingrained culture of running experiments for almost every single change that we make to our products or our systems anywhere from like a small change of a variable in how the system operates to something you actually see as a user will always be experimented on. And the decision making process is really, really focused on these experimentation results. Of course, in real practice, it's not that simple because every experiment you run, there's several metrics who might be up, several metrics might be down. So at the end of the day, the persons on the team will make a call based on the experiment results, but still everything is focused on causality and experimentation for the most part. Okay, awesome. So I wanna read out one more question in the attendees' Q&A, which I think is related to culture. I think we could use that to close out the culture section so we can move on to methodology. So how do we keep politics out of causal modeling such as experimentation or cause analysis? Give me one specific example. So as data scientists, you may want to be rigorous, take time for experimentation, for causal inference, before rule out the next best products or features or changes as you mentioned, Michael, right? And also Ian. But then the marketing people campaign, people say, oh, we have the best products in the world, best rule out to everybody. So how do we keep such internal competition out? How do we keep the politics out to get things right? Okay, anyone can jump in. I'm gonna be, I don't think you can fully keep it out. I think there's also, if you look at the companies that we represent, we probably index in a particular way about how we handle these things, but there's a huge continuum of like how this goes down in companies. And I think the important thing to understand is like, you have to have some first principles about how you want to operate, but also understand that in some situations, it may, you may not get the outcome that you want, right? Even if you have those like repeated principles of, this is how, what we're going to push for, this, we're gonna try to be, as, you know, adhere as much as we can to our causal framework, you have to allow for a different set of outcomes and understand like in some situations, trade business may just decide to do something. And just because you lose once doesn't mean that you did the wrong thing. And I think that's like having an understanding of what are your first principles is really important because otherwise, if you just judge by outcomes, you'll probably get pretty frustrated after a while. Yes, Summit, your hands is up. Yeah, thanks, Victor. I agree with Eric, like think of it as a marathon rather than a sprint. So you might have some aberrations, but I think process and overall evaluation criteria can really help. So, and the kind of two sides of the same thing, the main thing is to make sure you have a falsifiable hypothesis before you start an experiment. Often politics starts after you've seen the experiment results and then you try to fit a hypothesis that matches those experimental results. So if you can have a falsifiable hypothesis beforehand that is well documented, that helps. And when you think of it at scale, you want every experiment or every person in the organization to be pushing towards the same direction to roll that board in that same direction. So you should have like some overall evaluation criteria metrics for a product already predefined before even somebody thought of that particular feature they might be really attached to. And that way you can be more objective about, okay, this is the direction we want to take. This is where our goals, the results might be contrary to them. And then it becomes harder for you to argue against that. You still might be able to and there might still be good reasons to, but at least that process acts as a gate. That's interesting to have the falsifiable hypothesis documented beforehand before running an analysis, just like writing a doctoral thesis. When I started writing academic papers more than 10 years ago, same process, it's good to know what we are using that language in the industry now. Satya, to you, and then we move on to mythology after that. Yeah, look, I agree with what Eric and Somit said high level, but I think the key to understand is like it's not easy to boil decision-making down to a set of numbers all the time. There's a bunch of other criteria that go on in the world around you, in the business around you. There's business strategic reasons, there's regulatory reasons, our competitor may be doing something that you want to protect against. Not all of it is measurable and not all of it is directly dictated by the numbers that you see and you can measure in an experiment. So I think as data scientists and especially people who want to make or help make more rigorous decisions, I think we need to be humble and try to absorb as much of the context as we can from our stakeholders and the folks who make the decisions or our owners of the products and business areas. And I think it all comes down to transparency and context sharing, right? So you may think that just because a product manager made a decision that went against the numbers, that it's somehow politics or the incentives are not aligned, but you may just not have the full context for why a certain decision was made. So I think we also need to have a little bit of, we should be humble to say, okay, this is a way to make a decision or this is what the numbers are saying, but how you make the decision may go beyond what the testers are saying. That's what it sounds. People may not agree on the matrix to focus on, right? Before discussing causality. I think, Satya, thank you. You kind of already foreshadowed the next topic I want to touch on about methodology. So well, in general, there are two broad methods to get causality from data, right? As you can tell from a title, one is experiment. The second is cause inference from observation of data without experiment or with some imperfect experiments. Now, on methodology, maybe we can continue on the topic you started, Satya, on matrix. So what are some of the best practices to get clarity on matrix to start causal analysis? And then we go from start with Satya and then anybody can jump in after. Yeah, thank you. And I think so, probably the expert here on that trend. But I'd say, and a lot of what we've learned is inspired by what he and his team have done over the years at Microsoft as well. So I'd say the overall evaluation criteria are having a north star that your organization aspires towards and that everybody's aligned towards really helps. Now, obviously that metric may not be sensitive or it may not even be possible to move it directly as part of any change that you may make. So you have to get into proxy metrics, guardrail metrics, figure out what's sensitive but still aligned with long-term lifetime value or long-term revenue that your organization may care about. But I think as long as you have the overall, here's like a hierarchy of metrics that we wanna follow and everybody's roughly aligned towards it. And you have the processes in place to make decisions and debate and share context throughout the organization. I think those are the key ingredients that go into this. Awesome, awesome. Anybody else who want to? I think Somi's name was mentioned. Somi. Yeah, thanks Satya. Just building up on, building on what you said. This is where if you've been running experiments in the past, this discrutenizable aspect of your decision-making comes into play, you can use your past experiments as a test data set and say that if these are the new OEC or overall evaluation criteria metrics, which one of these experiments would I have shipped? And it doesn't need to align with the past ship decisions of those experiments. And that's where the strategy angle comes into play, like your product strategy might have changed. You might be making different trade-offs now between top-level metrics, but it gives you a little bit more data-driven way to think about what metrics to focus on and allows for multiple ideas because when you just think about ideas for metrics in a vacuum, every idea looks really good. But then when you actually try it out on previous experiments, which hopefully are representative of some of the future experiments you're gonna do, it really gives you that some grounding into like data and understand like will they be sensitive? Do we need a proxy? And one other quick thing I wanna say is, especially for metrics, I've put together a framework called steady. It's on our platform blog post. So if you're looking at a metric, it should be sensitive, it should be trustworthy, it should be efficient, it should be debuggable, it should be interpretable and it should be inclusive. So those six kind of properties, kind of I summarize them as steady. So I would definitely look for those properties in any top-level critical set of metrics that we want to have for a product. Okay, thank you, Summit. I'll definitely check that out, steady on the blog. Ben, your hand is up. Cool, yes. So just incentives, again, a metric is an incentive, right if you start measuring it, that's gonna alter behavior because people will start optimizing to that and you may decide to let anarchy rule and let each group choose their own metrics or you may decide that in your organization, you need some level of coordination and there should be some or limited set of metrics that the whole organization is gonna optimize to so we don't have random optimization where different groups are canceling each other out. That's really important to look for. And the Microsoft papers that Summit and his team have worked on are great like steady in the measuring metrics. There's also a great Yandex paper where they were able to like create an optimal OEC using this experimentation corpus of labeled trustworthy experiments to then create like a optimal OEC that combines several metrics that is both more sensitive and has better directionality. And so that's another nice way to optimize things potentially. But all this requires an organization that is sufficiently sophisticated that you can have trustworthy labeled experiments so you're running degradation experiments or replication so that you have high confidence in what the ground truth is and not every organization is that mature. Interesting. And Barbara, thank you for the other reference of that article. I'll definitely check that out. Now, okay, now once we get clarity on the metrics to start with that's only the beginning of cause analysis, right? And that's a very important beginning. We can align better aligned incentives with the other areas of business if people focus on the same metrics and then people generally try to get the ground truth on what impacts will be on those metrics. But what's next? I could do related to Professor Pro's talk earlier. They generate two kinds of ways to analyze causality. One, we have a sort of representation of the world in terms of a model or causal graph and then we sort of map the evidence onto the model or we just design experiments and then to compare outcomes and to build evidence from ground up. So any thoughts on that? So what's next after we have, so we have data or we don't have data, we have metrics we wanna focus on, what's the next step to start getting causality? Get the ground truth. Ian? Yeah, I can say something. I think the next step is a proper experiment design or a proper analysis design if you're going the observational causal route. Experiments sometimes are fairly expensive. They take time. They can potentially expose our customers or members to a bad experience. And so making sure that whenever we run an experiment, the experiment is designed in a way that can actually inform future decision-making that we actually learn what we want from the experiment and that it's sufficiently powered to get as precise a result as we want, it's really important. Okay, and then I think I saw a hand just now it disappeared. And that was me. Did you read your hand? Okay. That was me. Yeah, I think I agree with all that Ian in has said here. Just to add on to that, kind of especially if you're starting off with just starting with experimentation and you just know getting that clarity on like what you want to optimize and to be something that you can communicate in a way that everybody can understand and make that same optimization is a big, big one. Then after that, you may want to like start a flywheel of A.V. testing or experimentation. You run a well-designed experiment as Ian talked about and then hopefully you would get like interesting insights that would allow you to create more incentives to invest in that platform. And then you find more hypothesis based on your existing results as well as other brainstorming and run more experiments. So if you kind of, I'd like to think of a value of experimentation in that potential outcomes framework model itself. Like so hopefully because you ran an experiment you're making some decision that you wouldn't have made otherwise. That's the value of experiments. And early on in your product, if you're not running a lot of experiments you would get those like insights pretty quickly and that would like spur more investments and you would run more and more experiments and test more and more ideas. Safiya and then Eric. Yeah, just to add to what Ian and Somid said by asking everyone to focus on what the business problems are and what the priorities are of the business. I think if it's easy to run analysis or run experiments when nobody's looking or the business doesn't care about what part of the product you're running experiments on you can get away with a lot of stuff. But I think to run a successful experimentation program or if you want this discipline to be successful in your company, I would say engage with the business understand what the priorities are what questions are they looking to answer and then have a candid conversation about whether experiments or causal analysis is even the right tool to answer that question, right? And so as long as you have that debate and discussion going on with the business it builds mutual trust which in turn then allows them to trust your results more or when you have something to ask off of them I think that dialogue helps build that foundation of trust to make the program successful. Yeah, Eric. Yeah, building on what others have said I think there's two things. One, try to bring business partners whoever that might be along for the ride to the degree that you can, right? You can't involve everyone at every stage but I think working toward it being a partnership rather than just something that one part of the business does can solve a lot of problems downstream. And I think as part of that is to be really explicit about the costs of an experiment from a time perspective from potential risk to customer perspective there's a lot of different ways to talk about costs but I think that's one thing that the more explicit you can be I think it's easier for people to understand is the trade-off here worth it? Is it worth it for us to take three weeks or four weeks to power this experiment to a sufficient degree or should we be doing something else? All right, so it's time for me to read out some questions in the Q&A. I think it's one question that is related to a lot of topics we've covered especially on matrix. So what if we have some competing or multiple competing or alternative performance matrix? For example, some are short-term outcome matrix for AB testing but then there's some long-term goals of the company. So how do we incorporate those competing or alternative measurements in our method? Anyone would like to start? Ben? So some of the long-term stuff you may just not be able to move. It may not be very sensitive and if you have an experiment that ideally you want your experiment to run in one or two weeks so you don't have to worry about macroeconomic drift then you need to worry if you have good surrogate or proxy metrics and it all comes back to a lot of the work some of you did, right? We want steady metrics we want things that are sensitive and directional and if you've built your labeled experimentation corpus then you can evaluate these metrics in a data-driven way and life should be good. Summit and Michael. Just to add to what Benjamin said I think there are two aspects to it. One is I always think you should focus on the long-term even the short-term metrics when you think of them in the long-term a lot of these problems go away like for instance in one of our papers we talked about early OEC metric for Bing so we wanted to increase query share but instead of focusing on that short-term metric of like number of queries done in an A-B test increasing that that would have been bad because I think Benjamin mentioned that some metrics can have wrong incentives. So you can increase the number of queries in an experiment by showing really bad results where people have to reformulate queries and then there'll be a large bump in number of queries. So but if you think about it in long-term if you ran that experiment for say a very long period of time you won't see those queries increase. So a better proxy or a better measure of unit was that how many tasks a user is able to perform using your search engine in this case and a proxy for that was sessions. And I believe even LinkedIn at some point used sessions I don't know if they do now but that became like a good indicator for long-term as well as kind of resolve that tension between the short-term imbalance you might see. Second is strategic click. It's not about like the goal has to be defined by the team, right? So are you focusing on engagement a lot? Are you focusing on depth of engagement? Are you focusing on monetization? Are you focusing on performance? So those trade-offs ideally should be made once for all experiments. I think again, I think Benjamin was saying that so that different teams don't start making different trade-offs and canceling out each other. So that's the strategic aspect of it. Michael. Yeah, on the long-term thing I also agree that like every organization or team should focus ultimately on the long-term but the average experiments may last about a month. So one good practice that a lot of companies doing including meta is something at least a meta called long-term holdout where every team might have one small experiment essentially that a small 1% of the population doesn't get anything that that particular organization or team has launched in the next six months or even up to a year which can help prevent small experiments that optimize for the short-term to know if they also optimize for the long-term eventually and then on the topic of trade-offs between organizations. I agree with some that obviously these should be discussed in advance and ideally in a very larger organization like meta things will and often do cancel each other. And so we also have a concept we call metrics defense where we show actually these cancellations where this organization ran something that made their metrics go up which makes the other organizations metrics go down and it's oftentimes hard to make these decisions automatically based on numbers but at least then these organizations should start to discuss and have a good strategy, a path going forward and make a decision on these. That's interesting. It sounds like a permanent long-term, permanent holdout which is not included in any experiments so we see what is going on. We can use that as a sort of universal control group to compare with everything else. It's interesting. Satya, you have something to add? Just a long-term holdouts. Like again, there's no free lunch, right? So I think you have to be cognizant that you may be giving a degraded experience to 1% of your users for a very long period of time and they may not be benefiting from the improvements that you've rolled out along the way. So there's pros and cons. The pros are of course you get to study the long-term effects, you understand seasonality. There's a whole bunch of benefits but you also have to be cognizant. It may not be the right move for every organization. Right, right. Well, I'm happy as long as I am not in that 1% permanent holdout on my Facebook. Anyway, so okay, the last question before we move on to the next topic. So methodology, I think this is pretty unique place where we have a large number of academic researchers, innovators and industry practitioners together in the same place. So are there any desirable innovations that you guys would like to see to come from the county edge research from the academia that we can use, we wanna use in the industry, like any problems that cannot be solved by the existing methods now. We would like to see any innovations in the coming years from the academic space. So there are two areas where I think we could definitely use improvements. One is experimentation velocity and that has to do with the speed by which we are able to come to an experiment result. So experiment velocity, for example, is especially a problem in enterprise experimentation where the unit of randomization is a advertiser, a company, et cetera. So we have very small sample. And so for us to get a sufficiently powered result, we would have to wait a very long time. And a slower experiment really is a bottleneck for innovation. So we're always looking for ways to speed up our experimentation. The other area where I think we are also looking for, new methodology is in the area of privacy. So recent changes in privacy has forced us to revamp the way that we do experiments and also observational causal methodologies. And so a natural thing to do, a natural direction to innovate is to basically develop methodologies that accommodate the privatized landscape that a lot of industries are moving into. Okay, so let me summarize. One is a speed, but actually more ultimately, it's a sample. How do we get the power sample? And the second is privacy protection, right? Okay, Summit? So in the end of 2018, I was lucky enough to host a summit with like 13 different organizations. And we kind of discussed this topic and we have a paper on top challenges in the industry. You have a lot of papers, Summit. Yeah, we like to contribute to the community. So I definitely recommend folks to read that because I think a lot of those challenges there are still on map. Some of the ones that come up often are, I think one that Michael was referring to is that we want to focus on the long-term, but we need like really good like proxy metrics or surrogate metrics that we can observe in the short term. Second is fairness. A lot of like experiment analysis, it wouldn't be optimal if you just look at the average treatment effect and ship it to everybody. We need to be cognizant of like how different important segments of the product are reacting to this and how do we identify those heterogeneous treatment effects and ensure, I know LinkedIn has done some work even on fairness, like is that decision fair to everybody? And the other part, Union has already mentioned like privacy and differential privacy. On top of that, I would also mention like deviations from the traditional AB testing framework. So we assume like the SAPBA assumptions are true that everybody's response to a given treatment is independent of how others are aligned to or assigned to the treatment or control. But a lot of times there are like network interactions between products or it might be because a lot of products are collaborative in nature or it might be because you might have multiple cookies on the same device, those kind of things. So those breakout assumptions and how do we know? First of all, like how do we know is there a red flag that can be reliably raised whenever there is a problem with the analysis? And then second would be what do we do if we end up in those situations? Okay, interesting. Eric, would you like to close out this final question on this topic? Yeah, this is I think less about methodological innovation though I echo what everyone has said. But we're talking about even from this panel a very specific subset of companies and how we operate and how we do things. I think it's really important to actually understand how other companies who don't have the same infrastructure setups, how they actually make choices and decisions and what evidence they're open to using. Like it might not, it's not really about methodology but I think it's really important to document like what are the costs of not doing things this way in companies that don't operate in the same way that we do. And I think there's not a lot about that because we tend to go more to the methodological side of things, but I think it's important especially from a broader industry perspective. That's an excellent point Eric. I just realized we are just a very small subset of large corporations which maybe at a high maturity of the cause analysis cycle in terms of infrastructures and capacity. So in general, let's move on to tech topic and let me full up on that point you just made Eric. So how do organizations build up to capacity? And by capacity, I think it means two things. One is infrastructure, right? Platforms, engineering and also people, the kind of skills we want when hiring people. So let's start with the engineering side of the equation. I noticed a several of panelists here are leading the development of experimentation platforms. So for those who don't know what is an experimentation platform? What does it do? Maybe start with Michael and Summit and then Ben, Eric. Yeah, so an experimentation platform at least and the way we have it at Meta is it allows you to run an experiment. It takes care of like having an A version and B version and it takes care of logging. It will take care of ingesting the metrics that you've logged and ultimately it will do the methodologies that we have to give you an experimental result of this metric was moved by this much up or down. It may do all kinds of additional advanced methodologies. It will also check things like power. Do you have enough sample size? And not every experimentation platform will do this but at least at Meta, it also provides you a UI and easy to use UI for you to do all of this, such that experimentation is democratized and decentralized. Any team, even if they don't have the most experience can go come to this experimentation platform. It'll let you know that you need to have a hypothesis. What are your metrics? Here's the experiments. You want it to be one percent or two percent. It'll guide you through it after a month or a time or when you have enough power, it'll say here's the results that you asked for and here's the how we think you can make a decision. Okay, so is that platform accessible also for the users, for Facebook users or is it sort of internal? No, this would just be for internal engineers. Internal, okay, gotcha. And Summit. Gotcha. Summit, so what is Microsoft experimentation platform? I know you have that mentioned in several papers, so feel free to set those papers again. Actually, we do have a paper on this one as well. It's called anatomy of a large scale experimentation platform. I'm going to quote from that. So generally, like you can think of an experiment having two sides, the online side and the offline side. And the experimentation platform, as Michael was saying, it should enable that scale for folks to run experiments in a responsible manner within the entire organization or where the customers are for that platform. The online side is basically think of it that you need to first configure your experiment. You have to decide what's the audience or the population that you want to run the experiment on. What are the two or three variants you want to test? So how do you configure them? And then what is your ramp up strategy? Like how do you ensure that you are putting guardrails before you expose to a lot of people? And then there's experiment execution side in the online side of things. That is where you're actually randomizing the experiment units into treatment and control. It needs to be done in an efficient manner, unbiased manner. And lastly, the experimentation platform should be able to control exposure. So from all the way to zero to 100, if you want to shift, right? On the offline side, you want to define metrics. So what are your main metrics? Ideally, you should already have a metric set that you can leverage, but you might want to define additional metrics on top. Then you would want to have some way for the product logs to connect with those metric definitions and compute those metrics and have an analysis out, like what we call internally a scorecard, which tells you how the treatment is doing compared to control. And even have some emergency breaks as part of that analysis where it can send you alert emails saying like, hey, this metric is really tanking or seems to be going crazy, take a look or we're going to shut it down. And finally, you get the final analysis and you make that shift decision. And so the experiment platform is enabling all of that for our customers across multiple products in Microsoft. Mm-hmm. Awesome, sounds so interesting. Sathya, and actually all the panelists, feel free to jump in if that's in your alley. Yeah, so my teams don't run an experimentation platform, but we're consumers of the Netflix experimentation platform. And a few other things that we think about as consumers are really, who is the audience for the platform? Is it other engineering teams? Is it science teams? Or is it product managers? And are these like business owners? So depending on who you decide is the primary audience for your experimentation platform, you may go with a different set of features that makes it point and click really easy or provides different layers of modularization and customizability. A good example from the science side is really that when you're innovating, not all metrics are usually already defined and computed as part of the standardized pipelines that Somet mentioned on the backend. So you often need new metrics to be computed just for that experiment. And if that feature or product doesn't roll out, those metrics are not even needed sometimes after the life of the experiment. So you don't wanna invest in building it, building out a full standardized pipeline while you're still testing things out. So having the proper hooks for scientists to write their own metrics and also be computed as part of the pipeline is an example that we would look for as science teams as an example. So I think thinking about who is the audience for the platform is also a good access of innovation. And Ben? Yeah, I think we could probably all talk about this for days and many of my colleagues have written a lot of papers on that. But let me just offer a few points. First of all, I don't think your experimentation will be any better than your culture. So even if you build like the greatest ever platform, you're gonna be limited by your culture. A few people who are overriding results with their spider sense and not trusting data, you won't be happy. And for those in the audience who are starting out at smaller companies, you might think, oh, maybe we'll start using some third party experimentation product. And that might get you going quickly, but you might also wanna worry about, am I protecting my user data? You may have latency problems. And ultimately your experimentation platform is gonna wanna be tailored for your business problems. So you probably wanna start simply and build something that's your own that works for your needs. And as someone talked about, your experimentation platform ultimately, if you've got the resources and it makes sense, you wanna be able to track the entire experimentation process through the whole life cycle from I have an idea, all the way to reporting at the end, and taking care of all the important things like power calculations and such. There's a great Netflix paper about this where they talk about separation of concerns, which really helps you scale effectively, which is building some generic capability to run science in the platform. So engineers can focus on what engineers do best, which is building infrastructure. Scientists can focus on what they do best, which is analysis. So for instance, if my platform initially just supports a key cast, if I wanna add something like quantile treatment effects, so I can look at the distribution of how people respond to treatment, like a scientist should hopefully be able to go write some code in a notebook and we drop that in a pipeline and everyone's happy. It's not like, oh, we've got six months where we have to go get this feature on the engineering schedule and have lots of like Gira items be tracked and people turn gray and lose hair. And it should be easy to put new science into the platform as needed. So I just say, Ben, the platform should be customized so that it could cater different kind of users. It should be extensible. It should be easily extensible. And so to the extent that it makes sense, a lot of this infrastructure should provide a generic science capability, particularly in the analysis part of the pipeline. You may wanna make it easy to add metrics. You may not, depending on how that's controlled in your organization. And that's something that goes back to our earlier comment about choosing good metrics and you need to think about what's the right way to do that. And also the other thing that the platform does which is really important is we have lots of attrition in the industry. And so there's certain key knowledge that you wanna institutionalize in the platform so that if I lose my key person who's measuring some aspect of marketing, that business process isn't down where we're spending months unable to think about how we're gonna spend $100 million on advertising. The platform should have key mission critical knowledge like that in it. So if we lose personnel, we're not dead in the water. So just follow on that topic. So how do we sort of institutionalize the knowledge in the case of mobility of people? Anyone doesn't have to be Ben? So like in a simple case, I've done a lot of work on performance marketing. So you probably have a specific set of ads you support. They're on Google, they're on Meta, wherever and you've come up with some measurement methodology that you think is good and you would wanna automate that to the extent that's possible. This kind of offsite, these offsite marketing experiments are hard because you don't control everything, right? Some of it's running out on Google or Meta or whatever platform been and that makes this much more challenging. So you would wanna automate the things that are sensible to do, not just to make sure that you eliminate errors and they happen reliably, but also so that hopefully a new MBA can run these kinds of mission critical things without requiring scientists. The other thing we haven't talked about is you grow the platform you need to have the necessary human capital in your organization to run experiments. So that may be you just need, someone who's a data scientist from a bootcamp or it may be something more sophisticated where you need a team of PhDs who produce research papers and can solve thorny problems when things go pear shaped for bizarre and interesting ways. So, yeah, so far we have been talking about experimentation but the people side is also very important, human capital. Actually, there are several questions asking that in terms of what kind of skills will causal analysis team be looking for? So several panelists here are not only practitioners in causal and experimentation but also managers, people managers. So in terms of skills, what are some of the most desirable skills that we would want to look for in people into this space? Sathya, would you like to start? Sure. I think there's technical skills. I'm sorry, I should be specific. There's statistical technical skills. There's engineering coding based technical skills. There's an understanding of the business, product aptitude and communications that are important and then everything else that you may want to test for to make sure that somebody fits within your organization and culturally on a holistic basis. Let's say on the statistics side, pick up any sort of like advanced stats textbook on both statistical inference, causal inference specifically. Most of that is like foundational knowledge that you would need if you want to be a deep practitioner of this discipline. If you are mainly interested in just running experiments, I think foundational basic stats and understanding of p-values, how basic regressions, all the foundational statistical knowledge is probably key if you're not in a technical discipline and you're not a scientist. And then add to that, depending on your role, you may want to have the ability to pull your own data, write some code in R or Python in a notebook somewhere. And then if you want to get more advanced and work on the platforms, then you need to step up your coding skills as well. But I'd say my teams are business facing and we interact pretty heavily with product managers and business managers, engineering all the cross-functional teams at Netflix. So we specifically look for product aptitude, business aptitude, general critical thinking, communications, to sort of round out that science profile. Now I want to switch topic a little bit because so far on this section, a lot of discussions have been focused on experimentation, especially experimentation platform. Now, what about cause inference? So when should we introduce cause inference? Despite that, I think one panelist mentioned experimentation is the golden standard for getting causality. But at the same time, you all understand it's going to be very expensive. Sometimes impractical to do. So where shall we introduce cause inference? And how do we build up a capacity in organization to be able to run rigorous cause inference? So I can start. So I think that observational causal inference, there are methodologies, there are some methodologies that are more rigorous than others. I think the best combination of observational causal and experimentation is when experimentation is very expensive, use observational causal as a precursor to inform whether or not an experiment should be run. Of course, there are situations where experimentation is just not possible and people resort to observational causal. There are benchmarking papers out there that basically compares observational causal results to experimentation results. So they take an experiment where we know the ground truth from the experiment and apply observational causal methods to try to uncover the experimental estimate. And the result of those papers show that sometimes observational causal methods fall pretty far from the ground truth. And so- So you're talking about a paper by Ronnie Kouhavi? I think there are multiple papers, multiple benchmarking papers. So I think in terms of your question about platformizing observational causal methods at LinkedIn, we do have such a platform called Ocelot where data scientists can leverage a variety of observational causal methods with just a few clicks and we have built in validations. And so platformizing solutions are always a good idea and not only does it democratize these methodologies to people who don't know the exact methodology detail, they also save data scientists time and they save data scientists from human error when trying to implement these methodologies in their own Python notebook. That's interesting. What is it called again? Well, at LinkedIn, it's called Ocelot. Ocelot, okay, very cool. So I do wanna fill up on one point you made, Ian. So you suggested that causal inference could be introduced as a precursor before experimentation. So how do we know, see, after causing inference, how do we know we should run an experimentation or not? So usually if the causal method, let's say, reveals that there is somewhat of a causal effect, there is something of a causal effect to be detected and it's significant enough that could be picked up by an experiment. And so for example, one area where experimentation could be very expensive is in cases where there are very few samples. So we have an issue of power. So an experiment has to last very long. And so in those cases, if observational causal methods can inform whether there is indeed something of an effect there that could inform whether or not an experiment, a long experiment should even be run. Okay, that's interesting. That's very helpful. Thank you, Ian. Satya, your hands, what's up? Yeah, so my teams, we also do a fair amount of non-AB testing type causal inference. And I would say two big buckets. We do observational studies when we cannot randomize at all or the changes being made by a third party that impacts the Netflix business. So let's say Comcast goes or Disney goes and does a big promotion and we see our metrics changing and we have no control over that. So we use many of these techniques to understand what the impact was and if we need to make any adjustments to our strategies as a result. There is a layer in between unit randomized experiments and an observational which we call and I believe it's standard terminology as well. They're called quasi experiments. And so here we don't randomize units at the most granular level but we instant randomized groups of units. So the most obvious one is geo. So we could say that, hey, if you're running a global quasi experiment, we can say that the Nordics behave similar to each other in terms of the metrics that we see. So maybe we put Finland and Norway in one group and Sweden and Denmark in the other group, right? And then these then have a time series component to do it as well. And so we have a set of other causal inference methods that provide us more power than observational situations but definitely less power than unit randomized ones. Okay, pretty cool. Now, thank you everyone. This has been very interesting discussion but I do want to leave some time for the audience or for us to answer questions from the audience. Hey, Paul and the German. So can the audience attendees speak or they just don't wait for them to ask questions to the Q&A? Yes, the audience can raise hands and then we can unmute them and your cohorts. So you should also be able to do it. Okay, so let me do this. Give the limited time left. Let me read out. I might combine some questions in the Q&A if they appear to be similar. And if it's not clear, I would ask the names behind to elaborate, right? So let me move on to the top of the list. Okay, there are several questions that I think they can be combined. So they ask that a lot of the difficult assumptions or the modeling techniques, the complexity of the method. So how can data scientists convey the confidence in the results to the management? Given that the methodology is so complicated, there are a lot of things some people mentioned, supta, ignorability, et cetera. And those may not be the familiar terms for most of the day-to-day managers and the product managers and so on. So how do we convey the confidence in results? Anyone would like to start? To the business side. So I think being able to communicate is super important and it's something like, I certainly struggled with early in my career. I wrote lots of white papers that no one read and didn't result in decisions and just it wasted a lot of time. And so I think it's really important to learn how to communicate to executives. I'm sure everyone in this room has got horror stories and stories about like victories where you finally managed to get through. A lot of it's just being able to break it down, have it be very simple. Your language should be, at least at eBay, we try to target like a new MBA is kind of like, it should be written so a new MBA can understand it. And I try to keep that in my mind. I try to write something that's one pages and not 12, try not to put equations in it. Graphs are really good. If you find the right visualization, that is often pivotal in getting someone to make a decision. Cool, anyone else? Summit? Just to add to what Benjamin said, I think that's one of the reasons I like to keep things simple to begin with. Like a lot of complex methodologies with a lot of assumptions. They are not only more fragile, but also less democratizable because not everybody would understand it. And then it has to come with a long disclaimer list to use it. So generally, like if this is something you have to do on a regular basis, and I know Satya was saying, you need to have a very strong relationship with product managers, the decision makers to understand their language and what they really care about. So understand your audience really well and then try to develop like a template or a standardized way of presenting your results so that if they've seen that once, they can quickly like understand any other result you might produce or share next time. And Eric? I think it's important also to not work under the assumption that you have to go all the way over to every business user and everyone in the business and like make it work for them. I think there's also part of what you have to focus on is where do you want people's basic level of knowledge to be, right? For product managers, marketers, and are there ways to get them up to speed in a way that you're not having to go customize and make every individual person happy or tell them a story that makes sense for their background? And so I think, you know, saying like working with a business partner comes off often as like treating them like they don't know much, but I think there are ways to actually say like they have a really good facility with this part of statistics or experimentation and then it becomes easier to create that and create that partnership over time. Okay, great. The next question I'm gonna read out I don't know if we have already addressed or not it's about experimentation agility probably related to the speed point I mentioned. So what do you guys recommend to do under traffic limitation? So what should we do if there's no power, no sample? Sufficient sample size? I mean, there are different methodological advances to reduce variance and also to test things offline. So, you know, like the question, like the commenter post, the traffic online is limited. And so to the extent that we can reduce variance in our experimentation estimate that will help with getting more precise of an estimate with less sample. There are also ways to do offline policy evaluation especially when it comes to testing different variants of an AI model. And that is also something that could be used to narrow down the set of AI models that we decide to test online. Okay, okay, that's helpful. I wanna combine several other questions together which are very interesting. They're all about cause inference from observational data. Now, you mentioned that in several papers compare the evidence between the experimentation and cause inference and some of them are way off from each other. But in the industry, very often we don't have the ground truth to compare to what we run cause inference. And also different methods may generate the different results when we look at the same of observational data. So how should people deal with that? No ground truth and different methods may generate different results. And it's observational data. So we don't have evidence from experimentation. So one way to think about correctness is there's an epistemological framework that comes from the nuclear industry called verification, validation and uncertainty quantification which I find is really useful because the first V is verification which is did my code correctly implement the model? We don't worry about whether the model is correct. So you unit test like make sure your code's correct. Step two is the validation. Does my model have fidelity to reality? And then uncertainty quantification is thinking through like ways my model could fail because I've made assumptions like maybe I assumed such though but I've got interference or people are getting exposed to aspirin of different strengths because somewhere kept in the back of the warehouse for two years before they were given to people. And so it gives you a rigorous way to think through your approach to the problem and whether you're likely to be correct. That's interesting. What is it called again that the standards? Verification, validation and uncertainty quantification. There's a great book by Oberkamp and Roy on that subject. Okay. Interesting. Thank you. Thank you Ben. Yeah. Satya. A lot of a couple of things. I think it's never too early to start investing in the source of truth and to build it out. So to what a couple of other panelists mentioned figure out what you've done in the past that is trustworthy and start building that collection even if it's like five tests or three tests like you have something to start off with and then you can build that over time. The other piece I would reemphasize is just speak the language of the business, right? Like talk to them, see if the results of your analysis pass the smell test. Does it make sense that you would make this change in this part of the product and this is how your metrics change or not? So a lot of it is also just down to does the do these results make sense or not? And then getting the businesses buy in on that aspect of it. Okay, great. I think we have time for maybe two or three more questions. Now there's one question. Victor? Sorry, Victor. I think we have officially two more minutes. Yes, so maybe one final question. So this would be a very quick answer question I guess for everyone. Ideally, all the changes should be tested, right? It should be experimented and tested. What should companies who are mid-tier not best in class, not 100% test but have been experimentation for many years? They have experimentation capacity, they have the culture but they're not that mature to test everything. What should be a good number to target for in terms of percentage of changes to be tested? Given the limited resources, limited time. So 80%, 50%, 30%, what should people target? Let me be, I guess say like it should be 100%. I think the role of experimentation platform or any platform in general should be to reduce the marginal cost of doing that same operation again and again. And it shouldn't be that every like the next 100 experiments would cost you as much as the first 100 experiment. So I would always, I feel like the cost of doing something else and having a lot of confounding variables and assumptions behind it, they end up costing you more especially if you end up chipping something bad or you miss something really, really wonderful because your assumptions were wrong. So I would always keep that goal 100% but I'm biased. Okay, thank you so much. Oh, Yen, yeah. I was gonna say that maybe the mental framework shouldn't be what percent of product launches should be tested but rather how to prioritize them if you truly have limited capacity. And prioritization can probably be based on how impactful is the launch? How widespread would the product be used or affect end users or customers? So I would think in terms of prioritization instead of percentages. All right, thank you so much. So apologize, I can't read out all the questions but it's been a very wonderful panel session and hope if we're in person I would call a round of applause to everyone. They can use emojis, anything. I don't have anything, oh, that's raise hands only. Okay, thank you so much. And then back to you Paul and the German and hope everyone will enjoy the rest of the conference.