 Welcome to What's Next in AI, a seminar series presented by IBM Research in which we spend time with some of our top scientists and researchers learning about the exciting work they're doing and then have an opportunity for a bit of a discussion with them to get into some more of the details as well as think through some of the business applications of their work. I'm Shaheen Parks and today we'll be speaking with Dr. Sumia Ghosh. Dr. Ghosh is a research scientist at the MIT IBM Watson AI Lab within IBM Research. His research is broadly focused on statistical machine learning and he holds a PhD from Brown University. Dr. Ghosh has talked today is about a form of model criticism, assessing the impact of modeling decisions on final output. He'll talk about both how to measure that impact as well as how to use that measurement to evaluate the robustness of your model. Essentially, putting eyes on the influence some of your early choices have on your final results, allowing you to make an informed decision about how to handle that influence. Before I hand it over to Dr. Ghosh, I'd like to encourage our live audience to please drop any questions or comments that you have in the chat and we'll incorporate them into our discussion after Dr. Ghosh's talk. With that, Sumia, take it away. All right, so thank you for that introduction, Shaheen and thank you all for joining us today. As Shaheen mentioned, today I'll talk about some of our recent work on studying the effects of modeling assumptions and statistical data analysis and their results. Let's begin with a motivating example. Let's say we are interested in predicting the concentration of atmospheric carbon dioxide as a function of time. One might be interested in such forecasts for several reasons, including the implications of such forecasts on climate, on future climate. The plot above, the plot here shows the average monthly carbon dioxide concentrations in the atmosphere recorded at the monolow observatory in Dubai. The data stretches from the late 1950s to the present day, but I'm only showing you a subset of the data up till the year 2003. Based on this historical data, can we predict carbon dioxide concentrations? The answer to this question is obviously yes, yes you can and you can, you may even be able to use off-the-shelf statistical models and tools for making such forecasts. But any such prediction requires assumptions. Assumptions about the data, assumptions about the model and assumptions about the procedures used to learn such models from the data. A national question then is to ask whether the results of such an analysis are robust to the various assumptions that we have made along the way. Or would the results change substantially under innocuous perturbations to the modeling assumptions? That is when the modeler creates in one set of assumptions for another equally probable set. If under such perturbations, the results change drastically or substantially, then one should perhaps not trust the results of the analysis at all. As an example of what such a substantially different prediction might be, consider the carbon dioxide prediction problem again. The predictions from this model are shown in black. As it turns out, the model's predictions itself are a little bit too optimistic. Reality was much worse. In green, you can see the actual observed carbon dioxide levels over the same period and time. In particular, before being breached in 2015, sorry in 2019, carbon dioxide levels had not exceeded the 415 parts per million level since the mid-pliasing era, some 3 to 3.5 million years ago. For a scientist back in the early 2000s, analyzing forecasts for the year 2019 or 2020, predictions exceeding this threshold of 415 parts per million could reasonably have constituted a substantial change from the actual predictions provided by the model. So this is one example of what a substantially different prediction might be, but as you can well appreciate, the definition of a substantially different prediction changes from application to application, and it's very application-dependent. There needs to be done on a case-by-case analysis. In this talk, we'll focus on tools for assessing robustness to modeling assumptions in a particular analysis. These tools are a type of model criticism and set after the model selection step in the typical modeling workflow. So the rest of this talk is going to focus on model criticism on this bit, on this bit of the typical modeling workflow. We'll further restrict our attention to model perturbations in a particular class of models, guessing processes, guessing processes or GPs are flexible statistical models that provide accurate predictions, as well as useful uncertainties, and are widely used for modeling spatial temporal data, especially when that data is low to medium dimensional, just like our carbon dioxide prediction problem. It's ideally suited for that sort of a problem. In a bit more detail, a GP defines a distribution of a function. In the one-dimensional case, you can think of a sample from a guessing process as this squiggly magenta line shown in the paper here. So this guy is a function that I've sampled from a particular guessing process. The GP itself is completely specified by a mean function and a covariance kernel. The covariance kernel is a key modeling choice and will be the focus of our reverse analysis. The importance of covariance kernels stem from the fact that they, in quote, prior beliefs about functions preferred by a guessing process. For example, depending on the choice of the kernel, GP could prefer smooth functions, rough functions, or functions with other kinds of properties like local periodicity. This kind of local periodicity is also important for the carbon dioxide example, as you saw from the data that we already have seen. So while there are results out there which guarantee that the mean predictions of a guessing process are independent of the choice of the kernel, if we have infinite amounts of data, we do not live in that scenario. In practice, we never have infinite amounts of data. In the finite data limit, it turns out that the choice of the kernel function really matters. For example, here are predictions from GPs with the three different kernels that we saw in the previous slide, fit to the same set of six observations. As you can see, both the uncertainties shown in the shaded blue regions here and the mean predictions vary dramatically between the kernels. Guessing process kernels just need to be carefully designed or learned from the data and must really be probed for robustness. Do they change the results of the analysis drastically if the choice of the kernel changes just a little bit? To assess the robustness to kernel choice, we pose the following question. How much do we need to perturb the chosen kernel to produce a substantially different prediction? If it turns out that we need to perturb the kernel by a large amount to the extent that the new kernel now encodes very different priorities from the original kernel, then maybe there isn't as much to worry about. We simply need to reason about which set of priorities are more appropriate for the task at hand. If, on the other hand, it turns out that we need to perturb the kernel very little and the perturbed kernel encodes essentially the same priorities as the original kernel, but produces significant substantially different predictions, then the results of the analysis should be up for questioning and perhaps your analysis is not as robust as you'd like it to be and we need to redo the analysis from scratch. Okay, so to answer the robustness question we just posed, we are going to propose a recipe. The recipe involves two steps. Given the original kernel, step one involves specifying an epsilon neighborhood around it and searching this neighborhood for a kernel that causes the predictions to change substantially. If you fail to find such a kernel, we grow the neighborhood by just a little bit and repeat the process till eventually we find a neighborhood denoted here by gamma epsilon, which is large enough and contains the alternate kernel K1, which produces the necessary change in predictions that we're looking for. We're going to express this whole thing more formally as an optimization problem. Now, after the step, we have this alternate kernel K1. Step two involves evaluating whether K1 and K0 are qualitatively interchangeable. Do they capture the same prior beliefs that we had or do they capture very different priorities? So the overall recipe is very simple. Given the original kernel, search in some neighborhood of that original kernel to find another kernel which produces the desired change in predictions, which produces substantially different predictions from the original kernel and then go back and evaluate whether these two kernels, the original kernel K0 and the alternate kernel K1, capture the same prior beliefs or not. If both of these things happen, then the analysis is not request and you need to revisit, revisit the problem. So let's now get a little bit more into the weeds and look at these steps in a little more detail. So the first step, we are required to specify a kernel neighborhood. Broadly, we have two desiderata for these kernel neighborhoods. The first is that they should capture any quantifiable prior belief we have about K1, the alternate kernel. For example, we might believe that K1 is stationary. In that case, we would want the neighborhood to be comprised of all stationary kernels within some distance of K1. Or we might have stronger prior beliefs still. For example, we might want the neighborhood to be comprised of all kernels that produce functions that are M times differentiable. Both of these things are possible and we can construct these neighborhoods to be restricted to these kind of sets. The second requirement really stems from the need to search these neighborhoods to find an alternate kernel K1. We want to be able to represent the neighborhood in a form that makes searching over them easy and efficient. So now with these two desiderata in mind, next we'll provide two concrete examples of neighborhoods which make minimal assumptions about the alternate kernel K1, but give us procedures that are amenable, that give us procedures which allow us to easily search over the neighborhoods. Other neighborhoods and parameterizations are certainly possible, but not explored here. Okay, so first let's assume that we believe the alternate kernel K1 is stationary. That is, the kernel output depends only on the difference of the inputs and not on the locations of those inputs. Then we can set the kernel neighborhood gamma to be the set of all stationary kernels and leverage existing theoretical networks. It turns out that we can represent stationary kernels in the spectral domain by their constituent frequencies. The big advantage of doing this, representing kernels by their constituent frequency, is that we can then search over this neighborhood by just solving a continuous optimization problem in the frequency space, subject to a closeness constraint shown here, which forces us to stay close to the original kernel K1. Okay, so stationarity is fine, but maybe it is too restrictive an assumption to make for the alternate kernel. I can certainly imagine cases where the alternate kernel should not be stationary. So in that case, we can relax this assumption and expand gamma to include both stationary and non-stationary kernels. Again, for this to be effective, we want it to be efficiently searchable. So for this, we rely on the observation that what may the inputs of a kernel per kernel produces valid kernels, which in general will not be stationary. So we use a neural network with weights w to parameterize a warping function, which takes the original inputs x and warps them, and then feeds them to the original kernel. Now searching over this expanded neighborhood just involves searching the parameter, learning the parameters of the neural network, which is again fairly standard. And we do this subject to a regularization constraint, which forces the original kernel to stay close to the alternate kernel. So here, so these were two concrete examples. The key takeaway from both of these examples is that we carefully parameterized the neighborhoods, the kernel neighborhoods, which allowed us to rephrase this difficult discrete search problem over kernels as a continuous optimization problem. The continuous optimization problem is both simpler thanks to monotonous differentiation tools and significantly more computationally efficient. Okay, so now that we have seen the overall recipe and the mechanics of constructing kernel labels, let's rebuild our carbon dioxide prediction example. It is literally a textbook example on how to carefully design Gaussian process kernels to capture trends in the data. It's taken from Rasmussen and Williams influential textbook, which came out in 2003. Their kernel is a composition of several kernels, each capturing a distinct trend observed in the data. So the first term here captures the long range smooth increasing trend in the data. The second term captures the near periodicity that we see in the data. The third term captures medium term irregularities, deviations from the structure that we have seen before. And the fourth and the fifth terms capture whatever was missed by the first three terms. In other words, they captured noise. Okay, so the book itself came out in 2004. The authors trained on data up till 2003. This is the reason we've only been looking at data up till 2003 and predicted forward over the next 20 years or so. I'm showing their prediction in black here. One thing to note is that the prediction is very reasonable in the sense that it captures the same sort of periodic structure that we saw in the training data. So the green periodic structure is reflected in the forecast here. And the predictive uncertainty is shown by the shaded gray region grow as we move farther away from the data. So this is again desirable to have the farther away into the future you try to forecast something, the more uncertainty forecast should be. That being said, we would now like to probe the robustness of these forecasts. Those minor perturbations to this particular kernel produce substantially different predictions. And if you recall from the early bit of the talk, substantially different predictions here would be predictions which exceed the threshold of the 415 dots per millimeter level. Okay, so to do this, we ran our procedure, our algorithm for detecting robustness using the non-stationary neighborhood kernel in red and discovered alternate kernel K1. In red are the predictions from that alternate kernel K1, which look just as plausible as the predictions from our original kernel shown in black, but exceed the 415 dots per million threshold. So they are substantially different according to our definition of substantially different. So now all that is left to do is we need to evaluate whether K1 and K0 encode the same prior beliefs. For this, we propose two tests, one quantitative and one qualitative. First, let's look at the quantitative evaluation. Here we use the two Washerstein distance to measure the distance between two Gaussian processes. We are going to conclude that the prior beliefs expressed by K0 and K1, the original kernel on the alternate kernel, are close if the distance between them is smaller than the distance between K0 and variance of K0 learn from slightly perturbed versions of the training data. For example, from bootstrap samples of the training data. For our example, the distance between K0 and K1 is shown by this red line here. And the distance between K0 and variance of K0, bootstrap variance of K0 are shown by this blue histogram here. Since the red line is significantly closer to zero than the blue histogram, we can conclude that the discovered kernel K1 is much closer and likely captures the same or very similar prior beliefs as K0. Next, since our data is low dimensional, we can do a qualitative check. We can visualize functions sampled from from Gaussian processes employing K0 and that employing K1 and assess qualitatively whether they capture the same qualitative characteristics of them. So here I'm showing you 10 noise mesh random functions sampled from the two Gaussian processes qualitatively looking at these figures, one would be hard pressed to say that K1 is expressing meaningfully different prior beliefs about preferred functions than K0. Okay, so let's recap. What did we learn about the monolow analysis? You were able to find an alternate kernel K1 that produced substantially different predictions. But the alternate kernel K1 also encodes essentially the same prior beliefs as K0. So by our definition of non robust analysis, this does fit that definition. It does produce drastically different predictions while encoding very similar prior beliefs. So we should probably not trust any of these results and deem the entire analysis not proposed. Okay, finally, this talk provided one example of analyzing robustness issues using Gaussian processes. If you'd like to learn more about this and see other examples, including higher dimensional ones, you can check out our paper on this, which is up on archive and is currently under review. This line of work for us, we emphasize the crucial role modeling assumptions play in any data analysis. In particular for Gaussian processes, we found that they can be very sensitive to the choices of the program. And although such sensitivity has long been suspected, our work provides the first practical tool to check for such issues in a data analysis of interest, perhaps your data analysis. And with that, I'll end and take up your questions. Thank you. Thanks, Sumiya. You've given us a lot to chew on there. And so I do want to move us into some discussion. And as a quick reminder to our live audience, please feel free to drop your questions in the chat and we will do our best to incorporate them. To get us started, Sumiya, I wanted to ask you a bit about setting the threshold value. It's from your talk that this whole analysis is predicated on a comparison to a threshold value. And in your example, somewhat natural fit in terms of picking a threshold value that represented a significant point in climate change, shall we say. So my question is, how do we generalize this? How could you pick a threshold value to make this sort of analysis useful for any particular case? Right. That's a great question, Shireen. So as much as we would like for there to be universal threshold value, a single threshold that works across problems and across domains, that just isn't the case. And these threshold values are inherently problem dependent. What we have found in practice is these thresholds naturally arise when you start thinking about what you're going to do with the predictions of your model. Typically, given the predictions of your model, you want to use them to make some decisions downstream. And it's often the case that if your predictions are in a certain range, you take decision A versus your predictions in a different range, you'll take decision B. So these kinds of predictions, changing decisions, give rise to these thresholds. They sort of naturally arise when you think about how will my decisions change based on the predictions of my model. That makes a lot of sense. Can you give us any examples of an alternate way to pick a threshold or a different situation? Right. So for one, if you look at the paper, we looked at an example of heart rate monitoring. So it turns out that the standard resting heart rate of an adult is around 130 beats per minute. Could be around 130 beats per minute. And you want to predict if the heart rate is going to change substantially over a period of time or not. So in that case, the 130 beats per minute turns out to be a natural threshold for these kinds of things. So often when you're making these kinds of predictions, there's some underlying value that you really care about. It's in the context of the particular application. And you as a practitioner would often care about a certain value of your predictions. And that's where these thresholds would typically show up. Yeah. Yeah, that makes a lot of sense. I can see why generalizing them would be a challenge. In another point that came up through your talk, you talked a bit about when making this comparison, doing it in a qualitative way and a quantitative way. And my question to you is, how would you pick? When would you use a qualitative comparison versus a quantitative one, which seems a little more precise? Right. So the short answer is, ideally you would do both. And sometimes you can't do both for a variety of reasons. So the quantitative comparison, while it might seem a little bit more precise, it doesn't really tell you whether it's capturing the same prior beliefs. It just gives you hints, right? Like, if two Gaussian processes are close, then they are likely capturing the same prior belief, but you can't really tell if that is the case till you do the qualitative test. So ideally, if you can, you should probably do both. There are situations where the qualitative test is a lot harder to do. For example, if you live in some super high-dimensional space, and you don't have these one-dimensional functions, then you can't really visualize these functions, which makes the qualitative test much, much, much trickier to do. So in that case, you have to rely on the seemingly more precise, but in this context, somewhat less precise as it turns out, quantitative test. But ideally, you essentially do both. Yeah. I think that there are a lot of situations in which sometimes patterns reveal themselves to your eyes in a way that sometimes the mathematical tests don't always show this long, so that that makes sense. Right. Like, the fundamental issue that we run into while doing these kinds of analysis is some things are just hard to quantify. Right. Like, we can quantify them to a degree, and the quantitative tests are telling you basically that, like, how close they are according to the beliefs that you were able to quantify. And then there's a whole host of things that you are able to precisely quantify, but like, when you look at the data, you're like, oh, of course, these things look similar, or these things look very different. And that's what the qualitative test gives. Hmm. So we have a question from our audience around how to execute this sort of comparison or criticism, asking if there's a way, if there's a list or process to follow that might make it easier to run a test like this. Right. So firstly, these tests that I spoke about today just apply to Gaussian processes. So if you're doing it for some other kind of model, you'd have to design a workflow which works for that set of models. But assuming that you're doing it for Gaussian processes, the first thing to do is to actually think carefully about your application and figure out what decisions you actually care about. So once you figure out what decisions you actually care about, that gives you this notion of the decision-altering threshold that we've been chatting about for a bit. Once you have that decision-altering threshold, then you can apply essentially our workflow, the tool that we just described directly to the problem. We are in the process of releasing software which makes this easy for other people to use and hopefully it'll be out soon which would allow you to use this tool. That's great. I think everybody likes easy. And certainly it would be great to have a way to apply this in a, in an easy way. So we have another question and it is around building this in. The question is, can these concepts be implied or can they be embedded into your process initially or is it a test that you need to run kind of after the fact? Right. So right now we view this as a test that you run after the fact. Obviously as you, it's an iterative process, right? So you run this test, you learn something about your data and then you try to incorporate that into your next data modeling iteration. Right. So one, one ideal way or one way in which you might want to use some of this is right. If you run this test, you find this alternate kernel, which is very close to your original kernel but giving you slightly different predictions or substantially different predictions as it might turn out. Then in the next data modeling iteration, you could imagine averaging over these two kernels. So which would give you more robust predictions. So. Yeah. Yeah. Well, so you said initially that this is very specific to Gaussian processes. Do you feel is there a logical way to expand that and to, to use other types of functions and processes? Yeah. So our analysis is rooted in, in Gaussian processes, right? Like we were interested in understanding the sensitivity of kernels in Gaussian processes, the choice of kernels and Gaussian processes. If you have a completely different model, those make very different assumptions and then you have to think about how to, how to understand how those assumptions affect your results. That being said, like many models reduce to Gaussian processes. So for example, if you have a neural network and you take the infinite limit of that neural network, you make it infinitely wide, it turns out that you recover a Gaussian process. So there might be ways in which you could criticize those kinds of models under certain assumptions. You're making them infinitely wide, for example, via this Gaussian process construction. But if you have a completely different model, then you have to essentially redo this kind of analysis. This is like one of the issues that you run into when you're trying to assess sensitivity to modeling assumptions, because every different type of model makes very different, potentially very different types of modeling assumptions, you necessarily have to do this on a case-by-case basis. So when you think about kind of the diagram that you showed initially around how model criticism fits into the process, and this is a part of that, do you feel like that's a standard part of the process? Is that something that's being done on a regular basis? Or are models maybe not getting the level of criticism that they need, whether this kind or in general, the kind of robustness testing? And is that kind of a risk that you see? Right. So certain types of model criticism is quite standard. So certain types of model criticism is routinely done in practice. People do, for example, cross-validation could potentially be thought of as a type of model criticism, and people do this all the time. That being said, the particular types of criticism to prior assumptions is not as common in practice. So while the robust-based community has been advocating for this for a while, it just hasn't caught on. The several reasons for this one is it has this qualitative flavor to it. So it involves a little more work than standard automated model criticism things, tools. At the same time, there haven't been good software tools available to allow you to do this easily. So we are hoping to change at least the second bit a little bit by making our tool available. So we should, you'd still be required to make your qualitative prior assumptions explicit, but like this tool would allow you to check for some of those prior assumptions that you're making. Well, that sounds like an excellent direction for this work to be moving, to make it a little bit more practical, a little bit more usable, and ultimately to ensure that more models have that robustness that they need. Indeed. Fantastic. Well, we are at the end of our time today. So I want to thank you, Sumia, for spending this time with us to learn more about this topic. And I want to thank everyone in our audience for spending this time with us today. And if you enjoyed this piece, we are this is part of a series. So go ahead, subscribe, like and you'll get notifications of some of our other content. And if you're interested in this topic we discussed today, there is some more information in the links in the video. So thanks, everyone. And we'll wrap up. Thank you.