 G'day, I'm Jacques. I'm a software developer at Shopify. I'm based in New York City. I'm sorry I couldn't be there in Austin today. What I'm gonna talk about today is a surprisingly tricky question, which is how to rank software projects according to the risk that they present to everybody? Not just to their users, not just to their authors, but to everybody who might be affected directly or indirectly if something bad happens. I'm also going to touch lightly on a sub-problem, which is how to identify obscure projects with disproportionate amounts of risk. I call these Lib Nebraska. And here's why if you didn't get the reference. You've probably seen this comic before in a supply chain context. In fact, you've probably seen it already at this conference. As an exercise, I would like you to estimate how many times in your life you have seen this comic before. Hold on to that number or write it down because we'll come back to it later. Now, here's our agenda for today. First, we're going to talk about what problem we are trying to solve by ranking software projects. Second, we will talk about what makes it so hard to create such rankings. Third, we'll survey some different ways that we can rank project risk before. Fourth, zooming in on one such option. We'll end with a brief demo and then some conclusions. So first of all, what is it that we're solving? What is it that we're trying to achieve? If we zoom out, this is the goal for the work that we do. We want to maximize the amount of risk reduction per unit of expense. This framing may seem a little cold to some. It may seem like a motherhood statement to others. But I think it should be kept in mind while we're doing our work. And speaking of motherhood statements, let's refresh our memories of what we mean by risk to start with risk has two orthogonal components. The first component is risk frequency. How often does the risk realization occur? Does it happen once in 10 years? Once per year? Once per month or per day? These are all very different risks. The second component is event magnitude. When the risk is realized, how much damage or loss occurs? Now in theory, you can measure this any way you like, but I like to think in terms of dollars or whatever currency makes sense to you. And I like this because you multiply event frequency by event magnitude and you get a risk exposure in dollars, which makes it easy to compare the amount of money you've invested in reducing risk to the risk. Therefore, the reduction in risk is achieved when we reduce the frequency or the magnitude or both of risk realizations. Now notice what I am not saying. I am not saying most CVEs closed or lowest CVSS score or best box in this five by five matrix. I want honest probabilities and honest dollars, crisp and sharply divisible stuff we can explain to our bosses and our sponsors, stuff which is unambiguous, which can be meaningfully averaged and compared. But how much risk are we talking about anyhow? One estimate puts risks, costs, losses and damage due to cybercrime at one trillion dollars. That's enough even for Congress to notice. This estimate includes spending on cyber security or to put it another way, spending on reducing the risk of the risk. And here's the kicker, which is that it excludes state action. There's still some gigantic amount of actual and potential risk in play. But back to the goal of reducing risk. To take action to reduce risk, we must first know one, what the risks are and two, where the risks are. It sounds easy when you put it like that, but here we run into our first hurdle, which is that there are just a lot of projects that might be a Lib Nebraska, at least thousands. And of OSS projects generally, there are probably millions upon millions. This creates a sparsity problem, which looks like this. Suppose for a moment I survey experts about all of the software projects. This matrix represents experts in the columns and projects in the row. You can imagine this will be a very, very long matrix indeed. Most of it will be completely empty, because the total number of opinions to fill it completely is x columns by y rows. x times y is a large number because y rows being the projects is a large number. This leads to an obvious next idea, which is let the computer do it. And that's the idea behind the criticality score, which is our first line tool for identifying projects with high risk. The criticality score works by sloping up a bunch of available data about projects, feeding these to a weighted index, then spitting out a score between zero and one, where one means riskiest. In addition to the criticality score, we have the Harvard census two of open source software. This used data provided by multiple software composition analysis vendors to identify packages that are being most used in a proprietary setting. So given that we have both the criticality score and the Harvard census, it's natural to ask at this point why we need expert input at all. For one thing, objective data is objective, uncolored by human bias and directly comparable between projects. So why not just use the score or the census results and be done with it? Well, to be sure, sparsity can be somewhat solved by letting computer do it, but there's still another problem to solve. And that problem is incomplete data. Only data which are available in numerical or easily counted form are suitable for the criticality score or the Harvard census. Less observable data may paint a different picture from more observable data. Sometimes this is called the context problem, as in there's a larger context than what the data reveal. Take download counts. It's fairly straightforward to obtain such counts for lots of software, and each download represents one usage, one import, one build, one something. Except that's not really true. There's no fixed ratio between the number of downloads and the number of usages. For example, a lot of packages see enormous download traffic due to CI systems without caching enabled, downloading the same dependency every time a commit is pushed, sometimes multiple times per commit, but only one in several such commits may ever make it to production. Or even worse, one download might be the download for the manufacturer of a small part that winds up in millions, perhaps even billions of devices in the wild. The download count in this case would be actively misleading. Or perhaps there are many downloads for CI in a fast-moving startup that hospitals rely on for their operational safety. In this case, the download count gives us a first guess at event frequency, but we have no idea of the event magnitude that lies beyond the veil. Download count is just one example, and no doubt somebody has injection about the details of that example, but the general problem holds. There is knowledge in the world, which is relevant to our problem, which is not amenable to simple automated collection. Somebody knows that knowledge. Who are they, and how do we get them to share it with us? This is the point at which I'm saying we need, or at least we want, some kind of human input to any ranking scheme that an automated score by itself is not complete and needs augmentation. But of course, humans are not without problems. For example, if I gather a pool of Fulang programs, all my opinions will be about Fulang projects. That doesn't give me a very good coverage of a very large domain. It's strictly better than nothing, but not by law. However, I'm going to hand wave this problem away for the moment as expert identification is a separate thread of work for the Securing Critical Projects Working Group. But the Fulang example does bring us to the next problem with humans, some projects are very famous, and so everybody can give an opinion on them. That's swell, but it doesn't help us define Lib Nebraska, which you will recall is a software project with low recognition and high risk. And there's a double whammy, which is that humans overestimate risk when they can easily bring examples of that risk to mind. Events affecting famous software receive wide coverage, and so estimators will find it easy to bring examples of such events to mind. The main countermeasure here is to guide experts to a pine on a variety of projects and to capture their level of familiarity with them. Is this your first encounter with a project? Perhaps you've only heard of it. Maybe you're an expert or even the co-founder. These different levels give us a handle on the degree to which availability bias will shape your opinions. Now let's talk about our options for eliciting expert opinions. There are three I will talk about in escalating order of detail and approval. The first is markets, prediction markets to be precise. In a prediction market, participants buy and sell contracts that pay out a certain amount if a certain event occurs. So if you think something is very likely, you would be rationally prepared to bid up to near the total payout available. If you think that it's unlikely, you would instead sell your contract to someone else. The price therefore fluctuates according to the market consensus of how likely something is to occur. Prediction markets have a really cool feature which is that everybody is motivated to reveal information they possess through prices. If you have information that nobody else has, you can buy something that's cheap or sell something that's overpriced and profit from your knowledge. But by doing so, you affect the price and that reveals information to other participants. Because you wanted to profit, you won't have an incentive to sit on information that gave you an edge. This logic applies to everyone, so the sum of all available information should over time plus or minus a whole bunch of noise show up in the price for each prediction. But of course there's a catch. This only works if markets have lots of participants who trade actively. One term for this is liquidity. A highly liquid production in the market has lots of people buying and selling. This means information appears frequently and the price moves up and down quickly. But if the prediction is thinly traded, then the information of only a few people will show up and it will show up infrequently and at a slow pace because there's just not enough activity to make the process smooth. This is a serious problem for predicting software project risk. There are just too many projects to meaningfully trade. While your Linux kernels and your Node.js will see lots of trading action and pricing risk very quickly, led Nebraska made language for years without a single trade occurring. And this is more than a hypothetical problem. It shows up in other kinds of markets as well. Shares of companies included in the S&P 500 index trade very easily. Shares in the so-called over-the-counter market, not so much. For a comparison of scale, consider that there are several thousand shares traded on the New York Stock Exchange and NASDAQ and that trading is the occupation of hundreds of thousands of people directly and indirectly. But there are potentially millions of software projects. So unless there are billions of people looking to trade in software risk prediction markets, most projects will go without predictions. It should be noted, this problem of sparsity is not unique to markets. It's a problem with everything I'll talk about today and it's where automation will always remain part of the solution. But what I'm driving at is that without liquidity, a prediction market does not buy you much over other more direct forms of elicitation given the amount of effort that is required to set up the market in the first place. Voting or what theorists call social choice is the other great mechanism for aggregating opinions that modern societies rely upon. I used to be in student politics and then I was involved in politics, politics for which my mortal soul is damned. But it left me with an enduring interest in voting systems. And there are a lot of voting systems, at least hundreds, if not thousands. And in no way could I cover everything worth covering about voting systems today. But I will take a moment to touch on several voting system alternatives that could be considered for the job of eliciting software project risk. It's worth noting that there are a few key differences between markets and voting, which is that markets can be arbitrarily precise, but voting can't. Now in voting schemes we are not making an exchange in a fungible currency instead we are ultimately trying to establish a ranking between alternatives using some sort of statement of preferences gathered from individuals. Now in a market you can derive a ranking by sorting by prices, but in voting I can't give two votes for John in exchange for six votes for Annie and find out how they rank according to the exchange rates of the votes unless something that is basically a market is going on in disguise. But back to voting systems. I'm going to dispose with the simplest and worst option first, which is plurality voting, sometimes called first pass the post voting. Each voter gets one vote and they can cast it for a single alternative. This is a very widely used method, but it's also a very poor method if you look at all the things that can go wrong. In our particular case it's bad for two reasons. The first is the recognizability problem. Famous projects will zoom to the top and every vote past the ones which establish a project's position in overall ranking is then a wasted vote. It tells us nothing that we did not already know. The second reason is that plurality voting limits us to the number of experts whose opinions are elicited. If it's one expert, one vote, then even if every single vote goes to a different project, we'll still wind up with a long tail of millions of projects that are never ranked at all. They'll just languish in equal last place. Approval voting works a little bit like plurality voting, except that voters can now cast a vote for any number of alternatives that they wish. You then tally up the approvals for each candidate and that establishes the final ranking. Approval voting has some nice properties. It doesn't require every single project to be looked at by every expert. It would be enough that some fraction of experts sees each project at least once. It's very simple, only slightly more complex than plurality voting. It's less likely to suffer from the popularity problem because there's no need to decide where the single precious vote must be spent. And lastly, it can be presented in a form where a single project is shown at a time because the actual question is, is this a critical project? Yes or no? There is, of course, a big gotcha. On what basis should an expert cast their approving vote for a project? When you ask, is this a critical project? What does that mean exactly? If you leave it to experts to decide, your results will be skewed by each expert's personal opinion of what critical means. But if you define criticality very precisely so that each expert use the same criteria, you will just cause a cutoff line that forces, again, a long tail of projects to languish in equal last place just as happens with plurality voting. One way to improve matters is to force experts to describe their explicit individual ranking of projects as a rank is more informative than a simple approval. In elections, this is usually done all at once on a single ballot with voters saying, this candidate is my first preference, that candidate is my second preference, and so on. This system is used in Australia and some U.S. states. But for our experts, all they need to do is answer questions of the form. Is A more critical than B or is B more critical than A? As with approval voting, we don't need every expert to rank every project. There are counting methods that can establish the aggregate ranking from partial rankings by individuals. But all is not well in paradise because you need enough individual rankings to create usefully large aggregate rankings. But the total number of pair-wise comparisons is exponential in the number of alternatives. That's much worse than plurality and approval voting, which were only linear in the number of alternatives. Now we'll come to the last voting system that I'll discuss, which is score voting. In score voting, voters don't rank preferences or simply give approval to a preference. Instead, they express some degree or amount of approval. And the nice thing about score voting is that it reveals an even more detailed view of the voter's preferences than rank choice voting. There are some downsides, however. The most important is that it makes a key assumption, which is that when you say 2 out of 10, and when I say 2 out of 10, then our internal values for 2 out of 10 are the same. That's a mighty big assumption, actually. It may be that you gave 2 because you think the project is rubbish, and I gave 2 because I don't know the project and I'm just being conservative. These are not the same opinion, but score voting treats them as if they were. So obviously I'm building towards my favorite conclusion, which is that we should directly elicit estimates of risk event frequency and magnitude. The basic logic is this. We probably can't get that much from a market system, and the best of the voting systems is probably score voting. But all voting schemes mix frequency and magnitude together, even though they're separate. Furthermore, score voting has the problem that scores are not comparable between experts. Direct elicitation gets around both problems. Compared to markets and non-school voting, it needs only a single expert's opinion to extract some useful information about a project. And compared to all the alternatives, it can distinguish between frequency and magnitude. Decomposition is an important thing we can do with the direct elicitation that we can't do with voting or market schemes. Decades of research over a variety of fields has shown that estimators who break questions into smaller questions produce better estimates, even if that decomposition is as simple as breaking risk down into frequency and magnitude. How good are the estimates? Well, if you had an objective way to test that instantly, you wouldn't need estimates in the first place. You'd just directly measure things instead. But that said, there is such a thing as estimation ability or what meteorologists call forecast skill. And one of the ways to talk about estimation ability is called calibration. An example of calibration is that if I estimate something happens 50% of the time, and then it does turn out to happen approximately 50% of the time, then my calibration is good. If instead it happens 20% of the time, or it happens 90% of the time, then my calibration is poor. You can get much more technical, but that is the gist of it. Now, it turns out of the gate, most humans are poorly calibrated. We tend to give estimates that are overconfident and depending on the topic, optimistic. Overconfidence means that I am too certain about my estimate. It means that I didn't pick the correct value or values, but I'm sure that I did. Overconfidence is easiest to see if we change to talking about estimates as ranges. Now, I want you to look back at your existing estimate of how many times you've seen that XKCD comic before, and we're going to turn that into a range estimate. Now, what's the lowest plausible number of times you've seen the comic? And what's the highest plausible number of times you've seen it? If you go on gut feel, overconfidence means that this range will be unnecessarily narrow. I bet very few of you gave two as your answer for the lowest plausible bound, even though it's easy to support with a simple argument. If you've seen it in my presentation, that's one, and you obviously recognized it from seeing it before, that's two. The upper bound can be found by walking down from ridiculous numbers until we start to become more plausible. Have you seen it a billion times? Maybe it feels that way, but it's unlikely. A thousand times. Maybe if you saw it twice or three times a day for years, a hundred times, probably still a little high if you saw it every week or so. But as you close in on say 50, that number becomes more plausible. For myself, I'd estimate about 30 times as a safe upper bound. Now studies show that unless trained, estimators are almost always overconfident. Even when coached to give 90% confidence intervals, estimators will give 70% confidence intervals, meaning that fully 30% of true values will fall outside the range that they provided, and this affects calibration negatively. Optimism is different from overconfidence. Optimism and its rarer mirror, pessimism, where the entire estimate is skewed towards one end of the possible values. Suppose I am estimating how long it will take to prepare this talk, and suppose that my estimate is approximately three weeks of full-time effort as it approximately was. If I estimate one week plus or minus two days, then I was optimistic. If I estimate five weeks plus or minus two days, then I was pessimistic. The good news is that we're not stuck with these problems. Calibration can be tested and it can be improved with training. Overconfidence and optimism can be demonstrated to experts by giving them a variety of estimation tasks with instant feedback. Some general knowledge, for example, such as how far is Austin from New York, or how confident are you that the ancient Greeks conquered ancient Rome, or they can be more specific to the domain, such as how many CVEs for Java-based projects were reported in 2021. Calibration is also helped by teaching some of the techniques that we've covered in passing, such as estimating by starting with absurd values and working down, or by decomposing difficult estimates into smaller, simpler estimates. There's even a school of thought that you should go further and wait expert opinions according to how well they perform on calibration tests. The theory is that there will be variation between the experts according to their estimation ability or forecast skill. So you don't want to dilute the good estimators by blending them with the worst estimators. I'm ambivalent about this idea. I recognize the logic, but it does mean that we would be throwing away a lot of estimates when we are already facing a sparsity problem. Now it's time for a demo. And luckily for me, this talk is pre-recorded, so nothing can go wrong. What you're about to see is a prototype of what I'm calling security expert elicitation of risk, or SIA. SIA is not a production app. The goal here is merely to demonstrate what kind of information is gathered and how it is gathered. First, let's look at who we will be eliciting judgments from. We have four experts, some of whom have requisitely silly names. Now we have a list of projects over which we can gather estimates. There are already a bunch of data here. I'll dive into it in a second, but first I want to move on to estimates. And here they are. As with the project's page, this has a lot of densely packed data. You might be wondering though, what is an estimate? And if we click into one of them, we see this page. Now on the left are the basic data for the estimate. Who made it? For what project? How familiar they are with the project? The range of magnitudes and the range of frequencies. But those are not the shiny thing that drew your eye first that would be the plot on the right. Which needs some explanation. I'll step out of the demo for a little while to explain how it comes to be by working towards it forwards from the original estimates until we reach the plots. Let's first start with the decomposition of risk into magnitude and frequency. Now you can imagine each of these is occurring on some sort of number line. The magnitude and dollars say and the frequency in events per year. Now first we estimate plausible minimal values for these. For magnitude let's say a dollar and for frequency let's say once per year. And next we estimate the maximal plausible values. For magnitude let's say seven dollars and for frequency 1.4 times per year. Now finally we estimate what we think are the most likely values in statistical terms. These are the modes of the estimates. The thing that we think is going to occur most often if we sample this particular range bazillions of times. Now your first instinct might be to multiply each pair of values to get a final result. You'd get a minimum risk of one dollar times one which is one dollar. Maximum risk of seven dollars times 1.4 which is nine dollars eighty. And a modal risk of five dollars times 1.1 which is five dollars fifty. But turns out this is quite misleading because it assumes that all of the values move in lockstep. Now in particular not all values are equally likely. Remember the mode is meant to be the most frequent value which the number lines then show very well. So what we need to do is to show that these values form a distribution in which each value has a distinct probability of showing up in a random sample. For simplicity I am using the triangular distribution here. Statistically minded folks amongst you might prefer a different distribution but for now the triangular is good enough for our purposes. I'll come back to why later in the talk. The key thing to understand is that the closer you get to the modal values the more often that value is going to show up in a random sample. So while the minima and maxima are possible values that could be chosen at random they are much less likely to be chosen than the mode or values close to the mode and that's why the distribution has this shape. But we still need to combine magnitudes and frequencies somehow so what should we do? Turns out that multiplying the distributions is something we want to do and with infinite computational power we would multiply every combination of values sampled and turn that into a combined distribution. Now we don't have infinite computational power but we can do something that approximates it. We take a random value from the frequency distribution with the selection weighted by probability and we take a random value from the magnitude distribution again weighted by probability. We multiply these two samples to get a result. Now this is a single scenario out of all possible scenarios of risk realizations and we repeat this process each time selecting a different pair of values which values we choose each time is weighted by the distributions meaning that we will pick more values near the modes then near the maxima and minima. We collect the results into a table. Finally we summarize this table into a histogram that is we define bins or buckets of equal size and count scenarios which fall into those bins. If we plot the histogram we get the plot that we saw on the estimate. You may recognize this it's the Monte Carlo technique allegedly first used in the Manhattan project to solve particularly gnarly integral calculus problems. It's a simple idea with profound power but let's get back to the demo. So returning to the demo we use a Monte Carlo approach to create 100 bins from 1 million scenarios. This gives a pretty complete approximation that you could use more scenarios but there are diminishing returns to scale so 1 million is probably enough for our purposes. Now what if I'm unhappy with this estimate? Then I can edit it. This page is also the same page that's used to create new estimates. As you can see there are values for magnitude there are values for frequency and the likely fields the modes are set as ranges bounded by the maxima and the minima so that users can't accidentally set impossible values. As we discussed earlier experts can provide statements of their familiarity with the project. Now this is probably the most unrealistic part of the prototype. In a real system users should not be able to select who they are it would be attached to the user account and presumably the project would be set automatically when you start the estimation process but it's good enough for a first rough cut. So we've got 10 estimates but we've only got 4 projects and that means some or all of the projects must have multiple estimates and indeed that's what we see. Let's take a look at the yellow project which has the most estimates. There's a lot to take in here. I'm going to break it down and work my way up from the bottom. First there is a table with estimate values. The frequency and magnitude columns are the raw inputs to the Monte Carlo process and the risk columns are the output of the Monte Carlo process. This table summarizes the values from the multiple estimates. Now note that the risk column is based on the underlying Monte Carlo values not on simple multiplication of minima maxima and modes as we discussed before. These plots give a graphical representation of each of the magnitude and frequency components of the estimates. The bars represent the full range of the component and the red dots represent the mode or likely estimate. The plots are intended to give a viewer some concept of how the estimates are shaped in their components. Now I'm not sure that these plots would survive to a final version since for projects with many estimates they've become quite long but they are an easy win for the prototype. I'm going to skip up to this plot shown here are the Monte Carlo results for all four estimates. The idea here is to allow some eyeballing of how different estimates are compared to each other. But most important is this innocuous red line. It's the mean of the modes of the distributions of the estimates. Maybe that went by a little fast so I will restate it. I take the most common outcome from each distribution and then I average them. Now this may seem a little primitive and indeed there are alternative ways of combining information from multiple distributions. Some surrounded by very dense tickets of mathematical notation but taking the mean of the modes is the simplest way to combine the information in these distributions. And in fact it's all we really need to do. Remember our goal here is to provide a ranking of software projects according to their risk. We aren't as interested in the values of the estimates though that may sometimes be useful. We just need to know whether A is worse than B or vice versa. When I promised earlier that I'd explain why triangular distributions are fine for our purposes this is what I was hinting at. So long as the ordering is monotonic with calculated risk the process is useful and we don't need to do anything fanciam. The last plot we look at is the possibility plot. The idea is this. Any given distribution gives a range of possible values and normally we think about this range in terms of probability but we could instead look at any value above zero as being a possibility. That means that all the values in the range have possibility one and in the possibility plot we show where the ranges overlap by stacking them up. So for example at value 70 the possibility plot shows that only one distribution contains that value whereas at 20 there are three distributions which consider it to be a possible value. The possibility plot just provides another way to look at study and interpret our multiple distributions but in some sense you can think of the values with overlapping possibilities as having higher or more possibility than others and it turns out there's a whole world of possibility theory which resembles probability theory. The differences and where the possibility theory is really thing is a source of considerable theoretical controversy but it is from possibility theory that I took the idea of the possibility plot. So now we have multiple ways of comparing multiple estimates on a single project. What can we do with them? Well I said earlier that we can rank projects using the mean of the modes of the distributions and in fact this was already being used to rank projects in SIEM. This column average likely risk is the mean of the modes for that project and you're probably now noticing that the table is sorted by this field in descending order. Orange project may have only one estimate but it's more dire than the combined four estimates of yellow project and so on. By using the mean of modes approach we can have projects of varying numbers of estimates being directly compared with each other in the same units of risk. Of course a tool like this is just the starting point. First we need to identify projects and experts to judge those projects. Secondly right now the prototype has estimates being made in a vacuum. But obviously you would want to present some menu of information and statistics about a project during the ranking process. For example you could show the criticality score or the Harvard census ranking whether available. You could show statistics from the chaos project that cover things like number of contributors and time since last commit. You could show lines of code programming languages. In fact there are probably hundreds of potential metrics covering product factors such as size, process factors such as frequency of commits and project factors such as number of active committers. The tricky part will be to decide what information is shown in order to avoid overwhelming experts. One possibility would be to allow those experts to choose the information they see for themselves. Another would be to use other research to identify the most predictive metrics and to show those although that leads us back to the original problem of sparse and unobservable data that we were trying to get away from in the first place. Thirdly we need to decide some order in which to elicit estimates which leads to an interesting turtles all the way down problem. Ideally we would judge Lib Nebraska first but we don't know which library or package is Lib Nebraska that's the whole point. This leads to a scheduling or resource allocation problem. I have X available experts and I have Y available projects for them to estimate. How should I allocate Y to X? It may be that we bootstrap it by say picking the top 100 projects or 1000 projects chosen by the criticality score and the Harvard census then opening up the process to allow experts to pick and choose or we might require experts to take whatever we give them chosen at random or there might be a hybrid some kind of multi-arm banded optimization where the estimate scheduler tries to create an optimal mix of estimates given the known unknowns. Now I'm not sure which of these is best or even if any of them are any good and I hope someone who is good with this kind of problem will come forward to help solve it. Lastly we have to re-estimate. Estimates can't be static projects change and evolve so it will be desirable to periodically re-estimate them. Ideally we would have the same experts re-estimate the same projects that's not always going to be possible so we're just going to have to deal with the variation due to that and of course we'll get another scheduling problem which is when and why should we re-estimate a project. An ideal end state for this process is that we wind up accumulating many estimates that together with the available metrics can form a training dataset for a machine learning or a statistical learning mechanism. The basic idea is pretty old originally called the lens method. In lens methods we take multiple metrics and multiple estimates and create linear regressions between the two. These regression equations then typically outperform the median estimator and linear regression is approximately the simplest thing belonging to the bigger category of define a numerical relationship between inputs a, b and c and outputs x, y and z. The fancy and expensive kind is called machine learning and ultimately such a program could be used in the estimate scheduling problem. Things that look problematic to the training model could be bumped up the work queue. This would also work for re-estimation as metrics change the model would give different predictions of what a median expert would estimate. Those with increasing predictions would be moved up the work queue. Now it's almost lunchtime so I'm going to keep this short. We should rely on data driven prioritization to the degree that we can but to the degree that we can't we need to rely on expert judgment. To do this I recommend that we directly elicit estimates of risk frequency and risk magnitude and combine these using a Monte Carlo method in turn combining estimate models in a single average risk level. These can then be sorted to produce a ranking of projects according to their risk. A number of outstanding questions remain before this approach can go into production but I hope that I have at least inspired your confidence in this approach. Never forget that we are here to prioritize the retirement of trillions of dollars of risk. Taking a little time to solidify our approach is time well spent. Thank you. For folks who would like to read more about expert judgment and probability based risk assessment these are some of the books I relied on while researching and stick around if you don't feel like lunch for a blooper reel can be improved with training. The theory is that there will be variation between now what if I'm some eyeballing of how the numbers shake out.