 Hello. I'm happy to present today my work on statistical stopping criteria for automated screening and systematic reviews, which I've done with my colleague, Finn and Mila Hansen. So in an evidence synthesis project, we have to identify studies. So we have to separate relevant from irrelevant documents. And this can be very time consuming. Machine learning can help. And we can use active learning pipelines, which means we have iterations of human screening documents, followed by the training of a machine learning algorithm using those human decisions. Then the machine learning algorithm predicts relevant probability for all remaining documents. And then documents are ordered in descending order of predicted relevance. And there is a fairly large literature on this, but it's worth saying that often the work savings which are identified have potential work savings, because they're dependent on a prior knowledge of how many relevant studies we're looking for. So to achieve these savings, we need to stop screening early. And in the paper, we go into a little bit more detail about why existing ways to do this are insufficient. In this presentation, I want to make a concrete example using some hard code. And we basically set up a toy data set with 2000 documents of which 100 are relevant. And we represent this as a vector of zeros and ones. So a zero is irrelevant. And a one is irrelevant document. And the parameter we're most interested in recall. And that's the number of relevant documents that we've seen divided by the number of relevant documents in total. And often we have a target level of recall, which is something like 95%. So if we screen the documents at random, then we will see 95% of relevant documents. So that's the y axis. After seeing about 95% of all documents. So this line here represents the achievement target. However, if we can use machine learning to identify relevant documents faster, we will achieve this target faster and we'll save some work. Or we could we could save some work. However, to save the work, we need to decide when to stop. And actually, we don't know, we don't have enough information to make that decision. Because here it looks like it might be a good time to stop. But actually, when we have the full information about the data set, we can see that this is too early. And so what we need to do is we need to calculate using a bit of maths when it is actually saved to stop. And we do that basically by using information about the documents we've already screened to infer something about the distribution of relevant documents in the documents that are less that are yet to be screened. And so what we do is we imagine that we stop machine learning prioritized screening at some point and start drawing it random without replacement from the remaining documents. So in probability theory, we use the analogy of an earn with green marbles, which are relevant documents or successes, and red marbles, which are irrelevant documents or fails. And using the hypergeometric distribution, we can calculate the probability of drawing k relevant documents in a sample of n documents from an earn that had n total documents, of which k were relevant. And so I have to skip these slides a little bit for time, but we can calculate a lot of these parameters. But the problem is we don't actually know k, so we can't simply do this calculation. What we have to do is we have to substitute a value in for k. And we do that in order to develop a hypothesis test for a null hypothesis that we haven't yet reached our target. So k time is the number of documents that would have been in the earn, number of relevant documents that would have been in the earn at the start of screening, if we have not yet achieved our we call target. And so we can calculate a p score for this, and if this p score is below our critical value, then we can reject the null hypothesis and say it's safe to stop screening. We can calculate this at every point along here. And here are the p values that we get for each of these little red dots. And you can see that at some point, we reach a point where p is below 0.05 where we can say, we reject the null hypothesis that we haven't achieved 95% recall. And this is a little bit after the time when we do actually achieve that. But that's not so useful because we don't want to start a random sample. So what we can do is we can treat the previously screened documents as if they were drawn from a random sample. And this is a conservative assumption, as long as the machine learning hasn't completely backfired. So we take the last one, the last two, the last three, etc. documents and calculate the p value for as if we had started a random sample here, here, here and here. And so here are the different p values that we can calculate for all of the different subsets of previously screened documents from this point. And you can see that at no point is the p value low enough to reject the null hypothesis. And so what we do is that we can calculate this at every point, the minimum p value, and this gives us these values here. And this shows that after about three quarters of all of the seeing all of the documents, we have enough evidence to say it's safe to stop, to stop screening. And we actually tested this on this 20 systematic review data sets, simulating 100 machine learning assisted reviews of each of these data sets, and our criteria naturally performed reliably. And you can see, you can have a look at the paper to see how this compares to other approaches, but it's the only one that's actually reliable in this sense. So thanks very much for listening. All of the our code you need to reproduce this or to calculate this p value in your using your own data is in this presentation. And this is also online, as is all of the code. Thanks very much.