 Welcome to the next talk. Our next presenter will be Steve Swagger, who's a professor emeritus at Cornell University and is a research principal at MediData Solutions. Additionally, Steve's been a part of the R Medicine conference planning this year. He'll be talking about using R to detect outliers and anomalies in clinical trial data. Good. Thank you. And thanks to everyone for joining us. I'm going to talk about using R to solve or to take a big bite out of the very old problem that I would have thought, you know, that we would have made more progress on it by now. Outliers and data anomalies are still very much with us in the clinical trials world. This is why we need to solve this problem because the success rate of clinical trials starting from phase one going on through approval, you can see the sections phase one to two, two to three and so forth. The overall success rate is under 10 percent, so takes a long time, costs a lot of money to do these trials. We need to do better and the data, we can't afford to make mistakes with the data. A few slides of basic principles. Errors in clinical trial data can greatly reduce the speed of the trial and can even lead to the trials failing. An important point that I just alluded to is that major data quality issues occur much more often than people tend to think, including me, before I sort of got to know the terrain more closely. And I have some examples of this later on. There are substantial data inconsistencies within and between clinical research sites, unreported adverse events, substantial differences among sites or regions in many ways. Adverse event reporting, protocol compliance and others. Some sites do much worse than others, some do better than others. And many of them make mistakes in specific areas which differ from site to site. And then there's site misconduct which is also known as fraud. Many of these problems I just described cannot be identified, let alone solved, by complete source data verification where we check everything against the records as they came in. The errors come in from other places. And they get into the written records so there's no way to find them by that path. As a result, risk-based monitoring and centralized statistical analysis and monitoring have emerged as alternatives to source data verification. The FDA issued in 2013 draft guidance for the industry on risk-based monitoring and it updated that guidance in March of last year. So the goal, just to be clear, is to increase the efficiency of clinical trials by maintaining a high level of data quality while keeping the primary focus not on the whole data set but the subset of trial data that can affect the quality of the study. A few more basic principles, maintaining high data quality is a lot easier said than done because there's an enormous number of data points usually in the millions and the outliers and errors and anomalies in data can occur in a limitless variety of ways. There's always another way it can happen that we have these goals done in an accurate, automated and comprehensive way. It's such a big job we can't steer through it manually. Some key components are clustering of the patients and variables, univariate and bivariate error detection, site trends and fraud detection. And these errors in the data arise from many different places, data entry mistakes, intranet malfunctions, protocol non-compliance, fraud, the list goes on. Now R is extremely well suited as a programming environment to developing the broad suite of algorithms we need to detect these potential problems. R has the flexibility to handle diverse types of data, numeric, text, binary, single measurements, time series, event-based outcomes, I didn't, it's not an exhaustive list but this is enough to give you the idea. So what we need to do is take an entire clinical trial database, process it quickly, select the appropriate tools for each variable and each pair of variables, execute the procedures, report the results and flag the anomalous results for investigation. That's a lot to do. So R includes some very powerful tools with crucial capabilities for aspects of what we need to do. Data wrangling, there's tidyverse, DevOps, integrated software development and IT operations, there's use this, test that and log for R and parallelized CPU intensive computations for each and for packages. Patients should be clustered, we want to do that using multivariate methods based on the complete clinical data so that we can in a sophisticated way detect anomalies, combining information from several related variables. The methodology should be consistent in scoring and comparing data quality across variables, also across patients, across clinical research sites and across studies if we have a whole portfolio of studies. And we want to do this using the best clinical, sorry, the best statistical practice, using methods like robust distribution fitting and regression, assessment of residuals and influential data values. So this methodology enables an approach to risk-based monitoring that lets us do several important things. We want to identify these anomalous data values and patients and sites for early action. We want to find them and fix them like at the earliest stage of the trial possible, not after the fact when it's too late to do anything about it. We want to cluster patients for advanced anomaly detection, compare data quality scores across variables, patients, sites and studies again. And if we can do this, put all this information together, it will translate directly into higher data quality, reduced monitoring effort because this is going to be automated, and being able to address and ameliorate early site and patient issues that could compromise the trial's success if we don't detect them. So a few words, a couple of slides about clustering patients and variables. The clustering that I have in mind that we use at metadata is distance-based clustering. So we need to compute a distance matrix between all pair of variables. The distance metrics are based on the variable type. The variables can be numerical or binary or categorical, and we have different ways of computing distance for each of those. And for pairs of those, we also have to compute a distance matrix between patients for all pairs of patients. And there's a quantity, the Gower distance, that's a good place to start, but it needs to be robustified and adapted to the different kinds of patients. And so for each patient, there are numerical, categorical, binary variables that have to be combined, so there's a lot of thinking that needs to go on. And it's useful to have weights based on the CDISC study data tabulation model. That's a talk for another time. Here's an example of a picture of what clustering patients means. These aren't patients, but they're close enough. You get the idea. As for clustering variables, different distance metrics are needed, again, for the different kinds of variables, numerical, categorical, and binary. So distance metrics can be based on Spearman correlation, Cromer's V, Somer's D. Again, there's a lot of thinking that has to be done. If there are 1,000 variables, then there are going to be n times n minus 1 over 2, 499,500 variable pairs. That's more than I want to look over manually. So clustering, based on computing these pair-wise associations in an automated way, often reveals a few thousand pairs, a manageable number, where variables are together in a cluster. And again, here's a quick picture of a clustering of variables. Okay. Detecting univariate anomalies, this is outlier detection. I think I'm going to skip this slide to get to some other stuff. This is something we would talk about, you know, in a first course in statistical methods. Bivariate anomaly detection is a little more complicated, because it's possible that each individual variable looks okay, but the two of them, the pair of them, looks odd. You know, someone who's four foot nine inches and, you know, 300 pounds, for example. Either of those might be good by itself, but together not so much. So, again, we need different methods for each combination of data types. And this is where the flexibility of R allows us to do different things for different kinds of variables. When we have a variable pair, we should turn the predictor variable, response variable in both directions for symmetry, and try to figure out what to do to identify and address outliers. Okay. This is a standard slide that I'm going to just go over, because you've probably seen it a dozen times, the five Vs of big data. Volume, variety, velocity, veracity, value. Variety is relevant to us because we have, as I've been saying, many different kinds of variables in a clinical trial. And veracity, well, that's what this whole talk is about. If we can't believe the data, if the data have errors and anomalies and values that clearly aren't or not so clearly aren't right, that's going to really sabotage our relationship or our analysis. We can use machine learning to find outliers. This is just a bunch of bivariate plots of pairs of variables. A quick example of something that can go wrong. Most of the data are on an arc from the upper left to the lower right, a kind of a hyperbolic arc. But there's a line going from the lower left to the upper right. And when we focus on that line, that line there, it turns out that all of these observations were taken at one site, which clearly was doing something wrong because it is so different from all the other sites. So finding this and fixing it is an example of what we need to do. Another example of a study we did at metadata. We looked at 40 clinical trials and there we go. I'm going to slide quickly through the details, but the bottom line is that every one of them had some kind of a substantial data problem of the kind that we're talking about that could sabotage the success of a trial. This is a summary of the data from the slide you just saw, but this is slide two. When I said there are examples coming up, this is the example and this is just a repeat of what I said before. Here's another kind of problem that can occur at a site. This is one site that has five very suspiciously similar patients. You look at those five patients and say, boy, they're just identical. Their labs are identical. At a good site, the labs would look like the ones on the right with some variability. Well, it turns out on investigation, the sponsor found that all those five patients were completely fabricated. They were all alike because the site just made up the same thing and put it down five times and they'd been entering fake data for two years without detection. A final example, in another study, we looked at 10 clinical trials by top 25 global pharma companies. And you can see there were a large number of avoidable data quality issues. Some of them were so serious they could have delayed approval. It turns out that 26% of those quality avoidable data quality issues, 118 out of 453 across 10 studies, had the potential to delay the drug approval. So here's my conclusion. R has major strengths for developing new approaches to familiar problems like this one, as well as innovative methods for new problems and implementing risk-based monitoring and centralized statistical analysis can provide valuable help in locating data anomalies in the data values, patients and sites, increasing data quality, reducing monitoring effort, and identifying early issues of site and patient that can prevent clinical trial success. So thank you very much. Beth, that's it for me. So there are a couple of questions. One of them is, would anomaly detection constitute double dipping with subsequent analyses? That's an excellent question and a timely one, in view of Daniela's great talk. What I would say is, if we look at a value and have substantial confidence that something really did go wrong, if there's a clinical case to be made for knowledgeable physicians to say, that just couldn't happen. That's not right. We don't know what did happen, but that value is wrong. I think there's a good case to be made on those grounds for pulling it out. If there's doubt, that's a different story than one could consider running the analysis both ways, with and without those possibly wrong values. But some values, when you see them, you know that that could never have happened. And I don't see it. I don't see any. I see every reason to pull those out. And then another question or a thought was that there is the point blank package by Rich Yonah at our studio that to automatically detect anomalies. So that might be something worth looking into. Very much so. Yeah, it's called point blank. Is that by Rich? Yep. OK, thank you. I am writing that down as we speak. OK. I think we're about ready to go into the next session. OK.