 I'm going to introduce the next final speaker in the session, it's Shirley Wang, and they'll be talking to their paper or their presentation generating reliable insights from real-world evidence studies highlights from RCT Duplicate. Hi, thanks. I'm Shirley. I'm an associate professor at Brigham & Rooms Hospital Harvard Medical School, and I'll be talking about RCT Duplicate today. So because I only have a few minutes, I'm just going to touch on a few highlights from this project. But if you're interested in more details about the methods, the process, or the results, you can check out this paper that was just published in JAMA two weeks ago. Or this QR code will take you to a full length recorded presentation. So for some context, in my field, there's been a lot of concern about the credibility of real-world evidence and whether this type of evidence is able to generate valid causal inferences about the effects of drugs. And real-world evidence is basically the buzzword these days for what I'm talking about, which is non-randomized database studies that makes secondary use of large databases like from health insurance claims and electronic health records. So a couple of years ago, there was this position paper published in the New England Journal of Medicine that generated quite a bit of buzz. And it was called The Magic of Randomization versus the Myth of Real-World Evidence. And the position that these authors were taking was basically that real-world evidence can't generate causal inference because it doesn't involve randomization, and therefore it can't. And they backed up this claim by pointing to a few selected examples where there was a database study in a trial that seemed to be asking similar questions, but they got different results. As a counterpoint to that, there was this other commentary that was written by a group of international regulators where the basic message they were saying is that this false dichotomy between RCTs and RWE is just not helpful. Because moving forward, it's not going to be about RCTs versus RWE. It's going to be RCTs and RWE, trying to use the right tools to address the right questions and not just using a hammer because you have a hammer. So where does RCT duplicate come into this? RCT duplicate is a methods demonstration project. It's a series of projects that are aimed at trying to understand and improve the validity of RWE for regulatory decision making. And to do this, we have three related aims. The first being to emulate 30 completed trials using database studies and then to predict the results of seven ongoing trials. And what we're trying to learn here is would we have come to the same causal conclusion if instead of a trial, there had been a database study? The second aim is to test a transparent and reproducible process with the FDA for evaluating database studies. And the third is to look at factors that predict concordance. And for this, we had three binary agreement metrics, as well as looking at a correlation coefficient overall. So here I am showing you a calibration plot of the trial estimates against the database study estimates for a sample of 32 emulations. And you can see that the Pearson's correlation overall was 0.8. But zooming in a little, you can also see that there's quite a few points that are falling away from the diagonal line indicating perfect calibration. Now keep in mind that the main purpose of duplicate is to be able to calibrate the results of the database studies against trials for which we are able to closely emulate trial design. However, we also learn a lot from those trials where we had difficulty in emulating certain aspects of trial design. And I'll get into that a little bit more in just a minute. So I wanna pull in this thread a little bit about the issue of bias versus design emulation differences because bias is what everyone is concerned about with these non-randomized, non-interventional database studies. And of course, if you observe divergence and results between a database study and a trial, bias could be a contributing factor. But design emulation differences can also be a contributing factor. Things like differences in adherence or follow-up. Or sometimes there's aspects of trial design like requiring abrupt discontinuation of maintenance therapy at randomization, which is something that you just can't emulate with clinical practice data. So to explore the potential role of design emulation differences, we created this exploratory indicator which separated our sample into those with few design emulation differences and those with more design emulation differences. And on the basis of the split, we had half of our pairs in each sub-sample. And in the sub-sample that had few design emulation differences, we observed higher correlation, as well as higher concordance on our pre-specified binary agreement metrics. In the half that had more design emulation differences, there was lower correlation and lower agreement. Now there's a whole long list of potential sources of bias as well as design emulation differences that I could go into. But in the interest of time, I'm just gonna go over two case studies. One of them being this issue of time-bearing effects. And the other being this, the issue of potential role of chance or other factors. So for my first example, I'm gonna be talking about the Horizon Pivotal Trial. And this was a trial that was looking at zoolodronic acid versus placebo on the risk of hip fracture in patients with osteoporosis. Now in the database study, we looked at zoolodronic acid versus reloxapine, where reloxapine was used as an active comparator placebo proxy, because it's not expected to have an effect on hip fracture. Now for this particular trial, we were actually able to hit all three pre-specified binary agreement metrics. But if you look at this plot, you can see that it tends, it's one of the points that's falling a little bit further away from the diagonal line. So to get into this a little bit more deeply, you can see that Horizon Pivotal was a 36 month trial, and over this time frame, it had a hazard ratio of 0.59. Now in the database study, due to low adherence, the median follow-up was 18 months, and over this time frame, the hazard ratio was 0.75. Taking a look at the cumulative incidence curves for the trial, you can see that it's possible that there could be time-bearing effects with greater separation in curves later in follow-up. However, we don't know what the hazard ratio would have been in the database study at 36 months because so few patients stayed on therapy. What we could do was digitize the image of the trial result to get an estimate of the hazard ratio would have been at 18 months. And when we did this, we obtained a hazard ratio of 0.75, which is what we got in the database study. So this is an issue that we observed several times, this issue of really long follow-up in the trial with potentially time-bearing effect, and this being coupled with low adherence in clinical practice. Now, when we corrected for this design emulation difference, we observed closer calibration in our results. So some of the take-home points from this are that it can be really difficult to replicate trial findings with non-randomized database study when the effect is delayed or time-bearing. And this is due to something that we already recognize that patients in clinical practice may not experience the full benefit that is observed in highly explanatory trial with a lot of extensive procedures in place to maximize adherence. This is also known as the efficacy effectiveness gap. So for my second example, I'm gonna be talking about the role of chance or other factors. And here's the calibration plot that I showed you before. But this time I've color-coded the dots based on whether there were few design emulation differences versus more design emulation differences. And zooming in, you can see that by and large, the blue dots indicating few design emulation differences tend to fall closer to the diagonal line indicating perfect calibration with the exception of this one right here. This is the Einstein PE trial. But to understand what's going on with Einstein PE, I also have to talk about this one, which is Einstein DVT. Now these are two sister trials, more or less twin trials that had virtually identical designs. They were both non-inferiority trials that were looking at river oxaban versus warfarin on the risk of recurrent venous thromboembolism. The one difference being that Einstein PE was focused on enrolling patients that presented with pulmonary embolism, whereas Einstein DVT was focused on patients presenting with deep latenthrombosis. Now because both of these had virtually identical designs, they were both classified as having few design emulation differences. But you can see that there's qualitatively different interpretation of the concordance of the results between the database studies and these two sister trials. Just to get into that a little bit more deeply, I'm showing you the same results in a different way. On the top row here, Einstein DVT, you can see that we pretty much met our binary agreement metrics, whereas in the second row for Einstein PE, we missed on all three. Now in terms of the trial results, both of these trials met the non-inferiority criteria that they were aiming for. And the p-value for homogeneity between these two sister trials was 0.09, indicating weak evidence for separation and results. But if you take a look at the point estimates for Einstein DVT, it was around 0.7, and the confidence intervals just barely overlapped null. And this is just not just non-inferiority, but perhaps some evidence for superiority. In contrast for Einstein PE, the point estimate was on the opposite side of null and there was a very wide confidence intervals. Now for the two database studies that were emulating these two sister trials, we obtained point estimates of around 0.7, and there were tighter confidence intervals that excluded null, which made these estimates consistent with Einstein DVT, but not with Einstein PE. So as we were looking into this and trying to understand what may be going on, we found this meta-analysis of six trials that were comparing different DOAC agents, which is the class of medication that river oxaban is in. These DOAC agents compared to warfarin on risk of recurrent menosrombombolism. And this meta-analysis of these trials found that overall, there was a modest protective effect of DOACs with no hydrogenating effects in patients presenting with DVT versus PE. Now obviously the trials that are included in this meta-analysis included different agents and other different aspects of trial design that may have not exactly aligned with the design elements for the Einstein trials, but I'd like to point out that this meta-analysis included the amplify and recover two trials, which are also two trials included in the duplicate sample, and we were able to emulate those with few design emulation differences, and we met all three predefined binary metrics for both of those trials. So with all of this context, it's still really difficult to say what is the potential role of chance or other factors in this modest separation of point estimates for these two sister trials, because they're separated enough, even though they're highly overlapping, that it leads us, as I mentioned before, to this qualitatively different interpretation of emulation concordance. And I'd also like to point out again that the two results of the two database studies emulating these trials were consistent with each other, with Einstein DVT, as well as the meta-analysis, which found modest protective effects and no heterogeneity by DVT versus PE. So some take-home points are that there's often a lot more nuance to understanding the replicability of trials with database studies than you can find with these simple binary agreement metrics. There's many factors that lead to convergence, as well as divergence, and sometimes you really have to dig deeply to understand what's going on. A second point is that when we're thinking about understanding when and how database studies can complement trials, we really need to think about what is the question that's being asked? What is the need of the end user? Are they looking for evidence under ideal conditions or clinical practice conditions? So with all of that said, we note that generally when the data are fit for purpose and you're using proper design and analysis methods, non-randomized database studies can come to similar conclusions as trials about drug effects. But of course the real benefit of database studies comes from being able to generate complementary evidence in tackling relevant clinical questions where for various reasons, a trial cannot be conducted. And with that, I'll end by recognizing the duplicate study team and if you have any questions, please be welcome. Very interesting. When you approach those comparisons between the database study and randomized controlled trials with equipoise, have you thought about what would happen with the same weapon put in the hands of an industry team that stands to benefit highly if they get to kind of pass the regulatory goalpost while issuing actually the hard work involved in completing a randomized controlled trial? Well, I think there are definitely different stakeholders that have a stake in the game in what tool is being used and for what purpose. So for us, at least we were working with the FDA. The FDA funded this project in trying to have a balanced approach and come out in a very transparent and reproducible way in our assessment of when and how we can possibly use these different types of sources of evidence. Certainly you could be manipulated one way or the other and it has been in the past. So an extreme oversimplification of these findings might be that perhaps we should trust these results from observational studies more than some of us may have thought to in the past. And an extreme oversimplification of the research program of late of Emily Oster, the economist at Brown, is that very often at least the effect size estimates that we get from observational studies are very dissimilar to the effect size estimates that we get from randomized controlled trials and she has a really powerful illustration of this in the study of, I believe, the effect of breastfeeding on child outcomes. And I'd be curious if you could help those of us who aren't deeply immersed in these kinds of research questions reconcile the perhaps somewhat different conclusions of these research programs. Sure, I think probably it gets down to what's the question that is being asked? Are you asking the same question? Because if you're not asking the same question, of course you're gonna get a different answer. So just being really clear about, like conceptually you can think you're asking the same question, but if you really get into the operational details, what are you measuring? What is the estimate that you're targeting? And if those aren't aligned then you can't possibly expect to get the same answer, essentially. Hi, I'm so curious about one thing. So you have a sense of like how, I was struck by how tight the correlation was between the real evidence and the RCTs. And I'm curious about your sense or intuition about how much of real evidence databases meet the characteristics that would allow for sort of emulating causal influence from RCTs. If that's even an answerable question. So I think in my field, the joke is the answer is always it depends. So it depends, not each database is different, each question is different. It has to be data that is fit for that particular question. So yeah, there's some databases that I would only ask certain clinical questions with and so on and so forth. So yeah, I just leave it as it depends. Thank you, Shirley. Thank you.