 I'm not sure what's going on with your sound right now. I'm being muted, you. OK. Much better. All right. Thank you. Welcome, everyone, to the first session for today. We're going to start this session with a 10-minute talk by Steve Master called Reproducible Data Reproducible Analysis, a model for clinical laboratory data use. Steve is still muted. Am I on now? All right. It looks like I'm unmuted. OK, thanks. That's what I want to do with this sort of brief talk format is to discuss something related to reproducible data and reproducible analyses, something that I think is of great interest to those of us in the R Medicine community. But I want to talk about it in a slightly different way. And I'm going to talk about it in relation to the clinical lab because laboratory medicine is really, in many ways, an ideal venue for R. The information that comes out of the clinical lab is already data-fied, it's largely quantitative. There's a rich data source from the lab. A typical large academic medical centers generate on the order of 15 to 20 million results per year. Those results are then stored in a dedicated lab information system. That information in the lab information system can, of course, be tied into information in the larger electronic medical record for outcomes research and things like that. And, of course, laboratory data are a significant contributor to patient management. And for these reasons, I think laboratory data are not only attractive to those in the medical laboratory community, but also once those data make their way into the larger EMR or into, let's say, a healthcare enterprise data warehouse, they are very attractive data for modeling for others in the data analytics community. And we can imagine a large set of use cases, some of which we'll hear about later today. Dan Holmes and Patrick Mathias are both going to be giving talks that relate to some of this. But, of course, you can imagine using data from the lab in an operational management capacity, time quality control, quality management reports, and, of course, reproducible workflows in those kinds of environments reduce FTE utilization and improve result quality. There's a disturbing use of Excel that seems to be hanging on in those contexts that needs to be supplanted by R. Additionally, of course, there's the issue of predictive analytics, which is of great interest. Several papers, for example, came out last year using lab data in part to predict acute kidney injury. Just this past week in the lab medicine literature, a classifier that was based on gradient boosting came out of a group from Wild Cornell Medicine that looked at using regular lab data, non-molecular lab data to predict COVID-19 status. So, of course, predictive analytics is a very important part of what one wants to do with laboratory data. And when we think about reproducible workflows in this context, I think we typically think about what's happening on the right here in the large, light blue box, whether we're looking at turnaround time descriptive statistics, whether we're talking about moving averages to look at quality control for assay drift, whether we're talking about machine learning diagnostic models as I just alluded to for prognostic classification. In all cases, we sort of have an intuition for what in the R world will give us reproducibility. We want audit or version control R scripts. We want reproducible reporting using markdown, et cetera. And so this is the world in which we typically live and think about reproducibility. But what I'd like to highlight here just very briefly is that if we only concentrate on that part of reproducibility that we typically think of in the R medicine community and ignore the reproducibility of the raw lab data, we will do ourselves a disservice in the kinds of models that we actually produce. What do I mean by that? Well, the problem is that lab data may not always be as reproducible as they appear, although they appear to be the ideal data source for the kinds of analytics that we'd like to do. Again, they're quantitative, they're already data-fied. And of course, one would assume that taking the same measurement with two different machines would give the same result. In fact, that's not always the case. And what I'm going to argue very briefly is that a multidisciplinary team, like Robert Gentlemen talked about yesterday, will be required to build truly reproducible models from lab data and healthcare. And this is going to require the domain expertise both of laboratory ends and of data analytics groups. So let me just give one example to give you an idea of what I'm talking about when I say this. I'll use the example of standardization or harmonization. Typically, when one thinks about lab data, I think that there's an assumption that it all looks like this. The next few slides are actual data from a crossover study that I did several years ago, moving from a Beckman set of instruments to a Siemens set of instruments. And of course, I think the world typically assumes that everything works like BUN here, like blood urea nitrogen, where the slope is one, the intercept is very close to zero, and everything matches up very nicely with the unity line. The reality is that for many assays, however, this is not the case. Here's insulin on these same two instruments. And now, if you look at the dotted unity line, you see that in fact the actual relationship between the instruments has a slope that's far above that. Slope is on the order of 1.7. And again, this is true for a number of different assays, and it's up to the person building the model to know which assays are well standardized or harmonized and which are not. And even if you were to go beyond insulin here and say, well, maybe the Siemens just runs a little higher, that's not necessarily the case. Here are two tumor markers, CA-153 and CA-125. You can see that there's a positive bias in one case and a negative bias with respect to these two instruments in the other case. And although you're seeing a lot of scatter and the points that are very high, you'll see that these same trends continue even down in the low end of the range, suggesting that again, if one were to build a model that didn't account for instrument type or assay type, then in fact, one would not have reproducibility in the way one wants it. And why is this a hard problem? You would think it's sort of been solved years ago, but of course, depending on the assay, this can be very tricky to solve. There are no reference materials. If there are reference materials, those reference materials may not really mimic a patient sample. If they do mimic a patient sample, there could be variability in what's being measured. There may be different post-translational modifications that are endogenous that will affect the ability of one assay versus another to measure a given analyte. There may be different methods. Of course, there are different manufacturers, sometimes no gold standard. So this lack of harmonization is something that's a known issue within the lab medicine community. We know which assays are better, which assays are worse. And I would argue that this kind of information needs to work its way into the sorts of models that we develop. And why would this be an issue? Well, certainly one can imagine a classifier here. I'm just trying to separate green from red in two dimensions. And I can of course encapsulate that very nicely in a simple decision tree. But if I start to think about the effects of lack of harmonization or a shift of bias with one assay versus another, very quickly what looks like a perfect classifier can start to call fault positives. And in fact, there can be confounding with the outcome if the prevalence differs. Let me show you one other quick example of this. This is just using some data that we published several years ago. And I'm going to take a subset of the data. This was using hematology analyzer information to predict myelodysplastic syndrome. And if I just take the top six variables from this model and build a simple logistic regression, I get an AUC of 0.84. If I now take a single analyte, PDW in this case, and create a nonharmonized version of that, and here really all I'm doing is talking about the same difference that you saw in the insulin assay that I showed you earlier, and now ask what that does to our ability to do classifications now looking at the test set. If I were to train on one analyzer and then to test using a mix of the virtual analyzers, what happens? Well, of course, I now am losing the discriminatory ability with respect to the original assay. So does this, can this cause, it's obviously very sparse data set, as you can tell by the sort of chunkiness of the rock curve, but I think nonetheless you get the point. When we start to talk about harmonization issues, we can start to lose power in our models. So I give you this as an example to say that although laboratory medicine remains an ideal venue for our base reproducible data analytics, that reproducibility needs to begin prior to the analytics. This can't be done in a sort of sandbox of data that is taken in a sort of anonymous way and just treated as if it falls from the sky in some platonic way. We need to understand the, using domain expertise, what the limitations of those data sets are. And I'd particularly like to highlight the role of laboratory medicine professional societies such as ACC, who's sponsoring this conference in actively promoting data analytics literacy, certainly encoding, certainly within the community of laboratorians, but also to highlight that there's a potential for a great partnership between laboratorians and data scientists. Although I definitely fall on the physician should code side of the debate. I would also say that it is certainly the case that data scientists will be needed. Certainly laboratorians can't do it all themselves. Neither will be able to do it alone. And to just finish that off, I'd like to highlight the fact that there are venues, certainly within the laboratory medicine community, the ACC annual scientific meeting is happening in December. It's now fully virtual December 13th through 17th, and you can go to the link there. And what I'm showing you here are a number of the data analytics and R-based things that are being held there, but also I'd like to make the pitch that if you're in data analytics and are working with laboratory data, it may be useful to integrate yourself into this kind of community to learn more about potentially the pitfalls from the data source to allow us to achieve true reproducibility. And with that, I'll take over questions there maybe. All right. Thanks, Steve. I think to try to stay on time, we're going to move on. But if you do want to contact Steve with questions, you're more than welcome to do that. Thanks. Thanks.