 Hi everybody and welcome to the fourth installation of the our adoption series. I am Michael Rindler from GSK and I'm joined here by my co-lead on the Fuse Working Group, Mike Stackhouse, from a Taurus Research. I'm absolutely thrilled to be here to talk about this working group. The team has done a lot of really good work and we want to share that out with those that are here on on the webinar. What we really do will hope that we can engage in some discussions and learn from your experiences, not just share what what this team has come up with. We can move to the next slide, Mike, and I will go through what we expect for the agenda today. We'll start with an introduction and a background on the project, the Fuse Working Group project, and then the meat of the work of the the conversation today is really focusing on the use cases. So we'll have Kevin and Brian to lead a discussion on the work that they did with the linear model. Minhwan, Mia will talk about survival, aiming, and Mike will then turn to CMH and we will round out with Andy and Kai on mixed models within each section. We hope to do just a short presentation to give a little bit of a background and what some of the key learnings are that each of the subteams in the use cases have uncovered and then really engage with you on whether these are similar experiences. You have different experiences, questions you might have, but hopefully a two-way conversation. When we were putting this webinar together, I don't know if people here have been to some of the other, our adoption series events that have happened throughout the year, but some have had breakouts and some have not, but the idea is to try and engage the community and have discussion. So that's really what we want to spend the majority of the time doing. So if you have experience with any of these, please, we encourage you to engage in that discussion and then we'll round out with a little bit of a closing and looking at what the next steps are for the project and how you can get involved. So with that, I will hand it over to Mike Stackhouse to go through our introduction and kick us off. Thank you very much, Michael. So in beginning this project, some of the conversation really focuses on the facts that if you're here, you're interested in R, you're interested in R and pharma and you'll have noticed that the mentality and industry is slowly starting to shift from the status quo of doing what we've always done and shifting towards using the best tool for the job. And a lot of the discussions over the last few years have centered around validation that going out to conferences and seeing presentations about R, the question's always being it's not validated so we can't use it, right? And that being the primary focus. But a lot of those conversations and a lot of the focus of validation around R, we're starting to find solutions, we're starting to find our way with using validated packages and get validated environments set up so that we can use them with NRG XP environments. So those problems are starting to be solved. But when you start trying to adapt to this, there are new questions that arise because validated packages don't guarantee that your results between SAS or R will agree. And in this group, we're really focusing not just on R versus SAS, but the future of the fact that statistical programming languages inherently have different implementations of the statistical methods. They're written by different authors, they're built from the ground up. And there's a lot of reasons why one package might not agree with another package. And it's a path that you need to go down to really understand why. And that's what this group has been exploring. So what we're going to go through today and what we're going to start exploring and demonstrate to you guys and have some discussion around is the venture that we've gone down in this working group collaborating between FUSE and the R consortium of looking at the implementations of a couple of different statistical methods and comparing the results between R and SAS and exploring the different questions that you need to ask as you go through to understand the implementations in R and SAS and some of the reasons why we're getting discrepant results and why the numbers that we see in different things don't agree and how we should start handling that and how to answer those questions. So clinical statistical reporting in a multilingual world, the FUSE working group that this is built on has two overarching goals. So one is to document and build a repository of common analyses that we do in clinical trial data analysis. So looking at some of the most common ones that we identified, there's a few different efforts or subgroups that we have in here and we've chosen those methods to try to give a couple different variations of use cases that we could explore. But it started by looking at some of the most common analyses that we do in clinical trials that we want to understand that'll give us the biggest bang for our buck off the bat. So in this outlining differences in the implementations of initially R versus SAS for those statistical methods and documenting differences in capabilities and known discrepancies between either the package implementations or just the capabilities of one language versus the other. And then we also have an effort to build a white paper around this that is and the goal there is to establish a general framework for approaching new analyses. So when you're going to go down this path, what is the recommended approach for that investigation and what are the things that you can expect to encounter along the way and establishing a framework for effectively communicating those findings or the potential issue areas? Because this is something that if we when we're going into a submission, when we're going to deliver results, be it vendor to client, be it one organization to another or be it the submitting company to the agency, how do we outline the things that we expect them to encounter? And how should we communicate these things? Because we're very used to outlining our statistical analysis plans that we're going to use SAS 9.4 and maybe going down to the maintenance release. But what are the things that we should expect to have to communicate when we're delivering to the agency or anyone else so that they can understand what to encounter when they're trying to replicate that analysis and are or in Python or in Julia, what have you? So this project has quite a different or quite a few organizations communicating into or working on this project. And this is across different pharmaceutical companies, participation of vendors. And we have some participation from FDA employees as well, including academia. So we have a pretty well rounded base here on the project team that has come together to work on this. And I've both Michael Roma and I have been very pleased with the participation that we've had from all these different organizations. So Michael is an economist by training and he's been a programmer by trade. I'm a data scientist. But we both know that we don't have the statistical prowess to do this ourselves. And we've gotten a great amount of participation from these different representatives to give us the expertise to really ask the educated questions that you need to ask when you go through this effort. The current status of what we're doing in the projects is that the white paper sub team has kicked off. We've begun outlining the initial pieces of that paper and expect that to ramp up even more in Q1 2022 as we go into January after the holidays. And we have the proof of concept investigation of four common analyses that Michael outlined in the beginning here that we're going to walk through and open up to discussion today. And those are linear models, CMH mixed models and survival analysis. So those have been the four sub groups that we've had some teams break out and independently do this investigation so that we can bring that back together. We have a public GitHub repository for the project content hosted under the fuse organization and the repository CSR MLW. And so that all these links will be delivered out after the webinar as well to each of you so that you can access those. But and we'll go over how you can get involved. But all of the content that we're producing is inherently public. And as we build up this content, there will be a bookdown website coming out so that we can deliver that in a more readable and accessible manner. But currently all the work that we've been doing has been taking place within that repository. So if you want to do a deeper dive after the webinar, then you can go and view that in on GitHub. So I'll hand that back over to you, Michael, so that we can begin the subgroup discussions. Thanks, Mike. Yeah. So first, we are going to look into linear models. And we have Kevin and Brian that have prepared a little bit of an intro discussion. And then that's where we'd like to bring it to the wider group to probe what you've learned or see if you have any questions around linear models. In this case, again, all of these are versus SAS, but we're also thinking about other languages. So Kevin and Brian, I hand to you. Thanks, Michael. So with linear models, we were starting off with looking at a couple of a couple of areas. The first one is the output that comes out from the R linear model functions. It can be tailored to match the output from the SAS procedures. So that's one of the first things we looked at was what tools can we use to do that? And secondly, Kevin Puchko will discuss contrast in R and how you define them in SAS versus R. So next slide, please. So in SAS, the SAS procedures output, you get your default output, but you can also traverse all the pieces of the output and turn it into a data set using the ODS trace on along with ODS statements to create data sets from pretty much any piece of output in SAS. On the R side, for instance, from the LM AOV functions, you can use a broom package with some functions such as tidy, augment, and glance. There's an ANOVA function, a summary function you can use with the STAT functions, or you can manually traverse the lists and sub lists which are generated by LM and AOV. Next slide, please. So here's just kind of a picture of at the top is the PROC GLM. We're using kind of balanced complete data for the simple model. The SAS procedures are on the top and the R functions on the bottom, and the different colored squares kind of shows the numbers matching up with the numbers from SAS for this type of model. So this shows that we can grab the pieces of our statistical output for linear models from these functions and get similar output to SAS. Next slide, please. And also just looking at all the parameter estimates, using the solution option on the model statement in GLM, we can use the tidy function and you can see the estimates match for the simple linear model. All right. So that's all for my part. Next, we're going to talk about contrast and we can turn it over to Kevin Fuchiko. All right. Thank you, Brian. So contrast and R are defined a little bit differently than they are in SAS, but the results can match in some cases, sometimes in many cases. So in SAS, basically all your contrasts will be defined manually using either an estimate or a contrast statement. But in R, there are many, many ways to define a contrast. One of the easier ways is to use the contrast argument in one of whatever model function that you are using. And there are some default contrast method built in, like counter treatment, counter SAS, or counter Helmert that you can use to quickly do the contrasts that you are looking for. However, there is a package called EM Means with its contrast function that in my experience, I found to be a bit easier to run these contrasts and have them match the SAS output. And then next slide, please. But there are some cases where the math that R and SAS does is a little bit different. So in the top example there in SAS running a PROC GLM, we have an example of doing a reverse Helmert contrast, basically comparing one level to all prior levels. And in R, if you try to do the same thing using counter.Helmert, you will see on the right that your estimates differ, but your test statistics and p-values are the same. And again, that is just because of some differences in the way R does the math where the estimates are off by a certain factor. However, if we do the same thing using the EM Means package in the bottom panel there, you can see it is almost a copy and paste of the contrast definition. And then the estimates go on to match that in SAS. So basically, that is what we have learned in our experiments for comparing R and SAS. And I think at this point we will open it up for discussion about any other things that other people have seen or concerns that they might have when working with linear models. Thanks, Frank. Thanks, Kevin. So this format is somewhat new and also to go with Zoom, so please bear with us. But if you have my main question that I want to ask on any of these is what are other people's experience? Have you used, have you worked with linear models and looked at comparisons in either implementation or results in R versus SAS? What is your experience? Is it similar to what Kevin and Brian are finding? If you have questions or comments, please raise your hand so we know how to allow you the ability to talk or feel free to drop questions into the chat and we will try and make this as seamless a discussion as we can with our technological limitations. Not seeing any hands go up, but I do have a, so Brian and Kevin, I mean, it looks like the takeaway here is that at least with respect to R and SAS implementation of linear models, we can do very much the same work. It's clear kind of what the path is and we can even obtain numerical equivalents between the two. Is that right? Is my interpretation right? Yes. So with what we showed it was, or at least what I showed it was a kind of a balanced data frame with no missing values. As you get into the different or more complex models and have missing values in the data, we still have more work to do to document the differences and what you may have to do to get things to match. Right. And so in some of the functions that you can run in R, it may not report all of the output that SAS does, but with some, you know, unfortunately some digging and some trial and error with other functions or packages available, you can get those values or if, you know, if you're not interested in trying to calculate them yourself by hand. And we have some of this written up in the over planning and having at least written up and put into our repo so people can see how you've been able to do this, correct? Right. Yeah. I do see a question about the rounding issue. It says, what's the rounding issue handled pre-calling LM? We didn't do any rounding beforehand and you'll notice the, when the output gets spit out to the console or in SAS that the rounding will be kind of handled differently based on the display, default display values from the procedure or the function. But if you do save the results off as a data frame or in SAS as a data set, you'll see all the precision there that will go out to many more significant digits. So hopefully that answers your question. Do you guys have an understanding of why the Helmark contrasts from base using the LM function produces the different estimates versus EM means? Not a great understanding. From what I could glean, I think that the counter Helmark as defined in base R works as a way of identifying your levels as, I think, binary indicators. Whereas if you're using it with EM means, it allows for something that resembles more of a typical contrast definition, where I guess the math behind it converts it to fractions. So it's, I think it's just a difference of linear algebra and I'm not exactly too sure on what's going on behind the scenes to make that estimate be on a different scale. Very good. Thank you. I should note that the sort of philosophy of the working group, I know no one's brought this up. I'm going to bring it up just because I want to make a comment on it. We started with the philosophy that we're not questioning the validation level of any of these packages or we're simply looking at what's out there and how things compare in this particular use case. And Andy Nichols, you put a comment in here about, so I'll just quote it comforting for a linear model that you can get the same results if needed because the math should be pretty standard. That's true. And what that says is that, you know, based on however a particular organization is going to establish these packages as being a reliable package, that means that the move between these two languages at the moment doesn't seem to show any additional challenges towards adopting either language or using them interchangeably. And it'll be interesting to see as this type of project moves on, whether that's similar in other with other languages or as we move on to some of the other classification models throughout today. Any other discussion questions or comments? Again, feel free to raise your hand so we can give you the ability to come up and you can actually interact. The platform doesn't allow that to be done easily at the moment, but if you raise your hand, we can definitely do that or drop a question in the chat. All right. Well, Brian and Kevin, thank you very much for coming today and presenting on the work that you've done within the linear models. We can move to the next slide, Mike. And this will be survival analyses from Minhua and Mia. There was a few single day events earlier this very recently. I don't want to say earlier this year, very recently because it was later in the year that presented the work that was done here in a lot more depth than what might be initially presented right now. But on the survival, I will hand it over to Mia and Minhua to lead a discussion on what they've been finding within survival. Thank you, Michael. So for the survival analysis, because there exist many survival models and we cannot just compare every model. So the first step we tried was to identify the most commonly used survival models in clinical trials. Could you move to the next slide, please? Thank you. So generally Cox model and Kepler-Mierr masses are the most standard masses in survival analysis. So this is a mock-up table we usually use when we report survival analysis results. The layout might vary among therapeutic areas or companies, but the contents here should be very similar if you do standard survival analysis in clinical trials. So the first rows are just the percentage of events or sensors. So they are just proportions. And the next chunk are the quartile estimates and the landmark estimates. And these quartile estimates and the landmark estimates are based on the Kepler-Mierr methods. So the last two rows include the hazard ratio and the p-value. So these are used to compare the survival between two treatment arms. And the hazard ratio comes from the Cox proportional hazard model. And the p-value usually comes from the log-gram test. Could you move up to the key takeaway slide, please? Oh, move up, please, to the key takeaway slide. Thank you. So after identifying the survival models we want to compare, we then try several data sets with different scenarios. The results may not match between SAS and R. It may do attribute to different implementation of algorithms or they may do to data. So if we observe the discrepancies, we also try to explore the reasons for these discrepancies. So for most of the time actually R and SAS give identical results if they are not identical. We found that there could be two reasons. The first reason is due to different default choices. For example, and the second reason, I go to the example next, and the second reason could be different algorithms or implementations of common algorithms. So for example, for the different default choices, Michael could you please move to the next slide? Next one, please. Yeah, thank you. Oh, yeah, this one. Yeah, thank you. So for example, the default tie handling method in R is called AFROM. The default and the default option in SAS for tie handling is called Bresla. So if ties exist in the data set, the hazard ratio and its confidence interval could be slightly different if you use R or SAS's default method. Michael just showed a side-by-side comparison of R results and SAS results at the beginning of this presentation. And in that slide, those discrepancies were indeed due to this default tie handling reason. And these both SAS and R default options are valid methods. So for those four methods are different, you can easily change the default options. SAS has this R's default method AFROM, R also has SAS default method Bresla. So after changing the default options, you could see in this slide that these results are identical. And could you move to the next slide, please? Thank you. And the other possible reason we found that the discrepancy may occur by different algorithms or different implementations of common algorithms. So this discrepancy is highly dependent on data. So you might see the discrepancy in one data set, but might not see the discrepancy if you use another data set. So for example, this data set has 10 observations. So the first five observations are all events, and the rest of the five are all censored. But for the median estimate, you could see that the SAS and R give different results. And the reason is because there are several time points having the exactly 50% survival probability, you can see it from the Kepler-Murr curve. In general, for the median estimate, the R and SAS use the same logic. However, when data meets certain conditions like this data, there could be slight differences in the algorithm. So you will see the discrepancies. There's also another case. It depends on how you write the SAS code. And if the data meets certain conditions due to the time constraint, we did not list here, but it can be found in the Github repo. So in conclusion, in general, for survival analysis, because the caucus model and the Kepler-Murr method, they are very standard models in survival analysis. So most of the time R and SAS gave identical results. For the cases, they don't provide the exact exact numbers, especially if it's due to the default methods. Even if they are slightly different, they are generally consistent when rounded to level necessary for statistical interpretation. For the second reason, if they are caused by different algorithms or different implementation of common algorithms, as it usually happens to small sample data sets. So whatever the estimates we get or whatever software we use, we tend to interpret those statistics with caution. And more details and more examples in the R codes and SAS code can be found in the Github repository. And that's my presentation. We can open up for discussion and questions. Thanks, Mia. Andy Nichols had put a question in the chat. With respect to language agnostic code in mind, have we tried changing the options in SAS to match the R defaults? And this is exactly an example of that that we see in the survival with the defaults of the tie-breaking methods of Efron and Brezlo, right? That they each have these two particular languages and their implementations have the same, in this case, the same set of options available with a different default set. And with understanding that and setting the options the way that you want them, you can get, in this case, numerical equivalency, at least to a certain level of significance. Yes, in reality, because right now, at least in my company, the most, the outputs are generally still best based on SAS. So when we try to validate those SAS results, we tend to change the options in R to match SAS default to see the identical results. As I mentioned, both options are actually correct. But for the validation purpose, we still want to see the identical results. So we will, we would change the default in R or SAS to match the other software's results. And I think I want to identify here that this is one of the motivations that Mike and I had for wanting to get this project going and get results like this, because to everybody on the call, for those that have worked in survival analysis, particularly these types within pharma, have you ever thought about which tie-breaking method was more appropriate for your particular analysis? What are, what are people's thoughts around this? I mean, I have this motivation behind my question, but I'm curious to know, has anybody ever actually said we should be using Afran or we should be using Breslow? Yeah, I think from my experience, as I mentioned, both the methods are valid methods. We usually don't specify which method we want to use. We tend to use the default methods in SAS. However, I do notice that for the, for this tie handling, so there's some literature saying that if the data has many ties, Afron is more accurate than Breslow. So it's recommended to use the Afron method, but most of the time in practice for the community trials, we don't see so many ties. So either option should provide very close results. So we can just choose either one thing. So that's my, my perspective from my experience. Yeah, I think that, that what our motivation in this, in this project, part of it is to start to uncover those differences. We haven't ever put two languages next to each other historically, right? It's only been started to happen over the last couple of years is there's been a large kind of momentum towards, towards R in parallel with, with SAS for clinical trial analysis. So as we do that, we're starting to uncover these and, and perhaps these questions that need to be asked, at least to be, to be just to be, to be demonstrated of why you would choose Afron versus Breslow and, and to say you're choosing one because it was the default in SAS and that's the way we've always done it, is not a statistical rationale for doing that. You've provided a statistical rationale that if you have a large number of ties, one method is preferred over another. And so that provides motivation of why you would choose a particular option in this case. And what this project is hoping to uncover is where those differences are, understand where those differences are, so that the, those that are defining the analysis and, and scoping the analyses can make informed decisions on, in this case, which option to use and why, also to be transparent to any third party reviewer, for example, a regulatory agency that's going to be interpreting those results. And Michael, to that point, we also have work on coming up some sorts of the algorithm or flow chart to help people when it goes through this kind of discrepancy that will help you to try to identify where things are, which for the cases that maybe haven't been, haven't been found out in terms of discrepancy, but they are discrepancy there. So I think those will be part of the effort that we're going to put through in our white paper. So just kind of give people a bit of heads up about further development that we will provide. Very good. Thanks. Thanks a lot. We do have a couple questions in the chat. One from Nikhil about which one provides more accuracy, which I think has been addressed, at least with respect to when there's a lot of, a lot of ties. I don't know if you want to add any more context to that or not. Whether I guess, well, I guess it's with whether SASS or R provides more accuracy, but what you were finding at least to was, to a certain level of precision, they were both fairly accurate, right? Yes, yes. They are both accurate most of the time. And yeah, those methods really depends on data. But as, as, yeah, as I said, if there are many ties, everyone is more accurate than Burslow. But most of the time those methods do not differ that much. They both provides the accurate results. Having a tough time finding the attendees. Kiran Martin has a, has a question and I want to try and provide the opportunity to come off mute, but I'm not finding him. So I want to encourage more discussion. If I can't identify him then in the chat, which I cannot, then I'll just ask the question. Kiran says, I think one of the most challenging results to justify can be where one software calculates an estimate and the other provides a missing value. You have some of that here on this particular screenshot. What, what are your thoughts around, around this issue? Yes, this is, yeah, this is because this data is, you know, you see, it has a very special pattern and also it's a small data set with only 10 observations. So you see these are slightly different. And in practice, when we see this small data set and we see this Kaplan-Mirror curve, either we see the media of value or the media of not available. We tend to interpret these results in caution. So for this data set, I would say both are, both are correct. And it really depends on, so, so this is, because this is the simulated data. So in practice, we, we don't see this special pattern usually. So most, most of the case, these results will be identical. And for this special data, yeah, I would just, I would still say this, both the numbers are correct. And so whatever the numbers are, we, we, we need to pay attention to, to the data with this, with this shape. Any other questions for, for Mir-Min-Hwa on the experiences so far with survival with R versus Sans? Do you want to dig into Andy's question? He sort of answered his own question, but the question says, yeah, sure, this is a bit philosophical, but do we need to change results to get an exact match in your PROC compare and explainable differences is surely okay. From, from, from your perspectives, Mir-Min-Hwa, what do you think about, about this issue? Yeah, I agree with Andy that the, as long as the difference is expandable, it's, it's okay. We accept the differences both for this, for this, yeah, we, we do accept this discrepancies. And we, we, from, from our perspective, we think they are, they are both correct numbers. However, when sometimes we really need to validate the results of the other software, we still would want to change the, the options to try to match with the other software's results. And yeah, that's, yeah, that's my, my understanding. Mir-Min-Hwa, do you have anything to add? No, I think, yeah, it's probably also to do with when you kind of comes to the situation and you submitted your, your submissions and, and maybe there's a different software just used between the UN and the organization, then we also trying to cover that, the trying to explain that discrepancy. So, so hope that, hope that ends the kind of discussion as well. The, the needs of trying to get them matched. So, Nikio offers a follow-up question, how can we document that validated difference? I think one, one thing that Andy's put in here is, is, is an interesting way to do it, which is through a sensitivity analysis to kind of run in both ways and show that, that numerical match. I would also think that it may be difficult to, to demonstrate it in an auto compare type, type way, depends on the situation. But you also have the ability to, like, for example, if you're running crop compare, to let, to change the criteria and that, to still show that there's a small difference between those results. Mia, Min-Wa, have you put any thought into how you might document the difference when, when validating, if there is a difference in those, in those statistics? Yeah. Yeah. Okay. So I think it's probably, that was like, when you're really just not, not just documented those discrepancy, but also kind of think more ahead of when you kind of comes up with this CSR and, and you are trying to do us up and you would also document it in the way of what would be the most appropriate, what's the default, what would be the algorithm to answer your research question. So we start serious, then you would have everything transparent. And so when it comes to the result, then you would kind of might expect some differences based on what has been found by others or maybe from your own team. So I think looking more ahead of being transparent, being documented, not just on the result end is very critical. Yeah. Yeah. I totally agree with Min-Wa. We could specify those masses for like, for example, 10 handling in the SAP to, to specify the, the master beforehand. Yeah. I totally agree with her. Very good. Thank you. I want to, and thank you for, for taking the time today to present. I want to move on to the, the next use case sub team on, on the agenda, which is, which is CMH and we have aiming and Mike here to, to give us a tour of what we've been working on in that space. Thank you very much, Michael. So the CMH sub team has had some interesting findings and this table here is a little bit representative of what that is. And it's that with the analyses that SAS puts forth, related to CMH, there's really no one solution that meets every requirement within the R package ecosystem. So they're with things like general association and the MH odds ratio, the mental hind test function from base R and stats package satisfies those. And to start getting into some of the other pieces of the analysis, there are other packages available. But there are some pros and cons to each of those. And that's part of what this group has gone through in their analysis and breaking down these packages. But there are some problematic scenarios that have come up here from what the, this group has explored. I'm not going to walk through here. So when you look at the CMH tests and so what pops out of SAS using proc freak, you have these different alternate hypotheses that offer the degrees of freedom, the value and the probability for the p value in there. And to get that general association, like I just displayed on the last slide, you have the mental hind test function that can achieve that for you. One of the things that I personally found in my experience when my team at a Taurus went through and replicated the C-disk pilot using R is that in the SAP, the plan states that treatments will be compared for the overall differences in the Cochrane mental handle test and using the alternate hypothesis in SAS as Romine's differ. And I'm going through and trying to replicate this with the mental hind test function that alternate hypothesis isn't available. And in the exploration, I'm trying to find what is available to achieve that in R. I came across the VCD extra package. And this was a little bit of an interesting dive that we had to do to get the replication to work. So in going down and replicating, I did find that there were some errors that I encountered using the C-disk pilot data and digging into the GitHub issues with VCD extra because it is a public package and hosted on GitHub so that you can dive into the source code. There's a note that for large sparse tables with many strata, the CMH test function will occasionally throw an error using the solve.default AVA from referencing into the source code of that. So the author there actually offered a solution at which the VCD extra author had a little bit of an issue implementing because there would need to be additional testing to make sure that it works in different data scenarios. But in the replication of the C-disk pilot, I implemented the solution and was able to replicate and get the exact values from the C-disk pilot originally going back done in SAS back to 2006 or so. That has some concerns though because that doesn't really hold up well for the validated environments that we use in our clinical environments because I had to pull the package, I had to make an update to the source code which creates a whole maintenance loop of that that you need to consider. So it might not be the best choice for a validated environment. Furthermore, as the team dug through and they were using and testing out VCD extra in a couple different scenarios, they also found some other problems in the package with the way the types were specified in some inconsistencies that they found with degrees of freedom and the resulting p-values. So this is all outlined in the documentation that they've produced. But if we dig into the key takeaways from CMH that they found that to match some of the SAS outputs, more than one R package is needed. While VCD extra offers some of the most, let's say some of the best or some of the most consistent comparisons back to SAS, there's a little bit, there's the maintenance concern with that. So the most methodologically mature package is VCD extra, but it's likely not stable enough for validated environments. And it's interesting in the fact that a lot of the design of VCD extra is actually to try to replicate some of the results presented by SAS. But looking at the R, looking in R and the packages available, one of the notes that the team here had to offer was that R packages are a lot more sensitive towards methodologically questionable designs and data proportions. So R speaks out when there are some concerns with the integrity of the results that are being put forth. So you'll get some warnings and notes of questionable design. Whereas when that data are run through ProcFreak and generated using the CMH option there, there's no notes given to you. So R speaks out a little bit more to tell you that, hey, maybe these numbers might not hold up. And it's a little bit more verbose in that sense. And then the last piece here was just to consult the documentation for each R function. And comparing back to CMH, you really need to understand what are the options being chosen here. So for example, I think it was with the mental hindat test function correct to set to false by default. And when you run ProcFreak and use the options there, there's no correction or there is correction being done so that you need to flip that to true to get those results to match. So like we've been finding with a lot of these packages, when they're developed independently by different authors, there's different defaults chosen. So that's one of the big things that you really need to explore are what are the decisions that have been made for this analysis? Why has it been made? And so trying to understand what was the author thinking when they produced this function and gave us the parameters that we have available. And how do we configure those to get the results that we want to get? Not necessarily to match the other language, but understanding the defaults that were chosen in each of those languages. So that really harps back to here of CMH's one package that has a lot of different implementations chosen. So these are the three packages that the sub team with Clara Aming and Matthew Kumar put together that they thought kind of addressed the most of the scenarios or most consistently addressed the scenarios. I'd be interested to hear from the audience any other exploration that they've done internally and findings that they've had with trying to do CMH in order. Any questions from people around the work that has been going on within CMH, in this case with R versus SAS, or any experience yourself that you may have had, whether it's R or otherwise? Then I will ask a question. How would you recommend if someone's looking to report out with these tests and they're specifying their analyses and they've traditionally used, let's say SAS to report out, what would you recommend as the approach to take if they're going to be doing this using R? I think it really comes down to defining the cases that you're trying to produce and understanding the approach that's best for that. So I've seen some different organizations detailing the functions that they want to use to do different pieces. So in those regards, it might be a matter of choosing the specific use cases. And for example, at VCD extra, in this case, didn't produce a wrong result when you made the fixes. So maybe that's something that you pull in the specific use case and defend that for yourself. And there's with the way that our packages are designed, there's a lot of flexibility in how you can do that. But I think part of what we're trying to do here is to help outline the best options for that. So one of the reasons that I put this table front and center is because the team gave a really great representation here of the different pieces of CMH analysis that you might want to do and the tools that are available to do that and understanding the scope of those as well. Like the notes here that the mental hind test works well for 2 by 2 by K or that the EPI display MHOR function, the odds ratios are limited to a 2 by 2 by K design. And to Andy's point, making sure that the teams have a reference there, because this isn't something that you want to keep in your head for everything. So making sure that your organization has the documentation available to make sure that you're choosing the right tool for the job when you want to go do that analysis so that everyone's not trying to do this discovery by themselves. And I also want to come back to the previous comment around not necessarily so the industry is moving in the direction of R and asking the question well does it match SAS and that's not necessarily the right questions to be asking and that's not the philosophy of this of this working group project. And to me when we look at what's been discussed so far and the outcomes from linear what's where it's almost all very rosy story of you can match and you know things are very, very well defined. And then you look at survival and the key takeaways from there are that things might not match but they're explainable or the change of some options can make them match. So then you have to decide which path is actually the most appropriate for what you're trying to do and to be transparent about that. Here we see a completely different type of discrepancy if you will in the sense that the some things that might be available in the SAS world and within proc freak may have different packages you have to stitch together and you have to be more aware of what those tools are that are available. And of course in the open source world you can also provide feedback and try and get contribute to that code base. But each of those three classes that we looked at so far have different reasons why we when you ask the question how does our compare to SAS and what's the right path forward. And to me the fact that we're seeing even different reasons you might see different discrepancies or differences between the results is why this sort of initiative and this sort of analysis is really, really important. The industry has never really had to face what does it mean that we might have different languages doing what we think are similar analyses and what do we have to be careful of. And I know Mike you and I agree with this and I feel like there's some others in the call that also agree with this philosophy that the results that we've always had from SAS are not necessarily the single source of truth. It doesn't mean that they're wrong. Small discrepancies, numerical discrepancies may still as long as the interpretation is the same and so far we've seen that we can do that. I don't know if Mike if you want to add anything. Well, Amy you wrote into the chat you're a panelist so you can unmute yourself do you want to your comment was that statistics or statistical assumptions for CMH are important. Do you want to expand on that at all? Oh okay can you guys hear me okay? Yeah okay thank you I apologize I wasn't able to get on the picture video. So I just want to say when we got statistical training or in the company also it is when we try to apply something and test some cases I know we test some marginal cases like extreme cases but in reality I think it's kind of falling to what Mia said earlier. Sometimes in reality the very small data sets means media is not rich so you don't use that for CMH is the same story when what happens when it's very sparse South says very small this kind of thing. We need I think a statistician would I trust as they will ask themselves if the what's the right methodology what's the right stratification needs can be done what's the justification so like be transparent ahead specified before you unblind in data or something so that that's what I'm trying to say for any methodology maybe sometimes we as a testing we want to test a more package we want to see SAS provide a lots and lots of things as a how are we different package will react but in reality I think for adaptation the first thing we need to ask ourselves what's the right appropriate statistics we need to do we want to do here that's what I can say. So Amy as I try and take in what you're saying would you do you think that it's fair to say that this what we're investigating here and what we're finding in this case within CMH is that there may be some assumptions that we unknowingly made in in the conventional historical implementation of these and those assumptions are now potentially being challenged as we see how implementations are within other languages. Michael I think I follow you on that like Mike Stackhouse indicated um if our package may be more sensitive uh it could be that just because I'm not writing that package it could be the package considering the general sense how it will be used in the that circumstances not like everything under the sun SAS become for many many years of development may be a different story like for example I'm sorry jump back to Mia's example effort break the ties effort in our first law oh in Merck I mean in my organization we prescribed like use effort okay so it we're not like in SAP you would expect statistical analysis plan you would expect by assumption what you are going to use ahead of time so Mike I think I I would agree it just because in this group I think Clara designed a lot of cases some cases may not be the CMH actually very designed to be used but we would like to see what's happening there I think as a kind of exercise to push the boundary so for the reality for for the team to use it is always good to know what you are trying to solve and what's the package assumptions package development um so now now we're doing statistics I think I heard stats always saying all in school assumption is everything a statistical assumption so we need to watch out then apply to what needs to be done thank you any other questions for the CMH use case all right so then we can proceed forward to mixed modeling I do want to give some credit to Andy Miskell um and Kai Sun um Andy had a paper in Fuse I think 2019 that actually was one of the precursors to prompting us to start this group um because that paper did some exploration um into some of the commonly found discrepancies between R and SAS and exploring different implementations um and mixed modeling was one of the kind of key pieces there that um we've it's kind of well known that um there are some challenges of getting matches um in SAS remix or in R for mixed modeling to uh what is produced through SAS um so with that I'll hand it over to Andy and Kai all right thanks guys I think I'm the designated speaker for this one and by the way with the 2020 Fuse conference um if people are looking for that paper the other reason I know that is it was canceled um three days you know three days before because of COVID it was scheduled for March from 2020 and we all know what happened then but we were able to do that virtually and I believe the recording is available online somewhere of that the presentation for that paper um so it's Mike uh it said we had the product mixed um or not the product sorry the mixed models um area of this presentation today I wouldn't like to recognize Kai Sun, Doug Thompson and Soren and Pauline Soren um I don't have your last name um but for the great help that while the four of us were in this space um so the key phrase that I'm going to have for my success today more work to be determined this is a huge area of mixed models and then the more we got into this the more we discovered how much we did not did not know so since in the background we've used SAS PROC mixed uh to do our mixed models are in colas for decades literally decades and one of the things we discovered as we got into this is the more questions that we raise while we're doing this doing this we realize and this includes me as well because I've done this I've done this on countless studies over my 20 year career I'm competent patient from the previous study why do I use Kenwood Rogers why do I use this option or that option well that's what we've always done and so we started in this question why are we doing this we've been doing the same things and are we even doing the right thing um we started testing what our functions can replicate what PROC medicine is doing uh to get the same or traditionally um similar results the goal is not to match SAS as I mentioned before in this presentation not to match SAS precisely but to verify that we're getting we're doing the same methods and we're getting similar results or valid results let's say not as odd but valid um our handles things differently floating points it rounds differently it does things maybe in a different order but it ends up at the same place with respect to the validity validity on the final results next slide please all right so before we get into some of the details it's some of the key takeaways we were able to reproduce um a lot of the the functionality of PROC medicine the results are broadly aligned um they're not exact um and as I said there's many factors why they're not it's not we did do some analysis when we did we did some planned altman type plots to sort of analyze what the differences were um we didn't notice any trends they were sort of randomly very small but randomly distributed differences between the two there was no systemic differences that we found between the lenses so that was good news but as I said there's a lot more research that needs to be done we had a very small sampling of data we did not I don't want to say this is a we took a thousand studies and a thousand simulations we didn't do that but based on the data that we had these are what we've observed all right next slide please so and I want to go in this very quickly we can cover this in the Q&A session but we basically we analyzed four different R functions to sort of replicate what we see out of PROC medicine but GLS we actually did two different things with the LME function uh LME R and then for some of the least squares means we actually able to do in some EM means um using EM means function which I know one of the previous groups mentioned as well you can see the code here on the screen that we used in R and then in the bottom right corner the SAS PROC mixed code um next slide please yeah so um we did want to show you one example in some of the output up here so we do have an example uh user log likelihood um estimates um one of the things that I think when it's covered more than anything else in this is the degrees of freedom there seem to be some differences between SAS and R even in determining the degrees of freedom it seems like a lot of the other things are very similar or is that but the degrees of freedom can have an effect on the results of course so even when doing the tenement rosters we sometimes found some different degrees of freedom issues that is again something that we need to do further investigation into um you can see some of our our sample sizes there at the bottom before we go on to the next slide that when you have three different visits so we did simulate some you know attrition from you know from one visit to the next in the study all right next slide um I mean the R results are at the top of the screen the SAS results are at the bottom when you see on the left side is for the four different simulations we ran um with GLS 2 with LME 1 and LME R we had very similar results or in this case we had exact results to the third decimal point um and then if you look on the right side um when you see the standard error and some of the other estimates and the p-values um if you look at the bottom if you were to multiply the four results you get from R 15 69.053 5-2 you would get within a couple decimal points the same value you see for SAS um at the same time if you go on one of the right side there the estimates we get from SAS in this phase match almost precisely what we got from R we looked at the estimates you've looked in the standard error we looked at um the p-value very very close differences all right next slide and I I did want to get to the Q&A so I'm going to go to this quarterly we did we want we have very small variant matrices between SAS and R unstructured heterogeneous we do have a guide here on what the SAS option is and what would in light let's say GLS or LME or LME R what the what the option name is in there to specify there's not a one-to-one ratio with somebody's like heterogeneous first order auto regressive you do need to use a couple other things like bar ident in order to be compatible with it so as I said it's not a one-to-one match in every case but these are just from the guidelines of where you can start to get the equivalent covariate matrices options all right then one more slide before or two more slides this one here just shows a very quick example of some of the residual analysis as you can see they're scattered around the the the zero bar so we're not seeing any systemic differences and these are very very small differences when compared to the actual results you know if you were seeing like the 1500 these are these are small differences but since to show you that there's no trends that we observed and then the last slide I want to cover before we turn it over Q&A there is a lot of future work to be done this is a humundous area and even going back to not only the humundous area but questioning have we been doing the right things in SAS all along I mean why do we always default to tenor rogers is that the right thing to do should we be doing satterways should we be doing some other degrees of freedom estimate we need to look at what happens when we add in random statements if we add in random values or we have repeated measures we have a lot more work to do on this this is the tip of the iceberg we also need to test on a lot more broader samples of data bender sample sizes smaller sample sizes to see how everything reacts and behaves in different scenarios and especially more work that's needed in the degrees of freedom area so with that Kai I know you're on the line if you wanted to add anything before we turn it over to the q&a I agree what you know some points aiming brought up you know there's a kind of two aspects of our testing of this you know different languages you know we tested like a software we tried to push it to the extreme and see how the software react but the other aspect is a statistical aspect what is the reasonable statistical practice what's the reasonable range of the data we will see and we can make conclusions so there's differences in these two areas as well as a statistician we have to be cautious about you know how you know regardless the tool you use maybe I'll just leave it at that and then if there's any questions regarding our our work so does anyone have any questions like Michael Remler dropped in the chat you can either throw a question into the chat if you can find the raise your hand feature we can unmute you and allow you to talk in the meantime Andy and Kai as you've gone through this from coming from a SaaS based world how much do you feel like you've learned about mixed models and this area as you've gone through are because when I've taken my dives I'm into it and doing things like replicating the CDISC pilot it's certainly my experience is that there's not one place to find things related to this method there's a couple different places you can go and it makes you ask a lot of questions so what's been your experience there yeah as I meant I didn't I learned that I've been doing things for 20 years and not even like I may I'm more on the programming side I have some stats graduate classes but I'm not a project statistician but even the project statisticians they seem to sort of rinse and repeat rinse and repeat and we this is what we use on the previous study this is what we're going to use on the next one and since no no no how many I mean I learned so much especially on the degrees of freedom like Cameron Rogers Saturday between you know containment between you know between within all the other things and I knew the options were out there but I guess I've never really delved into them as much as I did it's during this project and realized how many how little I questioned what the model is and the analysis plans that were given and then when we're trying to match our sometimes okay we matched arm with sats but really is that shouldn't that be our gold standard our gold standard should not be matched on our sats our gold standard should be what is the appropriate method for this situation for this study of the for this analysis that we're doing we've been using sats and the gold standard well this is what we've always done we we instead need to be questioning once the appropriate statistical method I don't care and can we trust the language to do that method I think that's the main thing can we trust the language to do what is saying that we're doing can you go back to a slide on the cdx pilot examples that that will be mine I mean when I go up one more yeah so I go down one more okay there we go so in this comparison I learned you know first you have to kind of understand what the procedure or the functions try to optimize there is an objective function of any statistical method we use and for the mixed model it's panelized v square so they try to optimize the panelized v square so if you can get the objective function match that's a huge step that means your estimates beta or and your covariance structure could be roughly get to similar places so if you couldn't get that to match you don't have to go down I mean if it's too different then you're probably fitting different things in the second step is you want to get the inference of your estimates and what's the most appropriate way to get that and we know from the statistical history we rely a lot on the large sample theory to get the you know t-test test or any approximation even Kimwe Roger to work appropriately so is that true for your own study you have to judge on your own and I think Dr. Spates his opinion is you can always do simulations you can always run a parametric bootstrap to see where where you're you know the distribution of your test statistic is and that you we can come use that to compare you know what SAS and yeah are whatever language that could be yeah that's an interesting point I think it was from a post of the NLM E author it's either NLME or LME4 that I Andy I sent that to you a little bit back but he was making a similar point like that but and kind of echoing a resistance to implement p-values on some of the stuff to try to match the output coming from SAS and he had a lot of concerns echoing around that but that's going back I think that post was from back maybe towards 2006 or it was years ago so discussions of R versus Proc mixed are quite old at this point too anyone else have any questions any other points and so you Seth said in the chat the algorithm used in SAS to minimize the objective function is quite optimized with some approximation in case of H being not greater than zero while in R it's not this could lead to some discrepancies I think that kind of echoes your point Andy that you're at the tip of the iceberg yeah and anyone particularly insights like that that you stuff is making that we're trying to uncover and it's I mean it's trial by error it's trying to take a what these these use case subgroups are trying to do is to in some sort of regimented way start to isolate where we might see discrepancies why we might see those discrepancies not to provide the answers I think that's a an after effect but to help us understand as practitioners what questions we should be asking so if we were in a different class of models than one of these four or we were looking at a different language that was implementing one of these what types of questions might we look at at least under the assumption and it might be similar to to these use cases so referring in the question regarding to the covariance structure is that referring to the the covariance structure to have should have may have you know singular um type of conditions and then the SAS could potentially handle a little bit better than R yeah oh the Hessian okay thank you yeah so that related to how algorithms is implemented right a lot of stats procedure is implemented as the second order optimization function so the Hessian was used um it doesn't mean all the first order approximation will give us the wrong results so yeah so just more more research more research need to need to put down and if you have you know trouble fitting data set please send it to us and we would like to see yeah okay very good so I think at this point we can move forward to the closeout here so rimler I'll hand it back to you yeah great so this is our our our last slide and and and we um I mean thank thank you everybody for for joining and listening I think what we wanted to do here was um in this last section is is try to motivate you sort of a call to action this is not I mean it is a huge working group and you can get involved with this so we do have the white paper that's that's currently in development and and you do you could contribute to that um but we have that that team looking looking to put together the framework itself um those typically go through a number of different types of review cycles one of which is a public review cycle so when you see that come out for a public review cycle which might not be until later in the next year um you know please do give us your comments and and your feedback on that I think the quickest way to get involved is by reviewing that that what's out there in the public code repo which is linked here as well as we said like we we started the analysis by saying let's look at some very popular use cases where the results out of those use cases could drive real value for people and understanding the differences questions that haven't been asked before um but to learn what types of discrepancies you might see on two different implementations of the same statistical model across two different languages and how does we haven't yet asked the question how those things might translate and expand across to other languages or or other other classifications but that is where where we think we can go and we would love to hear what what your thoughts are around you know do we think that the there's value in going deeper into the existing use cases although we've obviously identified that that's the case within next models moving to new classes of models what are other types of analyses that we typically do within uh within pharma that's not captured by one of these four where we might we might learn more um about about differences or expand to other languages as as folks look at um uh other uh other languages for example julio or maybe python and that that sort of thing where you might see different differences for for lack of a better term uh because ultimately I think a lot of the the folks on the team uh as well as as others that we've talked to really are questioning you know what is the right statistical implementation before you heard this from um from people in in in each of the different presentations of let's not do what we've always done let's ask what should we be doing and if what we should be doing based on sound statistical principles is is implemented in r by default then even if we were to implement in sass we should be thinking about setting up those models and those options so that it's the the sound statistical model maybe we want to we if we're in a sass we want to match our maybe an hour we want to match sass or maybe it doesn't matter because their implementations are still numerically close enough that the interpretation of the results is consistent and and that's sufficient um these are types of questions that we don't think have been asked or answered before that's what this working group is trying to do to deliver a framework for what types of questions uh might be important to ask when when when faced with these problems but also to develop this um this code so you you uh andy and kai we're talking about um some of the additional type of work that might be able to be done with the mixed models whether you're part of the working group or not you still have access to the repository and and you can pour across some of those things in or you can uh reach out to to mike or myself or your morn you can get directly involved with the working group this is an ongoing conversation um you know i think that uh as we continue to move in this direction uh we're going to find that we we need to learn more and more about what differences matter and what differences don't matter and what differences can be reconciled so ultimately we're actually performing the right analyses on the data uh to to get medicines to to patients with with quality insights so thank you very much the slides will be shared out uh and mike has put the the uh the repo in the chat as well uh we hope that you found this useful uh a valuable use of your time and we wish you uh the happiest of times as you finished out your year and we head into next year mike any last words from you nope um just that this is a large effort anything that anyone can contribute we greatly appreciate um to echo andy's point we have a lot of evolution um that we can do andy's point in the chat that is um we hope or the content that we've been producing we want to evolve into a bookdown site so that it's easy for people to find um related information on these different things um so that you have that kind of resource that you can go to to say this is what i'm trying to do these are the considerations that i need to make um and uh so it'll be evolving so just keep your eyes peeled in 2022 and um i hope that if your organization is going down this route that you can encourage them that this is a public effort um and this is something that everyone's trying to do so you're going to have findings if you can share that back with the community um through um your efforts to adopt our open source languages um it this these are hurdles that we all need to come across um so if we share that information it puts less burden on your organization to um to take that information back in house um and you can save others effort as well thank you very much everyone have a great day have a great day