 All right. Good morning, afternoon and evening everyone. Thank you all for being here. My name is Eric Milliman, and I'm a data scientist at Biogen and the development lead for the risk metric package, which is part of the our validation hub. And today is part one of a two part mini series revolving around the use of risk metric and risk assessments for end to end our package validation. So quick disclaimer, all opinions expressed here on my own, and not sort of reflect those of any body sponsoring this work by the biogen the our validation hub, etc. Actually, second disclaimer, I apologize for any weirdness in the slides. This is my first time with Korto and I eventually gave up on some formatting things so. And so quick, I think most people probably know what the our validation hub is but for those that don't, it's a group of approximately 50 companies, mostly in the pharma biotech space. And our mission is to enable the use of our by the biopharmaceutical industry in regulatory settings, where outputs from our maybe used in, you know, things like submissions to regulatory agencies for NDAs, or BLAs, among others. So today we're talking about sort of risk metric and risk, the risk end to end our validation and sort of assessment of risk in our packages. And so that's really right now focusing revolves around these two tools for risk metric, which is what I'll talk about today and then risk assessment which is a shiny app sort of built on top of risk metric which Aaron Clark the lead developer of will speak to next week. I like to start these talks sort of by grounding in risk and what it is, because risk I think can mean different things to different people. And, and sort of you know there are some aspects of risk we control and some we can't. So risk is really a combination of quality and intended use or company culture. I like to give the example that at Biogen, in some of our sort of system designs we have three levels of risk that is the system could kill somebody the system could hurt somebody either temporarily, or reversibly, or it doesn't do anything to people and everyone is safe. And quality is obviously one way to mitigate risk. Others is process. But what that means is high quality something like high quality software can still be high risk, depending on your intended use. And so really I think of risk in sort of two dimensional space where there's a quality axis. So, again, if we think of our packages, high quality packages might be something like the tidy verse or base our or it's recommended packages and lower quality packages maybe that one off projects the summer intern did six years ago that no one has touched sense. So that's like one access of sort of the risk calculation and then the other is. I guess some might call it risk but at the redundant I call it the allowed margin of error and that is, you know, will something that will the process that I'm performing with said system potentially hurt somebody or will it maybe just cost us a lot of money or does it really not matter because it's completely exploratory and for fun. So really our risk is sort of the combination of these two, these two aspects. Obviously we don't want to be down in the lower left where we have a very narrow margin for error and we're using very low quality code. And where we really want to be is towards the top, and then you know left to right is dictated by circumstance. So we have to come up with a criteria to quantify risk or maybe better said would be like quality. And this figure at the top was part of a white paper put up by our validation hub, three years ago or so, sort of laying out aspects of software development that could be quantified objectively to get at that sort of first half the equation in terms of risk. So those things are, you know, I think pretty, pretty common sense when it comes to software development in our pack, you know in our packages things like is there a copyright license is the source code available, is there a place for people to report bugs are bugs being closed, or addressed, you know, is their documentation is their unit test coverage, how many people are using this package all sort of indicate our all can all be indicators of sort of high quality low quality packages or said in a risk way would be, you know, high risk of this, you know, package failure or package error versus not. So I'm going to start by just sort of diving right into a way to use risk metric. It's quite simple you instantiate with some sort of package ref call and you can add you can pass in any number of items you can pass a path to package source directory, you can just name a package, and the, and then you assess that set of packages and then you score it. And this package ref function essentially does its best to find metadata information based on what's available in your system, and we get outputs like this where we have a package, its version, and then score and metrics, etc. So internally this is sort of the process. So we start with this package ref and the function package ref cash which aims to collect metadata from different sources store that raw metadata somewhere, and sort of as a matter of convenience be lazy about it so that we don't repeat complex or time consuming computations. You know we only do that once we say we need that reference. So we can add information to assess the package from the package ref, we assess the package, which is trying to create an objective set of criterion objective sort of a objective summary of the metadata. So, for example that could be summing or creating a table of the number of errors warnings and notes from running our command check on a source directory. It could be a table of, it could be a simple binary saying you know indicating whether or not a package has a GitHub repo or has a maintainer, etc. So in the assessment, we then create the metric, and the metric is a numeric, a sort of a numeric score for that individual assessment bounded between zero and one. And that indicates for that assessments on from sort of low to high in terms of quality or risk. And there's nothing to say that we can't create multiple metrics per assessment. For instance, we have an assessment for the number of downloads in a package. We could say just some the number of downloads over a year or we could as one metric and we could also create another metric that say is the trend in downloads over that same year as as a second metric using the assessment. Once we've computed all the metrics for package we then can create a score. The score is essentially the average, the average of all the metrics computed again bounded between zero and one zero being low risk or higher quality one being higher risk or higher quality. And of course we can do things like customize the weights of the metrics for the score so if you favor unit test code coverage or downloads you can sort of bump that weight up to see your liking. So I think it's a consider I think one using risk metric. And this is sort of my experience over the last year and a half or so sort of developing and in discussions. First is package source. So one, not all metrics or assessments are available for all types of package sources. For instance, if you have a package that we reference that is installed in your library, you will not be able to get assessments for code coverage because unit tests for that installed package are likely not available to run against similar for a CRAN remote you are referencing a CRAN a package, you know, at up at CRAN remotely. Things like, again, unit test coverage may not be available. The second is that you have ways to sort of specify to be specific about where you want a package ref to be generated from but if you don't, our package sort of does its best to figure out, you know, goes through the list of sources and in a specific order to reference that. And so what that can mean is you can generate a list of package refs of differing sources, which means that your assessments and metrics may not line up exactly for comparison. And so based on that what that means is, you know, there will be missing information in your outputs. So again, if a metric is only implemented for a specific source type, you know, then if it's missing in another ref that's installed, you know, that's probably okay because we couldn't generate it. And again, I give the example of code coverage which is only available for package source, as compared to an installed package or even a CRAN ref. There may be missing errors. I would say generally we've been good about handling that and sort of messaging that out in the console but you know, maybe you expect a metric to be there and it's not and I would highly recommend if that happens to report it to us so we can figure out, you know why that might be. There are others that are maybe sort of not sure how to handle yet or how you know different people may want to handle it differently but there may be times where a metric is missing because it's not expected. So I give the example of a package that may not have a URL for bug reports, which would mean you would not expect some sort of metric on bug closures because you don't know where bugs are being reported. And so you would sort of have one metric in the bug reporting slot but nothing in the closures and whether that should be missing or maybe a zero I think is sort of up to the user and you know, could have a healthy debate on. And then lastly weights so again with mixed type sources. You know you might have to be careful with what you're waiting if you're doing custom waiting because if a metric is not available and you heavily wait it, then those packages that don't have it available will be you know maybe not weighted as you know to your expectation. In this case I give the example of something like remote check so one of our metrics is to scrape the crayon remote checks table and report back any, you know, the score based on errors warnings and notes, but you would only expect that for crayon remotes and not installed package or source but if you were to over weight it, then those other packages in this, you know in this example might, you know, not be weighted the same as say the rules package here. So, sort of how it works things to think about. Where we're going right now we've got sort of four avenues as we move forward. First is increasing the ease of use. So that's really we're trying to make the package and the, the process the workflow as simple as possible to get people started so that's creating convenient wrapper functions to go end to end say compute package score and you give it a package and it does that ref assess score type in all under the hood, but provide you with helpful messaging so that you understand what's going on and can sort of learn until you're ready to say, you know do your own sort of customized as you see fit for your, your own workflow. And then along with that is to start cleaning up reporting and output so that you can create nicer looking tables if you wanted to say automate this process. And do a risk assessment for, say 500 packages that you want to validate for your gxp environment or say for all of crayon just because you want to have some fun. Second, we're working through completeness. We've got a lot of metrics and assessments implemented sort of the low hanging fruit and now we're starting to sit back and say okay where can we fill in the gaps. And so here in this table we show sort of the assessment generic calls and then where it dispatches to in terms of source. And there are a lot of nays, not as bad as it looks because we have a lot of default but to really start filling these holes in where we can so that we're very explicit and sort of how we link from assessment to source type. And so with that consistency and sort of source to assessment metric. And then we'll talk about how we might chain or nest sources together to increase metric coverage for your own analysis. And so I sort of propose we're proposing this kind of diagram where you could, from any given source, maybe you could nest in the next source down maintaining sort of a chain of custody so that you know if you start with a package. Cran remote you could say if you wanted test coverage you could nest in the package source information to do test coverage, and then even install the package temporarily if you wanted to do some other sort of you know if there were other metrics you wanted to grab. So I would say starting with a package installed to go back out because there's no way to know when that package was installed and sort of what source code was used to install it or from what sort of snapshot, what, what, from Cran you know and sort of what time frame. And this will really help sort of I think fill out that that assessment table and metric sort of completeness table. Another aspect we're working on is more metrics but more metrics in a modular way so one of the philosophies and sort of principles that we try to maintain with the respective package is keeping our dependency footprint low. But there are a lot of packages out there that perform sort of similar risk, provide risk metrics oyster SRR auto test package that all of those provide sort of singular calls maybe to vulnerability databases, or extra testing etc and we want to include those but not necessarily make the package dependent on them. And so coming up with sort of a framework and API that would allow us to add sort of a light set of functioning that would pull those in, if you say have oyster installed and want to run it but if you don't you sort of are. You maybe are told hey you don't have this you could, but if you don't want to it doesn't sort of break everything down break break the workflow. And then also, I'm not sure when this might arise but also providing, you know, easy facility for ad hoc assessments or metrics. Maybe you've got some your company has some super secret proprietary method for assessing risk and you want to feed it into risk metric but you don't want to share it publicly. But you know allowing that sort of plug in for private or you know non public or custom on the fly type type functionalities. And then last is cohorts cohorts has been on a roadmap for a very long time. And that's because they're quite difficult. And we sort of all, if I say cohort everyone probably conjures an image of some sort of collection of packages, but you know really we're working through what that use case would look like whether it's a set of standalone packages. In the tidy verse that you want to get a singular score for, or if it should be more like installing an environment. So base our priority, the recommended and priority packages plus whatever you've deemed necessary for the business and you score that entire environment. So sort of one is probably a subset of the other but which way that goes is, I think, you know, we've still sort of working through that design and implementation questions. So that's sort of risk metric. I just say if you want to contribute we welcome people. You can file issues, I do my best to stay on top and at least discuss them. You can file an issue and you feel you can fix it by all means please do. Also, in our issues pages you can propose metrics. We have some good discussions about sort of the pros and cons of specific metrics. What they represent how they might be represented, etc. And of course if you can propose a metric and you feel like trying to contribute it by all means, there are any number already proposed that would welcome a contributor. But, you know, so with metrics I just thought I'd opine about what I feel makes a good metric and I think what we've seen makes a good metric and sort of this first phase of sort of singular past package risk assessment. The first I think self contains is an important one. So what we in this sort of current iteration, we really focus on that sort of self contains assessment and sort of give an example of a proposed metric for license compatibility with dependencies. But not necessarily for a singular package, probably more for cohort where it would be important to know if a packet if a user facing package has a compatible license with its dependencies. But for just assessing that single package where we don't really assess we look at the surrounding dependencies, you know, less, less ideal. The second I think is environment agnostic. Again, you know, I think to make things as comparable as possible this is important and again would, you know, as you get into sort of environment specifics that probably leads us more into some sort of cohort style assessment where the underlying architecture or packages installed or versions matter and this assessment already exists and we're discussing if it should be moved into moved over into a cohort metric sense running our command check is not only is not a self contained action. Right. It uses it relies on the version of our installed base packages, dependency package versions, etc. And so it's not really a self contained as it might seem on the surface. Third clear interpretation. This is a fun one because it gets into some good meaty discussions for and I give the example of version release frequency. Been proposed a few times that somehow version release frequency is probably an indicator of package risk or quality. However, it's been difficult to converge on an interpretation of what that means. Because you imagine in the life cycle of a package that there would be differences in package and version release frequency so early on, you might see a lot of releases indicating bug fixes which means clear engagement by the development team however also buggy and then as a package matures that the frequency probably spreads out, you know, elongates as the, as the package becomes stable. So, you know, I think it's clear but, you know, we, it's difficult to sort of get it down to a singular objective criterion. So that comes this can can we represent the metric numerically and so if we think about reverse version release frequency again, you know, how might we represent that numerically which would indicate sort of monotonic relationship to to risk or quality. It might be variance and really in time between releases, but then again from young packages to old packages how do you maybe account for stuff like that. I'll briefly introduce a new package that we're in sort of more resource that we're rolling out. It's called we're calling it risk score. It's currently on GitHub. It's highly experimental. So don't base any of your decisions on it. But it essentially a summer is the output of the risk, the risk metric package for all of all of crayon right now. We have plans to add a table of assessments as well so that you could sort of play with various aspects and maybe test different kinds of metrics and weights. Like I said only covers crayon and using the crayon ref type so you know there's a lot of missing metrics but as we develop risk metric we will continue to run to generate this risk score data set. And we mean this to be a community resource. To help contextualize risk scores for end users so a common question is like what's a good risk score what's a bad risk score. Again, I think that depends one on your use case to on your appetite for for risk. But this at least would let you sort of put a score in a distribution and see where it falls. And we also see it as a way to help the development teams monitor changes to scoring algorithms, catch edge cases, etc. So, and just generate I think a nice data set to play with and explore for in the terms of package quality and risk. And I will reiterate, not a replacement for doing your own risk assessment, especially now as it's sort of an alpha and needs to be QC but is out there to sort of at least investigate. And this actually was first generated by Aaron Clark who's the lead of risk assessment. And then together we've been putting together sort of table schema and designing what this package would look like, along with Doug. So just provide some insights from some very basic insights from running risk metric against all of CRAN. Distribution of scores across I think, you know, pretty good and nice, you know, relatively uniform distribution I think is what we want to expect when scoring packages. So this is down by number of downloads so what you might consider a measure of sort of package popularity, calling out specifically the tidy verse, as well as the farmer verse and then included are, you know, again by downloads and as you might expect, a very high quality run by a software development organization so I think that makes sense, farmer verse not far behind again packages devoted to sort of validated analysis in clinical trial reporting. And then the most top 100 downloaded packages, high quality and as we move down sort of the rank in terms of downloads or popularity we sort of see a shift towards higher and higher scores, which I think is maybe to be expected. In terms of catching edge cases, I just took a look at what's missing from these values we see a lot of complete metrics however there are some cases where we clearly had a failure which we suspect maybe API rate limits calls and stuff like that which is why this is not ready for production. But, you know, we're working through that. If we look at the binary metric so has a bug reporting URL has a maintainer listed has a website and yet, etc. Pretty even splits between, you know, yeses and noes which I think is good. Obviously, for one exception which has maintainer which is a requirement of crayons so this is, I think sort of to be expected and obviously for a package crayon ref not necessarily informative but for other types of references potentially could be. We plot Pat risk score versus the binary metrics. I think again you see pretty, you know, obvious trend where when the binary is yes you tend to have a lower risk score which again would make sense as having these items would indicate better SDLC practices or development practices. So we have metrics with continuous scores say bug closure rates, number of dependencies remote checks, some of remote check results, as well as reverse dependencies. And again, just varying distributions but at least, you know, now a new package comes in and you could sort of say this is how it works. So again, risk score interesting, much less correlated. So, not sure what to make yet, but just, I think, informative for, you know, planning, scoring in the future. And just for fun, we can take all that create a heat map is what is life without a heat map and data science and some clustering, and we can just see some pat maybe some patterns start to emerge. I can cut the cut the dendrogram and the rose which are the packages to create say a couple clusters, and we can plot those scores by cluster and see again. Some relationship between the different clusters and maybe quality. With that, I just thank the dev team, both past and current. It's sort of definitely takes a village. And we welcome new members of course, even if you have just a small amount of time every little bit helps. And with that, I'd say thank you and take any questions and people unmute themselves. I am not sure I'm not sure if we're in like webinar mode or if we're in like group call mode. If you want to ask questions in chat. If you kind of come off of you and you want to ask your question in person, have at it if you want to answer, ask a question in chat. I'm also happy to kind of like facilitate a little Q&A session. And I can read those out so either way is totally fine by me. Maybe I'll just kick us off with a question so how are you seeing, are you hearing from organizations that are using this in practice and like what does their process look like. Is there a little bit of a plug for the risk assessment app or other alternatives. So, there are a couple different ways, I think one is as so some groups are using not the score but the individual metrics for decision so instead of reducing a package to a singular number they're saying, we expect code coverage to be above X and we you know downloads or engagement to be above X and sort of using the metrics or, you know, sort of a non transform version of the metric into a rubric that they then sort of binarize themselves. So that's one thing I've heard. Another is to use the score for decision so we accept some might say, no, no extra work is needed if your score is below point three. And above point three needs peer review and above point eight maybe we completely reject outright because it'll be too much work to say mitigate risk. And then I know I've used it to compare packages so I need to package to do a specific statistical test. I find to that claim to do it. And I choose the one with the lower score sort of regardless I need the I need the method so I need a package say without writing my own. So I picked the lower of the two is that would indicate sort of less work on my end to mitigate sort of any any risk problems. Well, yeah, yeah, there's definitely a variety also from what I've heard. So I'm also one thing I think is also interesting is like, let's say you did write this package on your own for their own statistical statistical process, you know, like, what would theoretically if you were to run risk metric against your own like ad hoc package, it probably is not going to score as well as like either the two existing solutions. Exactly. Yeah, it's probably better to write extra testing around an existing package then sort of write your own from scratch. I'm going to bring that up because this is something we hit up against that Roche occasionally is like, we need a particular statistical method. There's like maybe a few packages out there but it's a niche statistical field with maybe a couple little use packages out there. And there's this question around like what is the risk of using these packages and yeah they might be high risk but the risk of like not using them is often higher than the risk of like using them, maybe finding issues with them and being able to actually like build that up as a like starting from that as a starting point for like low risk solution. Yeah, cool. I think it fosters. I hope that that fosters like a more an atmosphere around contributing toward existing projects. Yeah, me too. What I was kind of curious about was, if you're planning to like keep a running like roadmap with all these. Like you mentioned this big list of like this matrix that you're trying to populate. Is that visible somewhere or should we make it visible somewhere. And to make it visible somewhere we've we've got a few project boards up and running, which I think some are linked in the slides somewhere wherever that was, especially when contributing. But this, this list, this table of completion is actually requested to request from the risk assessment team so that they know. And I think it's a useful just a useful resource for people when, when running risk assessment risk metric to know sort of what to expect coming out because we had a case where somebody was comparing to and getting different scores and somebody was comparing to results and part of that was like somebody was doing to comparing to package grand remotes and somebody was comparing a package install to a package grand remote and so like the metrics were different because they were using different sources. And so I think providing that says, Oh, yeah, you know making that clear like this is not available to you is is important. And also just like the developer and me can't resist thinking about like how to handle this problem. And one idea that comes to mind is to interpolate these NAs with like, or to impute these NAs with both like the lowest risk and the highest risk and then compare it like give a range of what the risk could be, if these metrics were populated. Yeah. Just like, yes, we can talk about that later. I'm not seeing other questions rolling in. So maybe sounds like the material hopefully was clear, and I'm hoping that people are walking away with a better understanding where the projects at, and how they can contribute, how they can use it in their own practice. Definitely, it's very clear that there's like a lot of work ahead. I think it's really impressive where it's at and I think having all this analysis now around the metrics has been like super informative. So hopefully if you're using this your organization. This is something that's already proven productive and if you need more support. Now you know how to kind of feed that information back to us or contribute would even be better. And coming up in just a week, we'll have a discussion around a graphical user interface that's wrapping around this tool that is maybe more friendly to people that don't have deeper our backgrounds to support this actually being embedded in your business workflow. We'll have that to look forward to Aaron Clark will be presenting on the risk assessment app with any other logistic items that we wanted to hit on before we close out for today. If not, then I will invite people back in about a week's time. This time next week to join us for the risk assessment discussion. And yeah, thanks everyone for attending and we'll hope to see you next week. Thank you all. Thanks a lot. Take care. Bye bye.