 So first of all just a quick thank you really to the R Consortium. Personally as a I'm GSK's representative on the R Consortium. We've been a member since the turn of the year but I've been involved in the R Consortium for a couple of years now through something called the R Validation Hub which I imagine many people on this call will be familiar with. It's it's been great to be able to use the R Consortium and lean on the R Consortium to help get this series set up and to work with on a number of different initiatives in the pharmaceutical industry. So I just want to express some thanks to the R Consortium. I also want to thank Fuse who recently signed an agreement with the R Consortium and will be helping to co present sorry co promote this series in future as well. So lots of people behind this and whilst I'm on the mode of thanking people I want to thank a number of different individuals. Bob Engel who's going to be chairing one of the discussion sessions later help come up with this idea and a number of other people from from various different companies have been involved in kind of working working through to get to this point where we had this where we had this series together. So what is this series about? The aim of this is to pitch at people leading our adoption initiatives. If that's not you don't worry this is still open to all. I hope there'll be relevant information in here but really this came about after a number of sort of individual sort of one-on-one discussions between various individuals who are leading adoption initiatives within their company. So that can be include things like training, it can include environments, it can include the kind of programs that people are unveiling to try and encourage adoption on day-to-day, it can include things like recruitment, all sorts of topics have come up in sort of individual discussions and we thought what we would do is try and bring that together in a series of how-to webinars. So today's the first obviously the first of these this series so we'll see how it goes but the typical format we expect will be a presentation which is exactly what you see later today followed by some focused discussion around some of the topics that are raised within that presentation. And then in terms of what to expect from those future topics and today's is very much focused on environments and I'll talk about the environment that we built at GSK but future webinars will look at things like training, our packages and package development, combining SAS and R together, industry standards. We've got a long list of potential future webinars that we hope to run every couple of months over the next year or so. Okay as for today's session so this is really in two parts so following this brief introduction so this is pretty much my last slide now I will switch slide decks and then give a case study around scaling R at GSK so I'll talk for maybe about 35 minutes and then we'll have a bit of Q&A. That should take us to the top of the hour and then we'll have a brief break while we reconvene in a new Zoom link so I'll remind you of this at the time as we finish but we'll just be asking you and I think Daniela's going to post something in the chat window just before we do as well but we just need you to follow a new link so you can get into the discussion sessions. So these will be three separate discussion sessions which are highlighted on the slide in front of you as to what they are and that's really I think where hopefully some of the value will come because you obviously you're going to see the case study and hopefully that will be relevant for many of you but then when we get into the discussion we'll see what are some of the alternatives, what are the ways that other companies are scaling R. So that's something I'm really really looking forward to. I mentioned this right at the start but the part one will be recorded. We won't be recording part two but we will be taking some notes so the idea is that we'll then write that up as a blog post afterwards and share some of those themes with you all. It won't be minuteed though so please speak freely as freely as you feel comfortable with in those sessions and hopefully they'll prove really really valuable moving forward. Okay with that let me just switch presentations and we'll get going with the main piece. Hopefully you are all seeing a slide that says scaling R at GSK and I'll assume silence is positive unless anyone tells me anything different. Okay so as I mentioned the name for the first part of this this session is for me just to give you a bit of a case study as to what's going on at GSK. So I'm approaching this largely from an infrastructure perspective but I am not a technical person so it will be a high level of love for hopefully everyone to follow along. So I'm going to talk about particular environment that we have built within biostatistics at GSK. So my group the Statistics and Data Science Innovation Hub we are part of our biostatistics organization which includes clinical statisticians and programmers. It also includes non-clinical statisticians, research statistics, manufacturing group as well. So quite a big group of data scientists, statisticians and programmers at GSK. So I'm going to talk about warp a lot and the success of the warp platform. So as you'll hear me say the word warp over and over again just to give you a little bit of context as to where that comes from. Warp is an acronym. Originally it stood for the working area for R and Python. That was the vision we came up with. We called it warp because it's part of our larger program called space. So we essentially retrofitted the acronym to the word. And really the reason for the change, the reason we call it working area for R programming at the moment rather than the working area for R and Python is that Python is something that we've added later on. So I'll touch on that a little bit when I talk about how we developed the platform later on. I talk about warp very much as a success and the aim of the graph on the right hand side is to give you an indication of that success. So over the past 12 to 18 months the graph on the right hand side shows the if you like the growth in warp. So this is the number of users logging in per day. And you can see this has got kind of a weekend rate down the bottom and then an increasing logins during the weekdays over time. At the time that this graph was produced about a month or so back we were averaging around 120 users per day during those weekdays. But really I suppose you can look at that the figure that's most indicative of the success of warp really is the one above that on the left hand side. So 350 users active in the last 30 days. So an environment that is being used by many, many people and where R was barely used before. And in fact the top figure there 700 warp what we call our developer accounts of 700 people have accounts and had accounts and warp I think this was mid-June when we took the snapshot. Today that's already over 800 and we're already seeing numbers like 150 users logging in per day. So the growth of warp has been pretty rapid albeit steady over the period of time. So what I'm going to do is talk a little bit about how we got there and the two key focuses presentation are what is what? What are we built for R? What's the platform? Why what is it that all these users are logging into? And then I'll also touch on how we got there. How did we build the platform over this period of time? How have we added to it and made it what it is? I do like to include film references in my presentations. Last time I gave a presentation I gave something that was pretty obscure. I'm hoping this one is fairly self explanatory but we're going to go back in time and we're going to go back to 2018 and I'll talk a little bit about the state of play with respect to R at GSK in 2018. So GSK and biostatistics in particular were using R in 2018. Like a lot of companies the focus and the use of R was for graphics and for simulation and so I think I joined GSK early 2017 and I one of the first things I was asked to do was to develop some new training materials around R with respect to an introductory course, a graphics course and a simulation course and in putting those together I experienced and just generally using R at GSK experienced many challenges at the time. So the first challenge is that we didn't really have we didn't have our own biostatistics environment to use R. What we had was the option of desktop R so we could go to a sort of centrally managed installation platform at GSK and install desktop R in our studio and what we'd find is that those versions were maybe a year or two even sometimes out of date. Sometimes one would get updated and the other wouldn't so you'd end up with a very up to date R and not so up to date R studio or the other way round. That led to all sorts of problems around things like R markdown and so on functionality not working as expected and it was a real pain and what we had to do when we were training people is get everyone who was being trained to put in a temporary admin request so they would get admin rights on their laptops for 24 hours and they could then go to CRAN and R studio and install the latest versions which worked for the training but what it meant was that everyone was working on different versions of R it was difficult to maintain in the longer term and then when it came to doing real work so in particular when they were using R for things like simulation what people would have to do is they'd have to go and use one of our HPC platforms within GSK and these versions like the desktop versions that you would install would be again often a year two years out of date and so people would have to fight against all of the different issues they would have with versions of packages not being compatible they would write code that wouldn't work on the HPC nodes and so on they also had to be very technical expert users in order to just use the HPC nodes in the first place so that's the status we were at it was it was difficult I guess it's the one summary word that I could put on this slide instead of all of that information so what we wanted to get to is to a situation where we're at today and I'll reveal over the next few slides what that looks like but essentially we wanted to centralize that usage of R so handfuls of people using R on desktop we wanted to get to a point where R was a primary language within biostatistics and we wanted to standardize that as much as as much as we could and as much as made sense so statisticians and programmers, data scientists, logging onto a web browser through their laptop into a centralized infrastructure that we could then support for for all of those users okay so how do we uh what do we build and how did we get there um the initial remit for this given the HPC case that I mentioned earlier was to build a dedicated HPC environment for R that was our use case we're using um going back to 2018 um R was starting to be used a lot more in in our trial design lots of simulation work going on there and we had a need to build um but an HPC platform um so that we could loop through relatively small clinical trial datasets many many times um so that's what we were asked for but whether or not you uh it's debated somewhat whether Henry Ford actually said this but I think this is a a quote that many people are probably aware of if I'd asked people what they'd wanted they would have said faster horses um and I suppose that summarizes um the state of play that we were in when we um when when we started building warp we were asked for faster horses we were asked for an HPC environment for R um but once we'd um once we got the funding to build that environment we we took a proper look at it and decided that we could do something a bit better than that um rather than just build an HPC environment for R we could look at our future usage of R and build something um far more applicable for our future use cases so in essence we crossed that out straight away and we decided to build an all-purpose R environment okay so over the next couple of slides I'll just walk you through what that all-purpose R environment looks like so first component of this and this probably would have been the same for the HPC environment as well but we wanted to build it around a familiar tool set so um our users tend to use our studio products we opted for our studio server pros to the professional version of our studio and there are various different reasons for that but in particular we were able to install that and onto two different servers so this is an in-house installation okay so this is not a cloud environment this is not containerized it's installed bare metal in-house two servers with our studio server pro and we made use of the load balancer for those environments so essentially from a user perspective they go to a single URL and it they get logged into whichever server has the highest capacity in addition if we need to do take one down for maintenance or if there's a problem of some kind we've got that availability they're always going to the same URL we can we can continue with just the one server up there so that was the first part we need we wanted to center everything around the development environment so everything that the users do is done in the development environment where they write code okay um and bear in mind many of the users the early users of what we used to desktop our studio or they had full flexibility often they would keep they would maintain their admin rights and they would install a lot of other software as well so they wanted maximum flexibility so the team were quick to provide support for the backend tool so obviously are itself and I'll talk through the maintenance and a few slides time other other software that we wanted to support stan and jags basing modeling was one of the the key drivers for the hbc environment initially so we wanted to make sure we had that set up in the back end um obviously c plus things like c++ and and so on that you can you click the new button uh new script button in in our studio there's lots of different options there and we wanted to make sure they were all supported and I mentioned python as well as something which we had further down the line uh very fairly recently in fact we've um we've added a python capability within the environment uh as well so centered around our studio um the second piece um was the bit that we were asked for um and that was the high performance computing all all we really uh needed for that was lots of cores and for it to be easy to use um just a note here because this will feed into one of the discussions later when we look at um the clouds um cloud usage versus uh an in-house installation we did have an option for something else at the time I'm not going to name what what the technology was um but it was a cloud-based technology that required users to set up their own clusters um as and as and when they were ready to execute their their paralyzed code and we felt from a usability point of view um that was um probably a bit further along the line a bit more advanced than than where our users were at that point so we opted for an in-house system um the tool of choice is slurm um so we have a slurm workload manager installed on a separate node um and even that we felt wasn't um user-friendly enough um so we created an add-in which I'll show you on the next slide um to be able to to launch those uh launch those jobs um and uh essentially what happens is you launch the job from um from our from the RStudio server pro environment so you go to the add-in you launch the job you develop your code in a normal way um let's say you paralyze with four four cores in in in that primary environment and then when you're ready to go and scale you then go and launch that job and slurm uh essentially takes care of the rest of the scheduling and it goes out to very powerful compute nodes um so we said we asked for lots of cores we've got 192 logical cores on each machine and three terabytes of memory um uh as well so uh fairly expensive um machinery in the in the back end of this installation and this is something again when we get to that um for those who are joining that particular discussion thread at the end um it's something we can talk a little bit about is the potential to make this a cloud component in in the future um as HPC usage does does tend to vary um uh over over time just a quick note on the the add-in um these days if you go to our studio you'll see that um you've got um something called the job launcher um which enables you to to launch jobs on something like slurm or kubernetes if you're using a cloud infrastructure at the time that was only available as a beta release and we didn't want to take the beta release generally our IT guys have a policy against that um and so we ended up building up our own add-in and it's something that has worked so successfully we've stuck with it um so the add-ins are set in fact several add-ins on the left hand side that you can see uh that allow to do things like launch the job initially and then you can go and check the status of that job or jobs if you've got several running you can cancel it um to a degree of management there as well and on the right hand side here you can see the kinds of things that we are collecting our um the user experience um for running that job so um I've got a script and I want to run that script and a little bit of advice around things like number of cores so I'm using 40 cores in my script and it tells me don't don't vary that if you're using if you're paralyzing and saying use 40 cores put enter 40 cores here a little bit of advice to try and make sure that people are following best best practices but essentially they they fill that in minimal information clicks submit it goes off and runs and then they get email notifications when that job's finished um very very easy nothing they have to do to really edit or change their scripts from the tests that they would run locally so we definitely tick the user friendly box when we when we built this piece this slide just um because there's a bit of information I'm not going to talk through all of it um it's not the most exciting slide uh in the world anyway um but it is worth mentioning that um when we built what we um made a conscious decision um not to do um not to make it an all singing all dancing platform okay so with respect to our yes we wanted lots of functionality um but it's not a data platform um storage that we have scratch storage um which helps our users but the idea is that the data should be stored outside of the platforms and warp is really a purely in the analytics platform so given that what we what we did was made sure that it was connected to everything that we needed to connect it to um so for example our clinical data currently in GSK sits on file share so we made sure that we could access that data one of the problems with our more generic HPC solutions that people had to use previously was that we couldn't actually access the clinical data from those platforms so with our biostats and um centric environment we made sure that we could access all of our clinical data um equally um some of those other um storage options that I mentioned they're the object store um Hadoop um you can see some of the data examples that are listed in this third column um Hadoop all our real world data is there we've got historical clinical trial data stored on the Hadoop platform so we made these platforms like that easily accessible again from uh from warp and the object store I just want to mention um as that's been uh that was a really nice piece of foresight that proved really really useful in response to COVID-19 last last year um we were able to use the object store uh as a need to mean intermediary store for external data that we brought into the environment um so the the data that governments were sharing around um uh COVID restrictions um so there's a project that that made good use of that last year as well as the one that's mentioned on here which is also uh game change game changing initiative for us to enable uh an RNA clinical RNA sequencing pipeline um plus relational databases always needed for things like shiny apps um so we made sure we had all the relevant drivers etc ready and available within uh within the warp environment lots and lots of effort unsung effort I suppose go that went into the data backends but that's what makes it really easy to use and normally when another group that isn't by stats comes along they come along with their own data store uh and we make sure that whether they can they can connect to it and then they're very very happy with everything else they see so the bit we added the bit that went beyond the the faster horses um was the content sharing um I say content sharing specifically obviously if you know um RStudioConnect which is the component that we added here um RStudioConnect allows you to share shiny apps but you can also share content through our markdown or just generic hgml um so the two apps you see on the screen the top left and the bottom right the the top left isn't uh was a prototype for a new way of working with clinical data um so I know a lot of the companies are investing time and effort in building um shiny apps that will will allow us to display and hopefully submit data to the regulators in the future um so that was really nice to be able to say oh yeah we can build this and we can host it on a platform like warp um the one in the bottom right again is something I've talked about publicly before this is uh something called quantitative decision making uh which um is something that was already being rolled out across R&D um basically it's um a quantitative way of looking at your looking at assessing your trial design um and deriving the the probability of success um that was initiative that um as I say was rolled out across R&D anyway when we put an app around it and we showed how an app could be used to get the statisticians to engage with the study teams it proved extremely popular so the fact that we could then publish that on on warp uh all that stuff was all set up uh essentially immediately justified the uh the decision we made to add RStudioConnect as a component when uh frankly we weren't we weren't asked for it at all um otherwise it's also been really really useful to um share all of our user guides and training materials on the platform um that was being so successful um in terms of sharing that that content and the way that we can update and maintain it through our markdown um that um the team responsible for our um uh SAS uh SAS macro user guides uh actually migrated their content for SAS macro to warp as well so you can't run SAS or do anything in SAS from warp but the user guides are on warp written in our markdown which I find somewhat ironic but um is definitely a big uh big tick in the the success box um uh for for warp but yeah all of this is on on the platform and in addition um I haven't mentioned github so far um but github is one of those data if you like data backends or code backends that we have um but for warp um we have github enterprise installed at RStudio and I'm I'm a massive fan of the very what I would call the very lightweight CICD available through RStudioConnect so essentially when we make changes um to our training materials all of those are stored on github um and when we um push those to the master branch it gets automatically picked up and deployed um by RStudioConnect without having to install any other CIDC CICD software so that's a really really nice um feature of RStudioConnect that I particularly like in terms of what that looks like and how it's used um we've got RStudioConnect stored installed on a single server um and we don't restrict the uh what people can publish that was a conscious decision we made in the early days essentially um we had a couple of options we could restrict or we could not restrict and the reason to restrict would be to um protect production apps to make sure that they're they're always running and to make sure we had some form of um control over what people people were doing shiny we felt that that would be um restrictive when what we were trying to do was encourage people to experiment and use shiny so we felt the easier we could make it for people to publish the better so we made it completely unrestricted that has led to a situation now whereby we have hundreds of apps um and and our markdown documents and so on that have been published on on our single connect instance and we are looking at cleaning that up a little bit um but I think that was uh that was a reasonable price to pay I think we made the right decision for us at the time because it has encouraged a lot of people to start using shiny and publishing shiny and the other option we could have gone with of course is split splitting connect so have a production server and a non-production server um again we didn't do that partly because of the number of servers we had available and the cost of new one but we have recently completed a an experiment um to see if we could put a second instance of our studio connect in the cloud so if we decide we want to split that in the future um we will um uh we could make that a production server um and and that will sit within um within our cloud infrastructure and doing if anyone's joining for that discussion we can talk a little bit more about why we why we've done it that way then okay so that brings us on to our first discussion topic so I'll just give you a quick flavor as to what we're doing around support maintenance of um of the walk platform here and then uh if though if you're interested in that topic it's one of the options for the parallel sessions um at the turn of the hour so I've seen several models from different companies um but they all tend to be and it tends to range from two installations per year to our installations per year up to maybe one hour installation every two years um I don't see too much variation um variation on that for a core our distribution well so we've gone I would say on the more flexible side so two installations a year or one installation every six months that means that pretty much the latest version of our is always available on on the platform we're never more than six months old in terms of our our versions that certainly keeps our our colleagues in the search you've been using our a lot on desktop keeps them pretty happy and allows that flexibility but but certainly I've seen companies got a little bit more conservative and add a bit more stability what we do in terms of stability is um is create a central centralized installation of a core set of our packages so at the point that you log on to the platform and you go to a new version of our immediately 450 central around about 450 centrally available packages are already available to you so in other words you can type library tidyverse and it's there you don't have to go and install it yourself similarly for all of our popular um um popular sort of statistical packages and so on and the team monitor what people do install on top of that um and update that central list from version to version um so as I just touched on users can install their own packages so we're also pretty flexible we don't lock that 400 that set of 450 packages down users are free to install additional packages and one key decision that we made that is probably contrary to what I've seen other people do um is that we don't fix that installation so a lot of companies will uh that I'm aware of will fix a point in time so if you make your r installation in February what I've seen is that companies will say okay if you install any other packages they'll all come from that same snapshot that was taken in February um we actually allow full flexibility so we use our studio package manager um the people don't have access to the internet in this environment but our studio package manager takes daily snapshots from cram um and allows us to uh allows an individual to install the very latest version of any package and that includes updates to those 450 century available packages so that does allow a great deal of freedom um but obviously does come with some some risk um but we what we we're doing to try and manage that is for our clinical gxp environment which I'll mention later um we are creating a separate frozen environment um to try and account for account for that so essentially we users will have the the choice of two environments very very flexible environment which is just like using our desktop or a more lockdown version environment both available through the same warp interface and just a quick note on the the tiny screenshot you see in the in the bottom left there um with the new versions every six months we do retain all of the old versions so if someone's working on say a long-running study or a long-running piece of work through our studio server pro um you uh it will fix that our version um so even though we install a new install and make new versions of our available uh you don't have to upgrade that within your project unless you choose to unless you choose to do so um which is a which is a really really nice feature which does also add an element of stability for people so um that's probably the end or that is the end of all of my uh sort of technical overview of what what the components are um so just as a quick recap on that we've got two low balance star studio server pro installations um if people want to use the environment for hbc uh they can use a a launch uh an internally developed launcher to go and launch that onto the hbc nodes if they want to publish apps very very easy to do directly using the sort of publish buttons within our studio within our studio um or they can like I mentioned earlier they can use the github functionality to publish content directly from uh from github so that that's essentially the environment um as a whole that we've built very heavily relying on our studio um products all um in-house bare metal uh installation okay happy to take more questions about that at the end but what I want to talk about now is just is how we built this so this uh the what we built perhaps isn't even that exciting I think what we built is fairly obvious in in in many ways um the interesting bit for me is how we how we got there um so the first thing is we start with a high-level vision now I after the Henry Ford quote I put up um I put up a high-level vision um when I say a high-level vision I really mean a high-level vision we did not go out to 200 users and ask them what they wanted um because they would have said faster horses what we did instead was we took um a very very small number of people three or four people we got together and we decided what people would need from an R-based environment we didn't spend a lot of time writing that down we didn't spend a lot of time justifying the value we agreed that that was the vision that we wanted to work towards and then we started building and the way we built was using a small agile team um so the team um was a mixture of business and tech and um when I say agile yes they had their ceremonies yes they had retrospectives and daily stand-ups and so on but the key to being agile um was to really really stick to the agile manifesto so um I've put uh put up on a reminder of some of the key points in the agile manifesto for those not familiar with it um on the screen there um some of the key things that that I'd like to highlight here are things like individuals and interactions over processes and tools so yes there were process at times where we had to follow processes within GSK um but the key to the success of what um from my perspective is there were two individuals one from the business and one from our tech out IT organization who spoke daily um and course corrected as and when necessary so they had a very very tight relationship which enabled uh enable us to react and get from a business perspective get exactly what we wanted um it wasn't a case of spending lots of time documenting requirements to build this platform so um equally the second point there working software over comprehensive documentation documentation um was not heavy for for the walk environment and that really normally when I share that information I talk about how we did that the question comes up about well how did you make it gxp then given that normally um gxp platforms are so reliant on documentation well another key decision we made was although we knew we would need to make a component gxp compatible um we decided to essentially leave that for leave that for later so the gxp um the gxp component um was parked um and we built quickly we made some wrong decisions but we corrected and we got very quickly to the platform we wanted to uh to build um and now we're looking at the gxp component um and um and I think overall um that has saved us time rather than trying to integrate gxp from beginning to everything that we've done um a couple more key factors I just want to highlight um personal touch um it's very tempting to go to kind of standard support models very very quickly on a platform like this what we did with the with the platform was set up a generic email address so that when people had issues they could email that address um if their problem was you know can't log in or something like that then yes it would be triage to a sort of more more like a standard support model um but everyone involved in the development of the platform the direction it was heading in was on that email distribution list so if that was a case of um someone uh I don't know misusing misusing in the sense of not using the software in an optimal way as opposed to anything anything dodgy um so if they weren't using the the software in an optimal way um we could help educate that person as to how um how they could get more out of it how maybe they how they could speed up their simulations for example um or how they could control the way they were paralyzing their code or it might have been um in some cases like that with our RNA sequencing pipeline we would speak to the individuals learn um learn what it was they were trying to do and then actually add additional functionality that made their life easier so the team were responsive to what users were doing and we spent a lot of time learning how people were using the system and that personal touch also added to added to the system and as well as talking about a number of users I also talk about warp as a success um based on the really really positive user feedback we've had for the platform and lastly just uh to touch on um future webinars that will run like this um the platform itself is is not all you need obviously the training program that we developed alongside this an initiative called alpha qc which enable people to use r in their day jobs and the shining examples that I showed you before are also really really key to the success of the platform uh draw people on to using the platform uh make uh and help make people aware of some of the resources that are available to to support them so just to kind of uh just to finish up really um and set us up for the discussion discussion groups um first the second discussion topic that I mentioned is is the controlled execution um so if you like the gxp environment so I said that that we left that to later uh this is something that we are currently addressing um we've got a fair way along it's not available to our users yet um but to build this we were essentially um we essentially had a decision I've called that decision on this slide where to build the fence basically do you make the entire environment gxp and change the way we work to comply with our gxp processors or is there a way we can target a particular gxp use case and and make that a subsection of the environment so we opted for the latter um so what I've highlighted in orange on the slide here is is essentially the gxp workflow that we have so the when I say gxp here really what I mean is gcp so this is our clinical x um clinical use case so we built essentially a frozen our installation on our management note uh and users within the r studio uh server pro environment can select that frozen environment okay so frozen out as in they can't install any additional packages or change that environment in any way um they can select that frozen environment um develop their code in the normal way and then when they're ready to execute it they can execute it in a batch mode and essentially what we have is a dedicated proportion of our hbc compute nodes um for the gxp executions so they'll run that code um and it will run in batch mode much like the hbc code that they submit everything will be contained together and we'll have our logs and everything else that we need um around that so we've essentially we're we are validating if you like a small proportion of the system rather than trying to validate the the whole system with the content sharing and everything else as well and to give you a little flavors to what that looks like um or could look like as I say this is still um somewhat in development basically the user again through an add-in choose to execute their scripts in a gxp way the type um uh name for the directory to collect that and then they execute that script and it will collect all the relevant information the relevant outputs in that in that directory and what we're looking at now is how we uh how we essentially audit that process um to make it compliant one thing I do want to highlight from this that um I think is a fairly interesting development um as part of this when we um when we built this uh when we started building this we um followed the our validation hub guidance around this so for those not familiar with that um one of the key components to that is when you're looking at an r package look at the packages that users are actually going to use so for example if a user's going to use dplyr we will spend a lot of time just looking at something like dplyr and deciding whether we uh we think it's okay to to use or not but installing dplyr requires 50 or so other packages to be installed as well and it's it's pretty difficult to stop users loading and using those packages so what we've written and developed um and all the credit for this goes to uh teelo teelo blanking my team for for the work behind this whole this whole flow and generally walk actually as a whole um what teelo's developed is a um a means of passing a a script um and uh we have you can see here there's a column of as to whether that's verified or not so if a package is approved for usage um then fine it's allowed um but if a user tries to use something that's a dependency so something that we haven't verified we haven't checked then it's highlighted here in this uh checker and what we're doing at the moment is discussing how we will then uh what we do next does that stop the user from executing that script or does it give him some warning um so that they can go and check and review that um uh review say digest in this case um to to make sure it does what what it's supposed to so we need to adapt our qc process um to uh to respond to this potentially okay so last last topic in-house versus cloud very very quickly just to speak to this i've already mentioned that warp is a what i call a bare metal installations it's all an in-house environment nothing is containerized nothing is in the cloud today we have run a couple of pocs for components um and we've done so because gsk uh gsk's current it policy is to be cloud first so warp is part of a wider program of work called space um warp is part of that wider program and that wider program um will mostly be built in the cloud so we are thinking about how we can start to migrate components to the cloud in future and and even though we have built it in-house we have had some good success in in working with that hybrid um architecture for example um putting our studio connected to the cloud and still having it work with the same in-house our studio package manager and so on um well what i will say is i mentioned that we have some very very powerful servers um here's a typical day in the life of warp servers um we're not really pushing it to the limit despite having 120 people now 150 people using it on a daily basis concurrently um we really are not pushing the cpu or the memory to to its limit and i guess you could argue that both ways maybe we spent too much on internal um hardware um or to look at it the other way we don't really need to go to the clouds just at the moment because um our internal systems can can very much cope and this is for the the load balancer studio server pro environment but if you look at h our hbc environment it's very much the same same picture we're really not stressing our environment at all um and one one of the reasons for that i should say is that that personal touch again so it as we start to see users pushing the environment um often we'll check in with those users just to see what it is they're they're doing and sometimes they've done things like nested nested parallelized loops that have started hundreds of hundreds and hundreds of processors and they're not aware of that so actually the management of the users has been really really effective in in ensuring that we don't uh we don't push the boundaries and hit 100 percent memory cpu usage as well so just to summarize um what from a perspective of uh growth and number of users that we're trying to get from sass on to r has been a huge success within bios biostatistics and beyond with all of our partners um so as well as the biostatistics usage um we have many users from other groups our pharmacometrics group and um uh bioinformatics partners that are using this platform as well um it's obviously it's an all-purpose r environment dedicated for r although python is um uh python idees are on their way within the environment as well so soon to be a more all-purpose data science environment if you like um and very much an in-house installation based on the r studio commercial uh components um and the last couple of points there just to summarize we very much developed an agile way and i think that was a key success to uh to the platform and our new gxp workflow that we're we're piloting um our quality teams are pretty confident um uh or pretty pleased with the work we've done actually to make that part compliant so i think we've done a good job of marrying up the sort of agile ways of working and the the gxp requirements around documentation and so on that we have to to make sure that we're we're building a compliant platform so that's it um time-wise um we've got uh around about five or ten minutes for for immediate questions um and then what we'll do is we will go to uh we'll take a brief break whilst people follow uh the zoom links which will be posted in the chat if they haven't already and then we can go to the the parallel sessions um so i'll come back to that in a minute but in the meantime um i'm ready for questions my core i don't know if you're on and have been monitoring the questions i can see some questions on here but i don't know i will have a look let me have a look through the uh through the chats okay uh so um i'll try and pick these off um and answer as many as i can um so first of all the the obvious easy question yes the recording will be shared uh afterwards um next question down i can see from that does your hbc power your computationally intensive apps on connect so at the moment um if someone does um someone runs simulation so for example our qdm app has a bit of simulation involved um we are just using the power of connect the connect servers are as powerful actually as our hbc servers so we haven't really got an issue with that um but yeah we started exploring whether we could um whether it be more efficient to kind of use the hbc platforms in future i guess with with apps with shiny apps it's not actually that common that we have a need to to process things um in high performance way because if you think from a user experience perspective um if you click a button then you have to wait for a load of simulations to go and run it's not a great experience so it doesn't come up too much we try and use um try and use mathematics to solve as many problems as we can um but yeah it has been a topic that's come up through our quantitative decision making uh and and uh the intention is to look at it but we haven't so far no one is using the hbc through through connect but theoretically it would be it would be possible um question who did we um who did we need to set this up in terms of skills um so we use the third party contractor for this um but as i said the the team was essentially one person from the business um so that person tealo blank it was in my team so i was kind of loosely involved in terms of talking to tealo uh talking through some of the ideas but essentially one person from the business and a um an it team consisting of one technical architect and five other uh for the majority of the time five other individuals working under that um that leads architect for the platform uh with input from one or two others um from our tech organization as i say all of those uh the technical architects and the five people underneath uh underneath him were uh third party resource so we also had a tech contact as well um who uh in particular helped with things like um uh compliance to some of our gsk processes and things like that um as we've developed our gxp process we've had input from a couple of other groups as well but as a core team it really has just been the six or seven six or seven individuals technical architects business person who really knows um knows what they're talking about from a tech perspective as well i think that probably is the key for for me is having technical people so people on the it side who actually understand are um i know our studio talks about things like and our admin as a role i think people who understand are is almost a must for something like this and equally the business side have to again understand how what's going on in the back end and if you have those two things and they can talk to each other effectively then then you have the makings of um uh something something really really positive like like what um trying to make my way through some of the other questions the script parsing is done at a package or function level um if a function cannot identify function is caught okay so um the script parsing is still somewhat in development i can't answer too many questions questions on that it is done at a um a function level so it will look through see all of the functions that are called um within the package it is into i can say it is intelligent enough to know if you have any kind of masking you know which package the function comes from um um but yeah our processes are based around packages in terms of the assessment at this stage but that said um if with if there's any testing within that package we have information at a functional level to know whether that function has been has been tested as well so we're still trying to work out exactly what information we need to provide to the user what level of risk but if you think of risk as a sliding scale first of all has the package been assessed so is it a dependency um an import or depends is it dependency or is it a package that we've actually looked at uh and then potentially it it's flexible enough that we could say has this function explicitly been looked at how many tests have been written and so on and so from that perspective you really can get a kind of really nice risk profile around the the functions um question around mini tab uh how do we so users considering to learn mini tab to learn instead um that I'm not sure I can give a good answer to that um I suppose what I'd say is actually wait for our next webinar our next webinar will be on our training and so um internally at GSK we have focused very much on the tidyverse um and and I think the tidyverse is a really nice way for people who would look to a system like mini tab um getting onto our um certainly for our SaaS users it's proved much more user friendly than teaching teaching base are so I can kind of part answer that question saying focusing on the tidyverse makes the the jump a little bit easier versus other software but I probably can't really speak directly to mini tab um lots of other questions coming in so I'm trying to pick these pick these off as as I see them I'm sorry if I'm missing questions we'll try and get to all the questions afterwards uh and either addressing the blog or or elsewhere um Eric uh so for my org our studio connect is deployed on a small virtual server and we recommend the uh user leverage ghpc clusters offload their analyses if their conversation is intensive I assume that's in response to the question earlier on connect sort of skip that one um have you encountered any issues with 800 user users loose on walk building apps running scripts so is it generally being positive was there any reluctance to have it that way um so as I say the the users we haven't really got beyond sort of 150 concurrent users um and all of the servers in the system are 190 they're all designed as hpc servers so they're all 192 logical cores so essentially that gives everybody a core each we haven't really exceeded that yet um in fact um it would be a long way uh before we we got to the point where you know we'd have to have 400 users concurrently using the system before we got to more than a core each so certainly no problems as I said we have had some problems on the hpc servers people doing things like uh nested parallelized uh loops so you've got like a a for each loop in inside another for each loop uh and that created a huge number of processes and the and the team on those instances worked to to cut that down um but it didn't really push the system um I can't really recall maybe I can two three instances I remember where the system was really really um stretched um and in each case it was people using the uh doing something wrong if you like not using not understanding what their code was doing and the team worked with them to resolve it but in in normal rounding it hasn't been it hasn't been a problem um but but yeah it does take quite careful management to avoid that um can you share what is the extra care needed to be taken while using third party packages uh not sure if I'm interpreting this question right but I support I assume that relates to what I was talking about with the primary package and the the um the dependency packages so essentially if I use the dplyr example again um actually we we internally we've made decisions to support the whole of the tidyverse um based on some of the documentation our studio produced but if you take that as an example um what we would do is we would look at dplyr and we would look at um uh metrics around dplyr so we look at the number of downloads we'd look at the their response to issues we'd look at the amount of testing within the package basically try and get an idea of both the software development lifecycle is it does it follow good practice and we would look at um usage metrics so is it being widely used and tested within the our community um we do all that for the main package but we wouldn't do that for all the dependencies and so the dependencies just haven't been through that level of scrutiny because the idea is if you think about validation you're what you're actually trying to do is talk about how a user will use the package so the expectation in our case if we install dplyr is that users use dplyr um so the expectation is that they don't use those dependent packages if they also wanted to use those dependent packages we would expect that they would request that package and then that would go through the same kind of level of scrutiny and so it's a bit like if you install um if you install r on a linux system you go and you test r um but the level of testing scrutiny you give to the linux operating system underneath is a very very very different or if it's a c library that supports the r package you look at the use of the c library through the r package not the direct use of the c library because that's not how you expect users to use the system so it's essentially it's essentially risk based and we're trying to account for that and mitigate any potential risk of people using those packages thinking that they've been fully explored by us um through the through the parser and i'll take one more question um sorry i've just read a note from from bruce hi bruce bruce is one of our gsk tech people is here keeping the personal touch with the theme thanks bruce yeah so the team are very engaged um turning up to things like this as well learning how users are using it and i think that has been a real success of the platform so actually i won't take any other questions i'll leave it on that comment because we're right on the hour um what i will do is um just share my screen as a reminder so we're going to get the new zoom link shared in a second um but what we'll do now is go to uh let's have a look see i'm not sure if you're seeing what i think you're seeing let's try that again so if uh hopefully you're seeing my slides if you're not um you will shortly be seeing my slides let me stop and try again so um we're going to take a five minute break um whilst everyone uh gets into the next um the next session so this is set up as a as a webinar what we want to do now is is move into parallel um breakout sessions so the recording will end soon um have your five minute break and then follow the new zoom link which is hopefully being posted in the chat now which i which i can't see and you've got a choice of three different um streams um so you can discuss support maintenance strategy for our um gxp workflows and control execution or carry on addressing the in-house versus cloud infrastructure so question so hopefully you've all got time to to join those discussions i think that should be um that should be valuable um and um as you do follow the new link um if you end up in a session that's not the one you want to be in please you know please reassign yourself through zoom to the one that you are interested in being involved with so uh i make it one minute past the hour so if we say at five minutes past the hour we will um we will get going with those sessions you've got five minutes to take a quick uh buy a break grab a coffee