 Welcome to our medicine 2020 day three. I am Peter Higgins and I'll be moderating our session this morning. I wanted to give you a couple of updates. Our workshops from Thursday, one on one intro to our for clinicians and two on intro to tidy models. We'll be posted tonight. So schedule that in for a bit of Sunday watching, if you'd like, but they'll all be available to all registrants, whether you attended that workshop or not. We will have a feedback form at the end of the meeting. We'll have only 3 questions, but because we made a lot of changes this year from an in person meeting with cheesesteaks in Philadelphia to an all virtual meeting. We need a lot of feedback, particularly on the format, how it works. Keeping it in person versus virtual or going to some form of hybrid. We want to have active participation. So we encourage you to use the crowdcast chat button to chat in real time. Use the ask a question button and upvote questions that you would like to hear the answers to whether you ask a question or not. And also participate in the sidebar discussion on Twitter. If you need to hashtag our medicine 2020 and follow at our medicine, but there's a lot going on there as well and people sharing a lot of what's going on in the meeting. This is an example of the chat window just jump in state your opinion, or if things don't make sense. If you're thinking about formulating questions, a great place to discuss. And to ask a question just hit the button at the bottom of the screen called ask a question, like on the orange button and fire away. And I also want to encourage future participation. We need people to be on the organizing committee on the programming committee. It takes a lot of work to put all this together. And I particularly want to thank. Beth Atkinson who put the program together. And I also want to thank Kevin Kanuke who organized the entire meeting, including the platform and Daniela Mark who's been helping us keep the platform roaming all three days. And so just email Daniela to volunteer for next year. Today's themes we're going to start with our in clinical practice with you and Harrison doing our keynote, followed by a number of submitted presentations. We have five will be our medicine shiny apps let off by Chris Beley and theme six are in COVID-19 with a keynote by Patrick Mathias, looking at all the applications of our to the COVID-19 pandemic. We'll have birds of the feather sessions on imaging analysis minorities in our medicine reproducible research COVID-19 and geospatial mapping. Again, just a reminder about the code of contact. I think things went very well yesterday, but particularly in the chat. No harassment. And this full code of conduct is at this website, but the Linux foundation and no screenshots or recordings or photographs of birds of the feather as participants have not consented to have their photographs shared. You can take screenshots of slides during the presentations and share those on Twitter if that's something you want to do, but not personal photos. I want to thank all of our sponsors who helped us put us together as particularly as we change the format late in the game, and particularly our consortium who gave us bridge funding to keep it going. And today I want to lead off our, you know, with you and Harrison, he's a professor of surgery and data science and honorary consultant surgeon at the University of Edinburgh. He is known in large part for his large surgical research networks for the final fit package and for work in our education, particularly in low resource countries. And he'll be speaking on from cancer to COVID scale and agility in global health research using our and we will switch over to you and am I sharing. Can you see me. We can see you. I think you need to share your slides again. Is that good. Well, thank you so much for the opportunity to speak at this meeting. It's an absolute delight. The format I found great and I loved yesterday's talks and I loved the chat and it does feel like this is a new way of doing things. So I'm really grateful to the organizers for putting on such a great meeting. So I am a surgeon and a data scientist and I don't know what life decisions I took to end up where I am today, where it all went wrong. I did a PhD in the lab and I liked the numbers more than the pipette and ended up doing a statistics degree after that. And so I now spend half of my time caring for patients with cancer and the other half wrestling often with our to try and work out how to make that patient care better. So I started in 2004, which is making me feel really old. This is how it looked. It was just the console, just the script window, no color. If you did something right, then a plot would pop up into a third window, but that was about it. There was no studio. There was no tidy verse. There was no markdown. But goodness, what opportunities it opened up. So I guess the first promise of health data science is something like this, an interconnected health data ecosystem where information from electronic patient records from imaging from labs from omics is aggregated together with data that may come directly from the patient, maybe from sensors, maybe from social media. But what's important is that is aggregated. It is analyzed and it's turned into actionable output that a clinician can use to improve patient care. Now some of these arrows exist in some places, but we're really quite far from this reality in most health systems that patients around the world have access to. And then the second promise of health data science is really truly meaningful improvements in patient care. And that might be around diagnostics, including all the great work being done in medical imaging. It might be around decision support. It might be around the actual delivery of treatments, you know, and we're now using, you know, robots on a day to day basis in surgery. It might be about follow up of patients or quality assurance of patient care. And probably you're possibly most importantly, it might be actually about deriving new treatments and performing performing clinical trials. So this is a really pragmatic talk. I see it from being at the frontline of medicine where really what matters is the improvement of patient care. A theme that I hope that will come out is the importance of the assembly of high quality data that is actually processing action in real time. It's not data disappearing into silos, but it's actually turning into useful information that clinicians can use in real time. I've made the emphasis here on the use of R and the tools that we use on a day to day basis together with a few tips of things that I think are useful, some of which you'll know already. But rather than really talking about the kind of details of the studies themselves. So I'm going to focus on three areas. How do we gather high quality perspective patient data often in global settings at scale? And I'm going to use the global surge cancer study, which is just finishing or just finished at the moment to talk about red cap and what a wonderful tool red cap together with R is to talk about our liberal use of shiny some apps that we've put together, the final fit package that Peter mentioned earlier. Now I'll go on to something different, which is using smartphones for follow up but really practical use of this in real patient care for granny in Scotland who doesn't really know what an app is. And certainly doesn't know what Bluetooth or the like is. And I'll use, so this is a randomized control trial actually about predicting wound infection using, of course it's smartphone selfies here, but using questionnaires on wounds and taking photographs of wounds in patients after surgery in order to try and determine whether they're going to become infected or not. And I'm going to talk about integration with Twilio, which you'll know is a cloud based communications platform together with Keras and TensorFlow. And then finally, I'm going to talk about the COVID bit and how can we do really speedy agile real time data analysis, but that is of high quality. And if that's something which has come out, you know, over the course of the pandemic, it's the difficulties when research is not of the highest standards. And we'll finish up just with a few words on on healthy are our approach to teaching are using notebooks and everything that we know and I have done together wrapped up in our for health data science, which is going to be published in November of this year. Yeah, so in that final one, I'll be talking about our studio connect and prognostic modeling and just an approach to prognostic modeling mixing machine learning approaches with with guns and the suit. So in 2018 16 million people around the world will be diagnosed with with cancer and four out of five of them will need surgery. But fewer than one in four will have access to high quality, timely, available surgical care. The incidence of cancer is rising worldwide in all countries mortality has plateaued in some that continues to rise, particularly in low and middle income countries. But here's the rub. What do we know about the provision of surgical cancer services worldwide? What do we know about the quality of surgical cancer care? What do we know about outcomes after cancer surgery? And I'm particularly interested in the moment in the immediate outcomes of surgery. So surgery within 30 days rather than longer term oncological outcomes. And the answer to that is almost nothing. So over the last five years, colleagues in Birmingham and ourselves have met lots of fantastic people about the world in order to put this research collaboration together. It's now 5000 clinicians strong across over 100 countries who are all dedicated to the improvement of surgical care around the world. We do research. The research is ideally driven by the priorities of those living in low and middle income countries and certainly not by those in high income countries. The setting up of this collaborative and how it functions and, you know, what a great bunch of people is is all a separate talk, which I didn't think was as relevant to the kind of our flavor of the talk today. I'm happy. I'd be delighted to speak to anyone on a kind of one to one basis about all the details of that after this. So the global search three study was a worldwide cohort study in cancer surgery, looking at early outcomes after cancer and the quality that may predict that. So really big studies about 120 variables collected. There's lots of sub studies within that. And so today I'm really just going to concentrate on the stage of presentation. So how advanced is a cancer when it comes to the light of doctors within different settings within different countries at different income levels and to what extent does that explain differences or potential differences in the outcome of surgery. And we're focusing on this in breast, gastric and colon cancer, because these are the most common cancers across the world, both in terms of incidence and in terms of mortality. Recap has become central to our operations, partly because it integrates so well with our studio. I'm sure you'll be familiar with it. Electronic data capture service or system rather. Rob Taylor and colleagues at Vanderbilt have just done an amazing job keeping this moving and providing such a fantastic, such a fantastic system. So any project that we do now that's not using routinely collected data will often use red cap as the as the primary database. In particular, you know, things like the extensive data quality rules that can be implemented really help reduce errors, particularly when you're gathering data in a global setting. It's got an amazing API, which makes interaction with the data itself itself, you know, in any ways, but particularly using our an absolute breeze. So Kenny McClain is a PhD in the lab with me. I think he presented this at our medicine last year, but he's put together the collaborator package, which really helps the running these collaborative projects and particularly the management of, you know, these two or 3000 individuals that all need appropriate data access rights and need to be seeing the right patients in the right hospital, etc. And it really helps with that. I mean, the red cap API to R is really easy. Here's some code. I promised I would put some code in. So there's using our curl, a simple post form call to red cap that will pull all of the data from a particular project into into R. There's some packages now that that wrap up the API red capper is great. I don't know if anyone from that package at this meeting, but we use a lot it does batching really well. And it means that you can just pull the data, you know, live at any time on to your R server and use it to do it in real time analysis. So the API works really well up to about about 50 million cells. And then it's and then it starts to just just as APIs always do fail. None of the packages at the moment have got make particularly good use of try catch or other ways of dealing with errors. But power, which you'll be familiar with the package. I love it. I map every day, I think. But it's got this really functions, which it's many people don't know about insistently. So you can wrap your function in insistently, which modifies it to retry a given number of time. You can implement exponential back off, just meaning that time between each run of the function increases. And this really smooths out the pool. So we so we now pull, you know, up to 500 million cells, you know, straight from red cap on to another studio server with with no problem just using a simple wrap. So we really use shiny a lot. And I mean shiny is fantastic as I'm sure you all know just for the speed and agility with with in which you can get things up and running. These projects are complicated to run and we need to give national leads. We need to give leaders around the world tools in order to help them facilitate data access. So we've got red cap authorship projects, red cap data projects, and they'll just pull routinely into into a shiny app. And then we can just put up anything that we want to help people use run the project. So here, for instance, I don't know how well is projects, but you know, which patients were collected at which times, what data is missing, what's the data quality like, which teams have pulled which data, who signed off the data, you know, etc, etc. That allows them to go to an individual and say, look, your data is poor quality, what's going on, you need to improve this. And so we just get the best quality data now coming in. It's, it's, it's absolutely phenomenal. We can, we can push to public facing websites. It's really easy. Our studio connect is now made that, you know, an absolute breeze. So we've tried a bit of gamifying data collection, you know, around quality and around completeness. You've got to be careful that people just don't make it up in order to win the game, but you can, you can make your, you know, the state of play at any given time now, really easy to show. So what, so what we got in this, in this global search three project, well, across a relatively short space of time, we got 16,000 patients collected by over 2,000 collaborators, 836 teams, 428 hospitals across 82 countries. Peter mentioned yesterday has ideas around concert diagrams. That would be fantastic. I'd be really keen to help with that. We do use diagrammer, which someone mentioned in the chat. I find it really difficult to use. I, the kind of dot language, I just, just doesn't make sense to me. I've actually started pushing with the Lucidchart API. I don't know if anyone's using Lucidchart. That's a service out with our, and that works. That works well, but I'd love to, I'd love to take part in that project. So the way of 16,000 patients with good spread across country income level, good spread across our gastric breasts and colon cancers representing the prevalence in the, in the communities of those diseases. So I said we would, we would think about cancer stage and about mortality, about early outcomes after surgery, to what extent does late presentation of disease in poorer countries reflect on poorer early outcomes. And here on the X-axis, you've got country income level on Y-axis, you've got the proportion of patients and these lines. So this is, as I'm sure you know, a faceted GG plot. These lines represent the proportion of patients in each income setting by stage of cancer. And you can see the blue line there, early stage disease for breasts dropping off rapidly as you go from environments in which breast screening is just part of the infrastructure to, to where it is not. Same for gastric cancer, where screening programs exist, same for colorectal cancer, where screening programs exist. And stage two and stage three disease, so locally advanced disease, stage four disease, metastatic disease increasing as you go from high to low income settings. It is clear from the data as we'd be expected that cancers present later in low income, low income settings. This is a really detailed analysis, which, which again, just for the purposes of this talk, I've just reduced it to one slide. So I would love to talk about this in detail to others. So how does mortality rate differ between countries? Again, as we'd be expected, we found a significantly increased rate of mortality in low and middle income countries. The top bars there up to, for instance, 10% mortality after gastric cancer surgery in low and lower middle income country groups. But skipping through a lot of analysis and a lot of work. These differences do persist in models adjusted, you know, for patient characteristics, such as performing status, comorbidity, age, sex, when adjusted for stage of disease, which we've just talked about, and in differences in procedure factors. We do a lot of hierarchical logistic regression, modeling both using LME4, the LMER functions for continuous data and glimmer for GLM. And I'm talking about it today, but in using STAN and Bayesian frameworks, STAN is an amazing platform, really, you know, fantastic. These are really great R packages which allow it to be run without some of the pain of writing that R script itself, the STAN script itself rather. But, you know, disaggregating, you know, the deviants across these models, you know, two thirds of the variation in early mortality after cancer surgery is explained by patient factors in the round. Stage is not particularly important in that. And a third is explained by hospital and country level characteristics. And this is really important because it now allows us to think really clearly about what interventions can be looked at in order to improve outcomes. You know, in that green patient part, for instance, better nutritional support is almost certainly an area which will improve outcome. But in the third of variation explained at the hospital, country, geographical level, better imaging facilities, better period of care, etc., is almost certainly going to improve outcome. I would love to talk about this in more detail to people. So please do get in touch if, you know, if you want to know anything more in detail about that. A lot of our analysis is just done with final fits. So we wrote this package about five years ago, and the aim of it initially was to get the data out of R, was to get our final plots, you know, table one plots and regression plots out quickly into predominantly word documents so that we could, you know, into manuscripts and send away. We'd used other packages at the time, table one, you know, for instance, we'd used but but nothing did quite what we wanted. I mean, we've really actively developed this over the last five years. And now we actually use it for most of our primary analysis, just because we find it so flexible. It takes it takes a minimum input has lots of options under the hood, but only ever outputs for tables of data frame, just a data frame, it's not anything more than that. But that gives it real flexibility in terms of where you, you know, where you take it. So, you know, using knitter using flex table differences in those package over time, over time, which, which maybe break other packages, we've just had no problem with that. You know, so you just you can list, you can list a set of explanatory variables, you can list a dependent variable and using one of the big functions is about 40 functions in total. Summary factor lists, you can just get this well formatted table out it deals with, you know, well just what you would expect continuous categorical data does hypothesis tests, etc. I love the talk on GT summary yesterday and it's great to see great see packages like that being developed really exciting. And I think this is a really important area of of our development, how do you really facilitate getting the information out as quickly as possible. So you just have to switch that summary factor list to the other one of the other big functions, which is called final fit itself. So that's the regression function. And depending on what the dependent variable is, whether it's continuous, whether it's categorical, or whether it's a survival object, it will just automatically do linear regression GLM or logistic regression, or cox proportional hazards. And give you this univariable multivariable tables out you can do, you can take variables out if backwards fitting is your is your bag you can do it in different ways. You can pull out model performance metrics. And you can you can do more complex things by using the constituent functions under the hood which put those those tables together. But we use this just quite a lot now to just both to explore models but also to get results quickly out into into reports. And once again, just by just by changing the kind of single function plotting so you can get coefficient plots or odd ratios or hazard ratios depending on what you're looking at. We quite often find just reducing our model to that particular format just works well particularly when models are maybe large and complex and they'll go in the appendix of a appendix of a journal. And that plotting is for these different options as well and we we we wrap the great serve minor package, for instance, again just for our own convenience. We don't have anyone use Zeleg in the past which we used a lot it's just a kind of really great set of packages, which kind of started breaking for me I'm not sure why but but we quite often bootstrap our models, you know, so we're interested, not necessarily in, you know, the the beta coefficient of a particular linear regression model but we're interested in how the characteristics of a set of patients, you know, change. And so you can see you can bootstrap on a choice of of x a choice of covariates and there's kind of various functions in there which which we think are quite useful and then there's some nice missing data pipelines so you can all all the way through from from missing data diagnostics around, you know, missing completely at random missing at random etc. Through some wrapping of of the fantastic mice package to kind of final regression using using mice output, which is again what we use for a lot of this. And this is really all about getting results into into our markdown really you know we do quite often now working from markdown packages, sorry markdown documents primarily rather than from our scripts and then making a document at the end. And as I'll just mentioned, briefly right at the end we do teach in notebooks style now so the so that the information is sat beside the is sat beside the code itself. One of the important aspects of this global data platform is getting the data back to collaborators so that they can look at it as easily as possible. And again shiny is really good for this because it is difficult with data governance getting actual raw data sets back to individuals. And I'm getting quite a lot of feedback from someone's mic. My headphones are quite loud it's making my head shake getting data back without necessarily sending data, you know data to them so I'm trying to demonstrate this live. So, so renewing the team, you know, put these data vis vis together this is a real data set it can be updated really quickly. And this is about saying to collaborators, here's your data, you start exploring it while we're still, you know, finishing up the analysis and tell us what you think about it. So, you know, it started off with, you know, an explanatory variable and then you can split it, and then you can look at a particular outcome. And you can get quite far just with understanding the data, you know, to what extent does body mass index impact on mortality after surgery by the different cancer types, etc. So we put all the variables into these apps and make them available. You can, you can, it's on GitHub so you can do this for your own data. And then I decided to go and extended this further to an actual regression app. So, so this has final fit under the hood, and it's called shiny fit. And you can just add in, you know, variables. So this is a logistic regression model, exactly the same as final fit. Click, click, click. There, you know, there's a regression model. I want to take some out. Add some in rather. There's that you can show final models, you can include model metrics, you can subset to a particular set of data that you might be interested in. You can make missing data explicit if you want to model that. And you can then output that as a CSV file as the raw numbers and do more with it or just output it into a Word document as a natively formatted table. You can get the plot. You can cross tabs your table one automatically. And we've got our own glimpse functions just so you can check the underlying underlying data. So this has been really useful for getting our collaborators actually using, you know, using the data really quickly, rather than having to wait for months and then for an analyst in a hospital somewhere in a university somewhere else to give them their data back. And create a goal with your own data. So we haven't spent a lot of time developing this, but you can you can just put your own data set into that. And then there's a kind of prep file, which is really just about the kind of mostly about the labeling. And then you just push that into a shiny package and push that into into shiny and then use it yourself. So we've got a few example data sets of that. Good. So the second project that I was going to talk about was is something completely different, I guess. So this is in high income settings, although we're setting up in low income settings as well. And it's really about the right hand side of promise one of health data science. It's about, you know, what data can we actually get from patients that's going to be useful. And again, this is trying to make this really pragmatic really. You know, just this is this has got to help patients and it's got to help clinicians, you know, working in hospitals actually improve patient care. So we did it as a trial and there's pros and cons, there's pros and cons to actually doing this within an RCT rather than doing it within another structure. So it was kind of it was set up for that. And it was, as I mentioned at the beginning around wound infections. And we wanted to see if we could speed up the diagnostics, the diagnosis of wound infections after patients have been discharged home and to monitor the wounds to monitor them remotely. And then there's this neural network project that we did in the back of that. And because granny in Scotland, as I mentioned is not maybe as technologically au fait as others, we based this on on text message on a sms, which is reasonably used more so than older patients and so my dad still text message only one everyone else's on WhatsApp. But the patients get a text message. It has a link to a red cap form on it. There are questions to fill in about the wound, which are based on, which is an algorithm which we're also testing based on CDC criteria for surgical site infection. And they take a photograph of the wound themselves, a wound selfie, and that comes back. So I hope this projects. I don't know if you can see my arrow, but maybe not, but patients undergoing emergency surgery were recruited, they were randomized to standard of care, or to this intervention. And it works some basic patient data, including their mobile number and put it into red cap. Now, one of the great things about this study and this approach is the whole thing is done automatically. So the text messages get sent via Twilio from red cap. So red cap is great Twilio interaction setup already. The photograph and the questionnaire comes back into red cap and email alert is stimulated. A clinician makes a decision about whether the patient has a wound infection or not based on the answers to the questions on the algorithm and I'm looking at the photograph. They go to a text message sending shiny app and send a message to the patient who gets it and they are stratified into what is essentially a low, a medium or a high probability of having a wound infection. But which translates to the patient as reassurance everything looks okay to we're not very sure we see your family doctor or come up to the hospital because we think we've got a wound infection. And the the text messaging from our from our is really easy. So I mean, this is it in the past. Yeah, sorry. So you get it's just an API similar. So here's here's the Twilio API and then you've got your from number, your to number. And then there's some misuse service ID. Your user number, your token, the body and then you just use HTTR and a post call to that and ping the text message appears. It's fantastic. So you can obviously just wrap that into into a shiny app and use that because that allows you to log on the background allows you to put a bit of a bit of validation and to make sure that no mistakes happen. So we had a planned interim pool of data. So this is that rather than final results. So we saw 68% wound infection rate in our patients. That's kind of lower. And it tells you something about recruitment in a trial. So it was patients who were having easier surgery laparoscopic surgery, typically who were recruited into the trial rather than those that were having more difficult surgery. So wound infection. So our rate was, you know, maybe lower than expected. So we do see a non significant difference in time of two days from the diagnosis of wound infection. We needed a bit more work there and there's more data to come on that and it may show a difference. It may not. And I'm not sure it matters so much whether it meets its primary endpoint, rather than what we've learned, you know, kind of using this. Yeah, but patients liked it. You know, the puts that about 60% in the smartphone group thought that they were reassured. It was easy to get hold of advice about their wound. I mean, what this is about is about spinning up an idea around an intervention that might help patients trying to avoid as part of that the pain that goes in with app development and all of that that goes afterwards. And I don't think we're suggesting that this is a kind of platform by which we would use this routinely in clinical practice. I know there's a lot of chat at the moment about shiny in production and redcap in production. But we really see this as a tool for exploring important clinical questions easily and quickly and cheaply because it's really it doesn't cost any money to do this. Many of you I know are working on deep learning and neural networks applied to various problems, including computer vision. And many of you will be aware of convolutional neural networks and that approach to trying to interpret data. So you're really taking an input image and reducing it down extracting features before passing it to a set of fully connected connected layers to classify that into something which is useful, which is often dog or cat, it seems, but sometimes car, truck or van. But for us, it's wound infection or not wound infection. So the big paper in this was in nature two or three years ago from Sebastian Thrun's group. Looking at melanoma photographs of skin lesions put through a CNN in order to diagnose diagnose melanoma with impressive results in terms of discrimination reported as equivalent to dermatologists. I mean, as diagnostics go, this is a holy grail. Melanoma is that I mean, it's an obvious first target for this technology and everything that comes beyond the study is going to be more difficult. But nevertheless, this is a really impressive new statement of fact around the, I mean, right around how this technology can be deployed. So our studio, Keras TensorFlow, it's amazing. It works really well. And I would really, you know, encourage anyone, particularly, you know, young people out there kind of listening to this, I really encourage you to get into this and just just try it out yourselves. So, I mean, I set this up on an on an Amazon EC2 instance. Talk about that. Talk about that separately. I use the deep learning AMI, although I'm not sure that was necessarily useful. I put our studio over the top of that. And then you can use a CPU instance to get yourself going. I mean, they're free when you first start with kind of no power or capacity, but you can you can get everything set up and not actually not actually pay any money. You do to run these imaging based CNNs need access to GPUs. I mean, it's just intractable on on a CPU and Amazon now provide these P3 instances. There are other instances coming along and some of the Amazon and other other providers are available. Data centers giving you access to GPUs and TPUs. You do need to be careful because they can become quite expensive. I mean, some of these are running at $30 or $40 an hour. So you can start burning through money quite quickly. I went through $300 and a weekend kind of kind of setting this up, not really, not really realizing it. So you need to be careful that you that the person paying is available. So anyway, so we set up this classification CNN. This is a kind of standard basic approach of CNN. And I mean, this is where a speaker usually says, we've got amazing results. We can predict the existence of our condition with near certainty. Well, that was absolutely not the case. I mean, it was pretty rubbish to begin with and really for good reason. I mean, that we don't have that many images. We don't have that many images of wound infections compared with controls. The images are not consistent enough because the patients have taken them themselves. The lighting isn't consistent. The color balance is off. Some are zoomed in, some are out of focus, some are away. And I mean, it really, so I mean, we've, I mean, those that are into this will know that that is a basic start. And we have gone through the pre-trained models. I mean, this is good. And again, the young people who haven't done this, I mean, you'd really need to get into this. I mean, many of you know that the first layers in this and these CNNs are about extracting features. You know, common to all images, edges and textures and blobs and the like. And by utilizing models that have been built on large data sets like ImageNet, you can really improve the discrimination in your own particular problem. Augmentation is another approach stretching and zooming and rotating images, which you wouldn't think it would make any difference, but actually does significantly improve and improve discrimination. But this is, I mean, it does start to get expensive when you do this because these models take much longer to run. You need a significant GPU resource in order to do it. But this can all be easily done through RStudio Keras TensorFlow. You don't need to, you don't need to know Python to do this. I mean, we do a bit of Python. We don't do much, but you can do this all really easily through the R front end. So, I mean, I think my conclusion from that really is, you know, just to emphasize the importance of the quality of the data. I hope that's a theme that's coming through. You know, you get patients to take photographs of wounds and, you know, the photographs if they're not high quality will not give you good results from, you know, from these approaches. So finally, as with many, I suppose all of you, we pivoted to COVID at the start of the pandemic and became involved in this project. So in 2012, the International Severe Acute Respiratory and Emerging Infection Consortium, which is a mouthful, ISARIC, had the foresight to establish what they called a sleeping protocol. So this was in the aftermath of SARS and mayors and was set up ready to be triggered in the event of another pandemic. So this project gathers data on patients admitted to hospital. So this is the UK part of it, but it's a global study admitted to hospital with COVID. It includes biological samples, so deep phenotyping across all of the omics space together with detailed clinical data on these patients. So as soon as the pandemic hit, a team of research nurses across the UTA immediately started enrolling patients into this. And we were asked to help with the data science part of it, reporting to UK government and feeding various data streams to modelling groups who are looking to try and help, particularly with policy around COVID. And there's now 80,000 patients in this data set. So I mean, I think it probably represents the largest prospectively collected in hospital patient data set of COVID-19 patients around the world. And just a big shout out really to RStudio Connect, which is just an amazing platform. I mean, I don't have much to do with RStudio as a company. I really admire them as a company, but I'm certainly not on the payroll or anything, so don't. So all of my enthusiasm about RStudio products is really generally as an end user. So we push everything now to RStudio projects. We share results in that way. We push our shiny apps there and we do these markdown documents. So for instance, this was a dynamic report which was updating on this data set every half an hour, which was used by policymakers and scientists informing policymakers about the current state of the data. And the, you know, the scheduler on RStudio Connect is fantastic because you just, you know, you run your RedCap API, the data is pulled in, you run whatever analysis you want, and you get an email notification if you wish that your project has been updated. So this was incredible. And we would sit in meetings and, you know, be able to update the data in meetings. Just in passing, if any of you have used parameterized markdown, it's really useful. It allows you just to pass any data parameter to the markdown document through the YAML header. You can kind of look this up. But, you know, for instance, it allowed us to quite easily just generate these reports by UK home country, so by Scotland, England, Wales. This project didn't cover Northern Ireland. But that was really useful. That was Tom Drake in my lab who set all that up. Again, we used extensive use of shiny dashboards in order to get this all up and running to improve data quality, to identify missing data, to motivate teams to collect data. That particular platform that was done by Tom as well is on the GitHub and you can pull that down and see how that's used. And, you know, that COVID was or is as bad in the UK as it's been anywhere. We have 32% mortality on the inpatient data set, often in older patients, greater than 70, often who have co-morbidity. And these were the sorts of plots that grew over time as we collected these data. It was partly because of this project, but you realise just when the data is coming in, and quite a lot of it is missing, but not even missing, just not collected yet. But you really need to track what numbers are going into what plots. There's about 20 plots and a further 10 tables in these projects. I started doing something which I thought was really useful. So I just thought it would share with you. It's using the McGreeter T-pipe. So McGreeter, as you'll know, is where the pipe, which has been adopted by the tidy verse where that originally came from. And we really use this top to bottom programming style and teach it and think it's really useful. We've completely converted to that from base R. But you can use the T-pipe in order to send out this is the status of the data at this point in the piped function. So if you're piping into, for instance, a GG plot, then just before the plot, you just pipe out the summary data. So the T-pipe allows you to send that data that's coming out of that mutate function there into GG plot, but also to allow you to, with that double assignment arrow, save to the environment a further object called plot labels. And then you pass plot labels back into GG plot in order to ensure that the numbers that you think are there, for instance, your N number. These alluvial plots are great, sometimes called Sankey plots. And we were using those to try and track patients through hospital trajectories. There's a number of people doing work on state-based models using this. We're really just kind of counting up the numbers. But what this led to was as being asked to develop a model to try and predict death. So lots of models have been published, some of which promised the Earth and perform pretty poorly in real-world datasets like ours. There was a lot of limitations put in us in this project. Death was to be the outcome because that was the most robust. It had to use variables that were easily available on hospital admission. It couldn't use anything fancy. And it had to be able to be used by clinicians wearing full PPE, full personal protective equipment, which probably almost certainly meant certainly no smartphones. So in our critical care unit, you're not allowed to take a smartphone while people were in PPE, which is quite limiting in terms of how you build a model to do that. But we were really looking to provide clinicians with the ability to stratify patients at that point in care. So we went through this process of split this up because it would be a little bit small. I'm just going to go through this quite quickly as with all these things. I'd love to talk about it. But we have 35,000 patients at the time. We used all of the patients that we had at the time that we started this modelling. And then we subsequently collected a further 22,000 patients after that and used that for temporal validation. We did a geographical and non-random split, so temporal and geographical validation on that. We had to multiply and put because of the missing values using mice, using our fingerprint pipe. And we used these general additive models, which I had used a little on the past, but not that much. But they're fantastic. We made an a priori decision based on the brief to add cut points to continuous data. I'm sure people will feel strongly about that. And then we went into LaSue with logistic regression in order to generate a prognostic index. And along that, we ran a machine learning comparison. We were trying to do something best in class. So we used a gradient boosted trees approach. XG boost, which many of you know about, what wins all the Kaggle prizes. I'd love to talk about that. Anyway, we selected cut points. We went into LaSue, we penalized, we calibrated and then we validated. And these generalized additive models, which apply a spline, often penalized, very, very flexible are fantastic via MGCV. One of the original packages doing this, anyone that's not done this and does this sort of modeling and really recommend it. There's a great GG plot based visualization package now MGCViz. And then you can pass that into LaSue, which is a penalized logistic regression, which allows you to cross validate and choose a penalty. Set lambda in order to try and reduce overfitting, which is really the problem with a lot of these models. And you can see exactly which of your variables are contributing most to that. And this is the model that we got out, similar to other models, but in a way different to other models. So age, sex, preexisting illness, physiology at the front door, urea, bun, blood urea, nitrogen and CRP. And area under the curve, area under the receiver operator curve, it's an interesting statistic. I don't think I understand it in these big data sets. We often don't see what other people show in these in these imbalanced data sets. You know, we're often at 0.8, 0.85. But we didn't see a lot of loss of discrimination, which is surprising. And maybe something we should talk about, you know, going from a full generalized additive model with continuous variables as continuous variables, or using XGBoost with quite a lot of tweaking of hyperparameters, you know, through LISU to this final pragmatic score to be used on a bit of paper by clinicians who've got full face masks and gowns on. And it calibrated really well across the full range of prediction. I mean, it's a phase of complete to compare your own model with other models using, you know, the way you've collected the data set, of course, it will be better. But what we have seen is that a lot of these initial models that came out into the press based on smaller data sets just do not perform well in our setting. They just don't work. Decision curves and decision utility is important and we've, you know, we've used that here and we would encourage you to kind of use that to actually show clinical utility. So the model, the model does work well, it's easy to use. And that's going to be coming out soon and be interesting people's thoughts and that's approach. So I've had the bell in my ears. So I'm going to finish up now just with, you know, this is the platform that we use. So we've got firewalled servers. The redcap data is held firewalled. We've got an RStudio instance that we keep firewalled. We push to an RStudio connection. We have clicked towards on it there. I make lots of use of Slack and Trello and GitHub. We train using notebooks. All of our training materials are available free of charge here, healthyR.surgicalinformatics.org. We're interested in notebooks. We're interested in the kind of education of, you know, R itself. R for Health Data Science is the book that's coming out shortly. And, you know, just to finish then, I think what I'm trying to show is the absolute importance of the assembly of high quality data. The benefits of processing and actioning data in real time. As we would say in Scotland, R is pure dead brilliant. It is unparalleled in combined ease of use plus flexibility. It's got fantastic interfaces to all these powerful tools. We haven't spoken today about Spark, about Stan, about Reticulate, but it's essential that it continues to integrate with these in order to continue to be relevant. Robert Gentleman said yesterday that the conditions won't code. I think conditions should code, at least some of them. And collaboration is at the heart of this. And I think this meeting exemplifies just what a fantastic community R has. So many people behind this, you know, it's a truly collaborative effort. This is the global group. I'm meeting recently, and this is in the lab. So just to finish and say thank you very much for the opportunity to speak to, to speak to you. I've really enjoyed this meeting and I hope to learn a lot more from the talks to come. Great. Thank you, and we have objections. We may not get to all of them, but I'm going to try to post the other ones in the chat. Toplin's about final fit. Can you combine or specify different hypothesis tests in the same table in final fit? Yeah, so the categorical and continuous variables are clearly treated differently. You can specify the particular continuous variable hypothesis test or the particular categorical various hypothesis test. So if you want to do fissures and chi-squared in the same table, you can't do that. Okay. And can the final fit table successfully knit to Word? Yeah, very easily. And there's a vignette on the website of exactly how to do that. So that will take you through that step by step. It's really good just in a word using a template in Word and bringing that template back, adding that to the YAML header, and you immediately have Word set up exactly the way you want it, all there on the final fit website. Another question. Do you run into information government slash GDPR issues using patient data in RStudio Connect? And if yes, how do you overcome it? Yeah. So all of these data have to go through appropriate IRB or ethical processes in the jurisdiction of which the data is collected in. We do anonymize data and we make sure that our data sharing agreements are quite specific as to exactly what data is pushed into where. So for these data, for instance, we wouldn't normally have patient level data on the RStudio Connect server, but we have in the COVID from special dispensation around research practices in COVID-19. And one more with TensorFlow being built in C++ and the Keras interaction libraries being in Python. How much additional overhead or CPU cost dollars on a server do you gain by using our wrappers? Is it minimal and not worth worrying about? Yeah, I don't know that. I don't know the answer to that. My reading of Francois Coley's work is that there's not extra overheads in using the R front end versus the Python front end or others. I think for people who are really, really deep into this, then they'll probably naturally be more Python programmers than R, but we haven't found any problems. Great. Well, thank you very much. I think there are going to be other questions in the chat and we're going to move on. I think Beth is going to introduce our next speaker. Thanks so much.