 Welcome to this next set of sessions. Chris Bailey is going to be talking about machine learning and shiny and how this works in production or in predicting at the clinic. Okay. Thanks. Sorry, I'm just saying my timer. Yes. So I'm Chris Bailey. I work in the NHS in the UK. I'm going to be talking to you about using machine learning with shiny in our in in production or in our setting. So just to give a little bit of background, I think a lot of people have heard of the NHS, but maybe people don't always necessarily understand it. So NHS is very large. We've got 1.6 million staff, fifth largest employer in the world. And it's the largest single payer health care system in the world. It's funded from taxation. So NHS is very large, but it's actually made of lots of little things. And I work in one of those little things. And so as a consequence, I think we're not the sort of super slick operation that people sometimes assume that we are. So in my little bit of the NHS, I myself have been using R for about 10 years. I've been using R with shiny for about seven. I started using it with patient experience data, which is my other great love in life, about from shiny. We've only been using R and shiny with clinical data for about three years. So we have fragmented data teams within my organization. I'm sure lots of organizations have this problem. So we have about three different teams working with data. So I'm in the data science team. The data science team on the whole use R and shiny. We use a bit of Python here and there, where we're starting to. And in the past, what we've done is we've asked the data engineering team for a spreadsheet, a CSV, and we've analyzed it using R. The data engineering team in the meantime are using Microsoft SQL Server as many R and Power BI. And so the overall effect of this is that everything is rather fragmented. So that the data is fragmented. So we have insights and variables and information that the data engineering team don't have and they have information that we don't have. And for the user as well, also there's fragmentation. So we have multiple dashboards running on multiple platforms in multiple locations. And so the result for our users basically is sort of extra work and confusion and they're having lots of things in different places. So I've been trying to get away from that. I watched a talk quite recently from our studio conference by Eduardo Arena de la Rubia that was very influential for me. I'd recommend it. I'll put a short link on the slide if you're interested to see that. And one of the things he says in that talk is he talks about there being one data scientist for 10 engineers. And I think that's probably roughly during my context. I haven't done the exact maths, but it's somewhere around there. And what he's talking about in that talk is how data science teams, how small data science teams can maximize the value they deliver. So the way that he suggests that we do that is that we basically, we can be innovative, we can test things, we can iterate and then we need to move on and do something else. Which is a really good philosophy and it'd be very influential for me. But the problem that I have faced is how do we share the work that we, how do we share the work that we've done with non-R users? So I wanna just talk about that journey really. So phase one started about three years ago when we started using the clinical data from the data warehouse. And what we did at that point was, we ran, I've just seen someone say this in the chat. Actually, yes, we did do that exact thing. So we hooked into the data warehouse and we basically ran select all from like about seven different tables. And then we used loads of our code, 5,000 lines, literally 5,000 lines to sort it and filter and aggregate it and compute new values and all that kind of stuff. And produced a lot of useful insights and built Shiny dashboards and did a lot of great stuff. And then tried to communicate that back to the data engineering team in the form of 5,000 lines of our code. And the effect of that was absolutely nothing at all. They quite rightly could make neither head or tail of it. And the problem with that then is that we're stuck with these reports forever. So we've got a small team, we've done all this work and we basically, we've got to carry it forever. So that's what I wanna get away from. So this now I would say we're sort of right slap bang in the middle of phase two at the moment and what I'm talking to you about today is sort of the green shoots of phase three. So phase two is much like phase one really, we're still processing very large datasets in our on a small Linux server, which is not a computationally sensible thing to do at all. But what's a little bit different is we've been trying to move some of the data processing back to SQL. So we might do some computation or some exploration or some values, but what we're trying to do is feed them back and push them basically back to the tables that exist in the data warehouse. We're doing that for two reasons. The first one is because obviously doing all that computation is a lot easier in a big SQL environment than it is on a small R-based Linux, well, not nothing to Linux, a small server, it's a small server basically. But more fundamentally, the reason why that's so important is because we want to have one truth or at least one set of truths. So again, as I say, the fragmentation. So we have different definitions of things and it's quite, it's confusing for us even to be honest, nevermind about users. We're looking at data in different ways and that's unhelpful. So this approach that I want to talk to you about today, we've just, we've pretty much, we've kicked it off with this project really and this project is deliberately simple because I'm kind of working on the culture here really and I didn't want to work on the culture in a complicated way because that's maybe too much learning all at once. So basically all it is, it's just, it's very simple machine learning model and the idea is we want to predict when we've got a set of appointments at a clinic who's going to come and who isn't, who's not going to come. And it's of great interest to us to answer that question for several reasons that I don't really have time to go into but it's clearly of interest to many people. So the algorithm's okay, it's not amazing. The area under the curve is by 0.75. What I've been telling people basically is if you give me a thousand appointments, I can say you should ring these 50 people and if you ring those 50 people, 20 of those people wouldn't have come if you hadn't run them. And if you compare that with just guessing and ringing people randomly, it would be seven and a half people wouldn't come. So it's not amazing. I think it could do with more iteration in different contexts where it's a very large organization that we see lots of different types of patients. And I think if you tune the model within those particular, within each of those contexts, I think it probably would work better. But it's why I would say good enough. This is sort of another philosophy that I've been embarrassing recently is this idea that, and again, because we're a small team, you know, we're not Google, we can't be sort of making fractional improvements on areas under the curve. That's not really what we're about. We want to just make a change, make something useful, build it out and, you know, get people started with it. And that's what all this is all about, really. Yeah, so a machine learning model, basically it's only as good as it's reached. You know, we can do lots of clever things in our, but if nobody ever sees them and nobody can understand them if we start doing something else, then they're not really that useful. So what we've been trying to do is we're trying to build simple stuff, we're trying to build useful stuff and we're trying to build it and communicate it in a common framework. We can't hope, as I say, there are many, many clinical environments within my trust that we see lots of different patients. We can't hope to do all of the tuning iteration ourselves and we don't need to. That can be done quite easily by the much bigger, more technically competent data engineering team. But the testing iteration can be done by us. Some of it, some of it, the early work can be iterated rapidly in Shiny and we've used Golem along with the Shiny package. So for those of you who don't haven't heard of Golem, it is, and I quote, an opinionated framework for Shiny applications. It's basically a particular way of building Shiny applications. I think it's very useful. I haven't been using it for that long, but I've started to build more and more stuff with Golem. So it emphasizes particular ways of doing things. So it emphasizes the use of modules in particular, but it also emphasizes package-like functionality, like testing and the idea of all that is it allows you to be fairly agnostic as to where you're going to deploy. So I think the reason why it was first developed was because they wanted to have a code base, a Shiny code base that could be deployed in lots of different ways without rewriting some of the underlying structures. So whether it's on Docker or RStudio Connect or a CRAM package or whatever it is. So basically what we're trying to do to minimize the friction of these communication flows is we're running the data live from the data warehouse. So instead of going to the data engineering team and saying these are 80 variables we need to understand, can you please do a big join and get it all for us? Which is more work for them. We can just go and get it all ourselves. And we can use Golem and Shiny to build something in a modular fashion. So we enforce a separation basically. So there's a sort of data layer to the Shiny application. The data layer makes use of the pull package, which is useful for managing SQL connections, which again, I don't have time to talk about that, but it's a useful thing when you want to do that. You've got the business logic of the application. And this isn't so much Golem, this is just good practice in Shiny, really, but it's often not followed. All your business logic should be defined outside of a reactive context. So your business logic shouldn't contain itself of reactive values. It should be defined in static values and tested in a static environment and then take as inputs arguments that are reactive. And the overall effect of doing all of those different things is to produce a set of sort of components that could be reasoned about and communicated in pieces. That's the idea of this. Partly because I say we're different teams, different structures, different things. So we're not necessarily, what I'm trying to avoid essentially, this is the opposite of the approach where you run, I've seen this many times and many of these applications are wonderful. I'm not criticizing them, but basically it's a 2,000 line Shiny application. It does everything under the sun and it's very hard for someone who hasn't been involved in the development of it to reason about it and go, oh, that used to do that with this and that kind of thing. So it's the opposite of that approach basically. I just want to very quickly apologize for not bringing the code. We do try and open source everything in my department. It's early days for us, but we have open source stuff and we plan to open source more. I haven't brought the code because I think it's a mess and it would be more confusing than anything else, but just find me on Twitter, at Chris Bailey, as simple as we're finding me, I promise I'll get back to you when there is code shareable. It's absolutely our intention to share it. Right, so just to say a little bit more about those individual stages. So the model training, as I mentioned, is done basically straight off the data warehouse. So we're using DB Playa, which is very useful because it allows you to, again, operate, use the power of the SQL server. So instead of doing the filtering, the sourcing and the arranging and all that kind of stuff, instead of bringing everything into R and then having a lot of memory problems, it's all done on the SQL server, all of the filtering, all of that stuff is done there, and then you bring in only what you need. And that's very liberating. The machine learning itself is done in MLR3, which is another great package I don't really have time to talk about, but it's worth saying that MLR3 itself is also built in a sort of modular way, emphasizes the use of pipelines. And again, the idea is that they can be swapped out easily. So you can have a particular imputation step or a particular algorithm, but you can then swap it out with a different one and then the rest of the chain will work. That's part of the philosophy of it. So that's quite useful. So what we built on the shiny side, basically there's sort of two applications that I'm targeting at the moment. There may be more, but I think I need to talk more to the organization first. So my first idea was the sort of clinic view. So the idea was that people would be able to come in in the morning, turn on the computer, and the computer would say, here's all the people that are coming in today and here's this percentage chance that they're not gonna come and they could look at it and think, well, I've got a bit of spare time. So I'm gonna ring such and such a person because I can see that they maybe won't come. And they can make that decision. It's all in percentages. It doesn't say who you should ring and who you shouldn't. So they can make that choice. And the other view, oh, sorry, gone too far. The other view that I thought would be interesting would be for people who are interested in a whole clinical area, people who are looking at the performance of it. So they might be interested in the overall rate of do not attend. They might be interested in the percentage of people that we've called. The percentage of those that we called who still didn't come. The percentage of people that we didn't call who didn't come, all those sorts of things just to get an idea about how it's going. And obviously also whether it's changing, how well the model's working, whether it's making our practice better and all that kind of thing. So as I say, so GoalM is really useful in this in this respect because it allows, the data layer is totally separate from the rest of the application. So there's a separate module for the data and there's a separate module for each of the views. And that's very liberating on the shiny side because it means that you can, you can again, swap them out and reuse them. So it may be that you write something over here that could be used somewhere else or vice versa, you may want to swap something back in. And again, thinking about the business logic, aspect of things. As I say, this is not really GoalM, this is more just good practice. But I think GoalM does certainly help you to work in this way. So that you're defining all of the outputs. So for example, the table of the clinic view which shows all the patients. And that would be defined, the way that it works would be defined outside of the reactive context. And that allows you to get good testing. And what's really nice about that is not only is it good practice for shiny, but it means that those components are reusable. So they're reusable by yourself. So if you want to make an R markdown report or you want to make an API or whatever it is, it's not defined within shiny, it's static, it's reusable out of the box. So having done all that, what do we do with it all? So the easiest thing to do, as I mentioned, we've already got our base report. So it's very easy, all of this work, obviously very naturally goes into the R environment. And that's something we'll be able to do even without all this stuff. Even if we've done it in quite an arcane, bizarre way, we could still have a report into the R environment. The use of it comes now more when we're talking to the rest of the organization. So the first thing that we can do is we can take the model which we've, as I say, we've tested it and we've iterated it and we've evaluated it and all that kind of thing. So we can very easily just hand that model over and say, well, we trained all this data, all the data was pulled through using, your SQL views using your var, there's nothing strange in here. We've just pulled it straight through, trained it and we got this. So it's very easy then to communicate and say, this is how we did it and you can do it and redeploy it. And as I mentioned before, that means they can then iterate it within a particular clinical context or clinical areas or whatever. And similarly, again, as I say, because the Shiny application is itself modular, it's very easy then for us to say, so the clinic view, so for example, the clinic view has a very defined set of inputs which are this, which come from the data layer which is itself sort of modular and it has a very defined set of outputs which are this and that allows us to take that individual piece. That's what I wanted to go to with an individual Power BI developer, take that individual piece and say, this is how this works, these are the inputs and those are the outputs and you can just do it. And then similarly with the performance view that might go to a different Power BI developer or even a different team. And again, you can just pull it off the shelf and say it's that. And as I say, this is all designed to be the opposite of the 2000 line Shiny application that does everything because then you've got to take it and disaggregate and say, oh, well, actually you should know that an earlier bit for something else does this and then that gets pulled through and that's the kind of thing that we want that we're trying to avoid to make it easier to talk about. So how has this helped us to work better with the data engineering team basically which was the whole aim of this. So as I say, the sort of simplest, the most obvious thing that you can do is you can just push the results to a table. So as I mentioned, we've already been trying to do stuff like that. So if we've got a set of outputs we can just obviously let them have them and that's the first step. We're still kind of holding them, we're still kind of in charge of them because that's not ideal but that's something that they can start to understand. And then I mentioned then they can then can reimplement the model in the data warehouse. As I said, the model is written in a totally modular format. It's written pretty much in their language in the sense it's written from their own SQL tables. And then they can then reimplement the Shiny application so we can say, oh, they might improve it. I mean, they don't necessarily have to reimplement it. What we can say is we've built a Shiny application. We've shown it to these people. They had these comments about it. We've made these changes and here it is. And here's this bit and here's this bit and here's this bit. And now it's yours. It's up to you, you're in control of it. And then we can then move on to the next problem. So in summary, what I've come to talk, tell you about today just to sum up basically is that as I say, I watched that talk, it was very influential. We are a small team and we really need to punch above our weight. And in order to do that we need to be able to build new stuff. We need to test it, make sure it works. And then we need to just get it out there and just move on. To hand it over to people who are really good at scaling things and big architectures and that kind of thing that we don't have those skills and we don't have that manpower either for that matter. And in order to do that, we need to be using frameworks that help us to firstly help us to communicate with the data team. And secondly, for our benefit, we need to use frameworks that help us to reason about our working modules. And for those of you who work in Shiny who maybe haven't tried modules or for those who have that's the first thing you notice is when you start using this and as I say, GoLamb does very much encourage you to do this but it's not specific to GoLamb. It does really free you up and start to help you to think about your work in a modular way and that can be very liberating when you don't have to worry about how things are set up and having all these unpleasant interactions through the application and all that kind of thing. And that's what I'd like to say. Thank you very much for listening. Well, thank you very much, Chris. So we have a question here. Is recent attendance and emergency predictor of no show for follow-up appointment? Oh, that's a very excellent question, but I'm afraid I'm not the person to answer that question because I don't work in that kind of environment. So I work in, we deliver mental health services and we deliver community physical health services. So I can tell you off the top of my head the thing that is predictive that you'd expect, there are three obvious things that predict that do predict. I'd like to find some non-obvious things but we're not there yet. So the first one is just how many times we've seen you and the second one is how many times you haven't come before which is probably something you'd also guess but also something else that we were expecting to find that we wished wasn't there but it's worth knowing is there is also deprivation. So we find that the people in our, the place where I work just have quite a lot of deprived individuals within the city area and we find that those people are less likely to come clearly that's an issue because it means basically that those who perhaps in most need of our services are less likely to come and see them and to get them. So that's something that we'd like to work on. Can you briefly say anything about the conversion to Power BI? Do you have any code available on that? I'm afraid I don't at the moment. I have been talking to the team. I haven't sat down next to them yet. I obviously haven't sat down next to them because of COVID. I mean, I haven't got that going just yet. As I say, please, I want it to bring more really but it's sort of all tied together with string at the moment. So please do find me on Twitter and I promise I'll drop you an email whenever there's anything like that. All right, well, thank you for that talk. Yeah.