 Thank you very much for coming along to listen to me today. I'm Nicola Rennie, and I'm going to be talking about using tidy models and TS features packages to fit machine learning models to time series data. Focusing on the application of detecting heart murmurs in sound recording data. But first, let me introduce myself a little bit. I'm a lecturer in health data science based in Lancaster Medical School in the Northwest of England. My academic background is originally in statistics and that's where I first started using R about eight or nine years ago. After my PhD, I went on to work in data science consultancy where I worked on lots of different projects. Many of them using R, including some in the sort of health and medicine sectors before fairly recently making a return to academia. And what I'm going to talk about today is something that started out as a bit of a data science side project of mine. So the first part of many data science projects is let's get some data. So the data I was working with is a phonocardiogram signal dataset. What this is is essentially sound recordings of heartbeats. And one of the reasons this type of data is collected is because phonocardiographs pick up subaudible sounds. So you get more information than if a human alone was to listen to a heartbeat. So this dataset has just over 5,200 different recordings that's covering around 1,500 people at four different recording locations. So four different regions of the heart. Some people don't have recordings for all four locations. A few people have multiple recordings at the same location. We also have some information on the people. I should also mention here that this is primarily a pediatric dataset. So the oldest person in the dataset is 21 years old, I think. Actually, one of the pieces of information is whether or not a doctor had diagnosed the person with a heart murmur after looking at the data. So in terms of what each recording actually looks like, each one is around about 10 seconds long and that data is recorded at 4,000 hertz. So for each time series, we have around about 40,000 observations, which is quite a lot. And one of the key things here is that partly because of this very fine resolution of the data is that not all time series are of exactly the same length. Same for this project is essentially to classify the time series into people with and people without heart murmurs. So we have some time series data and let's start doing some time series analysis. And what we want to do is cluster each of these time series into one or two groups. Do they or do they not have a heart murmur? It's essentially a binary classification problem. First thing to know is that classifying time series data is hard. It's especially true with very high resolution data like this. If you think about it here, we have essentially 40,000 variables for sort of every observation of our dataset, which is quite a lot. And the key thing is that the order of those 40,000 observations, those 40,000 variables is really important. So we can't use sort of your standard dimensionality reduction techniques. So what do people normally do in this situation? Well, there's lots of different ways to classify time series data. And I've listed a few here. What seems to be becoming a little bit more common is that it's actually one of the best ways to classify time series data is to not classify time series data, at least not directly. Instead, we calculate some properties of the time series, some features. If you want to keep it simple, that's properties like the mean, the standard deviation, the slope, for example, and we use those features in classification algorithms instead of the raw data. And you might be thinking, well, the next obvious question is what features should we calculate? There's lots of different things you could calculate, lots of different properties of time series data. And some time series data, some features will tell us useful things and some of them won't. So if you think about our heartbeat recording data and what that looked like, the sort of shape of that time series, is calculating something like the slope going to be useful? Probably not. So what features you calculate is going to depend a lot on the type of model you're going to apply. And we'll come back onto that in a second. But also the domain you're working in in the context of the data you're working with. So in R, we can use the TS features package to calculate features of time series. It has lots of built-in features that can be calculated by default, but you can also use it in combination with functions from other packages or indeed your own functions if you want to calculate a specific feature. There are plenty of other packages we're working with time series data in R and different functions for calculating the features. This just happens to be the one I've used and I found it very user-friendly and I quite like it. So we can compare how these features differ in people with heart murmurs and people with album. So for some features, the difference is small. It might not be sort of an overly useful time series feature for us, but for other features, the difference is quite large. Even in the cases where the difference is quite small, the example you can see here, the differences are sort of persistent across the different recording locations. So now we have a bunch of different features. I think in the examples I'll show here, I think I had about 23 different time series features for each of the four different recording locations. And you can think of these features are essentially a set of transform variables. And that means we can start to throw them into some models and we're gonna do that using tidy models. So if you haven't come across tidy models before, you might be wondering what it is. It's essentially a collection of R packages for statistical modeling and machine learning. So it follows the tidyverse principles and in the same way that you can do install.packages.tidyverse to get all the core tidyverse data wrangling packages. You can do something similar with tidy models. There's packages for sort of resampling data, fitting models, tuning parameters, evaluating models and so on. So now we need to think about which models to choose. I will always recommend that you start with the simplest model possible. We know that we have a binary classification problem and multiple mostly generic explanatory variables. So the first thought that's starting to come to mind is logistic regression. If you have lots of potential variables for regression models, which we do here and we don't necessarily know which ones are relevant, then you might also start to be thinking about using lasso logistic regression, which will automatically help you to filter out variables that don't really have an effect. Yes, it's a fairly simple model, but if it works well, you have something that's quick to run and easy to explain. If it doesn't work well, then you have a reason you can sort of use to justify why you need to use something more complex. One of the things I like the most about tidy models is that regardless of what type of model you're fitting, the process doesn't feel too different. So for example, we almost always start by splitting our data into training and testing sets so that we can evaluate how well our model performs on unseen data. And depending on the model we're fitting, we might also want some validation sets. Then there are potentially some pre-processing steps that we want to perform regardless of the model we choose to fit. So for example, normalizing the data. Our other steps like performing principle component analysis, something that you want to repeat. And we can combine these into what's called recipe. It means we don't have to copy and paste code for each different model. And we also don't have to write our own pre-processing function that does all these steps for as it fits into this workflow. It's fairly simple to construct this reusable set of instructions and save it into this sort of workflow object. So now we can fit a model. So let's just start with our simple logistic regression model and specify that we want to use less logistic regression and the hyper parameters in that model should be tuned automatically rather than user specified. A little word of warning for this is that if you're ever getting R to sort of automatically choose values for something, don't just trust it blindly. Do sort of double check it and visualize the values it chooses to see if they're sensible, to see if it's working properly. So we can tune this hyper parameter using a grid search. It's not a parameter that we can calculate from the data we essentially have to try lots of different values and choose the one that gives the best model performance. And I think this is another example of where tidy models feels very user friendly. Things are named well. So if you want to select the value that gives you the best results, we use the select best function. And once we've chosen the best value of the tuning parameter, we can use it to fit the final model. And I just want to compare how fitting a random forest model works within tidy models. And hopefully you'll see the similarities in the code between the different models. So we start by specifying our model. This is the part that probably looks the most different because it's a different model. But then we tune our hyper parameters, select the best value and fit the final model. So one thing I found in the past is when you want to compare sort of completely different models like this, that has maybe meant using functions from completely different packages. Maybe they want sort of completely different inputs and they give you quite different outputs that makes comparison a little bit harder. It just sort of makes this, it felt like it made that model comparison a much bigger task than it needed to be. But working with tidy models sort of took away a lot of that pain. So let's see some initial results on how well these two models are currently doing. They perform fairly similarly in terms of accuracy. So we're getting about 80% of the classifications correct. But this dataset isn't especially balanced. There are more people who don't have a heart warmer in the dataset compared to people who do. So actually looking at the area under the ROC is more appropriate here. And we're getting better performance with the random forest model. So there is some justification for using more complicated models rather than the simple ones in this case. The performance here is still not great. The sort of top right corner looks a little bit too high for me. So we're sort of missing out on too many of the heart murmurs in our classification. So there's still quite a lot of work to do on this. These are sort of some initial results. So before I wrap up I wanted to talk briefly about some of the sort of potential use cases of this. And also things that we need to be a little bit more aware of if we're using machine learning in a sort of healthcare or medical setting. So when we're thinking about how do we use machine learning or time series analysis with this type of data? One of the main potential uses is as an additional diagnostic tool. So it's never about replacing a doctor with an ML algorithm. It's about giving more information in a useful way. And because these pharmacardiogram signals can pick up subaudible sounds, it gives more information. It could also be a fairly cost effective methods of sort of first line screening to help speed up the referral process to get a confirmation of diagnosis. And there's also potential for doing something similar for longer term monitoring of patients who have a diagnosis. And also thinking about further classification of these signals and thinking about sort of tailored increments a little bit. So I do think there's a lot of potential for using machine learning in healthcare and medicine. But there are several things we need to be aware of. And one of those is bias. So by bias, I mean as our systematic error due to incorrect assumptions in your model. And this is something that we see quite a lot of. So for example, some machine learning models for diagnosing skin cancer were found to be more accurate on people with lighter skin. So if you say this model can be used to classify skin cancer, then you may very well be systematically missing cases of cancer. And it equal if you say this model can only be used to detect skin cancer in people with lighter skin, then you're excluding entire groups of people from the benefits of earlier and less invasive detection. So the first thing you should be looking at when it comes to bias is making sure that your data is representative of the population. So for the skin cancer example, the reason it worked better on lighter skin is that that's what the images in the training data were of and that that doesn't reflect the population of people you want to to roll this technology out to. A second step which doesn't fix the bias itself is actually check if you have bias in your model output. So very often we see people reporting models with X% accuracy, but is it X% accurate for everybody? If you evaluate your model across different groups as well as on the whole, you might find that your model doesn't perform as well as you think it does. The other sort of key element of machine learning that's slightly related to bias that we need to think about is model evaluation. In particular, which metrics do we use to evaluate models? In some domains, there's little difference in consequences for a false positive and a false negative. In medicine, that is not true. Telling a patient they're perfectly healthy when in fact they have a heart condition, for example. I would argue has far more significant consequences than the opposite way around. And this affects how we evaluate models. So metrics like accuracy, which measure the percentage of correct classifications don't really work here, especially for river conditions. So if 1% of people have a condition and you tell everyone that they don't have it, then you've got a model with 99% accuracy, which looks great on the surface, but it's not. There are other metrics, some I've mentioned, that deal with this or imbalanced data better, but they're often not as easy for non-technical audiences to interpret. So there is definitely still a question of the best way for us to evaluate whether or not a model is working. And these aren't necessarily questions with an obvious answer. So I'll leave you to ponder those and end it there. The slides for this talk are on my website, as is the blog post and code as well. So with that, thank you very much for listening. Thank you, Nicola. We have some great comments in the chat here. So that's great.