 Hey, thanks, Tony. First of all, I want to congratulate the Haskeek team for organizing such a cool event. It has been very streamless, streamlines, sorry. And the talks have been really, really interesting today. I hope you guys are caffeinated after the evening tea to have another session. So to give a brief intro about myself, my name is Aditya Patel. I'm part of the data science team at Stasis. Stasis is a health care startup company working in continuous vitals monitoring space. And what we are here is basically we're trying to develop ML systems to detect patient deterioration. Previously, in my past, I've been part of research groups at Tata. And before that, I've been part of Metronik, which is one of the biggest medical device companies in the world. Today, I want to talk about time series classification. Now, traditionally, time series classification in neural nets is associated with LSTMs. But in this talk, I want to shed some light on different architectures, which can be used when you're working on this time series classification in real time settings. To give a brief outline of my talk, I want to start out with giving the context of the problem, why we started working on this problem, and what caused it, and what caused us to actually start working on this, then move on to actually explain one of the most frequently used algorithm, its advantages and disadvantages, and then move on to talk about our methodology and in the end, we also want to talk about some other noteworthy methodologies and takeaways. Sorry. So time series data. Time series data is ubiquitous, from IoT devices to self-driving car. It affects all our lives in ways we can't even imagine. And when you're working on IoT devices, as anybody who's working in IoT devices would tell you that the expectation is very different than reality. Now, at Stasis, we have five different vitals which you are measuring. We have blood pressure, heart rate, blood oxygen, restoration, and temperature from four different probes. And for us as well, our expectation was very different than reality. In reality, what we got was a data which was really noisy because of the patient movements. And there was a lot of missing data as well, because not all the time patients wear all the sensors. So building a predictive model on top of it was really, really challenging. So given the importance of time series classification, machine learning domain has a lot of research being done in the past few decades regarding it. We have the similarity measure approach where you apply dynamic time warping to actually align your time series and then apply supervised classification on top of it to do time series classification. But with this technique, what happens is because of the dynamic time warping, the time complexity is really high, which actually causes it incapable to actually work in real time. There have been methods to actually improve the dynamic time warping methodology, but in general, it's really, really slow. From the statistical part, we have the ARIMA modeling, where you actually exploit the auto-regressive nature of your time series, and based on that, you try to predict the sequence. But with this, the assumption is basically that there is linearity in your past which causes it to actually predict the future correctly. But with physiological vital science, the data is actually really non-stationary. Thus, ARIMA modeling wasn't working well for us. And with the advent of deep learning, we have recurrent neural nets. And in the past few years, there have been an influx of papers working on time series classification using recurrent neural nets. You would see recurrent neural nets employed everywhere in time series, from music generation to machine translation to speech recognition. And rightly so, recurrent neural nets are sequence models where you actually exploit the sequential nature of time series. You have your hidden vector, which is actually propagated through time. And the intuitive appeal of a recurrent neural net is essentially the hidden activation vector is actually the representation of the entire time series seen so far. But when we applied it to physiological vital science data, we didn't get as good results as we expected out to be. On the left, or on your right, you can see an autoencoder experiment where we actually tried classifying two different classes of patient deterioration. One class was stable patient and the other class was unstable patient. And the feature representation wasn't working well for us. And when we started digging into it, what we saw was the noisy data and the missing data was actually causing the representation to not come out correctly. The other two disadvantages which we saw in the LSTM architecture was, basically because of the recurrence nature of the LSTM networks, it's really, really slow to train. It takes a lot of time to actually train your network and iterate over it. And the other thing which we saw was because of the data limitation, it's not able to generalize well. So given the disadvantages, we started looking more into other methods. And one method which really spoke to us was convolutional networks. Now convolutional, so our hypothesis was that the convolution layer and the dropout layer will actually be able to mitigate the negative effects of noisy data and missing data. To give a brief intro about convolutional networks, convolutional networks come from a more neuroscientific principle where you actually apply convolution through kernels locally and then you stack up your layers to actually get a rich feature set which you can use to classify your inputs. To experiment with this architecture, we decided to take the label as patients mortality to as an output label. So our experimental setting was pretty standard. We actually had a very basic convolution architecture. We had the convolution layer. Then to introduce non-linearity, we had the relu. And for regularization, we had the dropout. And for classifying, we had the softmax. To give a brief summary of our training data, we basically worked on a data set which is open source by MIT called MIMIC. If anybody, at this point I would say if anybody's working on healthcare data, they should definitely look into MIMIC data. It's a really, really good baseline for working on healthcare data. Moving on, so basically what we were working on was these physical logical vital signs such as blood pressure, temperature, heart rate, and SPO2. And what we were trying to predict was basically can we predict about patient mortality in the next hour? And what we saw was pretty interesting. The recurrent neural nets performed reasonably well. The AUC, it was 80% in this. But what was really interesting for us was basically the CNN had an AUC of 87%. And it was working really, really well when we had noisy data and missing data as well. The other interesting thing was basically because of the parallel processing, the training time was really less. It was actually one fourth of the time which was taken in LSTM. To give more information on this, we also worked on trying to create a baseline using handheld features. So we actually did feature engineering and then applied your traditional supervised algorithm such as logistic regression, densely MLPs. But you can get more information on the paper. But in this, we just wanted to concentrate on the CNN and LSTM. So what we saw was basically the CNN was much better in performance, was much faster, and was robust against missing noise as well. Also, I would say at this point, if somebody wants to check out some more papers on convolutional networks, they should definitely check out Google's FaveNet paper. They have been using convolutional networks to actually generate raw audio signals. And last month, there was a paper by Facebook which actually did natural language processing using convolutional network. They're calling it convolutional sequences. If you were actually in the talk given by Arnold, he also talked about sentiment, sarcasm detection using convolutional networks, which was pretty interesting to me. So given the success of convolutional networks, we started looking into more and different kind of generic convolutional networks. And what we wanted to test out was the representational power of these networks. And one paper which really spoke to us was this paper proposed by our group from CMU, and they named this architecture temporal convolutional networks. Now temporal convolutional networks have three characteristics. The first characteristic is that the convolution is actually causal in nature, that there is no future leakage in the convolution process. So for example, if you have a kernel of size three and you are at actually applying convolution at time t equals to two, your convolution would be limited till w1x1 and w2x0, because you cannot actually access the future values. The second characteristic which they introduced was actually what they called something called dilation, where you actually have an adjacent step between two filter tabs. So if you have a dilation size of one, you actually have your normal convolution. And as you increase your dilation size, your adjacent distance between the two elements in the convolution starts increasing. And what it does actually, it starts increasing the receptive field of the convolution, which actually entails the increase in the representation as well. And third was basically they encompassed this entire dilated causal convolution into a residual block. And the reason behind they putting up in a, they encompassing it in a residual block was basically to actually go deeper. Since if you have a residual block, it actually takes care of some of the issues when you're going deeper, such as vanishing and exploding gradient reduces, because you're actually solving, because when you're solving for an identity matrix, it's much easier to solve. In our case, since we weren't actually going deeper, like we weren't having more than four or five layers, the residual block was actually not needed for us, but to have a generic analysis, we actually placed it in our analysis as well. And a one cross one convolution is usually needed when your input size is equal to output size, but in our case, since we were doing classification, the one cross one convolution was also not needed. So in this scenario, we actually moved ahead from an open source dataset to the dataset which we collected at Stasis. And we went one step further in to actually go from a binary classifier to a multi-class classifier. And what we ended up doing was basically we had three classes, one class for patients who are stable, and two classes for patients with a different severity of the patient's health. So there were three categories, one for stable patient, one for mediumly risky patients, and one which actually needed immediate attention in the future. Our input was pretty standard. We actually were taking two hours of data to predict in the next 20 to 25 minutes. And our sampling rate was every five minutes. So we had 150 in our input matrix. And the results were as follows. Our one DCNN still outperformed the DCN network, not by much, but both the networks which were based on convolutional networks outperformed the RNN. When I'm talking about RNN here, I'm mostly talking about LSTM networks because that's the one which actually gave us the best results between all the different architectures. But the other interesting thing which we saw was basically the temporal convolutional networks for the highest risk of patients was 91%, which was really interesting to us because these are the patients which actually require immediate help. These are the patients which if you can identify them will reduce the burden on nurses to actually get them immediate help and cause and actually save lives. So what we ended up doing was actually combining our temporal network and one DCNN into an ensemble model which would give us the robustness and the precision which was required in the R case. Sorry. So this is a retrospective data analysis of a real patient. And what you can see is basically there is a lot of missing data. The patient is actually not wearing a BP cuff and there is also variability and respiration and temperature. And the ensemble model of one DCNN or a TCN was actually able to predict well ahead of time that the patient is gonna deteriorate. And the time period and the deterioration is significant enough that it would actually entail a nurse alert and a nurse visit or a clinician visit. So this system is currently in beta program in multiple hospitals in Bangalore and so far we have received good feedback from the clinicians. Moving on, I also want to talk about some other techniques which people should be looking into when they're working on time series classification. We have something called bag of words approach. It's a pretty simple approach in which you essentially what you do is you quantize your time series into different bins and then you assign a character for each bin. So what you actually get is a time series converted into a character array. And using this character array, you can actually apply any supervised technique to do the classification. For the same multi-class classification task, this simplistic technique, it's also called piecewise aggregation, gave us the best results when we use this with a gradient boosting with a F1 score of 74%, which is really close to the LSTM network, which is much more complicated and much more harder to explain as well. And you can also apply other techniques as well. There is something called converting your time series into images and how you are actually converting your time series into images is essentially, if you think about your time series, it's actually into in Cartesian coordinates. And what you can do is you can convert your time series from Cartesian to polar coordinates. And then what you can do is essentially take each point's interaction with every other point. And what it gives you is from a time series of time sequence N, an image of N cross N. There are various ways of doing it. There's something called Gramian matrices, which actually takes the summation and the differential between each and every point and gives you these images. And then you can actually take them as input channels for your convolutional network to classify our time series. In our case, we had five different time series at any given point of time. Thus, stacking them all together was not giving us a valid performance increase. So for us, the most simplistic one was actually applying GADF with CNN, which actually outperformed the LSTM again. Takeaways. So machine learning is an exciting field. There is recent research coming up every day, every day of the week. And as machine learning practitioners, we actually want to apply those techniques everywhere. Like whenever a new algorithm comes, I actually want to apply it on the problem which I'm working on. But at this point of time, I want to give a word of caution. I want to actually reiterate that everybody should think about what problem they are trying to solve. Like for us, the biggest problem or the biggest hurdle was actually the noisy data issue, which was solved by convolution. Thus, working on convolutional networks made sense to us. At this point of time, I also want to say that this entire presentation was to understand the representational power of each and every network in the most generic sense. There has been recent advances in recurrent neural nets, which actually negates some of the drawbacks I mentioned. There is something called attention mechanism. I think there is a talk in Audi 2 about it, which actually rectifies some of these techniques, some of these drawbacks which I talked about. And in the end, I also want to talk about the need for working on one DCNN as a starting point when you're working on real-time settings. As shown in the presentation, it's actually a pretty robust starting point if you're working on a real-time setting. Future work, we are actually working on understanding what's causing these activations in a time series sequence. So we want to understand what's causing our patient to deteriorate. For that, we are actually working on deconvolution networks. And hopefully, we'll get some validation from clinicians on that as well. Happy to take any questions or comments. Yeah, so I saw that you converted the time series into a sequence of characters, right? So that is sort of a discretization of the continuous waveform, right? Yeah, exactly. So that is some amount of information loss out there. Sorry? There will be some amount of information lost out there, right? So does it, and what are the kinds of techniques that you apply on the top of it because that's a lot of information loss out happening out there? Because I've tried it out a couple of years back and didn't work out very well for me. So what is the kind of things that you've tried out? Good question. So when I talked about piecewise aggregation, we were also, so I was also trying to point out the simplicity of the techniques and it can be used as a base model. Like you mentioned, it is basically you are quantizing your time series into different bins. So you're actually using a lot of coherence between your point. And that's why convolutional networks works better than this. Because in this, you have a linear quantization metric. Whereas when you are applying convolution locally, it's actually giving you a non-linear convolution, non-linear quantization, which actually gives you better results. So these are my contact details. Yeah, so for a patient, for a patient to be... I can't, Arjun. I can't place. Okay, sorry, thanks. So for a patient to be diagnosed correctly. So for how long should we see his observations? Like this technique works for, let's say, half an hour of historic data set or it can be month long or it can be... Okay, so for in our case, we are actually looking for something on the order of half an hour to 40 minutes because that's the time which is required by our clinicians to actually administer medicine to actually take effect. If we actually do it any sooner, the clinicians would wait for some time before they administer something because they want to validate it completely. And what's your opinion about applying these techniques for a large scale? I mean, for a higher duration of time, let's say we take one year of data and try to figure out what's going wrong with the person. Is it applicable or it doesn't work well? So there is a paper on palliative care by Anand Devati. He actually worked on, he actually took 10 years of data and he was trying to see if in the next year a patient is gonna pass away or not. And what he was working on was essentially the point where you should stop trying to improve the patient's care and actually start administrating care for comfort. So his task was basically in a 10 year long, he also used something very similar. I think he used an LSTM network in that case to actually take a 10 years time stream. In our case, we want to be more immediate because we want to actually employ it for patients who are actually in the hospital after a surgery or before a surgery is what we are trying to look for. So for us, two to four is good enough, two to four hours. Thank you. Thank you. I have two questions. One, when you applied the one dimensional CNN, so in addition to looking at the model performance and prediction accuracy, precision, et cetera, did you also look at interpretability in? Yeah, so that's what we are currently working on. So we are trying to understand how to interpret this. One DCNN, actually, so there is a way you can do is like you give a bunch of inputs and you see which is activating it and then get the ones which are activating to get validated by the clinicians. So you can understand what's happening. The other way is to work backward using, if you were in Dr. Vinit's talk, he used something called Gradcam where you can actually backprop the activations and actually see it. So those are some of the things even we are working on right now is the interpretability and the explainability because that's what's important as well. Like if something is getting activated, why it's activated and why that is related to patient deterioration. So that is also something which we are working on actively right now. Okay, and I have one more question. The one that you mentioned about converting the data into binning the data, then treating each bin as a character and then using FUNLP techniques. Have you come across using LDA or anything like that with this data one? So we haven't done LDA on it. I'll be honest. We tried taking the representation and seeing if actually something like a PCA or something like a PCA can actually make the representation better. The piecewise aggregation we didn't apply it on, we were mostly focusing on the one which was best performing for us, which was one DCNN. In that we were able to see different bunch of clusters as well. Okay, yeah. And one last question. Did you look at any anomalies in the data while when you use this methods and did you have any observations on that? Anomalous patterns. So there are anomalies patterns always and that's why the results is not perfect. As you can see, it's still, if you check out our results, even the best one is 83, 84%. So it's still not like up to the best one. And this is also when we are limiting some of the constraints here. For example, like you need to have an X percentage of data available for us to make a prediction. I think if we can improve our signal quality, remove some of the, sorry, I just want to show you something. So if we can remove some of those noisy data issues which are actually happening because a patient is moving and we can actually include that in our model or if we can detect patient movement and include that in our model, definitely our accuracy is gonna improve. Hello. Yeah, if you wanna include some, let's say external variables, like in Arima you call them exogenous variables. For example, past patient information, his gender, these kind of things. How can you include this in this model? So one of the ways is basically if you can get the last layer of your CNN and club them with all the exogenous elements which you spoke about and then feed it into a classifier, it will obviously give us better accuracy. In this, we even shied away from using the demographic information, such as age, sex and other patient information like patient nodes, because what we are trying to build right now is we want to make it as real-time as possible. And when you're working on an IoT device because that kind of information is actually gonna come from another server, it's very hard. So we are trying to shy away from that and trying to understand how much we can leverage what we have right now to make it as real-time as possible. Got it. So within a short span of time, it adapts to the whatever is the context. Exactly. Yeah. Okay. Yeah. Feel free to reach me at the thirdstaceslabs.com. Would love to chat if anybody has any questions offline. Thank you.