 Thank you. My name is Hongjoo Lee. I'm from South Korea, so this is who I am but I I have to skip it because I I have a lot of things to talk so just moving out Today I'm gonna share my toy project, which is a logging matrix of home networks and Analyzing the data and doing some forecast for detecting anomalies Here's the outline for the whole process from the data collection followed by the time series analysis Followed by the forecasting and then the modeling the mod anomaly detections We're gonna go through all these items under each steps as long as time allows But instead of completing everything for each stage I will give brief overview a surface at first and Gradually get deeper into each process by iterating the steps So you will see a lot of figures at the beginning and then some text and codes later There will be almost no there are some but almost no mass equations as we don't get into that much deeper To start with Nive approach of anomaly detection So let me share you How this project started at the very first beginning when I was living in Hong Kong I'm Korean, but I I live for more than two years in Hong Kong one day internet started to fail continuously So I made a call to the service provider and the engineer came and he tested the network with his own device but at the time it was not just no it is it was just normal and I just could not reproduce the failure From the next day, I installed the speed test app on my smartphone and started to capture the test result every time when the network went down then I called engineer again and Showed him the captured images of failures this time He said the wireless device is not just not reliable. So he asked me to test with the wire device I was just pissed off And at that time the only a wire device I had was the Raspberry Pi with a LAN port So I ran speed test on a regular basis and kept logging for a few days before the engineers next visit This is the graph. I showed the engineer at that time in 2015 As you see in the graph there we can see the disconnections repeated several times in a day in the upper Upper graph. There's a red crosses at the bottom. That's these connections at last the engineer Replace the modem and then the internet service went a little more So in this case as these connections are normal ease But there are other types of normal ease in time series data. We will see in the next slide and there were actually for this kind of Analyzing is it's just not a date not a data analyzing but because there's no forecasting either Just we just I just waited for some expected Failures that to be repeated. So therefore it's just a naive approach Before we go more deeper, let's generalize the problem and consider what we should be care about What the problem is detecting or normal states of home network in more general way, we can say Anomaly detection for time series. So what is time series? The time series data is a set of observation on the value at different times and such observation have to be collected at regular time intervals and for anomalies, there are several types of anomalous pattern in Time series. Let's take a look one by one Okay So firstly the additive outliers, which is unexpected spikes and drops the disconnections that we just saw is Typical type of this type of anomaly Next is the temporal changes. It's an unusual low or high observations for some short period of time and Next The level shift in this case the metrics doesn't change the shape but total value of the period changes as Statistical characteristic has been changed for the shift So we must there must be many things to be said again after detecting such anomalies So the level shifts are very important type of anomaly. We have to deal with Let's go to the next step The second round is starting with the data collection Okay, I use the speed test CLEE Which is the command line tool written in Python for internet speed test. It simply gives you a metric Response time on ping test and download speed and upload speed. You can see the result and I ran the test by using cronta for every five minutes and I collected almost 20,000 observations for three months Okay, this is the log output looks like Each test is separable from the next test by delimiter the three right symbols in series Some of you may have noticed that the test didn't started at the exact time I found there are many cases of tests the test started one or a few seconds later But it does not make a huge difference and it can be easily corrected later. You'll see me and This is the iterator class that I use which is reading the log string until the next delimiter happens and Parts and store the metrics and date times It's time to build a data data frame with pandas I make a list of speed test object starting the loss I mean the parsing the log string and The next I build the data a daytime index for data frame Here, this is how I manage with the incorrect starting time by explicitly setting zero seconds and zero microseconds For each data Index is very important for time series data as I mentioned before By definition it has to be a regular time period so here's the chart the graph showing the raw data and upper blue one is the pink test and orange is Download speed and the green one is the upload speed so The actually we have to handle some missing data the handling missing data and data sign is very important sometimes it raises unexpected error on your code and us It's possibly lead us to incorrect result, which is even more worse so We obviously see some accidental missing parts for a few days Actually, the first part was failure of the Raspberry Pi and the second one is I don't know just server it's not responsive and In case in this case, I cannot just fill up those missing data is too huge with arbitrary Values, so I just this I think it's enough to train a model. I The first part is Planting enough for training the data and I use the second part as a validation and the last part as a test test data In the raw data, there are a few cases of missing we can hardly notice on the visualization but But we have to examine carefully the missing data like this so so by using the code The just the first line we can examine if there is any missing data in the data frame And I manage it by propagating just the last valid observation Forward to the missing hole. This is the one typical way to do And here is how I handled the pandas With the data frame with the data time index is There was actually there was It was yesterday. There was a talk about the panda indexing. It was really enjoyable to me and handling time series with time that is super convenient So I can chop off the time series and re-sample it and make a group for a certain period And do some aggregations and these are few example I used Frankly speaking a few years back at the time when I don't know much about pandas Actually, I was avoiding it because it gives me too much confusion So at that time I used to put the date time string or the date time object as an individual column and Then search the data data frame to get a numbered index and then curry again So it was ridiculous, but I did actually so Don't be scared about and the more we know the less pain we will get Now let's look. Let's have a look into the data This is the hourly plot for each day from Monday to Sunday for a week 24 hours from zero o'clock on the On x axis y axis shows the download speed and megabit per second As you see there are no specific pattern repeating each day, but Maybe you can notice that there are less fluctuation at night time on the right side of the chart and And the test capacity remains high Next I draw the box plot for each day. We can find a pattern in a week. So this is a Sunday Mouse pointer doesn't go to up there. So you can see this every Sunday Focusing the orange line, which is a median downloads bd for each day It shows a regular Oscillation and the median of Saturday and Sundays goes higher than the weekdays. So it is Shows clear pattern Like this kind of repeating pattern that we can categorize some pattern consisting the time series data An observed time series can be decomposed into three components The trend exists when there is the increasing or decreasing direction in the series and such trend Components does not have to be linear. It could be exponential or It can be decreased by log and the seasonal pattern exists when a series is influenced by seasonal factor and Lastly the random noise. This is component of the time series obtained after Other components have been removed. So it's Completely random and has zero mean and constant variation which plays very important role for anomaly detection We will see it later So the time series can be formally defined with like additive model or multiplicative model We will deal with these components more later for now We just tried to decompose the components with a Python tool and see if there's our there there are trend and seasonal seasonality on our time series Here I Tried to decompose the daily download time series for a week from Monday to Sunday into seasonal component and trend component I use I use the seasonal decomposed function in stats model package and You can see that there exists the seasonal pattern and clear trend Even if it was not clear with visualizing original data on your left side. It's time to build a model But before we go deeper into a modeling algorithm itself We need to think about how modeling Process of time series is different from that of original machine learning process with the time invariant data set We can split the training data set into a training set and testing set Use the training set to fit the model and generate a prediction for each element in a test set This is a one general way to train and validate the model. So say we have We we divide the data into three part ABC Then train a model with part a and B and validate the model with part C Or repeat the same process, but with this time With B and C for training data and part a as a test data This is this is the typical process called cross validation Anyone who have expertise in machine learning Should be familiar with this However, the cross validation cannot be used for time series data because of the time dependency Part a has nothing to do with part B and C It is so it is unreasonable To test the model with part a as a test set after training the model with part B and C So the model that is trained by all data affects less than that of recent data We have to recreate a arena model After each new observation is received This is so called the rolling forecast So here's the piece of code running the running the rolling forecast We keep track of all observation in in a list history That is seated with the training data initially and later new observations are appended for each iterations We will step over each new observation in test data set and then build an update model with the previous observation and with the updated model We forecast one step ahead for a time T and Then store the forecast value to a list Lastly keep history updated with the new observation at time T This is how we do the rolling forecast on your left side as a forecasting result the blue line represents original data we saw before and the orange line showing our predictions starting from The middle middle of the week So and but and just more important point here is the residuals on the right side The code block calculating the residuals and plotting the residual distribution under on the right side The residuals are a difference between actual observation at time T and predicted value at time T It follows normal distribution. You you see the bell cough and Meaning it's all it's just the white noise. It's very important as I mentioned before for anomaly detection so it can be used for anomaly detection after getting residual based on Robust forecasting model. So now we get residual with Gaussian random noise With the residuals our light detection can be done with several ways by using interquartal range or standard deviation and Median absolute deviation For interquartal range It's quite popular by sorting the data there Their median is in the middle and the first quartile and third quartile are positioned at 25% lower and 75% upper respectively that is if the data point is in red 80 area It is considered being too far from the center of value to be reasonable. Hence is outlier. I Can implement like this with NumPy or Pandas With the standard deviation If a value is such a number of standard deviation away from the median the data point is identified as outlier The specific number of standard deviation is called threshold usually We use three standard deviation The three standard deviation is most common. I think And also with code we can obtain outliers like this with NumPy or pandas Okay for a median absolute deviation. It's Most powerful thing and I've approached So we have unibarried a barrier data set and the mad is defined as a median of Absolute deviations from the data median that is Get the data median first and then take a residuals for each data and Median absolute deviation is the median for the absolute value values of the race residuals So it's more clear with equations So if a value is a certain number of Median absolute deviation away say 3 med from the median of the residuals that value is classified as an outlier There is a short paper Detecting outliers Do not use standard deviation around the mean use absolute deviation around the median is published in 2013 it gives it just as I remember remember it just has this four pages and It gives a super clear idea why we should use med other than the other other ways. I highly recommend it I I will highly recommend it to read it if you are interested it So the next step We go through the Arima The Arima is a class of statistical model It's just a classic it developed maybe 60 years ago, but yet it's really powerful and It is can be used the modeling and analyzing and forecasting times for his data Arima performs well with stationery time series So we need to understand the meaning of stationary time series and how to transform non-stationary data into stationary data To understand the stationary data here are the three Criterion of stationarity the mean variants and covariance of the series are that our timing should be timing variant meaning The mean of the series should not be function of time So in the graph the left-hand grab satisfying the condition Whereas the grab on the right right side and red color has a time dependent Mean and the mean value continues to increase as time goes The next the variance of the series should not be a function of time As you can see in the chart following the graph depicts that The the the blue graph is the stationary series and You can notice that the varying spread of this Distribution in the right-hand graph Which is not stationary and lastly the covariance in the time in the time series should not be a function of time So in the following graph you will notice that spread The the speed of the spread spread becomes closer as the time increases Hence the covariance is not a constant with time for the red series We can test the stationary For a time series with pison library and statistics We have the key for a test for testing the stationary and stats model packet has the implementation of the test so and the test ethics See at the bottom test ethics goes below 1% critical value Then we can consider the time series is stationary But what if what if the time series a time series is not stationary? The main problem with dealing time series data is They are they are just not stationary. So we have to make it stationary before doing something So So when the data is not stationary there are statistical properties like mean variance and Maximum or minimum value changes over time in general as a series which is stationary after being differentiated The deep times it can be I mean the non-stationary data can be the stationary by differencing the value from for Certain order so it is said to be integrated of order d and denoted I of d Which is the subtraction of y at time t minus y of time minus d So the integrated here is what the character I in the middle of Arima stands for the auto auto regressive Auto regression To simplify the auto regression is just a linear regression of itself or p timestamp p time steps of lag times. So the auto or Auto means the self in Asian Greeks. So The linear regression has several features But in auto regression, there's no feature but the time series, but it's regressing by itself over time moving average simply It's it's doing the similar way that moving moving average is self linear regression Not with actual observation, but with the number of residual error in previous timestamps so putting All together here's the summarized Arima model and It's and it required parameters We need p value or the number of lags observations included in the model or auto regression and d the Degree of differencing the number of times that raw observation of difference or integrated and Lastly the q the size of moving average window Well, actually it's a bit hard to understand those concepts, but Maybe it's just enough to To study how to identify such parameters Which is not simple either But we can we have Auto correlation function and partial auto correlation function that tells us how many lags We should consider for forecasting the basically The correlation of a time series observation is Calculated with value of the same series as period of time. That is why we call auto a correlation so the auto correlation function is the correlation between the current time step with the previous time step and the partial auto correlation function does the same as Auto correlation function, but this time it removes auto correlation of intermediate time lag between current time t and current time the the previous time T minus q and Sometimes floating the ACF and PACF Gives us hints for selecting Arima parameter. This is a simplified guidelines for selecting P and Q by floating ACF and PACF Also in the reference there are the guide which is more precise This one I gave you the super summarize, but there is a long story and But I recommend you to read it if you want to study Arima for there So I'll give you a simple example example, which is the easy case for identifying the parameters This data is not from my own project, but it gives you a clear idea so the upper upper graph is auto correlation function, which is tails off and Bottom it is partial auto correlation function cuts off after lag 2 You see the the third one is lag 2 the first one is they sell so the correlation should be should be one The current times time step is exactly the same as the current one so the lag zero should be one and It cuts off after left to which means It's better to use a moving average than Auto regression So we can parameterize like zero for P and two for Q But it does not go It it always not goes like that simple so this comes from my own data we saw previously and It is more complicated So I just use the greed search to find the parameters. Do you know what the greed search is? Okay, the greed search It is Finding optimal parameters First we take a certain range of parameters and conduct exhaustive search until we get the base best result so we can measure the best result by arbitrary measurements like a mean square error or Bayesian information criteria so on so it's quite effective for Searching optimal parameters for email as well Okay, now now say we have two residuals by Forecasting download speed and up low speed separately with our remote model with the two unibari data It's time to do anomaly detection again Sometimes the knife knife approaches I introduced before does not work. Well depends on data distribution because the usually data They show the skew some highly skew data is more common than normal distribution However, with the residuals distributed according to Gaussian We can get more robust results one may use the parameter Estimation so say we have a on the Blue graph say it's a Distribution of download speed and the for orange the dish distribution for upload speed to be more precise is actually residual of Forecasting upload speed then we have we can estimate the new the mean of that distribution and variants of each distribution and We can have a probability density function of each and then by multiplying them We can have a model and then when the new Observation comes and we can test it by cutting off the threshold However, this this method has a problem when the data point are covariant and scatter around Certain pattern say the diagonal ellipse as you see in the in the graph Then the upper left bottom right data point should be anomalies while upper right and bottom left are just normal But it's basically is in the same distance from the middle So how can we deal with this we can we can solve this problem with Gaussian distribution? Say this time we can estimate the mean and get a covariance metric sigma and then with some formula we can get Problem probability distribution function then do the same test this the code is more Maybe it's more simple. So this is the Gaussian the multivariate Gaussian distribution anomaly detection and you See we can With the sci-fi package we can estimate the Gaussian the the the mean and the sigma and Then we can calculate the multivariate Gaussian probability probability distributed function and then find the anomalies by by conditioning with the threshold The finding the threshold is the is another another level. So it's not covered in this talks Okay, I almost finished More faster than I expected We can replace the model with others such as LSTM there are there are there are many ways To forecast the time series, but one trendy a technologist a long short-term memory Which is one of deep learning technique? LSTM is useful for sequence learning Which enables to learn long dependency and it outperforms other method and applications such as language modeling and speed recognized As you see in the in in the figure the blue boxes in the bottom of time series inputs and The green boxes in the middle are LSTM cells and the yellow boxes represent the cells Outputs which is propagated to the next cell So it has a memory considering the previous time step and Finally the red box is predicted output So we feed a series of time t time steps from zero to t minus one for predicting the target value at time t in the red box a beauty of LSTM is Each element in time series Can be a vector with multiple features So we can train and predict the download and up low speed and response time at once and do the multi variant Gaussian efficiently So before While we are using our ima model We have to do the forecasting and taking the residuals For each for the future for downloading and uploading and ping test, but with the LSTM It's simply just done at once Here's a code block It's just one sample But actually it is meaningless because there are so many variations meaning There are a lot of things to study and understand to get a robust result out of Out of LSTM So actually I could not get a robust result with LSTM and neither I haven't saw anybody Have done Have showed a good result. Actually, there are several papers. They are they succeed with forecasting With robust a result in time series, but it's not reproducible because they didn't didn't Open how they get the how they train the model or sometimes they don't Describe About how to to get how to train how to get the hyperparameters to find for fine tuning So they just just said they are they succeed so actually It's ongoing research requires a lot of work to build a model for time series But will it will allow to model soft model the soft Sophisticated and seasonal dependencies in time series is also as I mentioned, it's very helpful with multiple Time series and still there are change challenges. It can take a long time to run so it could be very expensive to do a rolling forecast because whenever New observation comes it has to update the model and when it when it costs a lot then we can follow up the new observation and It often requires more data trained than other models and have a lot of input parameters to tune All right to summarize be prepared before calling engineers for service failure and Pisonista has a lot of all powerful tools to all of this and also Pisonista needs to Understand few concept before using the tools. That's the most difficult part. We need to study and Deep learning for forecasting time series. It's just a still ongoing research and most importantly Do try this at home Here's my contacts. I'm not familiar with social network. So just emails and you can Contact me by email. All right. I have a few more minutes to get some questions Thank you. Thank you for showing us what's happening on our broadband connection Who has a question? Thanks for your talk Did you consider that other traffic on your home net network may have interfered with the data that you were generating? In other words, if you were watching video, for example, that may have contended with the download speed that you were measuring Oh, wow, I'm sorry. It's really hard to understand here too. It's He made a really good question. So Okay When I get this plot, I was curious about while I'm downloading or doing heavy stuffs on network It would affect this chart, right? and Yeah, so I searched internet what would affect to speed test and yes, it does affect when I'm downloading or doing heavy stuff with my network then it should The measurements should go down but Actually, I found some interesting things and And the last two days on Saturday and the Sunday at that day. I was not at home I was went to for a travel But still there are fluctuation in the daytime so so my assumption is More than my personal use I think the more factors would affect for my Village neighbors who shared the the backboard so So such pattern of my neighbors is just a Just a random so Therefore we can have such patterns if I don't have such patterns and it only affects with affects with my Own usage then that could cannot be a random So this study can make sense because it's it's more affected by my neighbors Okay, another question in the end. Did you fix your internet connection at the end in the Hong Kong? I finally Could manage the connection problem But actually There was no such severe connection problems at this time. I just did for fun so But maybe I can do some more things and forecast if I Get more robust result and I can I can predict when you will be failed in the future or Maybe we can collect the data from the from the different houses and gather some collective intelligence out of this That could be interesting Now you're prepared Another question. Okay. Well, thank you a lot showing us what's happening in our connections and We'll have next talk in about five minutes about Thank you for paying attention. Thank you