 Welcome to the next session. We have no I think 10 o'clock. Yes, and it's Roberto Polly talking about it's a one-on-one of systems administration and for sure focusing on pysnet not at AVK what I heard and So yeah, enjoy Hi everybody I am Roberto Polly I work in bubble, which is the proud sponsor of this talk and of my hotel bill Today, we will see how to use and learn elements of statistics. It's not a statistic course with Python Before starting in also, I would like to apologize for my English. I hope that English speaking people can forgive Go on we will see a latency issue that affects one hour of our our customer and how in a Very few minutes. We were able to understand what was happening and what was not happening We understood all those things using correlation and combining data Then we provide a lot of nice plots today allowed our customer To to say that all that were happening wasn't his fault Everything was done with sci-pi and matplotlib The his problem the customer problems was episodic network latency issues We had locked traces with Message sizes the number of peers of the communication and the number of a retransmission and the errors in in this network The customer ask us do we need to scale? are those latency issues related to some peak condition and well We found a rapid answer using Python How we did it because Python provides basic statistics like the mean that we will denote with a bird on the X and using standard deviation which is actually an indicator of how the mean is is a Is a good Descriptor of our data series if the mean is good the standard deviation is low If the mean is not a good indicator the standard deviation is I the T Variable contains an extract of our data. There is a timestamp Allotting see indicator in seconds and the number of peers and there are other indicator just like the message size the number of retransmission You can see that getting a base description of all those Fields is really is just one line because Python provides max and minimum indicator and mean and standard deviation Are built in sci-fi Now the distribution the second thing You do is to create a distribution that is On the X axis you have got some time slots for for example. This one is Ping run trip distribution It says we have three being returning between 158 and 159 milliseconds four pings return between 159 and 160 and so on the Faster way to create a distribution with Python is using Mac plot lead which is a that is a plotting Library when we plot an Instagram for example an Instagram of latencies Ping run trip time is actually a lot a network latency The history we have got two outputs one output is the plot the other output Is a triple the interesting values in this time are the frequencies That is how many pings return Three is a frequency four is a frequency Two is a frequency and the beans the beans are just like Yes, beans or buckets are on X axis so the 158 to 159 beam and so on To get the distribution just use zip which tie together through lists or intervals Now correlation we have got a description of our data But now we ask our two data series related Is there a relation between the number of retries and the latency? Whatever if we For sureness use Delta X just like the difference from an item in the series and the mean Mr. Pearson that was a statistician answered with this formula it seemed complicated if you If your high school time Far but if you just mind back to your high school. It's actually quite easy it Just checks if the values of the X and the epsilon series Move together on the same line If for example both X and Epsilon move Together they start those differences start with negative values so the product is positive and they move on and If they reach the mean together They will be zero together and if they move together on the product will still be positive so If you try with your Python console With some data sets you actually find that this formula is quite reasonable so Row defines if the values move together on the same line But anyway, you must plot these are various scatter plot with they Peer some value on the first line We can see that we have on one Relation value and then when the data began to be unrelated though that value Goes to zero and then It starts to be again again a negative value when the relation is not direct but Inverse so when one Data set grows and the other Decrease but Even in linear cases where we have a zero correlation value But actually we could find that that those data are related or there are some patterns in the data So you always should plot Probability indicator Python sci-pi Provides a correlation function this function return to values the first one is The correlation coefficient that we just described these values are between minus one As said before when one data grows and the other decrease and a Plus one when both data grows together There is one other value the probability indicator That is its definition is quite tricky, but let's say that These value Tells us when such kind of datasets Produced by uncorrelated system so if the probability is high the system Are not correlated if the probability is low then those values are unlikely produced by Uncorrelated system So this if you have gotten Python shell you can just try And Check and next experience What you can get? The A and B values are just like a straight line and they have a one correlation and zero probability That is it's unlikely that random data can produce a straight line while getting True random values to random datasets We can see that the correlation is Low it is low. I don't care if it's positive or negative But it's up. It's absolute value is low But the probability that those data are unrelated is quite high is about 17 70% Now Combination return to our original problem. We have got various datasets We want to understand which of them and if Are related When we should The e1 we should do such kind of analysis the other tools module is is Gold point gold gold good place to check Combinations were quite an intuitive concept They just find all every possibilities in which I can mix A set of items without repetitions We use it to combine all table keys so we will combine the latency with the errors the Errors with the message size and so on and now This is how we get Our results simply we use combination to not fish for all possible correlation and probability values Between all our data series If the correlation is over a given three show We print something or if the probability is lower than three show Again, we print those values This is just a starting point, but we are concentrating Our customer wanted to know something Quickly We started with concentrating on what could be more likely our relation with the latency so the relation between latency and Errors is higher or not Is this clear I think If you're acquainted with Python it is But well remember the slide before Linear correlation is not everything We should use our eyes and Actually multiple hotly allows us to save the plots so What will do will we do We will save all the possible combination of our data and our data sets put Sticking on the plots all the possible information so the Relation indicator the probability indicator the data series we mixed and Then save that could produce 30 or Photographs, but we can just watch it with a one view or whatever your Image visualization and well at that point You can easily check if That plot tells you something This is an example plot with the buffer size and the CPU wait Hey, there is an eye relation Indicator and a zero probability indicator those data are probably related We can see that when the CPU wait is low the buffer is constant but when There is IO The buffer increase So there is a surely a relation if that relation is a straight line or the Relation is just like moving from CPU wait at a constant rate on the buffer side and then when the CPU wait Starts to be for three or four seconds at 50 40 percent then The buffer starts to grow. Well, this is a For the step of analysis, but for example if I if you're searching something This kind of plot is something you should it's a good starting point for investigation Then What lack lacks in the Previous graph was colors and a time indicator Have we have not plot time so we actually Don't know if the right side is the one is the starting point and the inside is the Endpoint for example because after a CPU work we flashed the buffer for example or if The left part is the starting point and the right the endpoint and I stopped myself in together data while Buffer was working using Colors I can understand better what what's happening and Well again either tools cycle makes iterable Continuous with each interval so colors next course the next returns Gb and again RGB. It's a simple case just Trace morning afternoon and night Morning with the red afternoon with green and night with blue I I Don't you I just use those compression syntax to split datasets in three Chunks and then I start the first one in the morning in red using Labeling it with red. I could even add Pearson and probably indicator on the single chunk and Then yes always set the title save the plot and so on Boom You're going to end me. I was fast. So this is one simple plot with latency on The x-axis and throughput on the epsilon axis The color denotes the time in the day We can see clearly that If we look at higher latency above three seconds, okay It's not a matter of throughput or size Okay The higher latency Match with lower throughput moreover With this plot we have even a a Indicator of the ability of the system of the speed of the system Because we can see that if we focus just on the first time slot between zero and one second We can see that there is actually an influence of throughput on latency but it's This influence ends after one and a half second and that That line could Could be a sort of throughput speed of the system We can see moreover that all the plot all the points all the red points With an a higher Thruput are in the same part of the day So if for example, we check that Those data are wrongs or there is a problem With those kind of data we the plot tells us that Points us to a precise part of the day to check In other one a correlation. This is another scatter plot with Size of the packet and the retry We can see that there is no relation The latency problem was not related to the size of the packets We can see moreover that higher size corresponds to a Low lower number of replies. So when the packet size is I there are no problems but the problems of replies are concentrated in The green part of the day so we can check if that part of the day Could have been some problem on the network, for example or some problem related to a part of our clients and All those plots were produced in 30 seconds Clock so once you have the data Just pass those nipples get your 40 plots and they'll tell you almost every thing so, yeah Again latency wasn't related to packet size of system throughput Errors were not related to packet size. We even discovered the system throughput Using those straight line capping the plots all these in 30 minutes The other time was just passing blogs. It was the hardest part of the of the problem a wrap up use statistics It's easy but don't use it use plots Plots plots and then yes continue to collect results Okay, 24 minutes That's all folks. Hope you enjoyed That was the best could be useful Don't know if there are questions, but well, that's it We have some time for questions any questions Go to the comes the microphones Okay Okay, there is one. I didn't understand it. Why are you using combination? Can you give like three examples of what pairs of combinations you are trying and why do you have to randomize that? Okay, as as I didn't know How the system worked The first time to do Was to combine all those data series using combination Will return late and peers late and errors Peers and errors Maybe Screen is smaller Okay, let's imagine instead of ABC. I've got retries latency time Okay So the combination Let's me tie every possible Value every possible data set with another one Maybe this is okay Imagine a is latency Bees throughput C is retries I've got all the possible combination between data's and I Can Evaluate the relation values or plots on every possible And What What is Beers was the number of the computer in the network That's another question Yeah, I have a quick question. I see that with these permutations you get a lot of different combinations Did you have any problem with a spurious correlations with high significance levels? Could you? Detect them as false correlations not really related or you didn't Actually There is No false correlation because It's just a number in if I understood the question the piercing indicator is just applying a formula that tells you if Those data are Plotting those data you get something like a straight line so For this reason I always say You should plot because and obviously You should Then learn how the system works There's an indicator because I made 40 plots. I Didn't know the system So I need I needed something I needed a stack to say we To steal the words of a costanza So I got a stack then they stuck I Started to find the needle Thanks Thank you very much Roberto. It was a great talk. Thank you very much