 Hello, everybody. So we're going to have Kalyan Prasad talking about advanced visualization in Jupyter Notebooks. Just now, we had no short talk about how Jupyter Notebooks were used in EDA, but they're not in production. But it's a very, very vital role when you're trying to start out on a data science problem. So let's have Kalyan coming up. Hi, Kalyan. Hi. Hi. So we're really looking forward to your talk. And you could take it right on. Thank you. Sure. Thank you so much. I hope you see my slides. Yes. You can see my screen, right? Yes, we are able to see your screen. Yeah. Thank you so much. Thank you. Yeah, so hello, everyone. I am Kalyan Prasad. It's really incredible to be here and such an honor to be given this stage to speak to all of you who are in Python India. So today, I'm going to talk about advanced visualization in Jupyter Notebooks. So in this talk, we are going to look at some advanced ways to visualize our data. So before jumping into the actual content, let me quickly introduce myself. So I'm a self-taught data scientist. So what do you mean by self-taught data scientist? Let me give you some big tips on that. So I don't have a formal computer-sized background degree. I'm purely a non-technical guy. So I'm a commerce and finance guy. So in most of the work which I practice in spreadsheets, I have almost like some different experience in Fintech where my most of the role was into the Excel spreadsheet. But somehow, I realized that I can do something beyond Excel. There, I started my journey two years ago in data science. And that's how I turned up as a data scientist now today. And I also love involving entire communities and try to help communities as possible as I can do. So currently, I'm associated with these two groups. Isobat Patzen used the group and also the icon for Isobat.net's company. So feel free to check out these groups in case if you are interested. And I also, my areas of interest lies in AI and Fintech. It's not just because of the buzzwords which are floating on the internet. It's just because of how the world and industries are transforming with these technologies. OK. So here is the agenda for today's talk. I tried writing my agenda in simple Python function. OK. So here, we start with the advanced visualization overview. Then we are going to give an introduction to information-dense visualization. Then we are going to talk about visual data correlation. Then we are going to show a simple model in a regression. And we'll give you a quick demo of the concept, what all we have covered here. And then we'll wrap up the things. OK. Without any further delays, let's get started. So advanced visualization overview. So data visualization is a huge field. You can see that just by looking at various matlockly examples on their website. Now, some of the key things to take away from this field without becoming an expert in data visualization are it's a delicate skill. So you have to be careful and not to exaggerate, even though you have so many tools at your disposal. Sometimes using less is more. So sometimes using a simpler method, doing a cleaner plot shows more information in a clearer way rather than going for fancies to fall to. On the other hand, don't underestimate your audience. So it is important to show various insight from our graphics. So we shouldn't just assume that the average reader of our document or a person who is going to look at our analysis report won't be able to understand the concepts we are describing. Instead, we should try to teach the audience how to read the more advanced graphs and don't underestimate them. And I like to think visualization as a way of storytelling. So think about what motivation of your research or work is and basically how to use these plots to guide the reader to understand the concepts you have discovered during your data analysis. Interaction to information dense visualization. So a key person in data visualization theory is Edward Tuft. He has really mastered the skills and written multiple books. On the same matter, I would highly recommend if you can read his book, The Visual Display of Quantitative Information, which is really the prime source that really combined all of the important concepts here. So one of the important concepts he mentioned in his book is information density. So think of this way. If we have a pie chart like this, which basically shows the size of four items, so wouldn't it be better if we just replace this pie chart by four numbers? So the pie chart uses a lot of colors, a lot of effect, a lot of shadows here. At the end, it just conveys the information about four numbers. So I think it would have been better to use a table in this case or just describe the numbers in text. And so one of the graphics Tuft used in his book is Napoleon's March, which is created by challenges who may not. So this graph basically shows the port of Napoleon's army losses in the Russian campaign of 1812. So this is the graph. I mean, it looked quite complex at first sight, but don't get scared. But what I mean, if we break it down, this graph, it basically shows the brown area shows the position of the army and the size of the army as it had once towards Moscow. And if we look at X axis, it shows basically the geographical movement and the passage of time. And if we see the black color, we chose the retreat of the army after their loss. And we can see so many variables on this graph. This is called multivariate visualization. I mean, we can see the army size and its location over time. And if you also look at the small graph on the bottom, we can see the temperatures that related to army's retreat. So we can see the temperature is really, really getting cold towards the end of retreat phase. So this is an example of information dense graphic which shows many aspects and basically tells a story in and of itself, which makes it a masterpiece of data visualization. Now what sort of graphics we are going to deal with? I'll be quickly showing them once we cover the remaining concept. So next we have is visual data correlation. So the main objective of this concept is to compare different methods of two data variables. Suppose if we are talking about time series data, the first thing would be visual comparison and to better understand the properties to compare. So let's see what time series is and what sort of patterns we can compare from it. So here is the example of sample time series plot. It already splits the first plot. Here you can see the original. And if we look at the second graph here, we can already see the seasonal effect or seasonality change. And for example, if temperature changes during a year as season changes from spring to summer or winter. So these are certain patterns and we call these patterns if they are repeating over time. And then what we have is called the trend line which is the third plot in this visualization. So trends can be long-term changes over time. So what I want to finally conclude here is if we are plotting the time series plot side by side, one way is to look at the properties to compare and then is to look at the properties and then compare by it. The method which we are gonna basically show which basically describe the trend more clearly is known as the moving average method which I'll be showing you in the Jupyter notebook. And the second plot, this is a second plot which has a similarities between the data and the relationship is called scatter plot which basically plot two variables one on each axis. So here is the example of a scatter plot of income and education. So these are the data points. If we look at each data point, a person with particular education length in year and what their income is. So the scatter plot what shows here is called correlation. So correlation is a statistical measure of association or dependent. So without going into much details about it which was beyond the scope of the talk you can think of correlation visually. So here is the different plot examples of correlation data and their correlation value. So a perfect correlation of one would be the linear function. So which is the first example in our plot here. So if we look at the correlation with 0.8 value where we can see a scatter plot and where we can also see some sort of a direction. I mean like where we can see some messy points here there is a variance of data outliers and so on. If we jump to the example of correlation zero here where there is a circle of messy point it clearly says that there is no relationship between variables in this scatter plot. And one thing that I mean usually mentioned in relation to correlation is correlation does not imply position. It means if two variables are correlated to each other it doesn't mean that one affect the other or vice versa. So it is important to understand the data and the factors that affect it. Maybe there may be a common cause and so on. So this is a huge field of study. So we are going to look at simplified overview of what correlation is. Then what we have is called a model technique which is a linear regression. So it's a method for fitting a simple mathematical model to over data. So the method which we are gonna describe show a model is called the linear function which is called f of x is equal to x plus b. If you show this graphically it shows at it describes our data behavior by plotting a straight line which basically follows the direction of scatter plot. Again, this is a simple use case but if we look at reality which is used in science which is used to model the real world behavior on experimental data. So that's again a typical use case. So this is a small step into the direction of actually doing the science of data science part. So what we have now finally is, so what are the concept which I have covered so far? So we'll be showing you with a practical demo. So for this demo I have selected as a Chicago crime and Chicago weather data sets and we'll first analyze these two data sets in relation to each other then we'll show our graphics with these two data sets and finally we'll show a simple linear model on the same. Okay, let me dive into my Jupyter Notebook now. So here is a Jupyter Notebook which I have created for this demonstration purpose. So firstly, I have imported the packages here, pandas number and macro clip. Now we are reading the both data sets here. So one thing I would like to mention in this point is, so I have gone through this crime CSV file and what I have noticed is basically the data was so uncertain, unordered and something. So what I've done is I have made a small adjustment to this data while passing. So what I have done is for this demonstration purpose I've just selected only the 2016 year data and thrown away the rest due to the memory constraint because this is a very huge data. So that's the reason I did this thing. So once I have passed this parsing here, so here you can see the crime data set with the final data frame here. Then for the weather data, while reading the weather data I also selected the period 2016 so that both data sets will have the same period. So once we read the data sets here, now we are plotting the data side by side. So from weather data, I've selected the temperature column and resampling it by day and calculating the mean. So mean of a weather, I mean. So now if we plot this, so you can see, we can expect, we can see the signal effect that would expect from weather data. I mean like gradually hotter over summer or winter over cooler. So and also what I have done is I have also done the same thing for crimes as well. So I have resampled it by day and counting the number of crime instances here. So here you can see the number of crime instances. So next what I have done is to simplify the above variables. I've just created two variables here and pass the data what I have shown above. So now we are plotting this on the same graph. So if you see this graph, so these two are different data types and the values are nowhere related to each other. If you look at the temperature values where the values are in the range of 10 to 20 and if you look at the crime values where the values are in the range of 600 to 800. So now this is a problem. So what can we do in this case? So in theory, we could have added another Y axis and basically have both plots on the same graph with different Y axis. But again, that is considered to be a bad practice in data visualization. Instead, what we can do is we can create a two sub figures for these two plots and we can really create a fixed size for that for each sub figure and we pass the fixed size and other mathematical properties like temperature and the label. So once we plot this, now we can clearly see that which data set belongs to this graph. And we can also notice now that the temperature and crime are one next to each other. So, I mean, to better, and also the thing which I mentioned in the slide says we are trying smoothing our time series plot to show trend more clearly. So we can do this by using the pandas rolling function. So we are calculating the moving average here. And so if you plot this, so if you plot this, so it shows that certain points in the crime rates here and average will be calculated. So the more you increase the number, the curve will be more smooth and I mean, so the longer the period, the average is the changes would be more gradual. And so what remains here is the trend change will be the long-term change. And I have also used the interactive with this here to show the moving size adjustable window. So what I've done is I have imported the interact here and for this interact, I have passed this argument so that it will create a nice interactive plot here. So we are calculating the moving average same here. So what I've done is I've basically selected the period 100 here. So probably you can play with this to check how the curve is smoothing here. So I selected the period 100. Now if you see the curve is so smooth and it clearly shows there is a similarities between these two time series plot. It clearly says that there is a similarity between the temperature and crime. So to better express this, we can calculate the correlation or scatter plot as I mentioned previously. So what I've done is to do this, what I've done is I've selected these two time series and passed into a data frame. So one column for crimes and one column for temperature. So here is a data frame we have. So this we can be used to explore further analysis at the beginning now. So then, so now once we have a data frame, now what we can do is we can call the plot function. So instead of going default plot method, which is a time series plot for the time series data. So we have given kind of scatter here and also given the x-axis and y-axis. So here you can see the scatter plot. So if you observe the scatter plot for certain temperature on the same day exactly what was the crime rate. So there is a clear trend here. I mean, there is a correlation happening here. So the, I mean, the lower the temperature rates, the lower the crime rates here, then the higher the temperature rate, the number of crimes are higher here. So there is a definitely correlation happening here. So if you want to expect this statistically, what we can do is we can call the core function here. So if I call the core function, so here you can see the correlation is 0.3. So again, so by default, I have used the Pearson correlation here, but probably you can also explore the different methods of correlation. And I'm not going to dig much details about it, but what I see is if you remember the slides, which I have shown where numerical values of between zero and one, how correlated the data is. So if we, and if we come, I mean, if you come back to storytelling side, one thing that pops up in my mind is one thing that pops up in my mind is there is an outlier here. So if you observe, the most of the data is clustered in this big area, suddenly we have a big outlier here. So, so outliers can always definitely lead to an interesting story. So probably to know the reason behind this outlier, so what I have done is so I've just selected the data point greater than 1000. So if you notice here, so the data point greater than 1000 is an outlier. So I just selected the data point greater than 1000. So if you look at results, it is a new year. So it seems like there is a large gathering happenings or theft happening due to big crowds and all. But again, so this is just a one year data. It's also good to look at other year data before we make a conclusion here. And then one only what we have is, we'll show you the model on top of it now. So for a linear model, what I'm doing is I imported the statistical model stats model here, which is basically show some statistical functions. And I also imported the AB line plot. Okay, so now we are giving the, so we have only one constant variable here. So we are going to give the temperature as input. So the output will be the prime rate here. So when I call the model.fit here, we'll see the model result object and followed by the result parameter as well. So these are the A and the B parameter here. And if you also look at the result summary, so here it shows some statistical properties like coefficient, standard error, E values and all. So and again, so I'm not begging you into all these details because it goes beyond the scope of this talk. But what I highly recommend is, it's good to look at all these properties because this will definitely help in improving the model accuracy. So for now we are just picking to surface layer which is used in this model. So we'll give a new input here for new temperature. So based on that, it will generate a output of this from this mathematical model. So how we are doing this. So for example, let's say if we have a weather forecast, it's going to be 28 degrees. So how much crime rate can we expect from it? Or if you have during the winter, the temperature is going to be five degrees. So what would be the crime rate then? So what I'm doing is I simply called a predict function here and I'm just passing this series value. So here you see the predictions. Again, so these are the predictor values but again, taking such predictions will definitely grain of salt here. Why? Because if you observe where our scatter plot has some way, a lot of variance data and moreover, we have just selected only the single period data here. But it's also good to look at other period data and observe how the predictions are going. So the idea behind for this is to give a quick overview of how to deal with these kinds of linear models. And the last thing what we have is how do we plot this? Can we basically plot this? Of course we can plot the scatter plot as we have shown above. So what I'm doing is I'm calling the AB plot line method which I have imported earlier. So I'm just passing the model results object here and followed by the other material properties like title, X label, Y label here. So here you can see the plot. So it shows the scatter plot from before and it also shows a line which is described the linear function which we have opted through a linear regression method. So I think this is a nice graphic here. It basically shows the multiple different information types. So we have passed two data sets to obtain this graph here and it also shows our model on top of it which shows a multiple layers here. So I think this is a nice implementation of doing a information dense visualization. So to wrap up the things, what we have covered so far is I mean, so we have explored the different information dense visualization techniques. So we have firstly looked at aligning time series plot and comparing it properties like seasonalities and trends. And then we also looked at scatter plot with basically plot two variables on the same graph. And we have also discussed about moving average method and also finally we also looked at a simple linear model which basically we have obtained through a simple linear regression method. So yeah, that's all. So here are the quick references that I have considered for preventing stock. So the book which I mentioned during my talk this is the book and these are the some resources. So yeah, thank you so much. That's all from my end. Hey Kalyan, that was a really nice session. I think I'm pretty fast. That's fine. We do have questions for you. Okay, so I'm happy to answer them. A couple of questions for you. Sure, sure. So let me start with the first one from Anurag in the crowd. The question goes like this. Can you suggest some tools, techniques for visualizing textual data? I mean, okay, let me be very honest. I never explored visualizing techniques on textual data. So I'm pretty not sure because I don't want to give you a false answer because I have some idea but I don't want to give some false answers on this concept. So I'm sorry for that. Probably I can export it and I'll get back to you. I think some of the ways would be in textual data when you're doing EDEA one of the first things you would do is create a backup words and then probably draw a bar graph with its count. And then- Yeah, correct, but backup words something I never played with that the visuals. So yeah. Yeah, okay. So moving on to the next ones which I have. Okay, so there were really interesting visualizations that you showed wherein you were comparing the size of the army and stuff like that. So one of the things I had in mind, so there is a fundamental data science saying, I should say, which is correlation is not causation but with visualization you can actually prove the inverse. True, true, true. So I think it's very important to keep in mind that when you want to put up some really interesting or scientific finding out of a data set you could really put it out as a visualization. What are your thoughts on that? Yeah, yeah. So in case if you want to better to compare a correlation on this so probably you can compare the DOEA some correlation and also what I suggest you can also do some applying some gradient on the background so it clearly shows that how the values are correlated with by highlighting the color. So that's how you can simply know that which are the which column or I mean like which data points or which columns have a correlated from one to each other. So that's how you can easily figure out things. Yeah. And in general, I think visualizations are very, very powerful thing to know. I mean to do first thing. Yeah, sorry to interrupt. Personally, what I feel one thing is instead of writing a lot of code. So one visual can speak more than 100 words. I mean, so if I instead of writing a 10 lines of code if I put a simple plot with a very constructive way. So it answers everything rather than writing too much lines of code. This is what I mean. Personally, I learned through my experience. I totally agree with you because in a data science process, visualization is the first thing and also the last thing. The first thing because you do DOEA with that and the last thing because that is where you put out your insights. So visualization is like the core. And that's what I meant. And moreover, that's what I mean. I was mentioning the book of visual display of quantitative information that speaks exactly the same. So instead of writing tons of code. I mean, the most of the process you can simplify with visualization. So you don't need to complicate things. So because at the end, at the end, if we are communicating something to stakeholders or the clients. So it should be easy for them to understand things. So he's a technical or a non-technical. So visualization definitely helps in that case. Until and unless we don't complicate the visualization. Yeah. So what are your thoughts on open source visualization tools versus commercial ones? So there are things like Tableau and a whole set of things which are commercial. And then there are open source tools like D3JS or Dojo or stuff along those lines. What do you say? So it's again, depends upon the use case. How you're going to, I mean, so depends upon case to case. So for example, if you ask me in my personal or else my day to day work. So I'll go on a case by case. Sometimes I do visualizations and Excel itself and I'll communicate the results to my managers or stakeholders. So it depends on me if we need to take a call depends upon this occasion. So I mean, like I'll not say that open source libraries are pretty much good than other visualization tools. But it depends like depends upon the case to case we need to follow it. So I use Tableau in my regular work. I use Excel in my work. I also do visualization. Matt Plotlipsy born Plotl as well. So it's again, so I segregate the things from, I mean, I'll play according to case to case. Right. And on Jupiter, I think most people first off would start with, you know, Matt Plotlip if they were to visualize something and then go in, you know, the various other things like C-Born or. Correct. Why? Because the Matt Plotlip is a basic visualize. I mean, if you're doing a data visualization with Python. So Matt Plotlip is a very primary library you should be aware of because until and if you are not sure with Matt Plotlip how to play with Matt Plotlip. So you can't go and expose C-Born because Matt Plotlip, C-Born is built on top of Matt Plotlip. It simplifies some process, but until and unless if you are not aware of Matt Plotlip properties or things, so it would be difficult to, you know, go and expose C-Born or at least even Plotlip as well. Let's see. Okay, we have another question from the audience. Let's go over. Could you suggest the best way to build a story in visualization for a beginner? I mean, story in the sense of what exactly they mean. I was not... What they mean is probably, you know, let's say you have data, how do you put it forward to your management that, you know, whatever you want to convey? So that's what, so simply you need to start with, you need to completely make your hands dirty. You need to do data cleansing and all. So first of all, you should be aware of your problem statement. So what do you want to do? You have a data. So what do you want to do from that? So you need to, I mean, like put it all your points, put it all to write up your points. How do you want to do a step-by-step process? So because if you ask me in my learning, I just segregated things. I mean, I write every point. For example, if I want to clean this, well, am I right? So I write, I design my story life step-by-step process, even the data cleansing process. So if I write, if you are clear with your things, I mean, like how do you want to do or how do you want to play with data? So that gives you, that will definitely, at the end, that will definitely create a story for you. So until and unless, if you are clear with your problem statement and what you want to achieve from it, so yeah, that will definitely help you to create a story. Yeah. So in a gist, what I understood is you need to have the story clear in your mind. You need to have your data cleansed accordingly. And then like, I think a lot of insights from your PPT itself would say, you know, on, you know, what kind of visualization to use, where is something you need to figure out? Like you cannot really, you know, show some, you cannot always use a pie chart. It might look jazzy, but it may not be the right one for the right kind of visual. Yeah, so we correct. Yeah, for the, for the right. Correct. So we can all, we can not always go with the pie chart. Sometimes it's simple bar chart answers everything. So, and also sometimes simple heat map speaks everything. So it's again, so yeah, or else, yeah, that's what I mean. The more you simplify your reply, so you don't need to go for that's what I was saying. So you don't need to go for a fanciful or a fanciful plot to express your story. So the more you do with simple things, it will definitely make it better. Yeah, so more than, more than things looking jazzy, it's more about, you know, conveying the thing simple. So you don't understand. I mean, if I create some jazzy visuals and all, so again, this I'm conveying that message to my stakeholder at some point, you don't understand why this isn't behind all this. That's what my, one of my example was pie chart there. So I just, it's just a four items. So there's no need to create that pie chart for me with a lot of fanciful effects and all. So instead of that, if we are just talking about the size of four numbers, if you simply put it as a number in a table or just write it in a text, that will definitely help you because if someone don't understand all of a sudden, what is the point of keeping this pie chart? That's what. True, yeah, it's been a good session and great learning all throughout. So thanks a lot for the session Kalyan again about, you know, advanced visualization techniques within Jupyter. And we also spoke about, you know, advanced visualization in general. So it's been great to have you here. Thank you. Thank you so much. And I'm happy to have any questions. If you please feel free to connect me in a little bit or else you can also connect me on my LinkedIn or like Twitter as well. So these, just let me show you. So these are my Twitter and LinkedIn handle. So feel free to connect with me. Sure, yeah, the crowd. I'm sure the audience will be happy to connect with you. Thank you so much for having me. Have a good day.