 And good morning, everybody. Maybe what I'm unsure about is whether or not my microphone volume level is all right. But so give us feedback somewhere on one of these chats. But let's start with an introduction. So my name is Kada Balbast. I'm super excited that I will teach data visualization here with Johan. We will talk about data visualization and map plot lip, two topics I really like to talk about. I work in Norway, in Tromsø. I do research of engineering. And I like to help researchers with improving their code. And I like to teach Python and different things. And with me is Johan. Yeah, good morning, everyone. My name is Johan Helsvik, I work at the PDC Center for High Performance Computing in Stockholm. But right now I'm sitting here from Uppsala. And I realize that I should take the screen and arrange it. Let me do that. Just a sec. He, you know, that's not right. Almost I will arrange. And while I open the sharing window, I will close my physical window. All right. Good. Good. We will talk about data visualization. And let me first tell you how the best way to participate So what to expect? It's good if you have your Jupyter lab open. And I understand you have that already if you were following the Pandas episode. But if you joined later, it's a good moment to open up your Jupyter lab. We will be using it. So we will be doing data visualization inside Jupyter. And we will also explain why that is such a good fit. And we also, you can open up this episode data visualization with Mapotlib, which is also linked in our collaborative notes. And Johan and me, we will look at these notes. So that's a good place to ask us questions. And in a different window on my screen, I have them open. And we will try to react to questions live. So please keep the questions coming. We plan to do a bit of introduction for the first 15 minutes. We will then do an exercise block. So you will get 20 minutes to try this on your own. We will then do a break. And after the break, we will do something. We will talk about how to make plots even prettier. And then we will have another exercise block. And now the question is, should we show you how this works? In the notebook, you can either try that with us. So if I add something to the notebook, you can do it as well. If that is cognitively a bit too much to listen and also go into a different window and type something, it's also totally fine to just watch what we do. And you will have enough time in the exercise box to try it yourself. Good. So let's first motivate why we do this and why we chose MatVotLip and why we do that in a Jupyter notebook. Let me just arrange the windows here on my side so that I can see your questions. And let me also zoom in here a bit, data visualization with MatVotLip. So our goal is not becoming experts and knowing everything that can be done with MatVotLip. But we want to give you a really good overview of how to find help, how do we use it, and what are good starting points, and how does it connect to the previous episodes, like pandas. So that's our goal. In total, we will spend one and a half hours on this topic. So not only being able to start, we also want to show you how can you tweak, how can you improve plots without remembering all the options and the commands, because we don't remember them either. The interface of MatVotLip is really big. And it's also so that... So, so Radavani is MatVotLip the only Python packet for visualization, or are there more? No, there are many. Here are the ones that we know, so they are probably even more. I have tested most of them and I use them in different contexts. So there is mapplotlib, this is the one that we will show you, there is also, and it is probably the most popular one, but there is Seaborn, which builds on top of mapplotlib, there is Vega out there, plotlib, ggplot, and then there are different libraries for more special use cases. I would consider the top four here or the top five really general. So there are many libraries and also Python is not the only language that can do plotting, so why do we do mapplotlib? So I would say one very good thing with mapplotlib is that, I mean as Python is a completely free ecosystem, it is transferable, so if you write the plotting scripts in mapplotlib, you can share them with your colleagues and also with the community at large. This can be a little bit in contrast with some commercial packages for plotting, where you might have a lock-in effect that you prepare your scripting and then you share it and others cannot, perhaps they can read the script file, but they cannot make use it because they do not have the program. And it may even be the same person or you in a different job in the future and so the fact that it is free, that both Python is free and open source and mapplotlib is free and open source is really important and it's really important for reusability and reproducibility. And I like to, so when I talk about data visualization, I like to start with this quote, which I took from this fantastic book by Klaus Wilke, Fundamentals of Data Visualization, which you can browse, it is, the book is online, and it is this quote here, that one thing I have learned over the years is that automation is our friend. I think figures should be auto-generated as part of a data analysis pipeline and they should come out of the pipeline ready to be sent to the printer with, and here we are paraphrased with minimal post-processing needed. And this is because when we generate plots for posters or publications or thesis, at least in my experience, I never do it only once. I do it at least twice. I do it once and then I do it again one day before the deadline because something changed. I want to modify the color or I get new data or I realize that there was a tiny mistake and I need to change the figure. And if this is not automated, it's hard. Might need different versions of your figures. So the precise size and shape of a figure that you have in a manuscript might be different from what you then need to go on a poster or on slides for presentation. Then if you have that inscript form, you can then tune to the appropriate format and then reuse your work. Yeah. Our new data comes in and you want to update the figure. And we will see that if we use tools like Matbotlib in combination with tools like Jupyter Notebook, this becomes a really nice combination. So we will now focus on Matbotlib, but you can browse of course all the other ones as well. At the end of the session, we might also have a look at one of the other libraries. Why did we start with Matbotlib? We motivated a little bit. It's also, so it is the most popular library. It is, if you come from Matlab, it will feel familiar because it takes a lot of inspiration from how plotting is in Matlab. Even if you choose to not use Matbotlib, maybe you prefer Seaborn. Many of these libraries build on top of Matbotlib. And then if you want to tweak it, improve it, it helps to have an understanding of Matbotlib to be able to improve your plots. But it is relatively low level. Low level in terms of we can really modify everything. So in terms of abstraction. It doesn't provide statistical functions. Some of the other libraries do. But the advantage of Matbotlib is that you can adjust everything. You can really make things publication ready. Everything can be configured and modified. Speaking of statistics, which is I'm not containing Matbotlib, you can naturally combine the Matbotlib scripting that you do with other Python packages, such as NumPy, SciPy, and then also with Pandas that we have been talking about earlier. And I'm having just a quick look on the questions. Thanks for coming, raising them. For instance, question nine, when should I use X and when should I use BLT? This is something that confused me a lot when learning and using Matbotlib. And we will comment on that. So we will clarify that. This is something I realized maybe 10 years into using it. That there were these different interfaces. So we will comment on it. Also, another really good question is, how is Matbotlib built on top, built using? So how does it connect to Pandas? And also, this is something we will discuss. Some of the plotting libraries interface with Pandas in a more nice way or less. But also with Matbotlib, it is possible to use Pandas data frames. We will come back to that. But I think we are ready to open up the notebook and start creating our first plot. At this moment, you can either do it as well if you have enough screen space and enough cognitive management. But you can also watch what we do. You will have the chance to test it out in the exercise block. I will open up a new notebook. So let's start with a new notebook. Let's not continue from one from before. I will open up a new one. And also, a good first reflex is to rename it. I don't want to have my notebooks untitled, untitled one, untitled two. I want to give it a good name. I will rename it. Right-click, rename notebook. Let's call it plotting. And back to the material. I will copy-paste the code here from this block. I will run it in a notebook, but let's also explain then what is happening. Let me copy. So I copy the whole block. Let's see whether this works at all. Yeah, you are the Python for Psycho. Yes. Or Anaconda environment. Yes, yes. So at this moment, all we import is a library called Matpotlib, which is part of the Python for Psycho. It is also part of an Anaconda-based environment. So if you get here an error that Matpotlib not found, then you are probably in a different environment. And I ran this code. I got a first plot, which shows some dots. I have an x-axis. I have a y-axis and a title. They are not very concrete yet. And now let's inspect the corresponding code. What did we do? We imported the functionality. I defined two lists of numbers, x-values, y-values. And these values, they are part of the so-called ans-com-squatted, which is a really important data set because it is used to motivate why we even do data visualization because it's four data sets, which really look very different when we plot them. But when you look at the statistical values, like the mean, the sample variance, the correlation, the regression, the statistical values are the same. So if I didn't plot these, if I would only look at the numbers at the table, I would maybe have less insight. But back to the example, what else can we discuss here? This is the important part, these two lines. We set up a figure and we set up axes. These are objects, which we then can use to, for instance, do a scatter plot. And here I sent the data x into, I said, these are the x-values, these are the y-values, and I define a color, which in a really weird format here, but I will later comment on why we do this. Instead I could also use a named color, I could say red. And if I run that, then the dots will be red. And this is self-descriptive here. Just having a look at the questions. I import it using just import nutlotlib instead of nutlotlib.pyplot. So why I do this here? Because there is more in nutlotlib than the byplot. Byplot is one of the interfaces that nutlotlib provides. I could also import nutlotlib. And then here I would have nutlotlib.byplot.plt. I could do that as well, and it would also work. It wouldn't be on any noticeable penalty, maybe a little bit more typing. I chose this way because this is often what people do, and this is often what you find when you look for examples on the internet, or if you ask one of these AI chat solutions. Just looking at the other. So question 15, is it better to use this way of doing it rather than the PLT? We will comment on that. So we will come to this. We recommend to do it this way. So we show you the more robust way, but we need to then also explain why this is possibly more robust. I think we will do that after the exercise. To go into the first exercise, so should we perhaps present it? Yes. So now you have all the tools ready to do the first exercise block. Your goal, I mean here it says 15 minutes, but we really want to give you 20 until five minutes past the hour, but I need to explain also clearly. Your goal will be to do what we did here with Johan. Open a notebook, copy the block, get it to run. But once you get the image that we got here, you are asked to extend it. You should add a second data set and then yet another data set, which is this one multiplied by two. And here we wanted to show you also that this is a way to multiply all numbers. This is one of the many ways in Python to multiply a list by a factor. Then you will get a plot that looks like this. And another thing that you can try to do is browse the documentation, find out, so you can have a look at the quick start guide and try to find out how can you get a legend into the plot that we can then link to the data values. And at the end, it should look like this. You can also experiment changing, modifying the colors. And if you get stuck, there is a solution here. So if I would open this up, whoop, but I will do it only very quickly because we don't want to have a spoiler, you find a solution for this exercise. And then we can come back five minutes after the hour. And after that, we will send you into a break. And if you are curious about why did we choose these particular colors, here is an explanation. Do we have everything we need for the exercise block? So your goal is this exercise number, Matpotlib 1. And I will add instructions into the document. And we will be back five minutes after the hour. All right, good luck and see you in a bit. Bye. And welcome back from the break. We will continue with Matpotlib. I wanted to also show you the result of the exercise session, so hopefully you get the result that looks like this. You also find this in the solution. So here we have plotted three sets of X and Y values with different colors. I want to save the notebook. I just wanted to remind a really good practice that I find very useful is that before I save a notebook and before I share a notebook with other people, I like to run all cells from top to bottom. It's not too long, I would recommend to do the restart kernel. Yeah. This is even better because this will reset and run everything from top to bottom because this is exactly what the next person will do. Because the next person opening the notebook will not have anything in memory. It will table run the notebook from top to bottom and I want to make sure that it still produces the results that I wanted. This will prevent me from having to run the notebook in a very particular order that nobody will remember and now I can save. Super. And before I hand over to Johan and before we talk about now how can we improve a plot, how can we customize it, I wanted to comment on the question that we got a couple of times, which is which of the two possible ways that one can use Matpotlib should we use and why. And I admit, sorry I need to zoom in here, I admit that although I was using Matpotlib for quite a while, it wasn't clear to me that there were actually two different interfaces and I got really confused because every time I was asking the internet for how do I do something in Matpotlib, I saw an answer but it always looked different than what I remembered. And I was doubting myself for a long while until I learned that there are actually two ways to run Matpotlib. One way is the so called object oriented way or also called the explicit interface. And in this explicit interface we create these objects and then we use them. This is the method that we use in this lesson. This is also the method that we recommend. There is another method which is the so called pipe plot way of doing things or the so called implicit interface, which looks shorter so there is less to type. I don't have to create the figure object, I don't have to create the axis object, I can do this directly. So it looks easier. The downside is that now that we start customizing, so if I now change the line width and the format and colors and settings, I will affect the settings for all my plots that come later in the notebook or in my Python code. And sometimes this is not what you want. So this will be more practical if you, once you start putting this into a function because then when you change settings, you change settings only for the function that you want and for the plots that you wanted instead of changing it implicitly for everything. So that's why we recommend this way. But we show you both ways because so that again, you know, if you then search Stack Overflow or you ask GPD for how do I do something with my plot lip, you might get, you might see this kind of an answer and then you know why this is different than what we have just learned. Here's also an explanation why we even spent now a couple of minutes emphasizing this. And with this, Johan will now take over and guide us through styling and customizing plots and we continue watching the collaborative nodes and please continue asking questions. We really appreciate it. We will now touch upon the topic of styling and customization plots and a starting notion here is that this is also an aspect of reproducibility because I used myself earlier in the days to often do plots with, often with us with Matlab and I would then get a certain feel and look of the figures and perhaps do something which needed to be tuned a little bit and I might then do it in a drawing program. It's fine if you do it once or twice, but if you need to do it for 10 figures, it has a lot of extra work. So Matlab and other libraries, they allow to customize almost every aspect of a plot and it's going to be good to know what are the different Matlab partial figures so that we know what we can search for when to customize things. So we open up this and this is here for a two-dimensional figure. So you can see here that in the obj oriented modality, we have all of these variables. So you have the axis set minor locator, you have a major tick label, you have the markers that you can change and you have a legend. And one thing that can be very convenient, which you can show later in the notebook, is that in order to see what are the different properties that are available, you can use the help command. Yes and it's also nice to know how is this even called, like if I want to web search for something, I really like this figure because then I know that I need to search for something called minor tick or legend. There's also, yeah, below here on the same web page, you also have an extensive listing on what are these properties in web format. So from this paragraph here, we also have a resource here, which is, this is to a Github repository with Matlab sheet reads. So this is something that you can explore later. There is also a number of predefined style sheets that you can activate with the use command. You can show some of these. So these are with different collections of colors and marker styles and line styles that you then can use from and these are designed so that they have a good collection of colors, which is something that is very important because it is not uncommon that a reader or a viewer of your figures might have a limited capability to, that you can be color blind, red, green color blind or other vision impairments. And then it's good to have a color palette that works also when you are working in a gray scale. So that you have lighter, you have that the color ranges from lighter up until darker. Yeah, like during my PhD, one recommendation was always to print your plot in black and white on a printer to see how it looks then if somebody later prints the paper on a black and white printer. Then later I thought, well, that doesn't make any sense anymore because I think nobody's reading papers anymore in paper. People read it on the computer, but now again, I know that it does make sense because it can help us identify any sort of color problems for color vision deficiencies. Yes, that's a very good point. So we will now have an hands-on example of exercises and styling and I will do it as a demo. You will also have time to do the UN exercise session. So what we'll do here is that we will import a data set from using pandas. So the exercise is this one, customization one, log scale in maplotlib and I will now switch over to do it in the notebook. So I start with copying this text snippet here. And please remind me because I was also distracting answering something. Should we now all do the same thing as you or should we watch? You can watch. So this is now imported a data set. This is about statistics of countries. It's the GapMinder data set and we will then use a plot command that I paste. Before we go into the plot, we can maybe reconnect to the previous lesson about pandas. So what people see here is that we load a CSV file from the internet, but then there is this read CSV command, which we have seen in the previous episode. And at the same time, there is dot query. So we filter out and we are only interested in the data for the year 2007. And here we have four different countries. We have the life expectancy. And we are interested in GDP, the gross domestic product. So roughly how wealthy the country is. And now take this snippet here, I paste it in here and execute it. And what we really get here is that what we choose to work with here is the life expectancy. We will visualize life expectancy as a function of the GBP per capita in terms of perching parity. What's that? That's perching parity. Yeah, so it's some inflation adjusted US dollars that are also adjusted to have a more fair comparison. So it's not US dollars, but for our purposes, we can think of inflation adjusted US dollars. Yes, the data is here distributed so that it's seen here. A lot of it is along the vertical line here. And then around this horizontal line here, horizontal region here in upper most of the plot. So the question is here, how can we make better use of the resuscitation to highlight the data here? So do you have any suggestions on the one? What could be? Yeah, so one step that we can do is we can try a logarithmic axis, which will then, because we have such a big difference in orders of magnitude on the x-axis, so by switching from a linear to a log axis, we will probably see the trend hopefully clearer. So we need to set the axis scale to a logarithmic one. And this we will do by setting this attribute, the set x scale, we set it to log. Take that line and I go through notebook and add it here, set x.setScale.xcale log. Yes, and now we have then the logarithmic scale on the x-axis here and what we have here is much more evenly distributed within the canvas. There's also here one attribute, the alpha attribute that we can play around with here. So the alpha attribute is at first here, it's a half. You can see what happens if we change it to 0.8. Yeah, then you can see that this affects the transparency of points. So by using this higher value, we have a more dense visualization of the dots. Great, and a question to both of us. So do you remember all these things, like how to set a log axis, how to set the transparency, like I admitted I don't. So I almost never remember this, I always have to look it up. Yes, that's a good point. And one can then, so we have here, if you see here, what are the objects that created, they've created the handle fig and they've created the handle x. And the logarithmic scale is something that is set by working with the x object. And let's see what attributes do we have here for this object. So we can type help and x. And then we can see that we have a very lengthy listing here. Here's first a general description of what the orbit is, what class it is. And then you can see that we have all of these listings. And among these listings here, we will have then the logarithmic scale, the opportunity to set the logarithmic scale. Good, we will now have, we will, not now, soon let you go into the exercise session, where you will play around with customization of figures. And perhaps introduce the exercises here. It's working with the same data sets. Your task is that you will make the tick marks and access labels once larger. And you need them, you can search on the web for what are these attributes that you need to work with. And the target that you're aiming for is to arrive to a figure which is looking like this. So this is one of the exercises that you can do. And the second exercise that you could, oh, we should hide the solution there. The third exercise that you can work with here is that you can adopt a gallery example. And here you have some links to some of the other resources that are, yeah, like Seaborn, which is based on Matloplib. And then there's also Vega Altair, which is a standalone partner package for visualization. Yeah, and here, oh, this, so people will choose exercise two or three. And the exercise three is really close to at least how our work in real life. So in real life, I don't remember all of these commands. I often look through the gallery examples for something that looks similar to what I have in mind. And here you will try to, first, I often take the example and I try to run it on my computer. And then once I get it to work, then I try to change the data. And then I try to put in my data. And then I tweak. So that can be a really fun exploration, which is at least close to how I work. I don't know about Johan, but I can't remember on almost anything. I always thought from something that already works that somebody else created. That's also how I do it. So looking at this gallery is a very good source of inspiration. And that's also where you then get exposed to what are the different functions and attributes that you can work with. So is there anything that we can bring up from the HackMD for now? I'm looking. So most questions are answered and they are relatively detailed. One bigger picture question was whether we will talk about interactive plots. So plots where you don't get just an image, but you get something that you can interact with, like with the slider. And I believe that we will not do it in this course, but I will link to a lesson where this is demonstrated. But I think we are almost ready for exercise session. It will be customization two or three until what time, like when we will be back? We will take 20 minutes for this exercise. All right. So we'll be back to 55 parts and then we will summarize. We will connect this a bit with Pandas data frames and then hand over to the next episode. So exercise customization two or three starting now. See you again in 20 minutes. Bye. All right, we are back. Five more minutes of matplotlib. So we want to wrap up the session and then we will go into a break and then we will go into something else. And we thought that in the five minutes we could try to do this exploration together on an example. And I also realized now during the exercise session that the solution that we have listed here doesn't match anymore perfectly the gallery of Seaborn because they change their examples. Anyway, let's try this together. I will now take Seaborn, which is something that builds on top of matplotlib. And I will take that for a specific reason that I will come back to before the hour is over. So this is often how I start. I open up one of these libraries. I go to the gallery and I look for an example that looks close to what I have in mind. And here I will try to be somehow close to the exercise. I want to do a violin plot, which is a way to show distribution of points and the statistical spread. And so this is something I want to have something like this. Here, this is what it looks and there is an example code. And the way I start often is I take what they have and I try to run it on my computer first. Let's try that. Seaborn is a library that should be in your environment. It is in the Python for Sarkom. It is also a part of Anaconda base environment. And now I'm crossing fingers and running the cell. And I get a plot that looks like what they have. So that's already a big success story. I don't fully understand what's going on here, but my next step often is to I want to get an insight into the data because I want to replace it with my own data. And here I have a feeling that solo example tips data set. This is some data set about some bird. No, it's awesome smoking and not smoking. But what I do because I don't know this data set, I would often, I would actually split this cell into two. And here I would print. I can do either this. Or if I'm in a notebook, I can do this directly tips. I want to see how it looks. Let's run all cells. And this turns out to be, we already recognize this. This is a pandas data frame with columns columns and rows. And I think this data set shows the different tipping behavior of smokers and non smokers. And now I also maybe understand that what this library is able to do is that we load the data set, and then we can map X values to a certain column. And we can map the Y values to a different column. And we can map the color to a yet different column. And now I would go in and instead of using this data set, I would try to put in my own pandas data frame. And I would try to plot that. And only then I would start tweaking and adjusting and customizing. And here I wanted to show you that Seaborn is a library that is able to really use pandas data frames directly and map columns to visual channels. X, Y, color. Can we do the same thing in my protlib? And back to our lesson. And I learned very recently that you can do almost the same thing in my protlib. So instead of what we were doing, that we were sending a slice, a column of data into defined as X or as Y, we can use this instead. I can say that data is a data frame. And then I can map X values to particular column, Y values to a different column. So that's very nice. If you then try to do a bit more, like if you try to map color to continent, then it becomes a little bit harder. And these are, so we wanted you to know that there are libraries in Python that make this easier. And this is also for those of you who come from R and ggplot2, you can do the same things in Python with libraries like Seaborn, Altair, Vega. But I see now that we are out of time. We will continue answering the questions. So please keep asking questions about plotting about map protlib and we will continue answering, but I don't want to eat into the future sessions. So thanks from my side. Johan, any concluding words here? Thank you. You covered it all. And I said we will continue to answer questions on the HackMD. Yeah, more details there. I will answer there. Thanks so much everybody for listening. Thanks to Johan for co-teaching and looking forward to the next sessions. And I think now we go into a break, into a 10-minute break, if I understand correctly. Yeah, break to 11 past the hour. Yep. Bye.