 introductions. So hello everyone. My name is Simo Tuomistan. I can remember the title now, but probably systems designer or something like that in Aldi University, where I work with AgaHPC and help our users with various problems, and quite often these can be IO related. So I'm going to talk a bit about data formats and how you should store your data to make your analysis easier. Yeah. It's almost funny how often we have problems that are not about, like people know how to use CPU in memory pretty well, but the data IO, that's often pretty, well, not great. Anyway. So welcome back everyone after the break. So I also hope that you have been working with progress with exercise one, two and three. So yes, I would like to point out that there are good questions in HackMD and we will answer them synchronously and get back and perhaps highlight a few of them tomorrow. So now I know that some of you have been working on exercise one and two and exercise, we have template solutions for exercise three. It's open-ended where you could go in different directions and we have one worked example with the Seaborn library and is there perhaps an example exploration that you have with one of the other libraries, rather one? Yes, so we will show now the remaining like, we have maybe seven, seven, eight minutes left. We could try together because the exercise three is maybe the most interesting one, but probably takes more than 50 minutes. But again, I really invite you to try it out. And what we could do now just for added difficulty is that we take one, let's take one example out of a plot leap and we try to adapt it a little bit and try to make sense of it. So this is how I often start, I go into the gallery and what I will do now, let's take this one here as an example. It shows some scores for men and women, I don't really know. But first thing we will do is I will copy that into my notebook and I will try to run it a little more readable. So this is how I often start. Copy paste and I will run it, shift enter. And I'm pretty happy here because I get the same result. That's already really good step. And now the second step. Before I go into any details, I try to make sense of the data. And the data seems to be here. So there are labels. Now let's simplify it a little bit. Maybe I just want to have two values plotted. So there are five things in a list and I see five bars. So what happens if I remove three of them? Will that, what will happen? Probably it will still work. So what are these data types that you have there? So these are lists, Python lists. Yeah, Python lists, yes. List of integers here, here there is a list of strings. So these are not NumPy arrays. One could also use NumPy arrays. And I run this again and I have no two columns and I could even, probably I could modify it with. And now let's just have a little bit more fun here. So I will call this day one and day two. And I saw that the label's changed. And instead of scores, where is scores? I want to change this thing. It's probably that one. I don't know, a number of viewers. And instead of gender, let's look at our viewers by tool. And instead of men and women, let's imagine we are interested in how many people watch on Twitch. We are later, I don't know, YouTube as we run and it's still kind of working. And now I can imagine that this is probably the, these are the bars and these will be the kind of error standard deviations. And I don't know how many viewers we, I guess the day was probably, and now it's also a bit confusing because we have, this is not men and women, we could rename these. We have a number for today. So currently we have 195 viewers on Twitch. So let's call this Twitch 195 and I think more, I don't know, 400. Yeah, well, in, yes, all that order, yes. Yeah, and how I change this, I will run it, it will not work. Because it will, well, how come it works? Why does it work? Because, well, I don't understand why this code is working. Because I was hoping that I should need also change this. But you had this arrays in memory, perhaps? Yeah, all right, yeah, good point. Let's actually go back because I should, one thing I should really do because it meant that's a very good thing. Let me show that before I finish this, this is very, very good. One thing I should really do is before I save the notebook, because I think actually everything is good here, it works. What I should do is I like to rerun all cells. You can even restart the kernel and rerun all cells. Let's try that because then it should really break. It will restart the whole thing and run the whole thing from top to bottom. Let's see, restart, yes. And now we get the error, very good. Yeah, yeah. Because now it just doesn't find, men means is undefined because I changed it. So that was really good demo effect. Before saving an output, before sharing it with other people, really good to restart rerun all cells from top to bottom because then this is the first thing that the next person will do. Good. I will not go more into details here. But back to the lesson. Just to summarize, we have like three moments left. I find this really useful to go through. Take an example that is close to what you work on right now. Also you can try any of these other libraries. They are all great with Jupyter. Some of them are not part of Anaconda. So some of them you need an extra installation step. But also have a look at Seaborn. Just to quick peek here at the gallery, also very nice library, which builds on top of Map.lib. And for Seaborn, we do have an example exploration that we can open in the lesson. So you can go through that. And that will work also in your Anaconda. So that's what I often do. I take something existing. I want to tweak the data. If it looks somehow alright, then I improve the looks. And ready to publish. So let's summarize the session. It was very quick. And we only could give you some starting points. Hopefully it was useful. Some points that I would again like to repeat is automation is our friends. That will be the day when we are really happy that we have everything in notebook and don't have to redo all the figures by hand. And all of them will regenerate in two minutes. As Johan mentioned, keep the data and the thinking process and the plotting and the figures all in one place. If you can, sometimes you cannot. Example when you cannot is when the data is sensitive. Then it has to be in a different place. Or if the data is gigantic, then it also needs to be in a different place. But then we can fetch it with pandas, for instance. And on Thursday, we will take this even a step further, because we will show how to create a binder instance from our notebook. And then we can share visualization with others and they can reuse it and reproduce it. And they don't even have to have Jupiter and Mapotlip on the computer. All they need is a browser. So this would be very nice and we come back to that on Thursday. What did I forget to say? I could perhaps highlight one thing, mainly this with color scales. And as you can read down here, an important aspect is this, that some people are perhaps color blind. And in general it's a very good thing to have a color scale that works also if you project it to a black and white color scale. And this is also important for the sake of printing, because sometimes you print things in black and white and then it's good if the color scale is working from the beginning. Yes, and further up in the lesson we have some links to resources that actually give you a good color palette, which is adapted to these different color vision deficiencies. Great point. Thanks so much for watching and for listening and for the questions and we will catch up with the questions and we will hand over to Simo. Please take away the screen for me. Hello, so we're in gallery view now. Yeah, thanks for the talk. I really liked the pandas and visualization part. I think these are sort of the most important parts of the course. Like the right level of advanced and basic for most people. So now on to data formats. So Simo, why are any other comments? I think not. Yesterday a screenshot. It will, I will show it. It can be narrower. Okay. Yeah, so first introductions. Narrower and taller. Yeah, first introductions. So hello everyone, my name is Simo Tuomista. I can remember the title now, but probably systems designer or something like that. In Aldi University where I work with AgaHPC and help our users with various problems and quite often these can be IO related. So I'm going to talk a bit about data formats and how you should store your data to make your analysis easier. Yeah, it's almost funny how often we have problems that are not about, like people know how to use CPU in memory pretty well, but the data IO, that's often pretty, well, not great. Anyway, your screen can still be taller and also the font should be zoomed inside. Do you want to have the terminal open there at the bottom? Or should I have it? So who's, are my sharing my screen or are you sharing your screen for this part? I'm currently sharing my screen, but yeah, if you want to have the terminal open as well then let's let you have the screen share. Okay, well we can switch to me what it's time. Oh, okay. Okay, we did it now. Okay, yeah, let's do it now. So here we go. Yeah, like first time lecturing in this course, so a bit of a hiccup. But let's start with like, while Richard is setting up our share, let's start talking about why data formats are important. So like data formats are something that you are using whenever you are using, well, any kinds of data really, like whether your data is something that your store, whenever you're using some sort of object, Python, optical, whatever your data is stored in some data format. So you're not, even if you don't think about it, you're really using some data format constantly. Is there a difference between data format? Well, maybe you're going to answer the student, but on disk data format and in memory data format, and then the semantic data format. Well, there is like a connection between them and I will be talking about it in a second. So basically the idea is that, like you will have some data format, your data will be organized in some way. And whenever your data is organized in some way, you'll want to have it in the disk in a similar way, because that makes it easier to work with. So if you, let's say you have like a bag of flour and you want to put it in the kitchen, into your cupboard and you want to put it into a container, you don't want the container to be too big or too small, or the bag of flour, you want it to be the same kind of a container, or like container that will fit whatever you have. So you want it to be similar. Maybe have the right properties, like can it pour it well, or can you scoop it out efficiently without dumping everything. Exactly, exactly. So let's consider like the two most common data formats that we have already used and what you're most likely going to be using. So first of them is the data frame, the pandas data frame that you have already used. So in the pandas data frame we have the tidy format and that is the data format of the tidy, well the data frame. So we have these columns that each have a specific data type and the data is organized in these columns. So we have a number of rows and then we have the data organized as columns. So Richard, do you want to run the commands that are in that cell over there? So just copy paste them into a notebook or somewhere. We can look at what sort of a data frame we get from that. For now I would recommend just following along and later on we'll have exercises where you can do the same thing basically. What Richard is doing. Okay, so let's see. Yeah, so this is a data frame. Yeah, so in this data frame we have various different data formats or data columns. So we have strings, we have timestamps, we have integers and then we have floating point numbers. So we have, you can see from the info command, the output of the info command that the data types for the different columns are different. So the data is organized in these different columns. Okay, let's look at another example. And here columns are numpy arrays. Yes. And these data types are the numpy data types. Yes. So basically the data frame is this kind of like a holder for these various. So it's like if you have like utensils in your kitchen, you have spoons, you have forks, you have knives, they are all in the different columns in the different places in the utensils cabinet. Yeah, okay. So the data frame is basically like that. So let's look at another example in the web page. So let's consider a numpy array that is multi-dimensional. So for example here we have random numbers that are two-dimensional. Maybe you view the shape of the array. Like if you view the data array shape. Yeah, shape and size is that multiplied, yep. Yeah, so we have like currently over there million random numbers that are organized in two-dimensional like block. And this is different to the pandas data frame because like in pandas we had one-dimensional columns and here we have two-dimensional like a block. And even though like you can represent both of them as like a matrix, the way that the computer stores them is different. And now we have a question that how would we save these different data formats because they are fundamentally different in files in a way that keeps the data format intact. So basically what we want to do is like if you are using some sort of like, if you're using tables, if you're using umpy arrays, you want to store them in a similar way. So here at the bottom there we have a visualization of the numpy array. So this data is organized in this kind of a block of numbers. So I guess like here there's, so pandas provides some useful mechanics for dealing with columns and rows separately. But it's deeper than that. So it's like you're saying that the actual arrangement in memory is numpy is optimized for array and pandas is optimized for column-based things. So. Yes or pandas is not usually, pandas doesn't usually work that well if you're working with multidimensional data. Like numpy is designed for this multidimensional numeric data. But then again numpy isn't that well first if you're going to work with let's say strings or if you're going to be doing these kind of four things that we did just a bit previously. So the numpy is a numpy can support these kinds of different type of data forms. Yeah but the question now is that okay we have data formats. So we have either this array data. So this is common in like physics or if you're doing like matrix calculations or something like that linear algebra you want to store stuff in these kinds of blocks. If you're a matlab user you're very familiar with these kinds of stuff. Yeah. If on the other side we have this tidy data and if you're like an auto user you might think of the auto data data frames in the same way that they're basically like similar kinds of objects to the pandas data frames. So we have this another type of data. Okay so now you would want to choose a file format that keeps the data format intact. So you can store the data into a file so that well you can return to it later on. And here it's important to remember one thing and that there is no hook data format for every use case. So this is very important. So you shouldn't choose a data format before you know what data for sorry you shouldn't choose a file before you know what kind of data you're putting into the file. Instead usually you have these various standard file formats and like this comic shows the proliferation of these standards quite well. So basically usually you have 14 standards then somebody thinks that okay they can unify the whole landscape they can make one standard that serves every use case and in reality they will be 15 standards. So this happens a lot in the IO world. So you will have multiple different file formats and it's all about finding the correct tools. And usually when you want to choose a file format you want to consider a few questions. So first off is everybody else in your field using some data format or file format. That is probably a good reason to use. Well the same tools that everybody else is using when invent your own if everybody else has a good reason to use these. So basically standardization is more important than almost anything. Yes because then you then you can utilize work that other people have already done and also you will reduce the risk of creating like problems for yourselves in the further on because like the data format most likely or the file format supports the work that you're going to be doing. Another good question is like is it fast? Is it space efficient? Is it easy to use for the use case that I'm going to be doing? So when we talk about efficiency is this really a concern for small data? Do these considerations matter most when you have very large data? What do you think about this for small? Yeah that's an excellent question. So most likely not. When you have really small data you don't notice these problems. Nowadays like every laptop has a fast NVMe SSD or something like that and it's their data loading will be so fast that you don't even notice it if you have small data. But the thing is like most of this talk is going to be like future proofing. Like if at some point you're going to be working with the bigger data set if at some point you're going to be working with data that other people provide you might want to check a better data format so that you know that you're not going to be well encountering problems in the future. So for example like if you have some data format if you have some file format you still need to write something to plot let's say the data and whether this can like whether you have chosen a certain format and you can only plot from this format that can make it harder for you to work with some other data format later on. And I think we we're going to see that it's easier to use good formats and good existing functions than make your own thing. So yes that's like yeah I like the future proofing idea so eventually something gets large and then better to do the right thing. Yeah and here here is an important thing to question yourself is that do you want the format to be human readable. So many of the data formats are human readable but most of the efficient big data formats are not like they are binary formats. And the question is that are you really looking at the data like are you really looking at all of the columns like if we look at the NumPy array that you had there like nobody's going to be reading that as a text file because it's million million numbers in a text file and you will be working with the data anyways using like Jupiter notebooks like code you'll be accessing the stuff with a code anyways. So is it really a benefit of having the data in a human readable format. It's very enticing to use human readable formats because like you you always feel like you're in control like you can you can read the format as well as the computer can but in reality like you might be shooting yourself into the food because like it might be better to just use the computer's way of reading the data instead of. And I guess that's probably why people start with their own formats or bad formats when you start off because when you start off it's better to be able to understand it. Yes. But then when you scale up. Yes this is exactly the case. Yeah yeah okay and then archival and sharing. Yeah the last question is that like there are different use cases for for data formats so you might want some data formats to be temporary data throughout your like analysis process and some data formats that you want to like share with people. So even though some formats might be good for like storing data when when you share data they might not be good for temporary data formats and the other way around so you might want to look at what part of your code you are working on so you might want to choose a different format based on the situation if you are. But let's look at like this is all theoretical and it's getting boring so let's look at some real world examples. So let's look at the most popular data formats and the first one is like the most popular data format. So comma separated values. So yeah what can you tell us about it. Yeah well CSV or comma separated values is like well it's what it says so you have basically some data, numbers, strings whatever that are separated by some separator. Usually it's like a comma like it says in the name or it's a space or a tabulator or something like that. Or tab but basically you have some text format that forms this kind of a table and then you have data in the table. And this is very common for sharing data especially like statistical data and stuff like that. But it's very slow and very space inefficient. But we still need to work with it because it's such a good format for sharing data because everybody can understand it. Whatever program language they're using they can understand it. Yeah and I guess there's no risk that someone gets it in 50 years from now and then like I don't understand this. Like you just open it and you see the column names and yeah. The biggest problem with CSV is probably like if you have like Windows and Linux or Mac you have different line endings and you might have different name coding and stuff like that but it's still like it's very like you can still look at the file and understand that okay this is the file format and you can like figure these problem sounds. So should we do a demo? Yeah so we have already while working with the titanic dataset we have already used the read CSV function but let's instead of like reading a dataset in let's take our already existing dataset and let's write it into a CSV file and that is really easy. So we can write it well like Richard has currently done. So what we have is now a CSV file. So Richard is using this head function to from the check from the command line what kind of stuff we have in the CSV file. This is just print 10 of the first four lines. So over here we have set the index equals false so that we don't have like this numbering index at the front but you can have it if you feel like it. I guess that's because the index has no meaning here. Yeah it's just an integer. Okay yeah yeah and we can recreate it like enumerated once we load the data. So here we have the data similarly that we had and the good thing about these functions like the two CSV functions that the panda supplies is that it will create these standardized ways of writing the data. So if you write data yourself with like your open file and then you write strings into it you can create like messy data that if somebody else needs to then figure out how to read but pandas usually like the two CSV function creates this beautiful CSV file. Okay let's let's read back in. Okay let's look at the data. So for that we will use the read CSV function so it's very self-evident what it does. All of these functions like the two CSV function and the read CSV function they have lots of options that you can like let's say you have headers that are like comments somewhere at the start of the CSV file you can skip those and you can do all kinds of stuff with the functions. Yeah so we get the same data. So this was for tidy data format okay so this was because CSV is much better when working with tidy data because with tidy data it's like well it's very easy to understand it with like numeric data if you have like a data array it's a bit more complicated not necessarily the writing part but the reading part and everything like that can be later on a bit complicated. So NumPy has routines for saving and loading these CSV files. So let's try writing this data array so there's the save txt function. I guess this will print out a whole lot because the first lines are very long. Yeah should I do it? And that is... Let's try it. Yeah let's try it. You have a typo array. Okay. But yeah and this brings the problem like why it's not very useful format for this kind of well data type because like you're not going to be reading this data like you're not going to be like humanely read this data and especially if the data is multi-dimensional like larger than two-dimensional it becomes really hard to pass the data but nevertheless we can load this kind of data in so let's try the load txt function too. To load the data. So for numeric data this isn't definitely isn't the best way of working with the data. Yeah okay so it works. Yeah so it works. Yeah there's one additional thing that is additional complication when you have text data and that is that the floating point precision can easily be reduced. So for example like if you have data in NumPy arrays you usually have like these double precision numbers so these float 64 numbers that have about 16 decimal places of precision but if you use like normal python like to write these numbers you can easily lose some of that precision. So if you look at the example there there we like create like a square root of 2 number of square root of 2 we write that into a CSV file using like normal python routines like we would just like these these kind of stuff that you see all the time when you're like just doing something quick and dirty so basically you open a file you write something to the file and then you close the file and when you read it back from the file the data is not the same. So can you can you show like what is the test number and what is the head of the CSV file. Okay then so the test number is like 16 decimals but when you write well when you write like you can easily mess it up and the reason here why it messes up is that it's in the writing part where there's this floating point like percent F like short hand for floating point and there's no mention of what is the what is the precision that we want to write in and the default precision is something like six decimals and these can easily like create problems and that's why usually when you're dealing with numeric data and the precision is important like physical simulations or something like that you want to store the binary data because then you can have exact same numbers represented when you load the data back in so of course in many cases like having 16 decimals of precision isn't necessary if you have like measurement data from like I don't know questionnaires or something but yeah like yeah but in many cases it might be important and that's why like using these CSV files can be bit risky we have a feeling for very numerical intensive simulations with high precision this is a whole art of dealing with numerical precision but most people don't need to deal with that an additional problem with the CSV file is that you need huge number of characters to write one of these numbers you need 16 decimals of characters to write one of these numbers so what you usually want to do is use the binary data because that is much more space efficient it's like okay let's move forward to actual binary format so what are alternatives to the CSV because CSV is the most popular and it's so easy to use but there are many alternatives nowadays that are really good at like working that you can work with instead of using the CSV files like the CSV files are nothing bad but they are not best format for every use so let's consider this better format so better format you can it's a format it was developed by the developers of well the main developer of pandas and the main developer of these tidyverse in Arda so this is a hardly recommend and the well I can't remember the pandas kind of but basically they developed the format so that they could share data frames between Arda and pandas more easily and you can try installing this format yourself with this BP install py arrow we'll talk about package installing later on but for using this you will need this extra package okay and better format is basically only for tidy data it's very efficient space wise very fast and it's great for tidy data but it's not great for anything else so it's very like one note but it's really good at what it does and you should use it for those cases and the best usage for it is to have like temporary storage of your data frame like if you have data frame you're working on it save it as a feather that's most likely going to be fine unless the data frame has some really strange python object but if it's like strings numbers stuff like that better is really good and it has interfaces to well like I mentioned at least python, argon, julia and pandas has really good integration with these many of these formats so you have similarly to the two CSV you have these two feather and it's very self-evident yeah so I'm showing it here and we see we can save it and reread it and it looks exactly like the other one it has the same datetime object into objects which actually I think if we go back up to the CSV reading it probably doesn't have that well the CSV reader it probably has but that's because like the CSV reader will convert the data formats but like yeah it's more complicated usually to like it needs to interpret the objects oh actually yeah like here the timestamps it doesn't recognize them as actual times it thinks they're strings yeah that's actually a really good point yeah I thought it would convert them but okay so there's options yeah there's options in the CSV reader that will convert it for you you say this column will be a datetime and so on but this is one of the advantage of using these good formats like it has all the metadata about the type so you don't have to deal with converting columns and this and that and oh this column has nons or something it's always the same yeah yeah that's yeah that's a really good point so feather is like this quick and dirty format for like storing temporary data but there are other formats for tidy data as well so this part okay format that is part of this empire auto package that the feather format is that's like the big data format it's designed by by apash hadub so it's the back end format of many of these like really weekend data farms and it's it's meant for tidy data but it can house binary data but it's complicated so we are not going to go into that if you want to have a demonstration of that today I will be having a talk in Nordic Agassi where I'll demo a use case of this parquet for binary data but it's it's really good also for like tidies just storing tidy data so pandas has similar functions for parquet as as it did for feather and csv so you just just do the same it's like there's not that many demos that we can give off these formats because like you the interfaces are all the same so it's not a really a big hassle to start using these formats you just like you change the name of the function and you can use the format wait so can you clarify what's the difference between when we use feather instead of parquet and vice versa well the feather is is like really fast for really big csv tables so instead of like having really big csv tables you would have these feather tables so it's like if you just want something to be done really fast that's probably the easiest option parquet is also very fast and it's it's like a lot more well it's better for archival because it's been used for archival and it's so I would say that parquet in general is is the better format but but feather might be good for if you just want something to be done is it something like feather is closer to the memory representation and parquet is a separate format for storing the disk yes so better is is pretty much like store it as it is in memory and parquet is a bit more like like do some extra steps in a way but but both of them are really good like you can use either one it depends on the on the use case yeah like yeah but but parquet is is very like well well shared there's interfaces for matlab and other languages as well yeah okay but okay so parquet is complicated for object like these kind of block formats like if you want to store numbers so for that hdf5 is much better so hdf5 also known as like hierarchical data format so this is very good for storing array data like if you have like big blocks of data that you want to store you can in pandas there's also this interface called pie tables that allows you to store these tables into hdf5 as well but that's usually not that efficient because the problem is that like a lot of this data in these data premises is usually strings and hdf5 isn't that good with strings it's better with numeric data so usually the file sizes can get a bit bigger if you have string data but if you have numeric data hdf5 is really good yeah okay should we do a demo yeah let's do a demo so we're doing the tidy data data set first so that is pretty much the same as the other ones it just does the same thing I just realized we have five minutes left so yeah maybe we can summarize better okay so it loads yeah and there's a separate package to use for the arrays yeah for array data you want to use this hdf5 like h5pi sorry package that is really nice and really good that you can use to to work with these datasets okay yeah I'll quickly summarize netcdf4 we don't really need to maybe go through the demo but netcdf4 is this kind of like hdf5 but organized so so you can have like multiple datasets within one data file so you can have like like let's say you have physical simulation and you have oh let's say like a weather simulation and you have like a grid of points that represent the weather at different locations around the world and you can store let's say multiple datasets with one dataset being like temperature other dataset being a pressure and all of them can be like huge matrices of huge arrays of numbers and the netcdf4 has this kind of like organized way of storing this data so you can load them into various different like programs and yeah that is something that like if you're going to be working with these kinds of data that might be a format for you there's demo there we don't have time to go through it unfortunately but it's it's not nothing really special it uses this x array package so it's really nice for working with this kind of data is netcdf one of these database kind of things somehow or no not really so it's basically like it's like it's just like hdf5 but it just has like internal like everybody has agreed to certain rules of naming the datasets and that's why it works okay okay and the last but not least we have numpy data format so if all of those previous data formats for this like array data looked complicated you can use these numpy array data format so this is really easy so this is something that works really well so you can have this you can use this numpy save function to store an array into into yeah you can store an array into into this numpy format and there's also this save z function that you can store multiple arrays into one of these and this is really good for like we can dirty storage of numpy arrays install a lot of like csv files you can just use those and but it's not very good for sharing necessarily data so for that I would still recommend netcdf or hdf5 but it's it's really good format anyways okay so I think maybe we can really leave the exercises as homework yeah let's watch to I guess it's basically doing what we did above so yeah so the exercises are let for homework so they basically like do the same thing they're in the in the page itself there's also information of like why you should use binary formats so there's benchmarks quick mention of like how these different formats fair with each other and the main question is like the scaling and the future proofing like if you're you know that if you know that you're always going to be working with small data then nothing of what I've said really matters like like you can trust your csv if you know that you will always be using like megabytes of data but if at any point you feel like you're being working with something bigger than that you really should look into like a correct data for good data format for your use case and use that to store the data and there are various benefits like mentioned here for binary formats and they can really help you out but remember that there is no correct data format like that is the most important thing to remember and you really should look into what other people are doing around you because like data is all about sharing and it's all about like moving information around and like you really should be using those formats that other people in your field are using because that makes it easier to share not only the the data itself but also the ways of interacting with the data like floating the data reading the data in like that kind of stuff and yeah usually there's a few questions here that you can ask yourself what the data should what you should think about when you're choosing data yeah unfortunately this talk is less interactive than the other talks but I hope you get something out of it yeah so okay so that's our time maybe we can stick around a bit after a quick mention one format I forgot to add here is that NumPy also supports math formats so if you are using Matlab as many of the users seem to be NumPy can read the math formats also just to mention like this in the NumPy IO pages you can look it up yeah so in HackMD okay should we continue with the any other discussions so one thing I was wondering was what about SQLite are these on disk database formats does that have any role in anything yes yes so SK Lite and various SK Database they are very popular when you have something that creates a lot of new data constantly like you create new rows of data so when you have this data that you can like you have a something that like iterates over over let's say like like a marco process of something like that that basically like you have a state and then you get another state and then another state you want to add new rows of data then SKL is very good or SK Lite is very good because each row adding new rows is very fast in SKL but also like if you have data that you can use the SKL lookup like the select functions to look into or merge tables based on some stuff it's very good for that as well I think once I said something like if what the data you're using can be if you can export your analysis to do it in the database engine so it returns only what you need and efficiently uses those index and gets a small amount of data it's good but not like if you're storing a whole data frame and every time you read it you'll read the whole thing yeah okay yeah yeah like yeah everything has its use case but it's like there's no like there's no one use case for all like all formats are not supported by every data format open file format yeah okay wanted to give a comment on the data formats really nice lecture and something I did in my past was to invent my own data formats and I regret that now so I think it's really there is point to go for standard data formats even if they are not like 100% fit if it's like an okay enough fit now I would be very careful about inventing my own format and I would rather try to adapt to whatever is that existing in my community so it was a really good point there yeah yeah like I have done it myself previously as well and I regret it as well because like then after people start using your format or like using the tools you have created for that specific type of format you realize at some point that you it's harder to harder and harder to like reimplement it again with some better tools yeah then you need to write our tools to read it to write it and the nice thing about the standard data formats that you don't have to write any of it it's a one-line thing just read it in and write it out you don't have to program this and it's a bit of a boring task to to program functions that write out your data into some other data format yeah and I think like everybody writes I don't know like maybe maybe a hundred CSV readers throughout the like programming lifetime at least like I have probably written I've written and read like like instead of using like these already existing stuff it's like okay I'll just do it fast and dirty like I'll create a CSV reader here and then at some point you realize that okay I need to add features here and then then you think that why shouldn't why wouldn't I use the CSV readers available in this package as already and yeah that kind of like how hard can it be attitude can really make it hard for you hard for you because like it is actually it can be quite hard and yeah it can save you a lot of effort and a lot of time if you just trust in other people basically like trust in somebody some engineer somewhere has designed these data formats because they have encountered problems that you are dealing with and they wanted to solve them and somebody paid them a lot of money probably to solve them so you don't have the resources to do the same work so yeah I think one one thing I remembered earlier I wanted to mention was this what's idea of like structural versus semantic storage so for example you wouldn't want to write your own binary format almost ever but in several my projects we're using say the feather format or escalate format or something and our format is basically saying are in the feather save it will have columns of these names so my data format is choosing the standard column names and then use the standard structural format for either in memory or on disk that is actually a very good point very good point so and then you know then it's also easy to switch out the back end as long as it loads to the same semantic format later and then yeah yeah I wanted to just say that like when it comes to like data accessing and data formats and data all of this what is usually important is that the what you can do with it like basically if you the most more important talk in this today today was the pandas talk and the plotting talk but in order for those to be like to like facilitate those you usually need to load the data in or say the data out somewhere so in order to be able to do the plotting and do the analysis of the data you usually need to like have the data somewhere and like it might seem trivial but okay like I have like a few megabytes of CSV why would I care but if you like make it a thousand times bigger like once you start getting to the gigabyte range then like you don't want to wait three minutes for your data to load so that you can do a plot like that's that's when you can do it like let's say in a few seconds better data format so then like it really interferes with these other things that you really want to do that like basically IO is usually this kind of like waiting screen or loading screen of the HPC world like you don't ever want to be at loading screen with your computer like because that doesn't feel productive you're not doing anything you're just waiting for it to do something and IO is basically that and by using these better data formats you can minimize the waiting time for these other stuff that is actually important and that's why why choosing good data formats is usually really good because then you can just like yeah you can and using frameworks that support these data formats for example pandasan and many other languages have similar functions so that like using these will save you a lot of time to do the other things that are more important yeah