 Hello. So my name is Simo and here is Teemu as well. Hi. Yeah, we're going to be talking about data formats next. So now that we have handled like, we have the basics of what we're going to be using the tools. We have NumPy, we have Pandas and we have plotting tools. Now we usually want to like figure out how do we get some stuff into our system like some outside stuff data. And for that, you usually need to choose a correct data format. And yes, so if we switch to the my screen, we can look at the data format in a bit more detail. So what is a data format? So usually it can mean two things, two separate things. It can mean the data structure or how your data is organized. Like how the program sees your data, like how it's organized in memory, what kind of objects it is, like that kind of stuff. How it's organized in your code and in your code and in the memory of the computer. And then there's also the file format. So how the data is stored when it's stored into a disk. And these are like both data formats, but they're a bit different. So the other one is like what is the format when you are working on the data and what is the format when you're not working on the data, but it's waiting to be worked on. And if you look at a few examples, so let's consider this random data frame that I have here. So basically it contains some strings, it contains some time values. It contains some integers, and then it contains some floating point numbers. So this is like a typical, like you might have a data format that describes, well, it can describe basically whatever. And if we create this, we can see that, and you look at with this data set the info command, we can see that there is like a lot of entries, 100,000 entries here. So some of them are objects, some of them are date times, some of them are integers, and some of them are floating point values. So we have different kinds of objects inside this data frame. So what is the data format for this kind of a data frame? Well, the data format is tidy data. We talked about tidy data yesterday. So in tidy data we have a variable that can have its own data type, it's stored as a column, and then we have observations in the rows. So we have multiple of these columns with different data types. Let's consider another example. So in here, I create like a simple NumPy array. So let's create this. And we notice here that we have an ND array, so NumPy array with certain shape, certain strides. We were talking about strides yesterday. We noticed that it's a contiguous block of data. So it's one big block in the memory, and it has a data type of floating point numbers. So what is the data format for this data? It is basically a block in the memory, so multi-dimensional array. Of course, in like what was yesterday talked about for the computer, this is only one-dimensional like a block of memory. So it's one big block of memory. But using the shape and the strides parameters, it will interpret it as a multi-dimensional array if it needs it to be that kind of a... Yeah, if it needs that. Well, wouldn't it be nice, like we have two different data types here, or data structures. Wouldn't it be nice if both of these could be stored in a way that would... like stored to the disk in a way that would retain the same structure. So what I mean by that is that if we save this multi-dimensional array, or we save our tidy data frame, we would probably want to store it in a data format or file format that has the same kind of structure inside of it, so that we can keep the format as intact as possible. And this is very important when you are going towards big data. Like big data is all the rage, well, has been for the last 20 years. And in the more data you start to have, the more important it becomes to keep your data in the similar kind of format in memory as it is in the disk. Because that makes all of the accesses and like you don't need to do any conversions when you're reading the data or when you're writing the data. It makes all of the accesses and everything much faster and easier. So how do you choose a file format? Well, the most important thing is like you should know that there is no format that is good for every use case. So if you look at this excellent comic from XKCD, you notice that like how standards usually proliferate. So you have one standard and then somebody else thinks that okay, like my use case isn't supported by this standard and then they create another standard and now you have two standards and so forth. And this is the same thing for file formats. There is no one file format that rules them all. There is no file format that can like satisfy every use case and be efficient and fast and everything like that. So you usually need to choose a file format based on number of factors. Good, like at least to keep in your head might be that you might want to choose a file format that is easy to use for your use case, that is fast for your use case. You might want to choose a file format based on what other people are using. So usually there might be some frameworks, there might be some other people who might be using file format and you might want to choose based on that alone because then you can easily collaborate and you can use their tools to do your analysis. One good thing to keep in mind is that while you're working with the data is that do you need the code to be human readable while you're working on it. So for example here, when we're working with Jupiter, we get human readable view of the data when we use pandas for that. But the data is in memory, it's bytes in the memory, so we use code to get access to that data, but we don't need to understand how the data is stored basically. So the data is stored as bytes in binary format in the computer memory, but we can access it using pandas. And if you don't need to actually look at the file itself, you might not need to store it in a human readable format. Because well, if you want to use computers to do like analysis on the data, you probably want the data to be in a computer understandable format so that the computer can easily work on it. And the last question is that is this for archiving or is this just for now where you're working on it. If you want something for archiving, that might be a completely different data format than what you want to use when you're working on the data. Because the archive formats might not be efficient whereas some of the other formats might be more efficient if you like storing temporary data. Here I noticed that you use the word choose or like select a data format. And I was thinking, can you come up with like choose from a set of pre-existing data formats? Is there a scenario that you would actually have to kind of like come up with your own data format? Or have you or better phrased, have you encountered such a situation in your work or in your studies? Yeah, that's a really great question. So I'd say that I haven't ever encountered the situation where I've had, well, I have once written like a data format that was basically a bad data format. But it was basically like a change as to some an umpire array. But I haven't written the file format itself. I have written a convention maybe. So basically choose which kinds of like names to use, which kind of how to store the data. Like that is also like a file format. If you choose certain conventions, if you choose like, okay, the data will always be stored with the name data or something like that, you choose a convention. And many of the data formats that we are going to be talking, some of the data formats we are going to be talking about today. They also are basically conventions that are built upon certain certain like already existing technology. They're just like more well defined. So in many cases you might end up like choosing that okay, you choose to use certain names, you choose to have certain structure, but you might still use the same kind of like data set writers to write the data actually today. But yeah, I haven't encountered myself a situation where I have written like a writer, like a data writer from Scats in C or something like that. Because yeah, like usually it's not, you don't end up with anything good. But yeah, that's a really great question and I'm not completely certain what would be the correct answer. Yeah, exactly. Yeah, it's about your experience and I have a very similar experience that it's, I have never had to go like deep into making my own data format or implementing it. So it's basically a question, as it very often is that it's a question of like best practices and what is the community practices. And usually you have some kind of like a collection of standard solutions and then you choose the one that kind of fits your use case. Okay, but yeah. Yeah, let's look at the view of the data format. So here's a big list. You can like click on any of these to go to the specific entry. There's different factors that we've been using to like set certain like, which is a good use case for whatever situation. But let's look at few of them, like the most common ones. So if you want to grab the screen, we can look at pickles first. So there is already in the HackMD, I noticed there's a question that is basically the first question or first like major downside of a pickle. Which is that pickles aren't secure. But if you show the key features of pickles. Here. Yes, a bit below. My scrolling doesn't work very well. So the pickle format is basically like Python's own like this serialization library. So serialization means that you want to take some object and then you want to like just store it into the disk. So pickle is a format that Python uses so you can you put arbitrary Python objects into these files. This is very good if you want to like to debugging or something you want to just see what your object has inside of it. But it has major downsides. The biggest downside is that it's very unsecure because it's an arbitrary Python object. Like you could store whatever kind of an object there that when it's loaded it will like run arbitrary code on your machine. So you should never use it to share anything with anybody and you shouldn't load pickles from untrusted sources. But if you have like like let's say you have a code that crashes and you don't know you want to store like some sort of like state of variables. Certain variables before the code crashes and have like some sort of like debugging or what is the state of the system. Of course you could use a debugger which is preferable. But if you have like if you're not there or if you want to like just just store my object so that I can continue where I left off. You can use pickles and yeah let's let's try saving some random stuff in a pickle. So what I did when you were running your code I ran it as well. So this is the same code I run. So now I take this pickle code. Yes so with pickle you need to use this with you need to open the file pointer before you you write into it and then you basically dump the Python object there. And then you can do the same thing to read it. So here is the important here is the right operation and here is the then loading the written yes. So so pickles are yeah the one big downside of pickles if you'd like have a huge array that is in your memory and you store it as a pickle. It's basically the same size as the original array and. And it when you load it will use that huge amount of memory. So if you like to analysis on a big memory system and then transfer the results to your laptop and try to read it as a pickle. You might like run out of memory and do all kinds of stuff. So it's basically like like it's easy but not very. It's not encouraged for like long term use. But let's try it out. You can try it out yourself. So we have first exercise here of how to use pickle and let's have like a quick five minute exercise. This is these exercises are very like very small and very easy to do. So I'll take the screen for a second. So if you look at the exercise one create some arbitrary Python object. And then you can have a string or at least or whatever and then pickle it and read the pickled object back and check if that matches the original one. So let's have like five minutes on this. So like 12 33. Let's continue. Yep. We're back. Yes. So hopefully you managed to do it. If not, then well, there's lots of data forms to go through. So we don't we won't use too much time on the solution. But if you look at the solution, it's pretty simple. It's basically what was like what we were running previously. But let's now jump. Yeah. Yeah. Sorry, I will grab the screen. Yes. Yeah. So yes. So let's let's focus on on more like pickle is nice and all, but let's focus on actual data formats so that we can use that to more represent like our data. And in this case, let's focus on tidy data. So if we have this kind of data in pandas, where we have the columns, and then we have the variables in the columns, and then we have the observations in the rows, how do we store that data? Well, the most obvious format and most popular format is the comma separated value. Or CSV format or tab separated value or whatever separator you want to use, but basically text data that you have data in organized with some separator in columns. And it's it's very popular because it's human readable and you can easily share it with people, but it's usually not the best format for for storing big data because it can get really big. Like, if you think about how many bytes you need to store one number in in text versus how much, how many bytes you need to store a number in in as a floating boat value. It's a lot more when you had multiple decimals there, but it's very useful for especially sharing data with other people and if you really need to see the data yourself. And pandas has a really good interface for CSV files. So if you want to show how to write our example data set into CSV. Yes. So the data set was the variable we made here. Yes. So it's a it's a pandas data frame. Yes. It contains a lot of stuff. So what happens? So yes, something happened. Yeah, it might go to a different folder based on what was your working folder, the current working folder, but anyways, it will store like a CSV file that you can then open up. NumPy has an interface for CSV as well. So you can use this save txt to store NumPy arrays as a CSV. But like I wouldn't recommend it because of the problem with data precision in well CSV. So if you want to demo scroll a bit down to the and open the CSV. Ah, yes, this one. Yes, yes scroll open the warning box there. Well, we can. Yeah, let's try the NumPy. Okay. So the problem is that basically if you store numbers as text, like floating point values, if you have double 64 bit floating point values, they have a lot of lot of precision. So you need to add or have a lot of decimals to store like a number. And if you're using CSV, it's very easy to like shorten the stuff up. So this is why like I would never like use my own. Like I have written myself in my time like CSV writers. So basically like just write text as floating point values with like four decimal positions of precision. And that can work if you actually know what the errors are in your data and that kind of stuff. If you don't, if you don't have error like propagation, but especially if you're working with like numerical code and stuff like that. You usually want the intermediate products like stuff that you have while you're working on the data, you want that to be be exact. You want it to be like, so if you want to like show them this example. In this example, like if you copy that. This one. Yes. If you copy that. So in this example, we store like NumPy square root of two, we store it as like with a simple Python writer. So we just write it as a floating point value and then we load it back in from this like manually written CSV. We notice that it's not the same. It's not exactly the same number. So if you try writing this or running this, you notice that there's an error. There's like an, well, it's not a big error, but you might have this. Yeah, it's an error and it will propagate. So if you like do something with your data and then you store it as a CSV and then you read it back as and then you store it as a CSV, you might end up like propagating the error. And in some cases it doesn't matter, but in many cases you don't necessarily want to your code to do something you don't. I would think if I was like into physics or something and doing some quantum physics stuff, I would probably like to have no errors. Yes. And that's why like CSV is usually not the best format when you're like dealing with like some intermediate data. But CSV is a good format for sharing data. And if you just need something quick and dirty, it's a very good format for tidy data especially. But let's go. We're running out of time, so let's go a bit faster and go through few of the like more important formats that you can use instead of CSV for tidy data. One example of a format is this Federer format, which is a format that has been written for like it needs an extra package called PyArrow, but it's a format that can store like arrays temporarily very fast. It's very fast and very space efficient for storing like, I mean, not arrays, but tidy data. But it's not usually the best format to store the long term data. And there is, I would say better alternative, which is part of care. And this part of care format is a format made by Apache consortium and it's mainly used in big data. It can store arbitrary data, so in big data like warehouses, they store like arbitrary binary data in parquet. But because it's designed for this kind of like big data stuff, having using it that way requires a bit more technical knowledge. And I wouldn't maybe do it if you're not familiar with the format, but for tidy data it is really easy. So if you look at the example, bundas has a really good interface for parquet. So you basically just have this to parquet and from parquet format. So it's basically the same syntax as with the CSV in bundas. Bundas has many of these to something and read something for functions that you can use to write various different data formats. And this is a one good format for storing data while you're working on it. So if your data is in tidy format, it's very easy to do. It will require you to install the PyAgro package. I'm not certain if it's included in the Anaconda distribution nowadays, but yeah, it's readily available. And it's interoperable format for other languages as well. So it's very popular. Okay, now that we have looked at tidy data a bit, let's have a quick exercise of writing something into a CSV so that you can verify that this stuff actually works. So let's have like a five minute exercise, a quick exercise on how to write the CSV file. Of course you can use the parquet example as well if you have the PyAgro installed, but if not, then the CSV one is a good one to do. So let's return in 47 and go through some array data then. Bye for now. Hello. So hopefully you managed to do the CSV exercise. These exercises are really small and well, you can do them later on because well, like mentioned at the start of the talk, there's a lot of these data formats and there's one good one or one excellent one. So like we have to go through many of them so the exercises cannot reflect all of the available data formats, but you can run them later on if you find a data format that suits you the best. But now that we have talked about tidy data, let's talk a quick about array data. So in array data like you usually have like a big block of memory that you want to store as it is like you want to store it as a binary floating point numbers in like a full block somewhere and you want to like then recall it. You don't want to like have it stored as a CSV file or something because it will use a lot of space. So instead of storing it as a text file, you might want to use something like a NumPy array format for storing like your intermediate results. So it's included in NumPy so it's a binary format and it stores like the array and the corresponding metadata of what shape is there and that's kind of stuff. So you don't need to store that as some in your CSV file as how to interpret the data. So if you try the example, it's a very easy thing to use. You have NumPy save and NumPy load functions that you can use to write and read these arrays. Okay so the file name and the object that we want to. Yeah and it will store it much faster in much more small space than CSV file and it will keep the precision definitely. Okay now there's also in NumPy this save Z function that you can use to store multiple arrays into one NPC file. So it's like a GZIP like a compressed file that you can store multiple arrays into. So it's similar to like MATLAB's MAT file in that sense with the user interface if you're familiar with that. Okay so you don't get like hundreds or thousands of different files. Yes but I would still like not store maybe thousands of files there but if you have multiple arrays you can store stuff there so you can then access it later on. So you can then access the loaded in BZ last like a dictionary and get the array that you want to use. For really big arrays it's better to use something like HDF5 or net CDF that we'll be talking later on. Because you might want to want to access only a part of the array at a time you don't want to access the full thing. So when you're using NumPy, NumPy load you can do memory mapping or kind of stuff to access only part of the array. But usually it becomes complicated so those are more advanced tools or better tools for these kinds of use cases. NumPy Save and Save Z also work with spare sparse arrays or sparse matrices. If you have like a situation where you have a really big matrix that is full of zeros and only once every now and then you can store it as a sparse matrix and then use the NumPy Save and Save Z to store it and there's also like specialized save functions in SciPy for sparse matrices. So if that is your use case I would use those. But when your data becomes bigger, when the arrays become bigger like if you have physics code and stuff like that you usually want to use some format that is also interchangeable with other tools because the NumPy format is NumPy specific. So if somebody else wants to use your data then well they are forced to use NumPy now so that's no fun. And for that like the most popular data format to share is HDF5 or hierarchical data format version 5. So it stores like you can store arbitrary data sets inside of it in this kind of like folder type situation. So you can create like these groups inside there and data sets there and you can put like arbitrary data there. And it has a lot of like complex features because it's been designed for the like these HPC applications that might run like a huge number of huge. Well you can create like data sets of terabytes so in one file so you need to have like a format that can support it. But it's not very good. And you said that it's in general so the receiver of the data is not locked to using NumPy. Yes yes it's a general format you can store in various like various formats the data inside of it. And it's shareable by well you can share that format and anybody can read the format. It's not good for random reads though. So if you're doing something that depends on random access like you want to randomly go through something it's not very good for that. But it's good when you want to do stuff in big blocks of data. And there's a few Python packages that have good interfaces for that. Then we have NetCDF4 which is like HDF5 which is like you have HDF5 with some sort of like how could I say it. You have a structure defined and it's very often used in the physics context where you have like time space. Like if you have atmospheric data or something like that you have like a time dimension and then you have a spatial dimensions like 3D spatial dimensions like X, Y and Z coordinates. Like it has this well defined metadata thing happening inside of it. It's built upon HDF5 but it's like has its own like conventions that are hard coded to the format itself. And it's useful if you're working with like programs that can accept NetCDF4. But these are special formats but these could be mindful of those. There's a really great package called X array which has good interface for NetCDF4. But maybe we'll leave the exercise. We're running out of time. So maybe we'll leave the exercise tree of saving the NumPy array as a home exercise. This one you can do it later on and let's focus on. There's a really great discussion in the HackMD about JSON. So one of the more modern and important data formats nowadays is JSON. Because it's basically like well how could I say like CSV on steroids or something like that. So it's not it can represent all kinds of stuff in it. It's a human readable format and it's very popular in like web storing web requests and all kinds of like relational data. Like if you have if you have something that comes from like you have a customer database and then you have a product database and like databases and that kind of stuff where you have stuff that is related to its other. So what stuff the customer bought and what kind of like yeah who tweeted that who and what hashtags they were using and that kind of stuff like if you have a data that is very like has lots of connections to other data. You don't want to like create a tidy data format out of it because the matrix would become so big and most of it would be empty like most of the stuff wouldn't have connections. But if you have stuff that is very connected, JSON is very popular in that demo can probably answer more on this. Yeah it's JSON is kind of like and it's like omnipresent nowadays. So if you're working with data or data or similar stuff then you probably encounter JSON every few minutes in some in some format. So and like you said especially with with web development. It's very all the data is basically transferred in JSON format. So it's very, very, very good to know good to know and and get familiar. I would recommend. I also recommend like looking at like there's been lots of like a big info dump happening here like there's lots of different data formats. But I would keep mind open about like what what these formats are for because like even if you're dealing with let's say you're doing physics and you only deal with big arrays. So you don't need some way usually to like keep track of what simulations you have done what analysis you have done which parameters you have used to do these simulations and which arrays come from which simulations like the old fashioned ways to like create folders and folders have like underscore underscore parameter value underscore and that kind of stuff. I've done it myself and I'm not changing admitted but the better way would be to usually have like like something in a tidy tidy format or maybe in a JSON file or something that would contain like these metadata these kinds of like contain the data and the relations between like what data sets you have what parameters you have used and which array data then corresponds to this to these parameters. So you usually even if you're working with certain data format as like the that stores most of the data in reality like most of the data is usually what values where what code what you were using to create this data what what like it's it's some sort of metadata like what parameters where you're using what arguments where you're using and that kind of stuff you don't want to store as an umpire array right like that kind of stuff isn't suitable for that format. So so usually you need to mix and match all of these formats so you might have as a back end like let's say HD file files store these big arrays and then you would have like a JSON or something that would describe what's in the like what analysis you have done and which HDF5 arrays have it like have the data you need. Of course, some formats like parquet and HDF5 and net CDF they allow you to put metadata into the data itself like right next to the data itself, but in many cases like you need to mix and match all of these different data from so it's good to know about them. Yes, but I think with that like there's few other data formats mentioned that there's Excel, which I won't recommend, but you if you're doing something like social studies or economics or something you have a lot of data is provided in Excel, pandas has good interface for that. And then there's lots of like graph format. So if you have data that can be represented as a graph, you should check what the packages that analyze these graphs what they recommend using. There's many different formats based on the structure of your graphs. So yeah, you should look on that. But as a summary, there is no one format to rule them all. You should think about how your code sees the data and store the data in a way that is similar to how your code sees the data because then it makes everything a lot easier. And follow best practices and community practices. Yes. Like I was thinking an analogy of like you can you can send a vase to your friend in a letter like a letter envelope, but then you have to break the vase into small bits and and at the other side when your friend gets the vase from the letter envelope, they have to like glue it back together in order to like get the vase back and you could just use like a padded box or padded like packaging. For sending it and that would probably be a better format for sending it so. I get it. Yeah. But Richard, do we have a like, I think we are at the end of the day. Yeah, so let's go to the notes. There is feedback already there. So you can let us know what you thought. Yeah, and let's see any other questions to answer. There seems to have been a lot of issues this year with possible version conflicts or something like that, like something in the lesson requires newer versions than software people had installed. It's something we definitely need to look at for next time. Yes, we'll talk about versions tomorrow with the dependencies or what was it in Friday, but yeah, I think, well, a lot of this is related to that as well. Yeah, yeah, that's a good point. So Friday you'll see how to handle the versions and environments. Anyway, were there any more questions somewhere? What about this one? How can you make a simple file with modules to load parameters? Well, yeah, that's like, I think you're now in the in the ballpark where it is starting to become another format when you start implementing that, like, I've seen many like, especially in like machine learning world or deep learning world, there's many formats that try to like encapsulate like what environment we're using to create this. Model and also the model itself and like having the software included with the in it and it comes, can create, become complicated, but I think the question here is actually that like, how do you get the import statements? Yeah. Yeah. Okay, the machine learning is the question. Yeah. Yeah, in those like, there's plenty of formats for one common format is this ML flow that tries to like store it, but there's other ones as well. Maybe this is a good question to bring up in the panel discussion on Friday. Yeah. Yeah. All the things here, we can sort of put everything together and make some better things because some like modules to load and imports, we see that Friday. We see about different libraries tomorrow and so on. Yeah. Okay. Yeah, we need less copying and pasting. It also slows us down. Okay. Well, should we resume on? What's the next day? Thursday. At the same time of today. And we hope we see you there. Is there any preparation for tomorrow? I don't think there's anything extra really. So yeah. See you then. Yeah. Thanks. Thanks.