 OK, should be enough at least to go with the background. So thank you everyone for joining today. Today, we have the last or fourth entry in our mini R and Insurance R consortium webinar series. I'll host it by George and myself. And today, I'm going to talk, well, I guess broadly speaking, about high performance programming in R. But I think I took a very small and very practical slice of that when it comes to how you can use the error package to manage parquet files and the data of increasing data sizes. On the background, as already mentioned, today we have our fourth entry. We started with how you can transition from Excel to R and how then continue on and how you can, I guess, move into a more production-ready version of R. Finally, also, how you can think about how to, well, if and how to get higher performance R. And as already mentioned today, that continues today with, I guess, focusing more on interacting with data over files as opposed to specific coding practices. Yeah, this feminist series, the first two entries were delivered by George Sparco Lucas or George, who is with us in the call today. And the letter to is or will be delivered by me, Benedict Schenberger, both from the actual control department from Swiss resource management. And they are very happy to be here today. Maybe as a very brief introduction that some of you may have heard already, I've been using R since my university days, mostly coming from our statistics or actual science backgrounds with lots of data analysis. And I've been happily using it ever since from university through my whole working career. That's also why I'm very happy to talk about it today in this forum. I'm also happy to hand it over to George for a quick introduction. Thank you very much, Benedict. Hello, everyone. This is George, heading the Motive Development Analysis team and the Atelier program. We are sponsored by a Group C factory, Phillip Long, to help other actors within Swissry to learn how to do programming for our day to day. I discovered R at the university in a couple of modules for statistics, but there we only had perfect data to work for. Forgot about it for eight years, and we rediscovered it about how it can help us solve day-to-day issues here we are today. A large community of actors within Swissry working through in a programmatic way on our daily data challenges. I'll hand it over to Benedict. Thank you. Yeah, thank you very much. And as also a reminder, last time also this time, the topic broadly relates to so-called experience study work that we do in insurance or reinsurance. We essentially write insurance contracts on a forward-looking basis, whether something is going to happen in the future and clearly after some time has passed, we check back on how these events that we thought would happen, did they actually happen or how did our insurance contracts perform compared to what we expected when we wrote them. That is the background for today's topic. Going to the actual content part. So I think it's probably no big secret that the data is generally increasing. I think here we have some historical estimated data and also projection from Sathisa that shows that we are probably going by now into close to 100 setabytes or a billion of terabytes per year of data being created. And that is a very general global trend or global issue, but I think that's also something that we very much see in the insurance space where some of the examples that we highlighted earlier may be very easy to do, even a spreadsheet or at least if you need to do something like 300,000 times that's well within the capabilities of, let's say a laptop still, but if some of the work that we're doing here had some experience study background that simply doesn't necessarily fit into a typical computer laptop anymore and maybe we need to think about some alternative ways of how we can wrangle or manage these kind of data sizes. And I think a very basic strategy with how you deal with increasing at least raw data is something that we also very much see in the let's say music or video space. We have so-called compression or lossless compression essentially instead of being going with the sort of naive way of writing everything out, think about some other ways of how we can represent data so we can save some space in DR or more general data science space. One of these ways of saving some of your hard drive won't potentially also memory from files called parquet. And yeah, there are also some split minds some may call it a pocket, some may call it parquet. Today I'm going to go with the parquet pronunciation which hopefully is good enough, but yeah, if you hear a pocket that could be equally valid. And with parquet, I think that's as I mentioned I think by now you could say quite an industry standard that is widely used across many database systems or also storage systems in the industry. So often it's also used in conjunction with other sort of software systems like Pablo once like Spark, also other sort of on-demand database like systems like Synapse or so on that are often using conjunction with these so-called parquet files that are used as the storage medium and as very briefly summarized, I think the idea is that parquet is a fast and efficient column of storage essentially saying it's the data stored by column. I think most are users should be quite familiar with if you think about a data frame where normally the data type or the data stored each column in a row. And that is also how you can think about parquet. Some of the tricks that parquet does in order to save some space I wanted to show here. I think it took some examples that are very generic. So these apply in many other spaces you may have heard about them but maybe it's your first time. It could be quite interesting to in particular one of them is run length encoding or RLE. If you think about the typical CSV files that you may have the name of a person for a million rows and it gets repeated a million times but instead of just repeating the name over and over again you could just save it down in a much more compact manner by saying, okay, so it's this name here Benedict and then as a second item you know it's down how often it is repeated in this case a million times. So you save quite a lot of storage instead of writing Benedict a million times you can just say, okay, so it's the name Benedict repeated a million times. A second trick is also relatively obvious and I think many may be employing it already or have been employed in the past is so-called dictionary encoding. So if you have some entry some string that is very long then instead of writing it out every time you can just map it or abbreviate it as an identification number or ID that normally should clearly depending on what string you have save quite a lot of time particularly if you repeated a few billion times. Finally, the maybe more interesting part about parquet is that it also has some so-called metadata or information about the data in the file and well, yes, one example of that. It's a particular parquet has so-called row groups. So instead of only having information about the data in a row by row basis it also while you're saving down the file calculate some additional aggregate metrics like for one group, which in this case are three rows but in reality are probably thousands or tens of thousands of rows depending on the exact configuration. It saves down what is like the minimum maximum value there. And then if you try to get data it uses those statistics in order to find which of these row groups could be skipped. So essentially only checking one statistic every thousand or so entries instead of having to check every single row whether data is appropriate. And I think the important part with these technologies or at least the background for this year's and how can you use it in practice? I think one of the reasons and what makes it particularly appealing use parquet files with R is that there's a very good infrastructure in place already. In particular, there's the so-called arrow package and arrow is a very broad project that is also supported for many other programming languages like Python, Julia and so on. And if you use the R package version of that essentially makes the very easy drop-in replacement for what you may have been doing with the more straightforward CSV or commons separated values files. You're looking at a simple comparison to assuming that currently you are reading your files as CSV files with data.table which is already a relatively or quite optimized way of reading CSV files. You may have some code like this. So my data is just data.tablefread and then you point it at the file. Now if you wanted to achieve the same with arrow you can have a more or less one by one replacement instead of using the data.tablefread use the arrow read parquet and then you point it at your parquet file which probably has some different file ending like dot parquet of dot CSV and writing data is quite the same. So instead of f write, fast write you can just use write parquet and you have essentially at least my eyes quite similarly easy workflow. Then the question is clearly if you make these sort of drop-in replacements what can you actually expect out of it? You're coming back to our sort of mini case study or the sort of illustration that we are using assuming that you have some relatively large data set and you want to do some summarizing information. In this case, we have again some information about our business and maybe you want to filter it on a specific year 2019 here and then summarize what kind of the code exposure. So what do we expect? How much we could have to pay out? Aggregated over certain benefits or lines of business or certain age groups or regions and so on. If you do it with the player it may look something like this. If you combine sort of the guess power on the expressive fit on the nice expressions of the player with parquet or an arrow and it could look something like the right-hand side of using the appraisal dimension of read you can in this case even more advanced version use an open data set and then if you do the exactly same operation of summarizing you just add a collect statement at the end which is essentially what's then fires off the calculation or which actually makes error to the aggregation that you ask it to do. And I think here we also had a question last time about larger than memory file sizes. So the left-hand sides assumes that you have enough memory available that you can actually read the entire CSV files into memory and then aggregated from there. Because on the right-hand side it's possible that this parquet or partition parquet file is bigger or would be bigger in your memory than what you have available. But there's some smart logic applied to these operations that you kick off with the collect call that try their best to fit into the memory you have available even if the whole raw data would not be, well, would be too large. In numbers, some of the results we saw in one of the examples we looked at though as alluded to or noted down at the top and the bottom left here we looked at some particular data sets relating to a particular type of business or called 3T and we had like 25 or so million rows and 50 or so columns. And let's say just under 10 gigabytes of CSV file associated with that or simple things. So already mentioned the basic CSV file gets quite a bit smaller roughly to a fifth in this example. That's nice saving if you have a lot of these 3Ts or probably even larger data sizes that can be quite significant how much you can save on hard drive space or storing your files. But from a sort of performance or our perspective maybe the more interesting metrics are how much time can you save if you're actually trying to do something with them. So with the example code that I showed previously so if you do just some generic reading if you do it with the FREED and CSV maybe you take 50 seconds in this example and you can make it to a 10th with using the read parquet instead. And if you use this special memory mapping or larger than memory versions of it as also shown in the previous the and you can even save more time here because it doesn't try to read the whole data into your memory, but only some information about it. Saving you like here more close to a minute of time to just in reading that file. And I think the other interesting opportunity here is that if you're not only interested in if you're like, oh, I read my file into memory and I write it back down once I'm done but also do some typical what I call it grouping or aggregation transformations like I showed in the previous slide then here then even with your already have all the data in your memory and you just want to do some deploy aggregation you can save quite a bit of time if you use the opportunities of error of optimizing these operations before they're all in memory. And yeah, in this case it very much depends on your data and what kind of aggregation you do but you can easily save like attempt or in the most optimal cases 100 of the time if you are smart about how you manage your data and what you want to do with it. These examples, I guess coming back to the main parts of what I would you today say, I think as mentioned that's a pocket or particularly the air packages are quite valuable tools and your tool sets or toolbox if you are working with far and I think particularly for kids also quite important to know that it's a very popular standard across the industry and I think those who used a lot in what is brought to call the data science space which I think the actual professions or insurance professionals often describes themselves to be in as well. And other than that also you can normally expect that you can save some time be it just on the reading and writing files but you can also have even more optimization there when it comes to maybe the file sizes are so large that they wouldn't fit into your memory anymore or you're just smart about maybe your group or your partition your file by some year and if you only want to have one year's data then it can be much faster to look up or to read but finally also I guess a word of caution. I believe the first two entries that we had in the webinar series by George they showed, let's say very typical very important workflows of moving from Excel to R or how essentially R and Excel can live together. One challenge with Parquet is that it's not to my current knowledge well supported by Excel. So it may not be the ideal solution if you have other colleagues you're working with that's still very much Excel based or you have very meaningful parts of your workflow that are in Excel though at the same time I believe if you're thinking of tens of gigabytes of data then it's very unlikely that you are going to start with an Excel spreadsheet anyway but maybe use it at the very end if you have some very aggregated form of them or presentation. Thank you very much for your attention. I believe we would also or we can start with a Q and A soon. I believe Zoom has some Q and A feature where you can write your questions and while you're thinking about them and maybe writing them up I also would like to use some time to talk about the R consortium. So three has joined the consortium a while ago. Now I don't quite remember how long ago it's been a while and in particular it gives us the opportunity to do events like today to speak to you to show the kind of work that we as an insurance or reinsurance company do in the R space but it also gives us the opportunity to support the general R community more broadly with I think it's the many other events that are shown here like R Ladies or USR and Latin R and so on the many different ways about how R or the R consortium can support the global R community. So I think we are very happy with being part of it and that's also something that you can consider by sending off an email to Joseph like shown on the right with that. I'll see whether we already have some questions. Okay, we have a nice or a very relevant question from Aaron. Let me just quickly go back to this slide for that. Do Aaron asks, living in an axle-dominated world, are there easy ways to convert in and out of parquet? How do you approach this issue? Yeah, I think that's a very good and very relevant question. For me, the good thing is that with these examples on the slide is that clearly you can mix and match them as much as you want. So imagine that you maybe started a very large file size and therefore have chosen to do large parts of your work in parquet until it gets down to a manageable size again and you want to expose it to Excel, SCV, let's say. So clearly after the data size is small enough, you still have the opportunities to mix and match here. So for example, you can just use data table F right to write down the data that you have in memory back in the SCV file, which then happily can be picked up by Excel directly or on a pivot table as a source data or depending on kind of what infrastructure you have available to you, clearly you can also save or write it down straight into some database kind of system instead that is then accessible from Excel. So clearly if you manage to have the data as a size that fits into your memory again, and it's not even necessarily, you can also chunk SCVs, but generally speaking, clearly you can easily change between the parquet and SCV format if you're working on. Okay, we have another question from Jill. Was provided with a 15 gigabytes SCV file recently for stealing if it using reader SCV chunked. So today's material opens a new world for me. Good to hear. I think there was also very much the background of what we found ourselves. The to a standard right that's asking my data team to write a parquet file would be a standard request. Generally speaking, I would expect so. I mean clearly the world is very diverse and I don't quite know what company you're working in and what the IT infrastructure may look like, but at least speaking from my own background and what I've seen many companies in the area do is that, yeah. So if already five years ago I had talked to someone in my IT team and asked, oh, can I have this data as parquet? It would have understood that request and probably or surely been able to fulfill it to give me a parquet file instead of SCV. I would hope that you make the same experience and to the best of my knowledge it's depending what infrastructure you're working with. Let's say large cloud providers like Microsoft Azure or Amazon AWS and so on. They definitely have very standardized parquet offerings. So if you're working with them then that should be possible, yeah. So I think Jill give a quick backup, well reply. So that doesn't seem to be a new question, but good that it helps. Can I ask a question, Belendick? The open table function in which package does it belong? Apologies, open table, you mean the open data set from? Yeah, open data set, yes. That's from parquet, right? Yeah, that's from the Aero package. From Aero, yes. Yeah, thank you very much. That's also a very good clarification point. So yeah, on the left hand I'm reusing function from the previous slide on the right hand. It was to use some Aero technology or the Aero technology. So some function from the Aero package. I guess Aero technology sounds a bit too much for something that is a function call in at least from our perspective at the end of the day. And I think when you describe the right hand side you seem to have chosen your words quite carefully about how much is transferred to the local environment that you have versus what was in the original data set. Are we saying that this filtering and summarizing happens somewhere in the middle? Where does it exactly happen? So essentially, we have the data stored on a hard disk. The hard disk doesn't do calculations. But then what you collect is a summarized version. So where was this summarizing and filtering is done? And whose computer is this done? Yeah, that's also a very good question. Maybe starting with a very simplified example. So if you imagine we had this same 10 gigabytes of CSV, let's say, and we have one gigabyte for each of one of the years. Do you, you could, one way of partitioning the data is that for each year or calendar year here you have like one, what essentially becomes a folder in your partitioned parquet folders or subfolder of it. And then if you just execute the first two lines and say, okay, give me this data sets, give me only the data for year 2019, then the metadata, it's the error packages smart enough to say, okay, so nine out of these 10 subfolders do not apply. So I don't need to check those nine gigabytes of data. I only need to check this one gigabyte of data and assuming you wanted to load it fully into memory at the same size for simplicity, then yeah, you would have only had to use one gigabyte of memory instead of 10. As you would on the left-hand side where you first gets all of the 10 gigabytes and then throw nine of them away again. But isn't it true that this filtering and summarizing you know, on the server side, say also happens even if you just have a single parquet file and not a partitioned parquet folder or it always has to be partitioned in order to be able to utilize this filtering and summarizing from. So yeah, I think that's also a very good question. So these optimizations also exist on the row group level. I guess that's one of the concepts that I briefly alluded to on the previous slide. So there is some madness and skipping over rows also on a single file basis. I guess I took the partition version as a particular sort of easy one where it's quite straightforward because with row groups and it depends on how you sort your dates and so on so it can be quite difficult to know what actually happens in practice. Thank you. I'll stop here so other people can ask because I can also access you offline. Yeah, thank you very much. We got another few questions coming, one from James. What are the advantages of storing data and parquet files or storing in a SQL table? That is also a very good question. Typically the main difference is that, or to my understanding a guiding principle is that just storing the data parquet files is comparatively cheap because the only thing you really need is a hard drive which often you can buy at very reasonable prices from cloud providers and then the time that it sort of costs you more money is if you actually try to process the files again like reading into your R session from some computer or laptop and so on. So there you can utilize the sort of computer you probably have already paid for in the form of your laptop or maybe some virtual machine you're probably using for other purposes whereas typically for a SQL table, again there are many kinds of SQL but the most typical ones you probably have a dedicated database server somewhere that needs to be running 24-7 and that combines the sort of being able to process data and storing it into one. So the SQL database will also store the data in some binary format. And then if you query it, give it back to you via some smart way, hopefully. However, yeah, you need to have that SQL database system installed in some server in order to be able to give you the data which can be inconvenient in many cases where maybe you just want to benefit from the cheap storing on hard drives and are happy to use your existing computer infrastructure for reading. But clearly at the same time SQL servers or relational databases can give you other advantages so it's definitely a trade-off. Okay, we got another question from Rob. Once the data is read into memory from a Parquet file is the data stored in the way structure within the data frame? So with the example that I had shown here with read Parquet, do you actually have two options? I don't think I show them off here but the default option is that it essentially reads it as a table into memory which is then more or less identical to a data frame with the appropriate data types. There's also the option to read it as a so-called error table, capital T table which is a specific error data structure which has sort of different characteristics from typical data frames. So there are some more detailed options that are available to you but let's say the default is just a regular data frame. Yeah, got another question from hopefully correctly pronounced key one, how do the are native data formats, are data RDS compared to Bacchio C3 files? Is there a reason to avoid them? That's also a very good question. So broadly speaking, RDS is, or okay, so you're more starting from even further way. So as you already say, like these are native data formats like RDS and Rdata are very much geared towards R. So essentially you need R itself in order to read back the data into an appropriate format like this examples data frame but you can also use these Rdata or RDS binary so-called binary specific formats to save down like models, let's say, or any other kind of objects or variables that you may have in your R session. So from our perspective, they're very flexible or ever that's then clearly comes at the cost of interoperability. So while R itself may be quite good at understanding RDS files, other systems or maybe some colleagues you're working with may not be as happy about it. So let's say a typical workflow that you can see or that often also happens similar to, I think what another person asked previously with being given a CSV file. So nowadays it could be that you're giving a parquet file because it's well supported by many different programming languages or technologies. So give them a parquet file, someone with Python can read it, someone with R can read it, someone with Julia can read it, some with C++ and so on and so forth. Yes, with an RDS file that may not be as easy. I would see the strong preference for CSV and parquet while being able to use many different tools for it. Other than that, from a sort of pure compression basis so how much read speed and so on, RDS can be good from compressing, but for like the example I show here with the optimizing aggregation so on, that's clearly something that RDS could not do and you would always assume that you need to read all the data back into memory so you wouldn't have those optimized aggregations or a larger than memory file advantages there, which you clearly, in most cases, don't have a CSV either, but yeah, as an example, seems like that was all the questions we had for now and so as I'm aware, we are probably also going over time. So thank you everyone for joining today. Hope that's, well, this particular webinar and also the other entries in the series were helpful to you. Also give a minute or a second to George to make any comments if he has some from his earlier series or entries. No particular comments. We hope to find a way to share most of the material we have from the webinars in a way that can be useful to you and thank you very much for attending and thank you very much, Benedictine, for co-presenting and Elenia for her support. My pleasure. My pleasure as well. Thank you all. Thank you very much. Thank you. Bye.