 But yeah, I guess we can get started. So hello everyone, and welcome to, I guess, this week's edition of the R and Insurance webinar series from the R Consortium. And today we have a, let's say, short introduction into our performance culture, I guess, at least how I think about the code performance in R. Maybe taking some of the, well, introduction part or some background part, as I already mentioned, it is the third part of the R and Insurance or Insurance webinar series, which is called R Performance Culture. Next week we have another piece that is, I think, somewhat related, but looks at a particular component of it with some example. More broadly, I think the first two pieces were delivered by my colleague here, George Spakalukas or George. And George and I work for C3, an actual control, which is part of C3's group risk management. And maybe seeing it here, well, I guess we both work for C3 at the moment, the views that we express here today or the content of this presentation is just for information for you and not any official view of C3 or related entity. With that, yeah, as a brief introduction, just to give you a short, I'm Ben Dijk Schumbeger. I've worked for, well, C3 for a number of years now, eight or so, and recently joined the World Development and Analytics team, which is also, I guess, the slightly infamous Atelier program, where you can also read an interview about online that George is setting. I've been using R since, I guess, my university days where I very much had a sort of statistics and quantitative finance and actual science background. With that, I'd also hand it over to George briefly for a short introduction. Hello, everyone. This is George Spakalukas. I've been in C3 for 15 years now, relatively new to R. Brief introduction at the university, never used it for several years, and then suddenly, my chance came across at six, seven years, and the way it transformed my workflow from Excel to R was something that was very, very, very rewarding experience. And hopefully now, with the sponsorship of Philip Longa or Group C Factory, we're able to share that experience with others internally and with the help of a consortium who we hope to connect with the wider industry and share experience knowledge. Thank you very much, Benedict. Over to you. Yeah, thank you very much. And I think as an additional background, while I guess the two of us have the honor to presenting you today a bit of the content or a bit of, well, some slice of work that we do, I think we're also backed by a very strong R community across the three, so now also called Ateli, communities of people who interact with R coders and the content they produce and so on. Probably several, well, more than a thousand people, definitely, and also we have a few hundreds, 500 or so actual R coders at the three, so there's a strong community there. And the example, some of the background of what we shared today relates to something that is called experience studies. They, I think, is a basic premise of insurance, often as an insurance company, you offer a contract to someone who then becomes a policyholder, and yeah, you don't really know what's going to happen over the next, let's say, year or so, do that policyholder, and how much they're going to utilize that insurance contract or not, though you come in with some assumptions of what's likely going to happen, but year or so later you then essentially take stock of what did actually happen, and clearly not only over a single person, but over what you call a portfolio, so over all the different policies or insurance contracts that you address for a particular, let's say, region, such as the UK or Germany or so, and a particular line of business, I guess, and when you think about something like that, which may include, well, quite bigger data sizes, I think that's also becoming to today's topic, so I think there's a quite famous quote that premature optimization is the root of all evil, and I believe that's also what we see, is what I see often in the context of R, so you have someone transitioning their work maybe from Excel to R, it's most of the time because they're, well, whatever they've been doing, the data sizes or the complexity of the calculation that they want to do, outstrips what Excel can easily do, and therefore what they're looking for is to be very fast, so I think what they're finding is that, oh, Excel can't do it or Excel's very slow, so now I need to find something that's very fast and they look for the thing that is the most performant to do what they do, at the same time, I think the truth is that, at least for typical problems, you probably don't really need to think that much about optimization, there are other considerations that are much more important than that, but what we're talking to do about today's, let's say there's what here called the critical 3%, so what can you do in those few rare cases where you've actually identified a real need to be fast and you would like to use R? And I think that also brings us to one of the common challenges, see, I think R has been designed for flexibility, tries to enable you to do many things, and that flexibility, particularly with the, they evolve on the perspective of metaprogramming, so programming with programming, something that enables many of these, I guess very powerful tools that we have at our hands today, particular things like the tidyverse are only possible because R is quite flexible and you can express what you want to do in various different ways. At the same time, I think, which is just a fundamental effect, if you try to be very flexible, then you're probably not the best at something, so you cannot be the most flexible and the best specialist at the same time. So what you would expect or what the truth is, R is also not that fast, generally speaking, but at the same time, I think it can be fast enough for most of the things that people actually do. So here's a particular example of a relatively well-known benchmark, and as you can see here, many different software or programming languages or packages, libraries that have been compared, and the comparison is what typical database like data manipulation, so you want to group some data by some attributes, or maybe you want to sum over all the values in a particular country, so you had more granular data and you want a less granular aggregation of it. Here you can find that, oh, yeah, there are some frameworks or some libraries like Polar's written in programming, language called Rust, mostly used from Python, and then I guess the famous data table written in C, mostly used from R, but also other things like data frames in Julia or real database system like Clickhouse and also the infamous Spark. And what you can see here, maybe R isn't the fastest option, and this data is also by now slightly old, so clearly depending on what you're doing and how technology involves the exact ranking will move around, but I guess what I take away is that R can probably be fast enough, so why it may not be the fastest option and for a particular problem or a particular point in time, it probably is fast enough for what you actually try to do, and then on top of having this sort of performance on paper, also gives you several different tools to improve the performance of the code you're working on, so starting with the, let's say more documentation or the general tips and tricks component of it to, I think there's a very good e-book from Hadley Wickham, other ones are in particular, that book also has a chapter on performance tuning in R that I've linked in the slides below. So there you can learn some general tips and tricks or some general ideas about how to tune the performance of R in general or how to adopt some patterns or workflows that are generally believed to be faster. That's on top of that, I think there are also many very practical tools if you want to make your R code faster. One of them is the, this relates to code profiling, where essentially you have a script or a piece of code or you have, I guess in most cases, written a function and now you have determined that your function isn't fast enough, so you want to see which parts of that function are very slow. So say your function or this example here is five or so lines of code or only four lines of code and you don't really know which of those is like the expensive one that slows you down. You can use a tool like profvis or code profiling visualization to see which of those lines is actually the ones that you spend your time on. And this mini example here, there's quite clearly a single line, the fitting of the linear model that spends all of your time. So if you want it to be faster, that's probably where you have to look to improve. And assuming that you're trying to, or you've come up with some alternatives and you try to compare them or see like, or as with many things, there are quite a few different ways of achieving your goal and then it becomes about what is the most appropriate for this situation. And assuming that you're driven by the speed or the how fast the code executes and there are tools like the Bench R package or Micro Benchmark R package that they give you very easy tools to just run a line of codes or like run your function and see how it compares to alternative ways of achieving the same goal. And based on that, you can make your choices. But on top of these tools and ways of just thinking about making your code faster, I think it's also quite important to keep other components or other ideas in mind at the same time, in particular, but may happen is that your quicker code is more complex and complexity is hard to measure. There are some sort of quantitative guidelines like cyclomatic complexity, which have been developed mostly in the computer science space that try to measure how complex a piece of code is. But there are also dimensions, as I mentioned, like if in the R universe, tidyverse is very popular and there are millions of people across the globe who know the tidyverse and let's say, the plier, then a piece of codes written with the plier is probably generally thought to be less complex than some other piece of code written in some niche package that only a thousand people in the world know, simply because many more people are familiar with it and the next person reading it has a much better chance of getting it quickly. Similarly, I think that the more straightforward aspects just how many lines of code have you written, if you have a faster piece of code but you need to write twice as many lines, such really worth the cost, so the next person coming in needs to read twice as many lines of code and well, if you have many more opportunities to introduce mistakes, those you just take the next person much longer to get up to speed and maybe they never really fully understand it because they don't have enough time to read all the code. And on top of that, I think there's also next to codes, how well is it documented and how user-friendly it is? I guess that's both on the level of what you're using. So if you're using some existing R package on your code, how well is that documented and how well-known user-friendly is that? And also the code that you produced yourself, you clearly have to make sure that the person using it can also easily feel comfortable with it. If that's going back to our specific case study life, well, I'm going to show you four very short snippets of code just to make that a bit more tangible with many different popular ways of achieving the same or very similar goal in R. They, as I already mentioned, it comes from some case study that we did around having relatively large data sets of a few tens of millions to 100 millions of rows and wanting to do some calculations on top of that. And one basic piece of that is just for an insurance company or re-insurance companies to calculate how much risk are you exposed to? So how many of those policies are active at a particular point in time or which sort of period does it belong to? And, you know, as you can see, it's relatively or it's only a very small snippet of that and also a simplified version of that. So you see there's no error or input checking or error checking or anything like that. So it's just some simple part of the logic. And as you can see here, if you use a Deployer version, it's relatively straightforward. I've also just split it into two mutates because the first one is maybe a bit more complex and if else because you're filtering on some specific rows but the remaining part is very much straightforward. So it should be relatively easy for the next person to understand what you're doing here. So you just have some date and you do some date manipulation. You extract some components out of it or calculate some minimum and maximum with something else. The, I think typically next more complex version that can be fast already is then a DT Plyer version or data.table Plyer version. So for those of you who don't know it yet, there is a nice package called DT Plyer that tries to give you the best of both worlds. It will give you the nice syntax of the Plyer where also allowing you to get some, okay, some of the performance improvements that data.table is typically known for and that I've also shown at the start. And here you can see this version of achieving the same thing. Some things to highlight is you can see at the start I need to introduce additional line of code to set my data to a data table. I also have that lazy DT line of code. So a lazy data table that I need to introduce and I need to be a bit more careful about my mutations or if I want to use newly created column and the next mutate, I have to make sure that I split up those mutates and at the very end I need to add an as data frame but broadly speaking, this is roughly at a very similar complexity while only introducing a little bit of overheads which may already give you quite strong performance increases. Then looking at maybe the next most complex version is real data table code. And again, there are many ways of achieving your goals but here you can see there's a lot more going on. So there are many more brackets. They are mostly empty, sometimes non-empty row field rendering. Then there's the data.table Y argument as you can see, so assigning new values to columns in place. There's definitely a lot of more complexity, a lot of more brackets going on. And as mentioned earlier, if you have an experienced data table user then that may not be more complex than the deployer version but given the relative prevalence, it's probably easier to have the deployer version or the DT-plyer one than this data-available one. Finally, I think the sort of, let's say benchmark that we created for this simple cases then also a version that utilizes RSC API. So C is a separate programming language that is mostly known for its simplicity and speed and that's also what R is based on. And you can also utilize that in your code if you want to. At the same time, as you maybe can already see here, the code becomes quite a bit longer, so it already has redacted quite a few more parts of it and you only see a small piece of the calculation compared to what you did in the previous versions. And here you can see that maybe the R code itself comes very easy because they only call some C function instead. But in the C functions, then there's a lot of overhead. You need to include some header files, or so-called C header files in order to be faster, deal with the so-called open, can deal with the open MP framework which is a parallelization framework and then you have a lot of more complexity and so on going on to achieve the same thing, but it can be much faster. All these options are available to you and you can all do them in R to be very fast. But I think what's for me is the most important takeaway is that R can be very performant and you can, in many cases, also scale it, at least to a reasonable size. So at least I think there are discussions about the sort of petabyte scale of some technologies and I think that's not the sizes that we typically see in insurance context, but what the data sizes we actually encounter can be very performant and scalable. And that can also, you can use it just with one of the easy, tidy-verse tools so you can add a little bit of sophistication with data table or dTiplier to get more speed out of it. And also, I think being performant is relatively well-supported in the R ecosystem so you have many other tools at your disposal like the code profiling visualization tool I mentioned earlier or the micro-benchmark tools to give you some ideas about how to be faster if you really have determined that you need to. But at the same time, yeah, you definitely have to be mindful of the trade-offs that you're making, that sort of speed is not the, most of the time, the sort of loss of your concerns. So only once you know that you're doing the right thing and you also have great confidence that what you're doing is easy enough for the next person who has to look at it to take a look at it and understand it and maintain it. That's when you can think about speed. With that, I think that brings us to the end of the presentation part and I'm happy to take some questions. Ben, can I ask a question until we are waiting for other people to post their questions? Yeah, maybe it makes sense to why people are typing. Can you please comment on the performance of our itself when it comes to discussions around being a single-threaded programming language and what that means in practice and how this can be essentially, the impact of that can be mitigated somehow. Yeah, I think it's a very good question. Our itself is single-threaded. What's called single-threaded essentially can run one line of code of our code one after the other. With, I guess, a multi-threaded example would be able to execute two or more lines of our code independently of one another in parallel so at the same time. I think that which also, broadly speaking, being single-threaded is an effect and it makes programming easier because you know that line 5 gets executed before line 6 and it will always be that way and there's no sort of risk of things getting jumbled up. At the same time, as already mentioned, there are packages or frameworks like data.table that are actually multi-threaded themselves so while you can only run a single line of data.table code, data.table itself is able to paralyze its computations over as many CPU courses you have available so it could be that while on the highest R level itself are maybe single-threaded, you can still utilize the sort of power of your computer on a lower level. Similarly, there are other frameworks like the Parallel Package or Future Package which also use other approaches to try to or to give you some different ways of parallelization with an R while itself is still multi-threaded on the highest level. You can utilize some multi-threading or some more performance underneath. Thank you very much. Maybe going to the questions, first one from an anonymous attendee. Do you have any experience with parallel computing using R and could you share some best practices? I think that very much fits into the question that George just highlighted. I think one of the generally speaking, as I mentioned, I'm personally also a regular data.table user and as mentioned, many of the operations that data.table users are already parallelized in C codes with that OpenMP framework that I've just mentioned. You get the benefit of parallelization while on R itself being in a sequential single thread mode. I think that is probably enough for most people. If you really or if you are outside of the data.table world or related worlds and still want to have some benefit in parallel computing. I think there are again the parallel package or also related packages like the future per R or per R packages that make it very straightforward and I think with respect to best practices that is good question like we have internally sort of developed some best practices and written them down of how we think people at C3 can benefit from multi-threading with the different packages that are available to it. With respect to general best practice, I'd probably say that if you're using something like data.table or a DT player you'll probably benefit from parallel computing quite well. If you need more than that looking into something like the per R package and how it uses parallelization is a quite sensible step and hopefully it's enough to get you started. From Noelle, I noticed your samples didn't have any comments anywhere in the code. Do you believe only code that is self-doc do you believe in only code that is self-documenting when you use comments that play into functionality of some part of the code? I think that's also a very good question and very good observation. Maybe highlighting it I think to a large degree it is also governed by this sort of being a presentation. Sorry I mentioned it's only a small snippet of the code that is only highlighting some of the simpler functionality and I would say if you have something like this snippet shown at the moment with the DT player version I would probably hope that it is self-explanatory enough that there's no additional comments needed beyond that code given that the DT player tries to be very expressive already and that you also follow very reasonable naming rules or the domain that you're in and there's good naming convention for the columns and variables that you're using. Do I believe only code that is self-documenting I think that different scouts of thoughts here I personally like to use comments when I do something that may not be completely obvious or as self-documenting as I said with something like the player or tidyverse we often benefit from very evocative wording like mutates and if-else and other sort of names that hopefully make it quite obvious what you're trying to achieve but I think there's some some of the comments like here and the C code saying that oh yeah I'll give you using openmp that is maybe more opaque and not as immediately obvious to a user most of the time I think comments and code are useful if it's very unexpected behavior or where you do something that seems a bit odd but is necessary due to some let's say do you still supporting some legacy behavior of your codes or there's some very specific configuration that is only important for special cases so I think highlighting those is quite valuable otherwise I'd probably hope that the most of the documentation of the code is done via good naming and good splitting of what you're trying to do and then the part that I find very important is then the actual sort of documentation let's say with R code that is often in the ROX engine framework so you definitely should be or try to spend a lot of effort into documenting your parameters what your function expects and returns and does and also writing some examples so I think that's very important because that's probably what most actual users will look at yeah I hope let's use some answers to that question and I think a question from Archbold could you please summarize the main approaches tools and techniques you use to improve performance of R what runtimes have you observed in practice for or complex models that use R I think with respect to tools and approaches I hopefully mentioned the most important ones here that's I think that's the typical go-to I think there's the education piece hopefully you've had some experience with R or have had the opportunity to read some of the nice material that many of the community produce to have some idea about what the typical approaches that normally yield R or slow code and once you're beyond that I think as mentioned the first goal should be that you have correct code that does what you wanted to do that is sort of solid and I think also going to previous code is a previous question as well documented and maybe commented so that it's very understandable and after that you may look at if you actually realize in practice that the runtime isn't as good as you wanted to be for your use case then you can look at some of the tools like Profis so taking a very a real example that they had so someone had written a function that was doing something in the loop and they essentially the person says okay run this function it takes don't quite know anymore so they ran it overnight so and so many hours and that's too slow for my use case can you take a look so you run Profis over it probably on a smaller sample because you don't want to wait eight hours every time but you cut it down a bit so you can actually try it a reasonable amount of time and then I found a particular part that was very slow and then I did exactly as described here so for the slow parts I tried some alternative approaches of achieving the same again maybe I'm a data.table guy but I found some alternative data.table approach that reduced it from something that took several minutes to single seconds and then after the end the whole runtime went from so and so many hours to a few minutes I think it was eight or something so and I think that's a very potent starting point and how to sort of find those alternative approaches I think they are normally Google or any of these again quite strong open source contributions from people like Adley Wigim are quite important to have some idea about typical issues and faster alternatives I mean the last component about run times complex models I think for me that's probably very difficult question to ask to answer given that just what the model is is very diverged so a model could be a very simple deterministic calculation maybe like what George showed and there you can run your expected monthly installment model for a few hundred thousands of policies in a minute so that's probably good enough there are other models where having like a single real probably not a single but a thousand realizations on a complex Monte Carlo model can already take you minutes to hours and I've seen both so yeah I probably would say there developing code takes time so you would have to come across something that's very slow and used often enough just to make your investment into better code actually useful so if you have something very slow but you only run it once a year probably doesn't matter it's very slow if you have something very slow and you run it twice a day then that may be a worthwhile investment okay we have another question from Jinye hey Benny do you have a view on good use cases of RCPP same question for use of data formats that support partial readings that's also a very good question and I think that there are probably different preferences sort of coming out so maybe mentioned in my introduction I've been using R for a long while in particular I learned out when there wasn't was no tidy worse there was not really developed RCPP and so on so I probably learned how to make do with base R and also the RC API and with that background I probably find alternative approaches like CPP 11 more appealing because they have less of the overheads of something like RCPP so they're closer to only writing C or C++ code but I can understand that depending on your familiarity with the different approaches and where you at that can be quite quite different and RCPP definitely makes it very easy to to port a simple algorithm from from R to some mostly equivalency C++ code I think broadly speaking where I find good use cases for this sort of lower level code is that you have some sequential algorithm so essentially saying that you have a loop and the next iteration and the loop depends on the previous value these can be more difficult to implement in a nicely parallelizing or nicely vectorized way depending on the exact calculation so there may be sense that you just take like the smallest part of this algorithm implemented in RCPP but with RCPP or CPP 11 or just the C API just pick the small part of this algorithm with the loop to do and then you still use the power of R with its for the argument checking the data checking and all the niceness it does for partial reads and parquet I think that's probably an easier sell for me it's quite typical to have relatively typical to have very large CSV files or large databases and then often what you want to do is some aggregation or let's say you want to filter the data to only have it for a few months or for your weeks at a time that's typical use cases where parquet can be quite nice I think next week I'm actually showing a use case that we had for that so I'll just leave the rest of the answer for next week's webinar then we have a question from Michael and considering whether to write more performance codes and moving from to see do you think about implementing this at the start of the analysis only when you or be only when you run into performance issues or see only once you are reaching the production stage that's also a very good question I think from my perspective I probably don't think so I think it's also links nicely into the previous two questions so I think a basic consideration for me is always how often is this code going to be run and how much time is there even to save by having more performance R code so if you have something that takes one minute and you can make it much much faster to one second that's nice but if that is only being run say 100 times total or something so you saved 100 minutes and how much code can you really develop in 100 minutes that's probably not that much you really need to first think about how much time saving is there even possible from running this code before you think about investing a lot of its implementation so I think that probably speaks to the start of your analysis how often do you how much saving could there even be on more performance code realistically and then the other component is I guess from experience identifying kinds of codes that are isn't as good as or maybe slow in performing in regular R dimensions or some sequential loops or so on that they're not easy to paralyze or vectorize you know that you're facing that and you run it very often then maybe even from the start you're thinking about higher performance but I think realistically the main part is really B so you have your written correct code it does what it does it is robust enough and now you find that overall running all of it is too slow for what your expectations and use cases are and that's really when I I guess go into the sort of number three show on the slide here saying okay so where can we pick the specific pieces to make it faster to get within our expectations okay and we have a final question from Ivan which hopefully you pronounce correctly do you utilize object-oriented programming R and your real-world project EGR6 and S4 and does it have an effect on the performance um I guess the easy question, easy answers yes to both so I personally used and seen used both as R6 and S4 and S3 in real life in so 3 in practice and it has some effect of the effect on performance I guess here I'd come back to try to highlight already in the rest of the talk is that it probably has some effect on performance but it didn't really matter in practice so I haven't been utilizing like these so the object-oriented programming frameworks in particular the levels where it matters the performance wasn't really affected so I guess to some degree it's also like my personal preferences I'm more sort of a functional um guy instead of encapsulated object-oriented programming guy so maybe the approaches I choose are already more geared towards functional programming instead of maybe having some agent network that I would try to simulate in a sort of object-oriented style and that is on the other hand maybe I have some higher level structure than S4 let's say but then the actual long-running part of it is let's say some data transformation or something like that and that is then implemented in data.table let's say or dtplier so while there is some performance impact on the choice of using R6, S4 or just regular functions the actual meaningful part of the performance is probably relegated to something lower level anyway it's probably not relevant to the performance discussions at least not what I've seen okay I think that brings us to the end of the questions and I think we also have overrun our slot a little bit so thank you everyone for staying with us for this time and for the questions and interest and hope to see you again next week thank you everyone bye