 I'm Peter Higgins, and I'm going to be talking about using the Aero and DuckDB packages to wrangle bigger than RAM medical data sets of over 60 million rows, with a bit about data table. Motivating problem was open payments data from the center of Medicaid and Medicare services. I'm going to talk about the limits of R and speed and RAM, the Aero package, the DuckDB package, and generally about wrangling very large data, and you can see this includes 12 million records per year and a total dollar value of 10.9 billion in 2021. I usually analyze smallish data sets, carefully collected and validated data of 500 to 1,000 rows, maybe 10,000 rows on a big project. But digital data, which we increasingly have access to from CMS or clinical data warehouses, can give us much bigger data. Easily over 100 billion rows, often greater than 50 gigabytes, which doesn't fit in an email, it also doesn't always work well in R, which was designed for data in RAM. The motivating problem that led me down this rabbit hole was a paper that just came out about payments to gastroenterologists. And the highlighted sentence in yellow jumped out at me, four individuals received over a million dollars over the course of a year, at least once. And at one physician received over a million dollars every year since 2014. Now it's puzzling, because while they strongly suggest these docs are being paid off by pharma, the classic pharma drug talk only pays about $2,000. The speaker talks about a new drug or device while the audience eats dinner on weeknights Monday through Thursday, and most academic institutions no longer allow this. This was a thing of the 90s to early 2000s. But how did the million dollar docs actually do this? Are they actually giving 500 talks a year? When do they actually do their main job? Did they ever go home for dinner? Did they travel constantly? And how is it actually possible? They're only 208 weekday Monday through Thursday dinner slots per year. Something's been kind of fishy here. So CMS has the data available publicly. You can download it. And the first thing they tell you about is these are really large files. Be careful before you download. This includes every single payment from pharma or a device company to a prescriber. They added NPs and PAs in 2021. And a typical year is about six gigabytes, 11 to 12 million rows and 75 variables per year. The big problem I immediately figured out was that my M1 Mac mini has 16 gigabytes of RAM. And the total data from 2013 to 2021 is 94 gigabytes. Our works on data sets in memory, that's not going to fit. And so I'll walk through how I approach this. And I'll use run times associated with the TikTok package on the M1 Mac mini. Your mileage may vary. Some of these are faster on Windows. There may be some issues optimizing for the Apple Silicon, but the general trends should hold. I started by trying to work on one year, six gigabytes at a time, extracting the summary data, then pulling across years. It was slow and clunky. And I needed to find a better way. My first pilot attempt, 11 million rows, 75 variables, 6.4 gigabytes. Thought it might be a little slow, so I decided to use read underscore CSV rather than base. Still took 49 minutes to read it, not good for just one year. So now I have two problems, both speed and RAM capacity. More data produces more problems. So then I recalled some advice from Frank Harrell on Twitter. The cost of mastering our stats data table is high. The benefits are twice the cost. So I looked into this and data table has a freed function for fast reading. And I actually read in the 2016 data in 93.3 seconds. That's a 30x improvement, much better. I also looked at the room function for room, better, but 14.3 minutes. It's about a 3.5x speed improvement. The data format works well with dplyr wrangling, but it's way slower than data table. So data table actually has a lot about it to like. It's very fast, but a lot of people are reluctant because the syntax is a bit tricky to read. And I'll show you an example of classic filter group by and summarize. The filters in the light prove. We're filtering for covered physicians, medical doctors, gastroenterologists, grouping by physician profile ID and calculating the total payments when the summarize. It's doable and it makes sense, but it's a little bit different. Shear speed data table is definitely worth using. I should point out the dtplyr package is a good backend workaround. It translates dplyr verbs to data table syntax. You can basically just library dtplyr and then write in dplyr dialect. And it translates the data table dialect in the background with a very small speed cost. But it doesn't address the bigger than RAM data problem and room doesn't either. So just a moment about data and disk versus data and RAM. It's easy to get big disk. You can get terabyte disks now, multiple terabytes. They used to be slow. They were physical platters, but now with solid state drives, they're much faster. They're still a little pricing. 16 terabytes runs around $1,800. And you need fair amount of space, including scratch disk space for writing intermediate data. But this is the standard strategy of SAS and has been since 1960. Data and RAM is different. It's dynamic. It was much, much faster than the desk. It's still faster, but not as dramatic with SSD drives. It's limited to gigabytes, not terabytes. And that's a big issue. The standard MAC configuration is 8 gigabytes, biggest is 128 gigabytes. With big data, you can often have insufficient RAM. But this is a standard strategy of R, maybe RAM-based calculations. You couldn't say just buy more RAM. It's not unreasonable. RAM's cheap. It's five to 10 gigs, dollars per gig. But my M1 Mac Mini can only address 16 gigabytes. I would need a new computer at this point. I can hear Frank Harrell on my shoulder saying, come on, just buy a new computer. And I decided not today, though it was tempting. I always loved new computer day. But the data sets just keep getting bigger. So how can you wrangle bigger than RAM data in R? And that was when I went back to Twitter. And Twitter suggested the arrow package. It uses a data structure that works across languages. It's very efficient. So I asked, well, can it read data, big data fast? So I tried to read CSV arrow function for single files. On the 2016 data set, 68.8 seconds, very nice. About 30% faster than data table freed. Encouraging on the speed front, a little bit faster. So what is Arrow? It's part of the Apache Arrow project. It relies on the C++ library. And it uses a different data format. It's a columnar organized for efficient analytic operations on modern hardware. And it enables zero copy data sharing between R and Python. Is it available for 12 different languages? The nice thing about columnar is it's very fast, especially with modern hardware. It basically uses rather than our traditional row and column lookup, which separates in memory buffer different variables. It keeps variables together in the memory buffer, which allows what's called SIMD, single instruction, multiple data operations, which are pretty standard in modern processors. So that for a single instruction, we can process all four of these data items. And this means it's a bit like parallel processing with multiple cores, but at the micro vector level, which is a significant speedup, and it's available for most modern chips. And looking at speed wrangling, a comparison of a basic filter group by summarized workflow, pure data table, this is on 6.4 gigabytes, one year of data, 0.3 seconds. Data table with the DT player translation, 1.5 seconds. Arrow with D player, 4.7 seconds. And room with D player was too long to measure. So D table is still freaky fast. DT players almost as fast and arrows at least in the same ballpark and getting faster, but everyday D player is much slower for big data. And you need some of these alternatives if you're looking at big data sets. Arrow can speed wrangle, but not all D player verbs are supported yet. They're all listed here, including most of their basic verbs, group by and summarize and joins. And if you want to use other verbs, you have to collect the results back to a table before you use some of the other verbs like rank or across or other non yet, not yet supported in arrow and D player verbs. One side benefit of multilingual arrow, arrow formatted data can be used by many languages, collaboration the old way, somebody has a good Python routine, you copy data from our Python, run the Python routine, copy the results back to R. With arrow, it's multilingual with less friction. All of these languages can use arrow format. They can read and process an output in arrow format so you can have a multilingual pipeline. But what about the bigger than RAM problem? Can arrow actually handle this? As of arrow 6.0, I can analyze and process multifile larger than memory data sets. And now it's in version 9.0. And a previous solution of this disk.frame was soft, deprecated in favor of arrow. So how does arrow actually do this? It partitions the data on disk into RAM size chunks, loads a chunk at a time into RAM, analyzes it, pulls the results on disks and does this all in the background. So you don't have to manage RAM or memory size or pooling results. It's automagical. So if we try it out to create a data object in arrow from these six data sets, I download them all the one folder which I call data underscore years. And I use arrow open data set, tell it where they are and that the format is CSV and load it into a data set. It's stored as the columnar arrow R6 format partitioned by year. And it can load all these data into an object, 33.8 gigabytes of data even though it's bigger than RAM. Now arrow has a deep player back end so you can write code with many of the deep player verbs. And this is a typical example, looks very familiar if you're used to using deep player in the tidyverse but it doesn't produce a result if you stop there. If you add the collect function, it pulls the result back to R and a table. There's resulting table allows you to choose. You can then use all of deep player or data table functions that arrow doesn't have yet as long as the results now fit into RAM. So it's generally a good idea to select rows, select variables and filter rows to shrink the data RAM size if you can before you collect. Do those data shrinking steps early, then collect then you can do whatever you want within RAM. But what if data are still bigger than RAM? What's gonna happen if you wanna use a tidyverse verb that arrow does not have yet like across? And I wanted to put all the first names and last names to uppercase so I use mutate across. You get an arrow. This expression is not supporting an arrow called collect first of all data in R. But this won't work if the data are still bigger than RAM. It seems like a no-win situation but there is sort of a Kobayashi Maru workaround, a function called to DuckDB. So what is DuckDB? It was designed to be used as an embeddable on the fly database like SQLite. There's no database setup required you just load the DuckDB library and go. DuckDB is an OLAP database online analytical processing. It has a columnar structure organized by field and variable. Designed to make complex queries fast more than making data lookup table fast which is the other kind of database or OLTP. This structure also like arrow takes advantage of SAMD instruction sets to achieve speed up with modern CPUs and it can work with the full Dplyr verb set. So database on disk, what's that? Basically it stores a list not a data frame. Instead of 38.3, 33.8 gigs in RAM, you actually only have 135 KB because what's in RAM after you convert to DuckDB is mostly pointers to disk. If you open up your environment and look at it, it's pointer, pointer, pointer all the way down. So then you collect your partition results if we filter by summarize. All this partitioning and analysis is abstracted away and it automatically happens under the hood with arrow and if needed DuckDB. And you get nice tabular results. So we process 66 million rows, 34 gigs bigger than RAM data in 44.3 seconds. The test that was not even possible in 16 gigabytes of RAM with standard tools. And you can see here some of the contenders, there's one guy over six years made $12 million. This guy made $4 million. There's no way they're doing that give and talk. So what was going on? We dug into it. Now, a couple of warnings. There are a couple of things about arrows that can get you in this one gotcha moment with arrow. Arrow is more than 25 data types in order to work with multiple languages. And like an example, integers can be into eight and 16 and 32 or in 64. This can make joins fail due to mismatch data type even when they look like integers. We look the same. You'll get an error in compatible data types for corresponding join field. Turns out national prescriber ID was in 64 and MPI ID was in 32. That's gonna fail. You can handle this by defining the data schema when you're reading in the data tables. And you need to add a schema argument to define the data types. Another warning benefit, arrow is rapidly evolving. It's rapidly iterating current major releases from arrow eight to arrow nine are coming out every three months. They're adding new deployer verbs. It's increasingly optimized for speed. And in some cases, this provides faster big data wrangling than data table. There are limitations that are present today. It may be resolved soon. So generally, there are three approaches to bigger than RAM data or BTR data. Read the arrow format first, wrangle with standard verbs to reduce data size. If it now fits in RAM, you collect back to R and you can wrangle with all the deployer verbs or data table if you want the most speed. If it's still bigger than RAM, but you only need the arrow subset of deployer verb, you keep wrangling with the subset of deployer verbs and arrow and collect the results back to R. If it's still bigger than RAM and you need more deployer verbs that are available in the current arrow subset, you can use to induct DB and then you can wrangle with all the deployer verbs. So back to the question. Did 50 people actually do 250 drug tops per year? No, nearly all of the top 50 in payments were from royalties from patents. Greater than one million per year in royalties for John Fortran. Number two is Bennett Roth. What were these patents for? Well, it turns out the first guy, number one, GoLightly. He was the inventor of GoLightly and there's a lot of money in that. A lot of money in getting people cleaned out for colonoscopies. And the Dr. Bennett Roth created the Roth net for retrieving polyps. Again, colonoscopies are big business. But there's always more than one guy. And here's an example. One person made 495,000 income in 2018 with no royalties. Drug talks, consulting, et cetera, gave a total of 101 payments for drug talks out of 208 possible. And there were 78 other people with over 500K over six years without royalties. So take home points from terms of speed. Read underscore CSV is two to three times faster than base. But it gets really slow when you get over 10 million rows of data. Faster options include data table feed, read CSV arrow and room. Faster options for wrangling big data. Data table is great as long as it fits in RAM. D2 pliers of front end for data table that can speed it up. An arrow with the deep plier back end gives you good speed, a growing set of deep plier verbs and the ability to go bigger than RAM. Take home points for bigger than RAM data. SAS folks have owned bragging rights for years with data on disk. Only SAS can process bigger than RAM data sets. Now with Arrow and sometimes adding induct TV, R can do this also in three easy steps. Read it in as arrow structure and data table, read CSV arrow or open data set. You then wrangle sometimes directly with deep plier functions. If it's small enough for RAM, you can collect and wrangle verbally with all verbs or go with data table. If it's still bigger than RAM, you need particular functions. You can convert to duck DB and then wrangle with the plier functions on disk. And remember to always collect your results out of Arrow and you'll be in shape for bigger than RAM data sets. There are a bunch of resources for learning Arrow, Duck DB and data table. I put links here on the slide. And thank you very much. I'm happy to take questions.