 Hello, my name is Konrad Schek and this is a collaborative effort with Colette Kerr. We're both from the Programming Language Research Lab at the Faculty of Information Technology at Schek Technical University in Prague. Did you ever have like a really big file, like 117 gigabytes? And did you ever try loading that into R? Let's try it now. Let's use the CSV to read it in. Oh no. And it crashed the R session too? This is terrible. Why does this happen? Well, function read CSV returns a data frame. In essence, a data frame is a list of columns. Each column is represented by a vector containing either integers, numeric values, logical values. You get my drift. Each vector is a fixed size contiguous region of memory with a specific structure. It has a header that contains meta information about the vector, for instance, the type of its elements. It has a body that mostly contains the information about the vector's length. And then it has a region of data where the individual values of all of the elements are stored. Function read CSV tries to create a vector for each column in the CSV file. It tries to allocate enough memory to fit the contents of each column. And so eventually it will try to fit all of the contents of the CSV into memory. Since there is not enough memory available to fit all these vectors at once, R issues an error and R studio panics. So what can you do? There are packages like BigStatsR, BigMemory, Matter and FF. They create R objects that mimic structures like vectors, matrices and data frames. These structures are file backed. They create files on disk to store data. When the structure is touched, it reads or writes data from an appropriate file. This is usually done using an operating system feature called memory mapping, or mmap. A memory mapped file is associated with an area of virtual memory. Virtual memory is a construct of the operating system. It does not contain any data inside, so it does not use up actual memory. Instead, all the data is stored in the associated file. When the virtual memory is accessed, the operating system seamlessly reads from or writes to the associated file. To the R programmer, it feels just like accessing a data frame, a matrix or a vector. It works. Well, most of the time it works anyway. Here we tried using the dplyr verb summarize to do the same thing, get a sum of column 1. And we find out that no applicable method for summarize applied to an object of class ffdf exists. This is because these structures do not perfectly mimic our structures. Packages like dplyr may not be equipped to approach them correctly, therefore. Because of this, packages like bigstatsr, bigmemory, ff, and matter each have collections of helper functions. For instance, cardinal is a package that builds on top of matter and produces a whole framework for analysis of mass spectrometry imaging data sets. On the other hand, many packages out there are implemented in C and C++, often for performance reasons. How did these packages will try to access these mimic structures using the C interface of R vectors? For instance, it will try to get the length of vectors by executing function x length, and it will try to get individual elements of the vector by executing integer ELT. But these C functions simply cannot be overridden by the mimics, so they fail. There's another problem too. Since the memory is file-backed, whenever you load a file into those structures, you have two representations of its contents on disk, the original file and the memory map. This may be okay if the file weighs around 6GB, but it can quickly become a problem if it weighs 100. So what else is there? There is the alternative representation of vectors and environments. ALTREP creates custom vectors that are transparent to the user. ALTREP is part of R itself and is available in R3.5 and later. You might not even know it, but you are already using ALTREP vectors. Let's see if we can show them to you. First, let's try making a big but ordinary R vector. Let's call it 1 billion integers and try to fit that many integers into it. It fails. But if we use a sequence constructor, 1 billion integers is a sequence from 1 to 1 billion. This works, and if we check its length, it is indeed 1 billion elements inside. But why does this work? Surely we should run out of memory. The second vector is not an ordinary R vector. It is an ALTREP class, a compact integer sequence. Unlike an ordinary R vector, it just remembers the beginning, step, and end of the sequence it represents. Therefore, it does not have a data section that contains all of the values of all of its elements. Instead, it keeps around a three-element vector that just has the sequences' parameters, the beginning, the step, and the end. You probably never noticed anything different, though, because these are meant to be indistinguishable from ordinary vectors. So if you take one of them and execute a sum or anything else, you should get the same result as if it were an ordinary vector. Even from the perspective of C and C++ code. Another pertinent ALTREP class, apart from compact integer sequences, are memory mapped vectors. Memory mapped vectors use M map to implement file backed larger-than-memory vectors. They are available via package RMIO and are used by BigStatsR. More importantly, though, you can make your own ALTREP vector implementations to suit your specific needs. It requires some C and C++ savvy, but there are some helpful examples available. However, ALTREP has some problems of its own. There are certain operations that will not work. For instance, if you try to add one to the 1 billion integers vector, you will get an error. This is because when an arithmetic operator is evaluated, it creates a new vector to store the results. The new vector is always an ordinary R vector. This vector must be large enough to fit the result, so it can exceed memory when one of the operands is large. This problem occurs with all arithmetic operators, variants of map and apply functions, and some cases of subsetting. However, it is possible that this will be resolved in future versions of R. Let's hope. There is one more problem, though. Let us try to print a large vector. Let's print our 1 billion integers. This runs out of memory and fails. Actually, okay, it does not fail anymore. As of R, 4.0, this has been fixed. So let's take a look at another example of the same issue. Let's try writing to an ALTREP vector. Let's say we try to write the number 42 into the first slot in the 1 billion integers vector. This also fails. So what's going on? The implementation of subset assign and the old implementation of print both call a function called data pointer. Data pointer retrieves a pointer to an area of memory containing all the elements of a vector. Since a sequence does not have any area of memory where it keeps all the elements of the sequence, it can't fulfill this request. So if such a request is issued, the vector must be materialized. All elements of the sequence are loaded into memory. In the case of compact integer sequences, that means that a new ordinary R vector is created with all of the elements of the sequence located in its data section, and a data pointer to that data section is then returned. But of course, in our case, the vector would need to be larger than available memory, so materialization fails. Many ALTREP implementations suffer from this. One exception is the memory mapped vectors class. Since it uses mmap, it can always have an area of memory handy that it can return a pointer to when asked. But we started thinking, is there a way to remove the need for materialization completely? And Colette came up with an idea. The Lytx kernel recently added a new feature to the mmap mechanism called userFaultFD. UserFaultFD lets us instrument an area of virtual memory rather than just map it onto a file. This means that when virtual memory is accessed, we can have the operating system execute a custom procedure to populate it. The procedure can be used to load data into memory from disk or to calculate it. This redirection is reminiscent of ALTREP. What is different is how we associate an R vector with an area of virtual memory. We create userFault objects, or UFOs, by using the custom allocator mechanism in R. When a vector is created, it can be allocated using a standard allocator, like malloc, or a user-specified function. We use a custom function that returns an area of instrumented virtual memory. Outwardly, the resulting vector is just an ordinary R vector that lives entirely inside virtual memory. When the virtual memory is accessed, UFO framework checks whether the access falls into the header, the body, or data. The framework automatically populates the header and body appropriately. But when an element is accessed, we allocate a small fragment of actual memory to fit it and the number of elements around it, and we execute a populate function. This function contains logic instructing the UFO framework how to fill the area of memory. Once such a chunk of memory is populated, it hangs around and subsequent accesses to data in it are simple memory operations. Over time, more and more such chunks are populated. The UFO framework keeps track of how much memory is collectively used by chunks. If used memory hits a threshold, we destroy some of the chunks, starting from the oldest, until a sufficient amount of memory is freed. If a chunk is reclaimed, it will have to be repopulated when it is next used. The populate function itself will be different for different UFO classes. Our intention is that, like with Altrap, programmers would create their own UFO implementations to scratch their own particular itches. With that in mind, we try to make these implementations easy to write. A prospective programmer only needs to provide a populate function and a data structure for passing data between invocations of the populate function. The UFO programming guide vignette in the UFO's package walks you through the process of writing a custom UFO vector. But we do have four example implementations. We provide a compact integer sequence that looks like INTSEC. We also provide vectors and matrices that map onto binary files. And the implementation we're most proud of reads directly from a CSV file into an ordinary data frame, where each column is a UFO vector. Finally, we also provide an empty UFO implementation. Why would an empty UFO be useful? Because UFOs can be written to. So if you create a loop and in that loop you start assigning elements to the empty UFO, well, it retains those values. When a chunk is loaded in memory, it is just ordinary memory, so it can freely be read from or written to. And since a UFO has the shape of an ordinary R vector, R, C and C++ code can operate on it as if it were an ordinary vector. Now, when a chunk is reclaimed, UFO framework will check if there's any changes to it using a checksum, a SHA-256 checksum to be exact. If there were no changes, the data can be safely discarded and regenerated later by the populate function, if need be. If the chunk was written to, the UFO framework will create a temporary file on disk and store the modified chunk there. This chunk will now always be read from disk rather than being regenerated. Apart from being writable, UFO vectors never need to materialize, because they already appear like ordinary R vectors. Thus, data pointer can readily find where the elements of the vector live in memory. Nothing, however, is perfect, and UFOs have a few important limitations. First of all, UFOs only work on new Linux kernels. We would like to extend UFOs to Windows and Mac in the future, but we like the prerequisite expertise. If you'd like to help, let us know. Second, UFOs suffer from the same problems with arithmetic operators, apply functions and subsetting as ultra-vectors. If you try to add 1 to 1 billion integers, you will also get an error. The reasons for this are the same. An ordinary vector is created to store the result, but it does not fit into memory. However, this problem can be partially alleviated by performing such operations element-wise and writing the result to an empty UFO. Now, this is far from perfect, but it does the job in a pinch. Finally, the populate function can be called at any point in the life cycle of the R interpreter. Because of this, we cannot guarantee that the interpreter will be in a consistent state when populate is called. Thus, it is best not to bother the R interpreter inside a custom populate function. Just don't touch it. Apart from those limitations, UFOs are fully functional and can be retrieved from our repository on GitHub. This work received funding from European Research Council under the European Union's Orison 2020 Research and Innovation Program, and from the Czech Ministry of Education, Youth and Sports.