 Welcome to session 2A on data management. My name is Christopher Maronga, and I'm a co-organizer Nairobiari user group and also a long-run data manager working with the help of such data management analysis. So today, we have four talks in this session as you're going to hear from the presenters. And the sponsor of the day is Axlon, and the session sponsor is Sinker. So the first talk you're going to hear from Neil Richardson on solving big data problems with Apache Arrow. So Neil leads our engineering team at Usela Computing and is also a package maintainer of several regular use as well as the package for Daystock, which is Arrow. So he's previously had a training in political science and has worked in survey analytics. So Neil Richardson, please welcome and let's see you at home. Hi, I'm Neil Richardson. Thanks for joining me today. I'm going to talk about solving big data problems with Apache Arrow and R. And to do so, I'm going to share some success stories from Arrow users about how they have used Arrow to solve their problems. What are we talking about when we're talking about big data problems? Well, first of all, I just want to say that the actual size of the data is not really something I'm going to get into about whether your data is big enough to be considered big data. Size, of course, is relative. What was too big for me 10 years ago is not too big for my laptop today because it has a lot more memory. What I do want to talk about is when you're in that point where you're going from data that is small enough that you can work on your computer on a single node to where it starts to push the limit to what you can do there. There's what I call big data problems. So when I start getting data that's bigger than memory that I have, whatever amount that is, the normal tools that we use in R to work with data frames, they tend to not work as well. You need to have to stretch a little bit there. When you have data that's bigger than you can work with on your machine, this generally comes with some other things, which is responsible for why the data is that size to begin with. It could be that the data is not local to your machine. It could belong, could reside on cloud storage somewhere like Amazon S3 or a network file system, but it's somewhere that is on a system that's bigger than yours. Because it's outside of your machine, it may also be data that you don't have exclusive ownership over. You may need to work with others that are either responsible for generating the data or who also need to work with this data. Data may also be getting large and uncomfortably large for your machine because it is continually getting updated. One example would be a process that generates logs and you get a log file every day with transactions or events that happen, and this will cause it to grow, and you have data that split across multiple files. So how do we solve these problems? So one solution, if I just have data that is too big for my machine is to get a bigger machine. That could be expensive, may not be an option. Another idea is, well, it's too big for my computer, so maybe I need something like Apache Spark. I need some big data MapReduce type of job in order to query this data because it's too big for my computer. Of course, if you need to spin up Spark instance, maybe you get into this world where you're actually becoming a sys admin because you're managing about a cluster of computers and the managing permissions roles and YAML files and all this stuff. And so maybe you've solved your big data problem, but you've also turned it into other problems. So what I'm going to show you today are some examples where people have used Arrow to effectively shrink their big data problems to help you find small solutions to your big data problems, to make them no longer big data problems. They're just data problems, data problems that you already know how to solve. Before I get into that, I want to talk a little bit about Arrow itself. The Arrow project started in 2016, over five years ago. We had our 1.0 release last year, which indicated the stability of the columnar format that is Arrow underneath it. Our latest release is 4.0. We've got a 5.0 release coming out later this month. We do releases every three months. And Arrow is fundamentally a format for how data should be represented in memory, in a columnar data format. And we have implementations in 12 different languages, R being one of them. And it really provides a shared foundation for data analysis. It's drawing on lessons from databases and other data frame libraries to try to take advantage of what we've learned there, as well as the advantage of the modern hardware that we have on our machines. The Arrow R package is on CRAN. It's also on CondaForge. We've got a package down site there. We've got the link there on the slides here. We have nightly packages available that we build every night, binaries for Mac and Windows. These are available. None of the features I'm going to demonstrate or I'm going to discuss from the case studies here require nightly features. But if you want to check out the latest things that we're doing, you can download those. Instructions are all there on the website. So I asked on Twitter a while back for examples where people were, Arrow had solved a real data problem to try to get a sense of what sorts of issues people were using Arrow to solve in the world. I know what I think of as a maintainer of the package of what Arrow's capabilities are. And I was curious to see what people were doing with Arrow in practice. And a few common themes emerge, and I will walk through those next. Some of them was just the ability to write parquet files. And I'll talk in a second about why parquet is such a compelling file format for data. Many people were taking advantage of Arrow's features for reading multi-file data sets and using dplyr to query them, where you can essentially select a subset of your data to pull into memory and do whatever analysis you want on it without having to read everything in first. Another case of this was being able to write the data into that partition format itself. And finally, there were cases of using Arrow's features for accessing cloud storage directly as a way of accessing that data and pulling it into memory. And a common theme of all of these was that Arrow was letting people work within their resource constraints, whatever those were, without needing to use bigger machines or a big data solution like Spark and all the complexity that comes with that. So it allowed us to solve our problems without needing to go beyond what our current hardware provided. So the first one I mentioned is parquet. Parquet is a columnar file format, whereas Arrow is a memory format. Parquet has a number of good features, compressions, and encodings that allow it to have very efficient small files, but that are also very fast to read and process. And it's widely used in big data systems. So even if you don't necessarily write parquet files yourself, you may encounter services that write parquet files and you'll need to read them. And Arrow has simple functions for reading and writing parquet. One interesting example, Mike Thomas sent this to me. He had a client that was building a Shiny app on top of some CSV data. But unfortunately, the CSV data was so large that it would not fit on shinyapps.io to host it. And it was also not very performant. And just by reading the CSV and writing to parquet and then using Arrow to read it, we could get the file size small enough to host there and it was much more performant. What's more, Mike put together a great example, Shiny app here with the link there where you can check that out and play with it yourself. Very cool stuff. Second, building on top of that, when data gets too big to read in memory, often we might wanna split that into multiple files so that we can touch those separately. We can beyond that, by splitting the files, by partitioning the data into multiple files, we can encode information about how it's split into the file segments and to the directory paths so that when we query them using Arrow, we can only touch the files that we need. We can use that directory information to save even having to open the file to do any filtering. It is common for big data generating systems to write data in partition forms and with Arrow we can read it and we can also write data sets in order to partition them ourselves. So what does this look like? Just with a very toy example, take the empty cars data set, using Arrow we can say empty cars and group by to indicate we want to partition, group by cylinder and then do Arrow's write data set function to this directory. And then when I inspect what's in the directory, I see I've got subdirectories where cylinder equals four, cylinder equals six, cylinder equals eight, that's encoded in the file path. And so if I were to do open data set on this and then filter where cylinder equals six, I don't have to touch those other two files. Obviously with empty cars this is trivial, I don't need that savings, but with bigger data it can really pay off and make your queries a lot faster. In a number of cases that people discussed with me where this was a really big win for them. So one in the fraud detection space with a data set reasonable size, millions of rows and hundreds of columns, they found it prudent to convert their data with Arrow from CSVs to Parquet and to partition it by month. Parquet was a good choice because they were working with a bunch of other teams, data engineering, modeling that needed to access this data too. And Parquet is a good efficient standard that people in the Python and Java and other ecosystems can work with. And because Arrow, the Arrow package lets you treat this directory where I've split everything out by month, I can treat the whole thing as a single entity and query it, I don't have to, there's no extra overhead for me as a developer or as a data scientist in having the data in multiple files. I can treat it as one thing. So they would, after repartitioning the data in that way, an example query that they did, selecting a subset of columns from the data and looking for places where there's missing data within each month. Any sorts of data cleaning and data processing tasks like this, we only have to read in the columns that we select into memory to do this processing. Another really interesting use case from the government sector in the British Columbia and provincial government, they were trying to estimate the size of their homeless population. Their data was stored across lots of fixed-width files from different sources and with different corresponding different time windows. And aside from the complexity of that interface there, they had struggled because they needed, there was a big machine available to them. And that was the only one that could open these files in order to do an analysis on it. And they had to compete with other groups in order to get time on this machine. By using Arrow, by getting some time on the big machine and using Arrow, they could write these files to Parquet. So they'd be much smaller and then partition them into standard chunks by year and month. And this had a number of benefits from them downstream. So now they got all the data standardized. Everything is in by year and month. So you don't have to wonder how a certain file is grouped. You've got this interface with Arrow's open dataset function where you can just point to the whole directory and filter on it. And I'm filtering on year in this example query. And so that means all of those files that are not in that year, I don't even have to touch. And so I can pull in the subset that I care about and I can do that on the small machine. I don't need to get access to the big machine in order to do this. And this was able to unlock a lot of work and analyses for them. Another use case of using Arrow to have multiple file datasets that you treat as one from, it comes from Chie from a group that was monitoring COVID statistics. And they used Arrow in a similar way to be able to query using Dplyr, these datasets that are backed by multiple files. There's a cool shiny app here that you can check out. And they also have a paper published in Science looking at the COVID incidents and mortality based on this data, very cool stuff. The last example I want to talk about is reading data from the cloud. So as I said upfront, one way you commonly interact with data that is bigger is that it is stored like in an S3 bucket or on Google Cloud. And so it's there because it is bigger. And in order to let others work with it. Arrow has some nice features to let you work with S3 or any file system that has an S3 interface and it's not necessarily on Amazon to read datasets and read single files. So if it's on S3, you can just use the S3 URL, S3 colon slash slash to point to your file in S3 and read and write. Additionally, using the Dplyr multi-file dataset backend, like we saw before, we only had to pull into memory the data that we were working with. We can also only download from S3, download from the cloud that subset that we care about. And this allows us to speed that up quite a bit. So Jared Lander wrote on his blog a while back about analyzing the temperature data from a bunch of thermosats and sensors throughout his house. It's a very comprehensive blog post going from the beginning of collecting the data to visualization and inference on that. And his data, he had a process where data from the sensors is pushed to digital ocean, which is another cloud vendor, which has an S3-like file system interface to it. And so using Arrow, he was able to point at his digital ocean bucket and use Dplyr on that to pull in just the columns he cared about and just the rows he cared about and be able to explore this data without having to download everything locally. So, Parquet files, multi-file data sets for querying and accessing data in the cloud, all these are things that allow people to really take advantage of the computing resources they already have at hand and avoid making big data problems out of their data problems. I wanted to thank everyone who helped with this talk, particularly to those who gave me the examples that we used here. Really appreciate it. Arrow is a growing thriving community. We have over 600 contributors, many more people using and reporting issues and feature requests. We can't do it without you. We really appreciate all of the support that we have from the community here. Our website arrow.pachy.org has lots of information there. And there's my Twitter handle at NPR if you want to ask me any questions. Thanks so much for your time. Thank you so much, Neil, for the great presentation. I think we're not wasting much time. We are going to move to the next presenter. The title of the talk is a framework for creating relational data with applications for the package, irrespectable. So, our presenter is Gabriel Berca. So, Gabriel Berca is a software computing researcher and a consultant working by Pharmaceutical Space. Gabriel has contributed to multiple new features in collaboration with the ARCO development team. For instance, Gabe collaborated with Luke Tieny on the implementation of Altef framework, which was introduced in our version 3.5.0. So, Gabriel, the floor is yours. Please, go ahead and present. Hi, everyone. Thanks, Christopher. So, as you said, I'm Gabriel Berca. I'm going to talk today about the Respectables package. And this is a little bit of a different angle on this type of data that we may be hearing more about later in that we are going to be creating the data rather than accessing or using it. There we go. This package is funded by Roche and is copyright Roche and it's been released under the artistic 2.0 open source license. The source code and development version are available on the Roche public GitHub account and a CRAN release will be forthcoming but has not occurred yet. So, first off, what do we mean when we are talking about relational data? So, I suspect many people have heard of it, but just a very brief introduction. Relational data is when you break a data set into multiple data so that certain data is not repeated that doesn't need to be. So, in this example, we have students taking classes and you can see that the names of the students only occur a single time in the entire data set and the entire database, in fact, whereas the students are taking multiple classes. And so, by linking these student IDs across these different tables, we're able to do what's called normalization of the data which is a much more efficient representation and it's what happens in most database, actual database systems that you will. That you might hear about. That's not typically how data is represented in R, but Carol Miller's DM package does support this type of multiple table relational model in R, including both accessing database systems themselves and combining multiple data frames that live in memory that are owned by R into a single sort of virtual relational database. And this can declare these foreign key relationships which is what that has called that relationship between the student ID and the different tables. And it can check whether the data meets those restraints or not because you can't have invalid data when you have a constraint like this. So, you may be asking if DM exists and DM does this already then why are we here? What am I talking about? So, the key here is that DM assumes that the data already exists and it gives you some really nice tools for filtering the data that understands these constraints and things like that, but it doesn't have any sort of tooling for creating the data under these types of constraints. And so, that is what we're going to be talking about for the remainder of this talk. So, respectables is what I'm gonna call a recipe-based system for the simulation or creation of relational data. So, recipes, so data simulation, especially under these types of constraints may seem to be pretty complicated. And so, the recipe may not seem like a great fit, but I hope that I can convince you that this is actually a pretty good model for what we want to do. So, before we get into what an actual recipe for this type of data might look like, we're gonna talk a little bit about the different types of interdependence that these types of data will have when we're creating it, rather than just accessing it. So, the first is intratable. So, within a single table, this has to do with things like the distribution of height depends on the gender of the observations that you're simulating, because different genders have different height distributions and things like that. Things where the actual variables in a single table are related to each other in some distributional way that your simulation or creation of the data is going to need to respect. The second type of relationship is intertable or between-table relationships. These are these types of foreign key relationships that I was talking about previously, where an ID in one column actually specifies an observation in a different table where you can look up more information about that observation. And the example that we saw, these were students or classes. And an example that we'll be doing for this talk, these will be customers that we're making up. So, speaking of customers, so this is a toy example. This is not real data, but it has the types of constraints that and relationships that real data might have, but in a simplified enough way that I think that they can be illustratively useful in the small amount of time we have. So, we're gonna simulate silly customers. They're gonna have a few different things that we know about them. They'll have a unique ID, they'll have a stuff level, which is either low or high. They'll have a key size, which doesn't mean anything because the whole data is made up, but it's normally distributed, but the mean of that depends on whether the stuff level was low or high or not. And then they will have dates for when their account was open and when their account was closed, okay? So, when we're talking about recipes, when you're gonna make a recipe, the first thing that you need are ingredients. So, the ingredients here are the ways that we can create these different pieces of the data set. So, we've got a sample FCT, which basically just samples a factor and that gives us our low or high first stuff level. We've got the key size function and all of these functions. So, sample FCT is provided by the Respectables package. Key size is a function that I wrote for this talk and you can see, we're not gonna go over exactly what it looks like, but the source is in the RMD that I used to make these slides, which is publicly available. So, you can look at that later if you choose, but that's gonna give us these key sizes and we can see that the, that takes a data frame that already has stuff level in it and that's gonna be really important. And we can see here that the first two are low and the second two are high in stuff level and we can see that the first two observations are lower than the second two observations. And then finally, we have this account dates function, which again, we're not gonna go over the details, but the source is in the RMD file. And that spits out the account open date and the account closed date, which is NA, if it hasn't been closed, which happens to be the case for the first two of these here, but if it has been closed, it has a date and the date is guaranteed to be after the open date, which is obviously important. Okay, so these are our ingredients and then this is what our recipe looks like. So our recipe is just a tibble or a data frame with list columns, if you prefer. And it has four or five columns in it, variables, dependencies, function, function arguments and keep. So variable specifies what variables in the data set this row has instructions for. That can be a single variable or it can be multiple variables and that's important. Dependencies is which variables need to have already been generated before this row can be performed. And no depth is just a sort of single tenant in the case that there aren't any. The function, which can be a string or a function object is what function should be used to generate those variables. Function args is additional arguments that should be passed to it and keep is whether those variables should ultimately be in the final data set or not. And so once we have our recipe, we just call gen table data, we tell it, we want 500 observations and we give it the recipe and then it creates that it can tell which rows need to be performed before others because we gave it the dependency structure. And so we don't have to actually worry about that ourselves. We just put them in the recipe in any order that we want and it takes care of that and out comes a data set that has the variables that I described, okay? And it just so happens that all of the account closed are NA for the first six here, but there are non-NA ones in there as well. So a brief aside, I'm not gonna go too much into this code exactly, but you can do this with the flyer so far. It's not particularly hard, but there are benefits to using this sort of recipe structure. It's easily reusable and it's automatically parameterized by the combination of funk and funk args. So if I want a different distribution for a particular variable, all I have to do is swap out a particular value in a data frame and call gen table data again. And I have a completely new data set that is how she distributed instead of normally distributed or whatever change you wanna make. The other really important thing, which is nice is that it supports jointly generating multiple variables. So we saw that the account open and account closed dates were generated by a single function at the same time. And this can be because there's some covariance structure that you want or there's some sort of constraint where like we had here, closed can't be before open and things like that. It's also really easy to add new rows to a data frame, which translates to adding new columns to the data that you're generating. So that's nice. Still though, that's pretty straightforward. The thing that gets a little bit more complicated where I think is respectable is gonna be a little bit more help to people perhaps is when we're simulating under these types of foreign key constraints, right? And so this is gonna be a two-stage process. And so first we're going to translate the foreign table. This is the table where the foreign key comes from into what I'm calling a scaffold table. And the scaffold table just has the actual rows from the foreign table, but the dimension has been transformed by repping those out into whatever the new dimensions should be. And so we can see that you can have these foreign keys that appear multiple times, you can have foreign keys that don't appear. And, but the point is we're not generating new values of anything, we're just transforming the dimension. And then once the dimension has been transformed, we just do the same process I just described, create all of the new variables in the new table, right? And so by dividing it into this two-step process, each of the steps actually ends up being quite simple. And so the rest of the talk, we're gonna talk about these sort of silly customers going shopping and just for simplicity because we don't have that much time, we're gonna say that they are going, some of them will never go shopping. So we'll have some IDs that don't appear in the purchase table. Some of them go many times and they all buy a single item whenever they go shopping. The item is random and it doesn't matter what they've purchased before, but it does depend on whether that item is actually being offered at the time, okay? And so that means we have to know when the shopping trip was before we simulate what was purchased. So scaffolding is easy. So general respectables provides a couple of different scaffolding functions, the one or function factories I should say. And the one that we're gonna use here is RAND per key, which as you might guess from the name, generates a random number of events or rows in the scaffold for each row in the foreign table for each ID. And so we say that the ID column is the one that we care about. The minimum number of times that they could go shopping is zero, the maximum is five, proportion present is one. So that's just another way of specifying how many of them had any shopping trips at all. But since we're allowing the minimum count to be zero, we don't need to worry about that. And then we just, that gives us a function. Remember, this is a function factory. And so then we just call that with our silly data that we made before. And that gives us a scaffold table. Here we see this is how many of the customers are simulated to have each number of shopping trips from one to five. And then we're 500 separate customers in the data that we simulated, if you call. And only 418 of them have or appear in the scaffold at all. So we can see that it is the case that some of them didn't go shopping, like we said. So then the scaffolding join recipes is going to look pretty similar to the recipe that we were talking about before. Again, it's just a table or a data frame with list columns. In fact, most of some of them don't even need to be list columns, but you just specify the name of the table that the foreign key comes from, the name of the key in that table, the function that actually is going to do the scaffolding, and then any additional arguments to that function. And that's it, and we'll use that in just a second. And then, so this is just, these are the products that we're gonna let them buy. This is mostly here just for information. And then here's the ingredients of our new purchase, you know, our new purchase table. So we've got the buy date function, which takes the scaffold and an end date and generates buy ID and buy date. And then the prods function, which accepts something that comes out of the temp buys and gives us product ID and product description. Okay, and the reason that that takes the thing that comes out of temp buys is because, as I said, not all of the products are always available, so we needed buy date and we depend on buy date, which is why that's occurring there. And then we've got our recipe, which looks the same as last time. And then we're gonna say gen rel join table, and then it's possible the names of these functions will change by the time it is uncran, but that's what it's called right now. We specify what I'm calling the scaffolding join recipe, so the recipe for making the scaffold table. We specify the recipe for the new data, and then we specify the database of things that have already been created. And you're not actually gonna have to call this manually. I'll show you that at the end, but this is how we get this piece of it to work. And then there you go, we've got ID, which comes from the old, that's the customer ID. We've got buy ID, buy date, product ID and description. And we can see in this case, this time, the first ID went shopping a lot. So all four of these are in the first, for the first customer, okay? So we've got a way to generate a single table that relies on other tables with this foreign key mechanism. We have a way of specifying tables that don't rely on foreign keys. But we're talking about this sort of set of multiple tables. So how do you do that? Well, when you combine recipes, you get a cookbook, right? And so the cookbook is gonna do the same thing that we've been doing this whole time, right? Which is it's another table, and it's just what table you're making, what the scaffolding recipe is, what the data recipe is. And then the other thing that it accepts, which I didn't have time to get into, is a recipe for missing this injection. So you can actually inject missingness after the simulation step at random or systematically or whatever you want. And that supports that. So then we just say gen data DB and we give the cookbook. Notice we're not passing at any of the data that we've previously created. So it's actually going to create these in order. And then if we just look at the head of these, we can see that again, we've got these customers and then their buys and that is it. So we've got a cookbook that combines these recipes that again, we can customize all of these things. And that is respectable in just over 15 minutes. So again, it will be on CRAN, it's not yet, but it is available on GitHub and is open source. And with that, if there are any quick questions, I think I can take those before the next talk. So thank you for listening. Thank you so much, Gimbal, for the presentation. So we'll keep an eye on the questions from the audience and we'll have them at the end of the fourth presentation. So I'm going to quickly move to the next presenter, Peter Masoner. So Peter Masoner is going to talk to us about going big and fast with Kafka's package. So Peter is a consultant at about seven long-time user and is both a package and a book author. He's a letter scientist, web sculpting veteran and also part of Humber user meter group. He's had several years of experience in analyzing and modeling data. And these days, Peter spends much time supporting both data science and IT projects as a technical expert in software development, data pipelining and the general IT consulting. So Peter Masoner, please, the floor is yours. Go ahead and present. Hello everyone, welcome to my talk, Going Big and Fast, where I want to talk about the Kafka's package to access Kafka. Peter Masoner, I'm a consultant at virtual ZIME. Software consultancy are based in Germany and Romania. And personally, I tweet at Peter Love's data. So let's get started. Kafka ask is an R package to connect to a Kafka cluster to read from Kafka and to write data to Kafka. Kafka is an industry standard, big data technology and the package is available on GitHub at Peter Masoner slash Kafka ask. And you can install it easily with the remote package using the install GitHub function. Why yet another package? There are two packages which also provide Kafka access. R Kafka and France, R Kafka is on Chrome, Java based, served as a blueprint for Kafka ask. Unfortunately, it's not maintained anymore and does not work with recent versions of Kafka. And there's France, France is on GitHub, C++ based, but never got finished. So what is this Kafka? Why should I even care? Kafka is a message queue. So you can put in data and later on retrieve some data. So first data in, first data out. Kafka's also locks, so meaning it's ordered, it's persistent, maybe it has a timestamp. It's distributed over networks and servers. It's for tolerant, your data is safe and the system will still work if parts of the system fail. Kafka is scalable, so you can always add more servers and get more performance out of it. It's as in Chrome, so you can, you don't have to immediately consume data once it's put into queue, but you can decide when to consume the data. And it's a big data technology, so you can think big, you can think in high throughput while having low latency at the same moment. And it's an infrastructure technology, so think more of it like a database than an R package. Kafka consists of various parts. So you have the Kafka cluster containing of multiple instances, usually of Kafka. There the data is stored, data is stored in so-called topics, think like database tables for data messages, for data, for messages. And those topics can be partitioned also, maybe by timestamp, maybe by some round robin algorithm, or maybe by a key which you provide, right? So data from Germany goes in one partition, data from France goes in another partition, and things like that. Then you have producers which can connect to Kafka and put in data, and you have consumers which can also connect to Kafka and can consume data, and both are just pieces of software. Okay, but why should I care as a data scientist, right? Kafka is open and language agnostic by nature, so it makes it ideal to share data with other parts of the system, with other users, with other languages, between software components or parts of a software or IT system. So you can access, for example, data already ingested in Kafka, or you can make it easy for producers of data to share data with you, and the other way around, you can have an easy time sharing data with other parts of the system, right? Kafka is low latency, high throughput, which allows to use it for any kind of stream processing, think real-time or near-time analytics, processing and data analytics applications. It's distributed by nature, so you can spread data, you can also use it to spread tasks among our workers, for example, especially you can decouple time and location of data production, and time and location of data consumption. It does not have to be the same server, it does not have to be right now, right? Kafka is fault tolerant and comes with some data persistence guarantees, making it a nice tool for debugging, for also exploring data after the fact, and replaying data if needed, and for auditing purposes. So if you have to prove that things in your system have happened, or you have to have a way to see what happened, when and where, Kafka might be a good choice. Since Kafka is a great good thing that there's a package for that. Kafka ask binds to Kafka's native Java libraries, Kafka is written in Java and Scala. Kafka uses our Java for our Java communication, and itself only depends on JSON light data table and my grid R, and R6, and has also light dependency in the Java libraries. Now let's have a live demo of the package. The first thing we have to do is actually spin up a Kafka cluster. I prepared the Docker container for this, starting with Zookeeper and Kafka, and adding some messages into the system. So next we can go into R, we load the library, we create the consumer object, spin it up, let it connect to Kafka, which retrieve a topic list, we subscribe to a specific topic, and then we start consuming messages one by one, seeing that a message actually consists of a topic and the key and partition number, specific offset, a timestamp when the message was created, and here we see the message value. Instead of consuming messages one by one, we can also use another pattern, we start the consumer, spin it up, subscribe to a topic, and then we use the consumer's consume loop method to loop and constantly consume messages one by one. So this will take some time, so we wanted to consume 10,000 messages. It's done now, we consumed the messages, we also did some aggregations, and we see that consuming 10,000 messages took us around about 11 seconds. In the last example, we consumed the and processed the messages one by one, but we don't have to do it that way. So Kafka provides a way to consume messages in batches, and I prepared an example, and let's see how fast this is consuming the same amount of messages, and it only took 0.69 seconds to consume 10,000 messages. Producing messages is easy as well, we load the library, we create a producer object, start it up, and immediately can send messages to Kafka to specific topics. Last but not least, let's have an example showing real-time interaction between consumer and producer. For this I prepared a consumer application, just printing out message values and the producer application putting messages into the user, user 2021 topic. Now I switch to the command line, I spin up my demo consumer application and I spin up my producer, we see the consumer already has consumed some messages and now the producer will produce messages as well and the consumer should pick it up. And this is what we are actually seeing. So a constant stream of messages is consumed or is produced by the producer and the constant stream of messages is consumed by the consumer. So if we stop the producer, then the consumer should also stop printing out messages and if we start it again, then this should start to work again. So this was a quick live demo showing what the package can do, how it works, how you can use it for your analytics, pipelines, and there's more to it. You can tweak how Kafka behaves or you can add partitions to topics to make use of consumer groups and so on but for live demo, I think this should be all for now. Having talked about Kafka, having talked about the package and having had the live demo, it's time to look back. So a lot of the work on the package was connecting R with Java, which was actually better than our Java's reputation. So it's a little bit low code, everything is typed and you're kind of restricted to method calls and scalars and vectors but it's basically, it is okay. I advise to, if you want to do it something like that on your own to do, to use a project as blueprint. I did it, you can use Kafka as a blueprint and I'd suggest using this code for Java development. This was quite easy. Concerning putting the package on CRAN, it's not done, if it's not on CRAN. This was kind of a bummer. I tried to put it on CRAN but basically the package and the Java dependency sum up to something like 11 megabytes and CRAN has some strict package size restrictions. Yeah, so the only way for me is to develop my own downloading, Java dependency downloading routines or functions, which I'm a little bit hesitant because I think there should be a common way to do it. So maybe in the future we can have something like that or CRAN can work as a community on something like that. So this will make life much easier. To conclude, Kafka is a solid technology with use cases also in data science, data analytics and also Kafka S might be a little bit limited to text data at the moment. It can be extended and it's at the moment the only working R binding for Kafka allowing the R community to control major Kafka API endpoints, allowing the R community to have access to this industry standard big data technology. Thanks everyone. That's all from me. See you, bye-bye. Thank you so much Peter for the next presentation. I think we have a couple of minutes. Maybe you could tell a question or two but I think we could just quickly move to the last presentation so that we can have time for more questions. Okay, so great. So our last presentation is titled Scaling RF for Enterprise and our presenter is Mark Honig. Mark is a senior director of product management for Oracle Machine Learning and he has over 20 years experience integrating and leveraging machine learning with Oracle technologies working with both internal and external customers. So actually Mark Honig is the Oracle representative on the Oracle Social Board of Directors and Mark please take the floor and present. Hello and welcome to this session on Scaling R for Enterprise data. My name is Mark Honig, senior director of Oracle Machine Learning Product Management. R is a powerful analytical tool that's made significant performance gains with each new release. However, with ever increasing data volumes it can be the data itself that poses some of the biggest challenges. What can we do to avoid or minimize moving data yet still get the benefits of R? As we discuss in this session Oracle Machine Learning for R enables you to work with database tables and views using familiar R syntax and functions among other capabilities. Let's first take a look at some common enterprise machine learning pain points. Common themes include data access latency taking too long to get or load data. Sometimes data access requires making explicit requests to DBAs for data extracts and it may take a few iterations to get the needed data. Users may have programmatic access to data sources but this still requires pulling data to separate analytical engines. Next is a lack of scalability and performance. Does all the data fit into memory? Can the algorithms take advantage of multiple CPUs? Another issue is complexity and the time it takes to put models into production. You've likely heard the statistic that 87% of data science projects fail to make it to production or that only 20% of such projects will deliver business outcomes. Deployment complexity contributes to these statistics. In other cases there are concerns about data security back up in recovery. How are copies of data managed and secured? In deployed solutions how well is backup and recovery addressed and tested? Let's look at a few specific scenarios. When it comes to enterprise data data scientists commonly deal with multi-gigabyte data sets likely several of them. Just loading data into our memory can take a surprising amount of time. For example using read CSV on a gaming PC with 16 gig of RAM to load a four gigabyte CSV file of six columns and 100 million rows as a baseline took eight minutes. So doing that too frequently could affect your productivity. Assuming linear scaling a 16 gig file would be expected to take over half an hour and a 32 gig file could take over an hour. Now we can also use the data table package with read table to do significantly better but this is still not ideal and we still have a 3.6 gig R data frame in memory. More advanced packages like Vroom don't load the entire file but instead index where each record is located and data is read when you use it but this has some drawbacks as well. Jim Hester has some excellent videos analyzing the performance benefits of Vroom. You should check them out. The sampling dilemma we're faced with is to fit in memory data needs to be sampled but to be sampled data needs to fit in memory. Ideally you'd be able to determine your sample at the source and pull exactly the subset of data you're interested in. Newer capabilities like those found in the Vroom package have many benefits in this space but not necessarily for database data. Now deploying a solution often involves a job scheduling environment and then starting the R engine. We then load the R script and likely need to load source data for model building or data scoring and possibly writing or reading models or other objects to and from our data. We might store results back to the data source perhaps programmatically and finally we stop the R engine until the next time this needs to be run. So while deployment complexity may vary there are at least a few potential failure conditions that need to be accounted for. So we overcome these and other challenges for users by leveraging Oracle database as a high performance compute engine. OML for R provides immediate access to database tables and views using our data frame proxy objects. These proxy objects overload familiar R functions that produce SQL transparently behind the scenes for scalable and high performance in database processing without moving data to a separate server. In database parallelized and distributed machine learning algorithms further eliminate data movement and take advantage of multi-node and multi-processor hardware. R users can create and store user defined R functions as well as R objects like ML models in the database avoiding the need to manage and keep track of separate flat files with deployed solutions. For enterprise application and dashboard developers accessing database data and results via SQL is pretty much routine. So the ability to invoke user defined R functions from SQL can speed production deployment in the final stretches of an R-based project. Using proxy objects in the transparency layer users operate on database data with familiar R syntax and overloaded functions by using query optimization, column indexes, data parallelism and even storage level partitioning that benefits SQL query performance, data exploration and preparation from R gains similar benefits. In database algorithms are exposed through familiar and seamless R interfaces that operate on proxy objects and use the R formula specification. The resulting models reside in the database as well with their own proxy objects both for prediction and inspecting model details. Through embedded R execution, user defined R functions can be invoked on database environment sponding controlled R engines. There's no need for configuring parallel environments or frameworks. Embedded R execution simplifies data parallel and task parallel processing. So how do we do this? OLML for R introduces proxy objects that behave as R data frames but map to database tables and views. Here we show how the proxy ORE frame is a subclass of R's data frame class. While the R data frame contains the actual data, the proxy object contains metadata, including a reference or query to the data table or view. Regarding benefits of OLML for R algorithms, consider this example. We built an R linear model using LM on 100 million row database table with seven variables to predict arrival delay from on-time flight data. Our VM has 32 cores and 2.8 terabytes of RAM. This required loading 7.6 gigabytes of data which took about 21 minutes. Building the LM model, single-threaded, took 38 minutes. Total time to get the first model was 58 minutes. Using models that operate where the data exists in the database with OLML for R, even when run single-threaded, we see a 1.6x performance improvement simply because we didn't need to move the data to a separate analytical engine. But as we increase the number of threads to 64, parallelism brings the initial hour down to 1.3 minutes for a 42 times performance improvement. Many factors affect machine learning performance, not just data volume, but also algorithm choice, number of concurrent users and load on the system as well as available hardware. As we see here, the combination of in-database parallel computation and no data movement has significant benefits. This plot shows essentially linear scalability across multiple in-database classification algorithms with data ranging from 100 million to 800 million rows. At the high end, we build a naive base model on 800 million rows in just over two minutes and an SVM model in under 16. These results were run on Oracle Autonomous Database in the Oracle Cloud using OLML for SQL, our SQL interface to the in-database algorithms. But OLML for R exposes the same in-database algorithms from the R interface for on-premises and database cloud service databases. In this example, we illustrate how embedded R execution provides data parallel functionality to build multiple models where data is partitioned by values from a particular column or set of columns. Perhaps we want to build one model per customer even using third party packages. With the data accessible as a database table or view, we write an R script that builds a model for an individual customer or zip code. This is then wrapped as a user defined function that is stored in the R script repository. Using the group apply functionality results in spawning the requested number of R engines in parallel, loading the user defined function and automatically passing one partition of data to each R engine to build models until all the data is processed. The resulting models instead of being stored in multiple R data files can be stored in the database in the R data store. Here's the corresponding R function invocation that realizes the scenario. Using the function oary group apply, we pass in our customer data proxy object and that we want to partition on customer ID. Then we pass in the user defined R function that takes two arguments, the data that will be automatically passed in as an R data frame and the name of the data store where we want to have the resulting model stored. We build an LM model which could also leverage one or more third party packages and use oary save to store the model in the data store. To complete the group apply function arguments, we pass in the string for the data store name, indicate we'll be connecting back to the database and that we want to run this in parallel. So let's switch to a quick demo of all ML for R. In this demo, we're going to use the RStudio IDE, although many IDEs are compatible with Oracle Machine Learning for R. In the transparency layer, we're first going to load the oary library. Oary stands for Oracle R Enterprise, the previous name of this component. Next, we'll connect to the database. In order to get the proxy objects we want, we'll use oary.sync, the two tables that we'd like to use in this demo and we'll attach the environment to the R search path. Next, we can see the tables that are available in this environment. In this case, narrow is an oary frame, a proxy object, and we can use overloaded functions to see which columns are present as well as dimensions. We can compute summary statistics on this oary frame, which does all the computation in the database. So whether we have 1500 rows or 15 million rows, there's no data movement except for the final results. Now we can also pull data from the database into the R environment and we see that we get back in our data frame. Here we're going to extract the year destination arrival delay columns and we see that the class of that is also an oary frame proxy object and the dimensions and the first few rows are viewed. Now loading the oary de-plyer library allows us to use overloaded de-plyer functions, whether to select columns or filter rows. For joining data, let's create a couple of data frames and merge those using our standard merged function. Next, we can drop tables from our database and then we'll create two tables based on the data frames we just created, DF1 and DF2. These can use the overloaded merge function to do the join in the database itself. Of course, we can use the overloaded de-plyer function to do this join as well. For machine learning, let's build an oary LM model to predict arrival delay based on distance and departure delay for our proxy object on time s. And then we can get the summary results as well and we see that this is exactly what we would expect from the R LM function, but this is able to take advantage of parallelism. Next, let's use the Titanic data set to use the in database naive Bayes algorithm. We'll first create a temporary table in the database of the Titanic data set and we'll do some recoding and create a factor for survived, taking zero and one to no and yes. We can summarize this data as well in the database and then we'll sample our train and test sets. We're using row indexing to allow us to get the exact rows that we want for our sample. We can specify priors for the naive Bayes algorithm and then we'll build that model to predict survived based on a number of other variables. We'll predict on our test set and again, the results from prediction in the database are also an oary frame, a proxy object because these results themselves can be very large and we'll get the first few rows from the result. With the results still in the database, we're going to use the overloaded table function to compute a confusion matrix for our result. So let's move on to embedded R execution. Now in the case of group apply, let's say that we wanted to build one linear model per destination to predict a rival delay. Here we'll do our data preparation. We see that we have 51,000 plus rows and 25 columns and then we're going to invoke oary group apply, partitioning the data on destination and you see our function is very simple to build a linear model to predict a rival delay based on distance and departure delay. We can see the summary of the Boston Logan Airport results. Now switching to the SQL developer environment, we can not only create this function in the R interface, but we can also do the same thing in SQL and then we will run our function that will create random red dots too in the database. We can then invoke a SQL query to return the ID and the image of our function invocation. Looking at the results, we see that we have our images available returned from the database. Now let's say we wanted the structured content. We can return the ID and value that we had created as the return value of our data frame from our function by specifying the table definition. In this case, select one ID and one val from dual and you see that we get back a table that could be used to join with other tables and views as well. For more information on Oracle Machine Learning, go to oracle.com slash machine dash learning. Thanks for learning more about Oracle Machine Learning for R. Fantastic presentation. Thank you so much, Makoni for the presentation. So I've seen quite an interesting discussion on this luck and I'm encouraging everyone to head there and engage with the speakers. So we've saved a couple of minutes. I'm just going to pick on a few questions that have just come up so that we could wrap up the session. Okay, so there are three questions for Neil Richardson. So the first question is from Eva Castillo. I hope I pronounced that name correctly. So can it be used to work directly with Athena queries in AWS? I think this is Amazon Web Services, the package. Neil? Yeah, thanks for that. That's a good question. I think if I'm not mistaken, AWS Athena can emit data in the arrow formats. So you could write a wrapper that would query Athena. I think I'm just Googling this after you asked and I think there are one or two wrappers out there using JDBC to connect to Athena. So you probably could speed that up a little bit by using getting arrow in the mix there but out of the box, no, it does not, it does not do that. Okay, so thank you. The second question is what's the benefit of using a pocket of a finger format? I think this is a draughtable thing, or? Yeah, so the, so Parquet is a on disk file format, whereas Feather, which is the name for the arrow file format is literally just the arrow memory written to disk. So there's some trade-offs there. And I linked, there's on the arrow website, there's an FAQ that goes into this a little bit. And at urselflabs.org on our blog last year, we did an exploration of Feather versus Parquet and we did a discussion of what some of the trade-offs are. Basically, Parquet is gonna be smaller on disk. There's some features that it has, it allows it to compress and encode data efficiently. And there's also some column statistics and other things that are in the format that we can take advantage of when we're doing a scan across a big dataset and let us skip over big chunks of data if they don't match any of the, if our filter doesn't match any of those rows. So those are some nice features on the Parquet side. On the Feather side, there's no processing required to start working on it. You don't have to decode or decompress or anything. So there's, depending on your disk speed, that might actually be more advantageous. So there's some trade-offs there, but check out the arrow website and there's a lab's blog if you wanna learn more. Okay, so actually that's a question from Daniel. So the last question is from Peter Mason. So Peter is asking whether we can do in-memory sharing yet? So each of them are with arrow doing in-memory sharing. So you can essentially share memory between, within process between, for example, between Python and R, if you are using a, reticulate in R to use some Python project and it is yielding arrow data, you can essentially take that into R and you're sharing that memory. You're taking it without moving it or copying it or anything. That's in process, but if you're talking about sharing across processes, there's this service called Plasma that's a part of the Apache Arrow project. It's a Python, primarily Python project. It's not super actively maintained these days, but it's a shared memory object store. And so you can use that to share across processes. Okay, thank you. Thank you. Related to that, has there been work on backing an alt-rep vector by arrow? Like I talked to Wes about that, like a few years ago at this point. And there were some sort of gotchas in the fact that you're using masks and stuff, but they seemed solvable. Has there been, has there been work towards that? Yeah, there's been a little bit recently and I think in the next arrow release coming out later this month, there'll be the first pieces of that of alt-rep from arrow to R. And yeah, it definitely seems tractable and as I learned more about it and from what Francois has been working on on those integrations too, we're starting to bring some of that in there. So I think we can make it work even without the issue about the Sentinel versus Bitmask. Maybe there's some things we can do to finesse that too, but yeah, it's coming. And it looks pretty nice when it works. Okay, thank you. Thank you, Gabriel for the contribution. So I have a couple of questions for Peter Mesno. So Peter, the first question comes from Matt Mannet. I hope that's the correct pronunciation. So Peter, what do you use it for in practice and what are your applications for Kafka? Yeah, I had some use cases in the talk. So basically accessing Kafka data, which is already in Kafka, I think is a symmetric use case. So basically why we got into using it in the first place was something else. It was about having a way to distribute work among servers. So we wanted to spread out work packages to some simple R scripts, which we're doing some jobs. And we did not want to be imperative. So we don't want it to push the work to the workers, but we wanted to just spin up some workers or scale them down. And if they're ready, they pull in some new work, right? And for this, Kafka was very nice because it has a lot of nice properties. You can have consumer groups. So a lot of workers can share the same offset, reading the messages. So they coordinate in a way and also you can replay data. So if you made some mistake, especially during kind of development, you can say, okay, we messed up, but please go back and just do it again. And so this was our use case to basically build a massively parallel but scraping framework with our processes. Yeah. I hope this kind of answers the question. Thank you. Gabriel Becker also was wondering if you combined Kafka with the feature package. Yeah. When we looked into combining the Kafka's package with the feature package. Yeah, yeah, I mean, in my mind, like a future package is super cool, but in my mind, it was like when you use future, you kind of push work or data around, right? So it's not independent in a way. And so for me, always those two concepts were kind of separate, but then there was some comment also by Gabriel that both things are kind of asynchronous. So there might be something very interesting there to kind of make Kafka requests in the background being non-blocking or something like that. But this is something I haven't thought about but might be promising. So it's worth thinking about for the future. Yeah, thanks. Thank you, Peter. So in the interest of time, I just want to ask a question to Mark Honig. I think this is a question from Gabriel Becker, still. So Mark, so what happens when you subset a proxy object for a new point if you want to install and then subset? That's a question I hope it dreams well. So what happens when you subset a proxy object? Sure, when you subset a proxy object, essentially you're getting another proxy object from that. So whatever the corresponding SQL that would be produced by your operation that you did in R, you're essentially creating a view and the new proxy object refers to that view. So you'll end up getting stacked SQL if you invoke multiple operations in sequence. Okay. Thank you. I think our time's up. We're starting to remind you that the sponsor of the day was Apicelon and the decision sponsor was Cicara. And I thank everyone for getting time to attend and also thanking the presenters for taking time to put the pieces together to present. So go ahead and take it back to you.