 We're going to be going through a big memory for art for the first talk, bridging the unproductive value, building data products strictly without magic and followed by data science serverless style with art and open files. So this session is sponsored by sponsor of the day is Russia and the session sponsor is synchro. So if Jin Chao or Austin, if you could please share your slides. Okay, no problem. Sure. Yeah. Hello, everyone. My name is Jin Chao Sun from Memorish. So today, we're going to talk about the big memory for art. Okay. So, sorry. This is actually a joint presentation with Austin, who is a research scientist from Tijian and also Dr. Khan from analytical bioscience. And first, I will talk about the motivation for the big memory for art. So art is actually very popular right now in a lot of different areas. However, as we are in the big data era, so we are facing a lot of challenges for art. For example, the data is becoming bigger and bigger. And because of that, so we are having a very long execution time. And that would incur a higher chance of crash or failure for data pipeline. So in order to avoid the failure, a lot of developers or scientists, they need to save the data frequently and load it back. So this IO will have a lot of cost a lot of time, especially when the data is large. And also, although art actually supports multi-threading right now, but there are a lot of legacy code which do the processing in a sequential way. So that's another challenge. So all those kind of previous drawbacks actually is because of the computer architecture. So we have two kinds of media. One is DRAM. Another one is storage. So DRAM is very small and also very, very expensive. And however, it has a very good benefit. It's extremely fast. And another kind of SSD, so they are large as they achieve. And also they have the capability of persistency. However, they are very slow. So there is a huge performance gap between the DRAM and storage. So this will actually kill the performance, especially when you want to save the data from DRAM to storage in order to have persistency. And what we want to do is actually kill the IO process. So we want to make use of a new kind of hardware called persistent memory in order to kill the IO between the DRAM and disk. So we will have a large memory capacity. We will also have a very good performance similar to DRAM. And the more important thing is that no application running on top of our software need to be changed. So the new hardware is called persistent memory. This is the new hardware introduced by Intel four years ago. It has very large capacity and also it's very cheap. It's much cheaper than DRAM and also have very similar performance. It's a little bit slower than DRAM right now, but it's much, much faster than the SSD. And what's more important than it has the capability of persistency, which DRAM does not have the persistency. So on top of this hardware, we have our big memory computing architecture like this. So at the bottom is the DRAM and PMM hardware. So we will have a middle layer. So memory machine. So it's a software will help the users to manage the data in DRAM and PMM. So the user's application, our application running on top of memory machine that's not need to care about whether the data is in the PMM or in the DRAM, you just write the code in your previous way. And we will help you manage the data in the DRAM and the PMM. So the usage is very simple. You can use GUI and you can use also command line and also REST API to do that. And what's the benefit? The first one is of course you can enjoy a tiered very large memory. So the memory capacity is the DRAM size and the PMM size. And as I introduced, so then could support up to six terabytes per machine. So that's a huge memory, which is not possible in current system. And we can put the hard data in the DRAM and code data in the PMM. So this will guarantee the speed. And we will also do the automatically swapping. So if the data become hot, we are putting the DRAM. And if it's become cold, we are putting the PMM, automatically user does not need to worry about that. We also have a very flexible ratio of DRAM and PMM. So user can adjust this ratio dynamically by themselves. If they are actually cost sensitive, you can use more PMM and less DRAM. If you are performance sensitive, you can use more DRAM and less PMM. So user can adjust this one very easily. And also in the future, we will also support remote memory, which means that you can use RDMA to use other machine's memory in order to increase this memory pool dramatically. And another very huge benefit is called snapshot. There are IO snapshots. As I mentioned, persistent memory has the capability of persistency, which current DRAM does not have. So in current system, if you want to have persistency, you need to write the data from the DRAM to the disk. And you will have serialization process, disk serialization process, and also transfer the data from memory to the disk. However, because PMM has a capability of persistency, we do not need to move the data from the memory to the disk. So the data in the memory already has the persistency. The only thing we need to do is just to flash the data from DRAM and CPU cache to the PMM. That's extremely fast, just to finish the in seconds. And also, if you want to take multiple strats, we only update the difference between two snapshots. So that's also extremely fast. And if you want to restore the process, for example, for a process, you can take the snapshot of the process. If you later, you want to restore it. So you can just restore the process and let it run again. And it can be restored into different namespaces. In this way, you can let this process running in parallel, running multiple processes in parallel. And also, we supported the replication functionality. You can actually copy this snapshot from one machine's memory to another machine's memory through network instead of IO. That's extremely fast. And also, for some users who need the files, so they can export this snapshot from the memory to the disk. Or we can also support that if the user want that functionality. So then I will talk about application scenarios. So firstly, it's very useful for the R jobs, which requires a huge amount of memory. And secondly, it's very useful for long-running job. It can do some auto-save. So if a job fails, you can restore from the latest snapshot. And also, it's very useful for job debugging, especially when your job actually need to run for a long time and then it has a bug. So if you want to reproduce this bug, it will take a long time. However, if you take snapshots, and you can just reproduce this issue in a very short time and easily. And for a lot of other applications, like iterative analysis, so you might use machine learning models to do a lot of work. However, you want to use the different parameters to try the final result. So instead of using the loading the files from the disk multiple times, so you can just directly restore it from the memory in seconds. So that would greatly reduce the time you spend on this one. And also, as I mentioned, then you can use restore to actually parallel our job into different names spaces. And that would also accelerate your process dramatically. And I think this is the basic introduction of our big memory framework. Then I will just hand it over to Dr. Khan to talk about his scenario, and also Austin to talk about his scenario. I will play the slides for Dr. Khan. If you want to a full version of this presentation, you can find it here. So we will put a full version here. Hello. I'm really glad to be introducing our work at analytical biosensors about accelerating single cell sequencing with big memory technology in R. So a little background about single cell sequencing technology is a newly emerged technique in recent years that we can profile each of the transcriptomes and genomes of the individual single cells. And it helps to answer a lot of important biological questions such as what is the identity of cells, what are the different cell components in the microenvironment? So it's very typical for a disease like cancer to have a very, very complex microenvironment consisting of different cell types with dramatically different functions. So using this technology, we could also interrogate the functional characteristics of different cell types and unveil some of the cell dynamics. For example, we could track individual cells, their movements and their transitions into other cell types. And also we could do some functional analysis on some important components. For example, the TCR and BCR sequences, which in turn turns into antibodies that our body uses them to target the pathogens. Using a number of techniques which I will introduce later, we could stratify the identities of these cells. And for example, identify which genes are turned on or turned off during a developmental tract where, for example, the progenitor cells turns into mature types of cells. So the general workflow of single cell analysis is like this. We use a number of mathematical transformations and a number of biological informatics methods. So essentially, deep single cell analysis is largely data science with super high dimensions and intensive modeling. For example, we're facing like a matrix of a thousand or 10,000 genes and one million cells. This is very typical. And we would apply, for example, feature selection techniques, dimensionality reduction and to further interrogate some of the higher level information of the cell. So by using this big memory technology introduced by our friend at Membridge, we did such thing to transform our conventional workflow into a new memory machine based workflow where each stage, instead of loading files and saving files like we do in the typical R scenario, we snapshot the R process and restore the R process. So using this method, we could achieve second level restore for typical workloads compared to the conventional workflow where we typically experience, for example, minutes of overhead to dump the data into our data form and further load it for new iteration of the analysis. And it's quite important to mention that many of the steps need to be repeated for three to 10 times to test out the best or optimum parameters. So that gives us even more waste in the time waiting for the data to write and load. So we designed this benchmark strategy to use our reference single cell workflow and test it on either the conventional working environment versus the memory machine based environment. We have achieved some very promising results by using big memory technology. For example, if we analyze a data set that's one million cells in size, we could achieve the full analysis within six hours. But if we switch to the memory machine platform and break up the different phases using snapshots instead of file IO, we could eliminate the file IO overhead and achieve the full analysis only in 2.5 hours. So this really gives us more time to really work on the true biology and reduce the time to insight, which is very important in some cases. For example, we want to get results as soon as possible when we're analyzing some new data and some emergency data, for example, some COVID data we are currently doing at analytical biosciences. And apart from the file IO overhead, as I just mentioned, there are also other challenges facing single cell data science and the cause for the evolution are, for example, we are observing these exponential growth in number of cells per study during the last few years. And the most recent study has more than 2 million cells in size. And also there are other modalities of data apart from RNA. So we have single cell DNA data, single cell chromatin data due to the technological advances. So the growth of data is really extra exponential. But the max matrix elements in R is restricted to this number because of the some of the hardware and source code of R. So if we try to keep a matrix of one million cells, we would end up having only room for 2,000 features if the matrix is dense. And also such thing is such matrix may only be 20 gigabytes in memory, but we could not allocate it due to the restrictions in R. So finally, at analytical biosciences, our goal is to enhance biomedical research with cutting edge single cell technologies, including deep single cell profiling and very in depth bioinformatics analysis and the big data mining. So all of this is not possible if we don't have advanced computing and analytics power, which is also we're very excited to participate in this activity and use R and share this new technology with everyone. Thank you. Okay, next I will hand over to Austin and I will play the slides for him. Okay, thank you for being here today. My name is Austin Terres and I'm a bioinformatician in Dr. Nicholas Banavich's lab at the Translational Genomics Research Institute in Phoenix, Arizona. And today I'm going to be talking to you about how we use big memory and are to study the genetics of a rare disease called IPF. Next IPF is a progressive disorder of unknown cause that leads to respiratory failure and death. It's a rapidly progressive disease that will lead to either death or a lung transportation within the three to five years of diagnosis. And unfortunately, there's no known cure and the current treatments only slow the regression of the disease with major side effects. Next, we use a technology called single cell mRNA sequencing to characterize the transcriptomic profiles of individual cells from 25 robotic and 10 healthy lungs. And this is just a general overview of the workflow before sequencing and our data analysis and R. Next, after sequencing and initial processing in R, we use a dimensionality reduction algorithm called UMAP for both visualization and clustering and every dot in this graph is an individual cell for which we have green genus rushing counts for and then after that we manually classified each cell by genus expression profile and our dataset contains nearly 40 cell types from the human lung and each with its own distinct transcriptomic profile. And then we identified four major cell populations, which are the immune cells at endothelial cells, mesenchymal and epithelial cells. And today we'll be focusing on the epithelial cell population. Next, when we were classifying the cell types in the epithelial cell population shown here on the left, we noticed a distinct population that couldn't be characterized by traditional gene markers. And we noticed this cell type was specific to fibrotic lungs. And this led us to believe it was a novel and uncharacterized cell, which we named the keratin five negative keratin 17 positive cell type. And as you can see on the right, the gene expression profile, the keratin five negative cells do not match that of any other epithelial cell type. Next, and because the keratin five negative keratin 17 positive cells were disease specific, we've fluorescent probes to bind to keratin 17 gene in order to determine where the cells are located in the human lung. And then we saw that these cells localized in the fibrotic regions of the lung. And we think they played an important role in the progression and regulatory processes of pulmonary fibrosis. Next, and if you'd like to take a deeper dive into our analysis and results, you can check out our publication, which was featured on the cover of science advances nearly one year ago today. Next, and then now let's move on to the computational challenges we encountered during this analysis. Single cell data is generally stored as a matrix where rows are genes and columns are cells. And this matrix contains observed counts for genes inside every unique cell. In our dataset, we had a matrix of 30,000 genes by 114,000 cells, which required a large amount of memory in R. In addition to this, our data sets keeps growing in size, which means our memory computational requirements also keep growing in size. Next, in order to speed up this analysis pipeline, we're teaming up with members to accelerate our processing workflow. Our analysis pipeline took six and a half hours from start to finish because our pipeline is stubbornly single threaded. But by easily utilizing the snapshotting and cloning capabilities of memory machine, we were able to parallelize the processing of our pipeline. And as a result, we can now save nearly 36% of computational time while also taking advantage of the big memory obtained nodes. This will enable our team to save significant amount of time for downstream analysis, and we're really excited to continue to work on this project together. Next. Yes, all the code and data is available for download on our GitHub page, at github.com.cgen.com. And then next, and you can download the full presentation at members.com. Thank you. Please give a hand to Jingchao, Chris, as well as Austin. Now, due to time, we won't just answer the question, but we do have 10 minutes at the end of the session where like all the presenters will be there to answer your question. So please feel free to ask away more and we'll come back to the questions at the very end. Please welcome next, Max Held, who's going to be talking to us bridging the Unproductive Valley. Max, please take away. Good morning, everyone. Can you hear me and see the presentation? Yes, I can. You should. Great. Okay. Good morning, everyone. Yeah, my talks about bridging what I've dramatically called the Unproductive Valley, building data products strictly without magic. First, a little bit about the motivation. Perhaps somewhat counterintuitively, I'd argue that the reason we use R is for scalability. Now, R isn't exactly known for being super scalable, but that's referring to scalability at runtime. And I think mostly in scripted data science, runtime isn't our biggest concern, but the limiting factor is development time. We usually don't have thousands and thousands of users in our dashboards or other data products, but our limiting factor is how fast can we get ideas for analysis from the developer's head or the data scientist's head into code and then show the results to our stakeholders. And I mean scalability here broadly. I don't just mean scaling by the roles in your data frame or what have you, but broadly for projects to become big and very productive. And for that code needs to be very expressive. I think this is what R is also, that's what it's famous for. It's not that the for loops are particularly fast or anything, but it has a lot of idioms and native vectorization, a bunch of other things that makes code very expressive. Onboarding should be fast for new data scientists to join the team. Turnaround should be fast, meaning you should be able to add like new features relatively quickly. And then of course, like any other software project, it should be maintainable, it should be possible to pass on responsibility to someone else, take on new team members, etc. And here I think we face a bit of a problem. Now, this graph is of course a little bit loosey-goosey because I don't have empirical data here. I'm charting here the number of person hours that you'd contribute to any given project on the x-axis, on the y-axis, the sort of the visible output or the output that your stakeholders care about. And traditionally, you would get to some output if you're using something like Excel or maybe even Tableau or other sorts of sort of point and click software, you'd get to outputs relatively quickly. This would be a pretty steep curve at the beginning, but then it'll flatten off as sort of the complexity eats up your productivity and you won't be able to deliver that many more features or just it'll become too big, right? And so that's why we've always promised that for real data scientists, you need like you need scripted data science as for example, using R or Python. But I think this doesn't always work. We advertise it, but it doesn't always work. What always works is that your initial output, the output that the stakeholders are interested in is much smaller. So you always have this initial trough here in your productivity, not necessarily your real productivity, but the productivity that your stakeholders see, right? It takes more time to set things up. And then of course we do this because we promise that in the end, once we've added a little more and then we'll get super productive and there'll be all these amazing dashboards, etc. that we can give to our and insights that we can give to our stakeholders. But sometimes this doesn't work out. And that's what I'm worried about that sort of we can end up and it's a bit unpleasant, but I think we have to consider there's a possibility that you end up in sort of the worst of both worlds. You have the high upfront cost that comes with building real software as opposed to quick and dirty one-off scripts. But you also don't really scale if you have sort of poorly defined hack together, quick and dirty scripts, right? And I think this is the sort of the sell here, the high upfront cost and does not scale prototypes on a script that we really want to avoid. Because quite honestly, for a lot of organizations, if this is what they're faced with, they may be better off with either doing it and point and click software or in spreadsheets even, or if they need something that scales but cannot afford the high upfront cost, just use the software as a service. For example, this doesn't exist in sort of all industries, but Google Analytics will be maybe an obvious example and there may be others in other industries. And what we do want to be building is really what I'll call for the purpose of this presentation, data products as real software. Now, this sort of sounds a little bit too clever and you might be asking, well, haven't we been doing this already? And what's this max health guy has to say about this? I'm not even a computer scientist by training or anything. The only thing I have is I have a lot of scars to show for it. So if you look at a lot of my GitHub repos towards the end of these projects, they all have commit messages like this. Not all of them, but many have. So I've been doing this for a couple of years and I have the scars to show for it of all these things. Maybe don't work out so well. And so Muggle is our attempt at the University of cutting and to sort of bring together the ideas that we've had and try to wrap them up and suggest to other people how maybe some things might work better. Muggle is on the one hand, it's just an R package. You can find it on GitHub and this is sort of the landing page. It doesn't do an awful lot. Mostly it's just a wrapper for a relatively fancy Docker container and it has a bunch of setup and helping functions along the lines of use this. Now you might be asking, well, do we really need another one of those? There's a bunch of these helper DevOps types packages out there. And just to briefly summarize how Muggle is different is we really do a lot less and we take this being non-magical very seriously. So other projects like R and do a lot more and are a lot sort of impose a greater overhead and more things to worry about. They can also do a lot more. We do a lot less. We try to be very slim. So the Rocker project, which is an obvious sort of competitor has containers that are ready made for interactive use, but they're much more generic and as a result, much bigger. And then there's a whole bunch of packages and sort of an ecosystem around a whole punch and binder. And we try to also not do what they do, which is to infer developer intent. So we try to be very religious about never inferring developer intent and never doing anything magical and also never polluting the source tree with sort of boiler plate. But basically everything in our projects that is committed has been written by a human. Now, a lot of the stuff that I'll talk about is sort of run-of-the-mill software engineering advice, not particularly original or anything like that just apply to R here. I really recommend the programmatic programmer. It came out two years ago, I think in a 20th anniversary edition by Thomas Hunt and Andrew, sorry, David Thomas and Andrew Hunt really commend that, whether this is based on that. Now we have sort of three big lessons that I'll focus on that model institutes. One is that everything we do is a package. We don't, in all of our data science practice, we try not to ever do one-off scripts or even dashboards, but everything we do we make as a data product and we ship it as a package. So for example, this is one of the projects that we're currently working on and you'll recognize this website. It's a package on a website and everything we do sort of always lives in packages. And you can, for sure this doesn't work, kind of click on the reference, not so important. And there's been some debate about this kind of design, the project as an R package thing. And we try to get around some of the problems of that approach is we don't ship data in the packages and we try to also have a lot of packages so that any given project has a very thin package, which really only does like user-facing things like rendering the output or creating a dashboard, but a lot of the sort of deeper business logic or data transformation happens in downstream packages that I'll talk about in a bit. Then the second thing we do, and this is arguably where sort of the meat and potatoes of Maggle is, is we regularly standardize our compute environment. Of course, the industry standard as this old joke here has it, is that this is Docker. So the industry answer to this problem is that it works on my machine. So we'll just ship your machine or in any case, a container with the important parts. Now our Docker image that we use in Maggle has a couple of extra features. It's built on the R-based image that's produced by RStudio, simply because that has RSPM support built in and it's pretty small. A couple of things that we built on top is we have multi-stage builds. So sort of like a Russian doll, we build like different versions off of this one base image. So the smallest one is our runtime version. We ship that off to Azure and Google if we ever want to publish things and that of course needs to be very small. So that's our production image. And we have a build time image. That's what's being used in CI CD, has a bunch more dependencies packaged on test that. And then we have a desk time image which also comes with RStudio and use this. But at the core, so these all have the same system dependencies and they all have the same dependencies. We have unbuilt instructions which is another neat feature that Docker has that I learned about only a year ago. Just so you can give Docker files unbuilt instructions which will get triggered once another image is based on this Docker file. And of course all of our projects are built on these Maggle Docker files. And so these unbuilt instructions get triggered. And the nice thing is that you'll see in the screen is that that leads to a really, really small git footprint because we don't have to copy paste a lot of boilerplate. We locked on dependencies using RSVM RStudio package manager which gives you like a snapshotting feature. I think you've just a date and you'll get dependencies only from that date. We're pretty proud that we have very small production images. Ours are 1.4 gigabytes usually for most of our projects in that ballpark. The production images and if you take something like Rocker Tidyverse that's 2.6 gigs because it comes with a bunch more packages of course. But you may not need those in any given product. We cache our dependencies on GitHub Actions which is usually like that and also our knitter artifacts and that sort of stuff. And so we're pretty proud to get a CI-CD turnaround of less than five minutes for most projects. What I mentioned is the small file print. This is basically everything you need to set up a project for. You specify the version that you want from Muggle which locks down the Ubuntu version, the R version and RSVM snapshots. And then you say, okay, I want the build time and the runtime. And in this case because we're shipping this out to Azure, we also want sort of the startup command for the Shiny app behind it. Now we make these images really reproducible because we also push them. We render them on GitHub Actions and we push them to a GitHub's package repository on every commit so that even should these dependencies become unavailable, let's say RSVM disappears or what have you. I hope it won't. You can always get the complete containers off of your package repository. In this case, GitHub Actions. So we think it's very reproducible and easy to get started for people who are new on the team. And we use the exact same image in CI-CD. This is GitHub Actions running inside this container. In this case, failing. We use the exact same images and runtime versions in production in our container as a service vendor. So this is Azure. We've also used it on Google Cloud. The exact same container. This is the Shiny app in action running off of that container, off of Azure. And of course, that being a container makes it really easy to debug things. You can just type Docker run locally on your machine and you might recognize this output here. This is the output of a flex dashboard page spinning up and then any output that the Shiny session would give to the terminal you would see right here and you could then on your local machine point your browser to this URL and you'd see the app running exactly in the same compute environment as it would on Azure, which has saved us a lot of headache. You can also, of course, explore it interactively and just get in the machine and get rid of all of this. Oh, it doesn't work on my machine. Slightly different contexts, etc. I see that I'm almost out of time. I have five minutes left. So I'll skip through the rest for maybe two minutes so that I get a chance to see some questions. So we also try to be really strict about functional programming and really stick to this design dot tidyverse paradigm, build mostly pure functions, make sure they're all length and type stable and fail early. And we sort of build around this victim from Rob Pike. Data structures are important, not algorithms and that everything else follows. You want to see like a really simple one. We built this and that's sort of the typical kind of package that we built. We have one only package which so far does nothing but parsing and doys digital object identifiers. And that doesn't seem like a lot, but like if you get all of that logic out of the rest of your packages and our data products for stakeholders makes life so much easier to deal with this once. And we're really glad we got to do this. Now to wrap up, is this still agile? I think it is. The problem with agile is that often sort of agile is misunderstood to mean that you're sort of building a car while you're driving it at 100 miles per hour down the highway. And that's not really agile, but we try to start with like really small but functional things even if they're sort of just play things. And then we make them more sophisticated down the road. And then you don't have this trade-off of like building quick and dirty and building good. If you build something small, you can build a good and it'll still be a relatively quick turnaround. All of this requires some organizational changes too, I'm afraid, and it's sometimes hard to explain and argue for this to explain to the stakeholders that pro-tups are not the same as agile development that sometimes it's better to be slow and than fast and that tight focus and saying no to a lot of things is really important. We think and I personally think it's really important to sort of break the cycle where you're always building technical debt and your stakeholder then gets impatient and wants more features. And so you build more technical debt and then at the end, sort of at some point, it just becomes a whole mess of spaghetti code that no one can deal with. And that's frustrating for everyone in the end. I think it's a hard lesson that especially small organizations with smaller data teams such as ours, which is four or five people have to learn and it can be hard, but I think it's worth it for the stakeholders. And also personally, I have to say, I feel like life is too short to build bad code. I noticed with COVID and everything, it's like I want to build things that I have at least a shot that it'll still be useful in two or three years, maybe five years. I mean, that's already long term for software. So yeah, I'll leave you with that. Thanks so much. And let's see whether we get any questions in the remaining two minutes. Otherwise, I'll yield my time to the next speaker so we can catch up a little bit. Doesn't seem like there's any questions that I miss any. No, it's okay. Thank you, Max. It takes time for people to type. So we should probably give a chance. I want to ask a quick question before we move to the next speaker. So with all this, I think it's great effort that you're putting in. And certainly there needs to be more reproducibility like that solid. I guess one part I always get a bit afraid is that whether the reproducibility of these tools, like, you know, is it going to be maintained in the long run? So even if I commit to this, can I use this actual tool five years later? Yeah, that's fair. The answer is that you don't need Muggle. So the Muggle, the package only has setup functions. So it only has stuff like use this, right? So you can say use this, use Muggle, but that'll essentially just paste these four lines of Docker file into your project. And so if I get hit by a bus and no longer use Muggle, it doesn't matter because the container is still out there. Admittedly, to update to new versions of R and to change the RSPM snapshot date, someone has to rebuild sort of the upstream image. But if you have someone in your organization who knows their way around Docker a little bit, it would be, I think, pretty easy to do to adapt this. And that's what I was saying, that Muggle isn't really like much of a package. It doesn't really do anything. I just, I'm lazy. And so I sort of document everything I do as a package. And so this became a package. But it's more of a way of thinking about a project and having a place where to write these things down. And one example of what we think is a pretty good way to use Docker. But yeah, I think, I mean, certainly, Muggle hasn't sort of anywhere near the sophistication that a lot of these other projects have or longevity to justify sort of people taking a big bet on this. I'm aware of that. But you shouldn't need to, ideally. I'd love to pick more of what you think. So we'll come back to this in the question time. So now let's move on. Thank you very much, Max. So next one, Joti, if you could share the video for Peter's talk. Have you ever wondered if you should try running your R scripts as a serverless function? Have you followed tutorials and wondered if there is a better way? I did. And I'm going to share with you what I've learned along the way and how I made serverless data science a little bit better than I found it. I'm Peter Salamos, biologist, data scientist and co-founder of Analythium. I build data analytics pipelines and web applications using R. I value resilience, freedom, and simplicity. And I expect data engineering solutions to align with these values. Let's see what I mean by that. First of all, why bother? What is the most common reasoning for a serverless architecture? This is what I call a coupled analytics architecture where the front end is server-side rendered and is in constant communication with the back end. This is the design behind Shiny and is best suited for exploratory applications that involve a lot of reactivity at the level of the data. You can see here that cost is proportional to the resources required by the front and back end combined. This can easily lead to a constant margin. Other applications do not require the same kind of reactivity and statefulness. As a result, we can get away with fewer round trips between client front and back end. Imagine a plumber API as the back end and static HTML or any JavaScript framework as the front end. We can push most of the reactivity to the client side. Decoupling of the back end analytics from the front end also means now we are not bound to bootstrap and can optimize the bundle utilizing tree shaking, code splitting, and lazy loading. After these convincing arguments, let's see what do I mean by resiliency? Here's a back end API, a microservice. It has two endpoints. The first depends on packages one and two. And the second function depends on packages two and three. You might use plumber for this API. As you make changes to the first endpoint, you update package two. This breaks endpoint number two. Of course, if you use unit tests, this will never make it into production. But even in that case, you have to fix bar before you can send it to production. That's extra work, fixing something that worked perfectly fine before. What if we used containers to isolate the two functions? Updating foo has no impact on bar. This is where Docker and similar container technologies come into the picture. Docker provides immutability for the image layers. This makes it easy to version images, test and release, or even rollback changes. You can even deploy both versions to roll out gradually or to experiment. Isolation prevents conflicts that you saw a moment ago. It also simplifies managing resources at the container level, such as memory or CPU limits. You can read more about the state of container technology in R from this paper by Daniel Nust et al. Next is freedom, free as in free speech. Here are the three largest cloud providers and the very brief summary of their function as a service offerings. Note also that each of these vendors has its own CLI. That means there's an R package wrapping that CLI. This leads to repeated efforts and makes migration less straightforward. I see both of these as potential risks. As an owner, I have less control over costs. As a consultant, I'm niching down into a segment defined by a vendor. What is common in all three is that there is no R runtime. Therefore, you have to use containers. Managing containers at scale is not trivial and managing serverless infrastructure is also often outsourced to public cloud providers. This is where OpenFuzz comes in. The OpenFuzz project was born to mitigate these problems and to avoid vendor lock-in. OpenFuzz is an open source framework to deploy functions and microservices anywhere at any scale. The project was created by Alex Ellis and is maintained by the OpenFuzz core team and the community. OpenFuzz has an emphasis on Kubernetes. It provides other scaling metrics and API gateway, and it is language agnostic. What's Kubernetes? Kubernetes is a container orchestration engine for automating deployment, scaling, and management of containerized applications. As you can see in this graph, the cluster consists of a set of worker machines or nodes that run containerized applications. The worker nodes host the pods that are the components of the application workload. The control plane manages the worker nodes and the pods in the cluster. OpenFuzz is great at abstracting away on important details and creating a simple multi-cloud experience where the user is still in full control. Let's outline what function we are going to build. The function takes some request parameters based on that retrieves data from an API that contains daily COVID-19 case counts based on the Johns Hopkins University's global dataset. Then we fit an exponential smoothing time series model to these observations. Finally, we make a forecast and send a response containing these expected values and prediction intervals around them. This is what the observed case counts look like for Canada, and the white lines and blue ribbons indicate forecasts done with two different subsets of the data. So we want the function to return the numbers that we can use to draw the white line and the blue ribbons. Step zero, set up your cluster. This can be a local cluster using KND, Kubernetes in Docker, Minicube, or K3S. It can be a managed Kubernetes offering by one of the many cloud providers, or it can be a server with the lightweight FASD, which is a stripped down single node cluster. Check out the open FASD docs for instructions and tutorials. Next step is to install the FAS CLI. It is available for different operating systems, including Windows PowerShell 2. This is the step where the instructions deviate from all other templates. Instead of a Go, Node, or Python template, we are going to pull templates for R. Templates are in the nlithium slash open FAS RSTEDS templates GitHub repository. We can use the FAS CLI template pool command to do this. There are 18 templates based on the parent Docker image and the framework being used. You can choose between three parent images. The DBN-based rocker R-based represents the bleeding edge. The Ubuntu-based R-Ubuntu is for long-term support, and it uses RSPM binaries for faster R package installs. The alpine-based R-hub R-minimal has the smallest image size, think of 36 max. STDIO passes in the HTTP request via standard input and reads the response via standard output. Other HTTP frameworks, like HTTP, UV, Plumber, Fiery, Beaker, and Ambierix are all packages that you can find in the R ecosystem. Each of these frameworks has its own pros and cons for building standalone applications. Let's pick Plumber with the R-based image. Plumber is one of the oldest of these frameworks. It has gained popularity, corporate adoption, and there are many examples and tutorials out there to get you started. Now that we made the decision of which template to use, the next step is to create a new function. The template naming follows RSTATS parent image server framework pattern. We define two environment variables, one for the Docker hub username or any other registry prefix. The second is the IP address or domain name for your open-files cluster. We pass the template name, the function name, and the prefix as CLI arguments. I'm sure you started wondering if I'm going to ever show R code at this R conference. Here you go. This is the handler.R file inside the newly created COVID forecast folder, which is going to be the name of the cloud function. According to the familiar Plumber annotation, following hash star, we define an endpoint at the root. Plumber will pass URL parameters to the function as arguments. Inside the function, we do some type conversion and call the COVID forecast function that spits out the forecast information that will be formatted as JSON as part of the response body. You can see what parameters can be passed to the function. If your handler requires extra packages, you can add those to the description file. Here we import forecast. For other use cases, you might need to add remotes, system requirements that will be installed into the Docker image before attempting to install R packages. You can also pin R package versions. Finally, we are ready to build the Docker image, push it to the Docker registry defined by R prefix, and deploy the function into the open-files cluster. You can do all this with one op command, or you can do it in three steps, especially if you want to test the Docker image locally before pushing and deploying. When all this is done, it is time to invoke the function. Go to your open-files URL slash function slash function name, provide some parameters, and this is the JSON you can use on the frontend to provide a numeric summary or draw an interactive graph. If you got excited about replicating this example, head over to open-files.com slash blog slash R templates, where I have recently written up this as an example. To recap, use the file CLI to create a new function. Added the handler.r file. Put here also data you might want to load for a scoring engine and so on. Define the dependencies. Build, test, push, and deploy. Once the cluster is ready, you can invoke the function. We all know how well R is suited for data science due to its diverse tooling and its ability to leverage and integrate with other languages and solutions. I believe that R can truly shine in the multilingual serverless landscape. Thank you for listening to me talking about serverless data science in R using open-files. You can find and read about the R templates for open-files in these GitHub repos and find examples at nlithium open-files R stats templates and R stats examples. Feedback and contributions are much appreciated. Cheers, and let's open the floor for questions. Thank you, Peter, for your talk. Peter, I have also a question for you. It looks like really useful. Open-files seems to be actually quite professional. May I ask how long it's been going on or is it quite a still a new product? Open-files started in 2016, so it has quite a bit of history and it's also a CNCF incubated project. So it has some backing from a foundation and it's unlikely to totally disappear. And I've started R templates in 2019. I've been thinking about a year before that, but I just got to that because I needed it. And so it went through, I think, two iterations so far. And I'm looking at how to even streamline more. And I particularly like the muggle images that I have to look into because just the image sizes, that's my pet peeve and I like to make them slim. That was going to be my next question, whether you can actually use muggle as one of the possible emplacement of worker. I need to look at that, but yeah. So whenever I really need to slim down images, I'm using the R minimal, although installing all the dependencies is not always easy with Alpine Linux, but it is quite doable. It just takes more upfront time. So you said it's for the ones that you showcased was in our example. Does this actually work with other languages as well? As long as you have a docker image, which in just some HTTP request, yes. So there's a little watchdog, which you insert in front of it, which runs the executable. And that's how you kind of manage the life cycle. I see. So we have a question from Anna. So thanks for the talk. How do you manage dependencies of packages that you need to import? Are the packages that you define in the import also installed when the image is built, as well as the dependencies? Yes. This is kind of a gray area, especially when it comes to the system dependencies themselves, because then you have these build time and run time things and you might need to do some cleanup. But I like this explicit approach of the user or developer should define what they want. And that can be a lot less than, for example, what our end picks up and that shows in the sizes of the images. So I think if you keep adding and it runs as expected, that's the optimal size or kind of dependency structure you need. So you can start broader and then kind of take it away. But then if it's already running, why spend more time? So at least to me, there are some really obvious dependencies that you would state and three or 10 lines. It's not a big deal to just put it in a description file. And then you might run into some issues when you're building a Docker image. And then you just basically read what the error says and add that line. So it's really straightforward. And I think the description file is a really good way of doing that. The only downside is that it's not very easy to pin package versions, because for example, remotes doesn't support that out of the box. So for example, if you state those version information, it is not going to just do it. So that's why I had that extra line, which is not standard description file notation, but I just added that. So if I happen to know that one of the package updates break my app, which happened to me in the past, I don't really care if other packages have been updated as long as they work as expected, but I know that particular one, I need 1.4.2. And that's how I got around that. Hope this makes sense. So I'm trying to be like really specific where I have to, but I also other places I just leave like it up to the major R version tool to install whatever latest package that came with it at that time. Thank you, Peter. So we have 4.30. So I invite Austin as well as Jing-Chao to turn on your video. We have now questions that you can direct to any of our speakers in this session. So we do have one more question for you, Peter. So Pedro is asking, is there a support for Singularity? That is like a Docker-like container version. I think I've seen that, but I can tell for real how that's handled with an open FAS. I'm pretty sure some people have already run into this, so I wouldn't be surprised if there would be support for that. Okay. So Pedro also asked a question earlier for the very first speaker. I think when Chris was giving his video and already I believe Jing-Chao has answered it, but just so that it is streamed, I will ask again. So if you could re-answer that question, Jing-Chao's. So Pedro asked, can this type of memory easily plugged into nodes or even classical PCs without charging the machine configuration? So some of the machine configurations need to be changed, but that's very easy to change. And also it needs some recent CPUs. So Intel actually listed all the CPUs in recent generations which support a persistent memory. And it can be directly inserted into the DIMM, so which is exactly the same smart as DRAM right now. So that's what can be inserted as DRAM and also the PMAM. So I hope that answers the question for you, Pedro. I can see some people are typing away on Slack and here you go. So Peter, what is a URL of the blog at open FAS that has a FAS template? So maybe if Peter, you could either ask that one on Slack. Yes, I'll do that. That would be fantastic. I can see more questions coming because there's some typing going on, but it probably takes a while for people to type away. For attendees as well, if you'd like to ask your questions in person, like speak it out, if you raise your hand, we can unmute you and you can ask in this DIMM session as well. Thanks, Peter. And thanks, Amy. So I've asked the question about installing the R packages and you gave me the answer. I'd be really curious to see your this description file because I've been having issues with installing R packages by using AMI on AWS, which I know that you've told us, you know, you told us a different way of doing things. And the only issues that sometimes when you install packages, it doesn't return an exit non-zero exit when you install the packages. So I'm wondering like I found that adding just the packages in the description file is not enough and I'm using an install script where I specifically install all the packages that I need. I'm wondering whether you have some other comments about around that or if what you've been telling me before, it's going to be enough. And I have to try different ways of making sure that all the packages that I need are actually installed. I don't know if my question makes sense, please. Yeah, this is a question to me, right? I'm pretty sure Max could answer some of this maybe as well as I could, but I agree that sometimes they just keep adding packages. They might still not install because it's usually not another R package that is missing, but maybe a system dependency. And there are two kinds of system dependencies and Geron Ums made a really good, like a package and the related vignette about this, how to spot which are build time dependencies, which you need to be able to build the package and which are runtime dependencies, which the package keeps calling because it has some dynamic linkages to it. And most often I run into these issues when some of these build dependencies are missing. So the package can't compile. And that can even happen when you are using like a wrong, like somewhat off compiled. I don't know how to best frame it, but sometimes with RSPM binaries, you install them and they just want to work. And I've run into this in the past. That means you need to specify the package somewhere else, or you just have to wait until like everything builds. But what I found works best is in this case, when you want to rebuild the package from source, but you just want everything else that works, just install from RSPM is to have as imports, which is going to use RSPM, and then maybe refer to the dev package on GitHub, which is then going to be built from source. And this is how I usually deal with these types of things. But then you also have to go to these databases and look up those dependencies for your specific, like Linux OS that you are using. Okay, thank you. Yep. Yibida and thank you Anna for your question. So we have a question from Dennis actually, which was one that I was also thinking about. So Dennis says to Max, sometimes it's hard to get people to change their behavior. How was you, how does your opinionated framework was received? How did you sell it? Yeah, that's a good question. I mean, the struggle is real and it's still ongoing. I'm in a bit of a lucky situation that we're working at the University Libraries and we sort of have a mandate to publish open source stuff. So there is some time to work on these kinds of things. I think the best thing that's worked was showing my colleagues how sort of once I've migrated an existing project to this framework or it's not really a framework, which I applied sort of these best practices. I should be clear about like Michael is, it's just, I put it in a package because I put everything in a package, but it does very little. So there's not, this won't like get into your project. It won't even be in your description. So there's very little that you have to adapt. It's more of a way to like put these best practices in one place and have a hopefully decent practice with docker images. But what was really helpful in selling it is sort of having a package that was previously just wasn't even a package. It's just like a bunch of scripts that people had to like source and then source this and then you have to pray and like hope that it's a full moon and then maybe like the shiny Apple work and what people really liked when I had migrated that was looking at the diff and how much smaller the package had gotten. Once all of this craft that was in git that wasn't really, no one really knew what that was for and it wasn't well documented once all of that was gone. I think that was sort of for the other developers where our data scientists are sort of a light bulb moment that like this stuff can make projects so much smaller and I think that's something that that that is hard for us as data analysts and data scientists also to learn and to try to convince our colleagues is that ideally I think the stakeholder facing product, data product, whether that's a shiny dashboard or an our markdown thing or whatever, should be very thin and it almost like it shouldn't really matter whether it's our markdown or shiny because it's so thin and only the interactivity or if it's shiny the reactive graphics in there and all of the other stuff is like in other packages and I think that can help like showing people okay this like recombinant, recombining stuff like the UNIX philosophy it really works for data products that you have that you try to like outsource the kinds of things that you're using in your organization across several different projects. You outsource that to its own package which is one thing and then by recombination you get a lot faster and your user facing or stakeholder facing products become a lot smaller. I think that's that's helped but I think the ongoing struggle where I to be honest have also failed is to explain that slowest fast and that you have to say like that you have to say no a lot and and try and keep the focus tight and this is sort of a personal opinion but sometimes I worry whether we're just living through something that has affected software engineering or software companies in general which is that if you're like very small and you don't focus you fail and I think there is a risk that sort of I mean we've already had a bit of a hang over with some of the AI and ML hype and that will get more hangover of failed scripted data science products if after two or three years of spending a lot of money we don't deliver a lot of value and and I think sort of the tight tight focus might be might not having that might be really dangerous. Yeah so I don't I hope that answers the question like the struggle is is ongoing I think it's real the benefit is real and I want to be very clear that it's the muggle doesn't really do much it's like it's just a place to document this and and these are sort of me too wisdoms maybe except for the docker container still has some fancy stuff yeah thanks for the question Dennis I wanted to add to add to Anna's question just two notes I've used two services to identify system dependencies in muggle the old one that I've now gotten back to is our hub which also has an R package that you can provide with your platform identifier and your the package you're using and then it'll give you back like a bunch of installed instructions and the other one is linked to RSPM which also has a list of system dependencies runtime system dependencies I should say in both cases thank you very much Max so we'll close the session now if you have any questions please feel free to put it on slack so you can continue to interact so again I want to give thanks to the sponsor of the day Russia and session sponsor to sync up and thank you to our speakers as well as our audiences for all your questions hope to see you next time thanks thank you thank you thank you bye thank you good one take care