 Hello everyone, in this talk I would like to give you a quick overview over tools and patterns for making production-ready R-Code and how to make it easy to share it among your team. My name is Marcin Dubal and I'm a data scientist and software engineer in the Upsilon company and my main area of interest is how to help teams in optimizing their R project structure and to work efficiently. I see from my experience two challenges ahead of data science teams. One is to create the value, so this is basically their daily business and that's what they do and they are experts in this. But there is also this common area for all data science teams how to share value among team members or with the community or with the clients, how to deploy, how to make sure that it is without errors, it is stable and reproducible. So in this talk I will focus on some tools and guidelines how to be sure that this value can be shared easily. I see three building blocks, three areas where the problems can occur. First one is that the code that we are using is different and it is basically solved with the version control systems such as Git so we will go briefly through it and for some extensions to it. The second part is the environment so if we are using the same version of the packages and the system and other dependencies and it will be also solved quite easily you just need to know the solutions. The third part can be the data is different so for example the code was based on some test example and on the production it is that the real data set is used and there are some differences so we'll see how we can assure that the workflow is correct and checked. So let's start with the version control with the simple thing. You are probably all familiar with it but I want you to make sure that you and your teams are using Git for example properly so for sure the must is to use it always in every project and use it heavily so all the features of it it will create a good habit. So the crucial thing is to use branches I've seen a lot of teams that were committing directly to mastering this code of problems so use branches and it will allow you to collaborate independently and you can switch between your features easily. Make sure that your commit messages are informative and clearly describe the changes that are happening. It will help you to see to scan through your history what happened and to find your changes in the future. And also manage the task board it is a great way to organize your view of what you're doing to split your tasks into smaller issues and to work on them efficiently so when you earlier think through what do you want to achieve it will make your branch structure more effective. Another part related to the repository it's how we set the continuous integration so a set of events triggered automatically for example to run unit test before we merge some piece of code. This is the crucial part but this is somehow annoying and boring process and it might occur that you skip it when you are like delivering features fast so having it automated is a great tool to be sure that it happens. I would like to introduce to you the github actions I think they are a cool way to set the continuous integration because you can use a lot of ready tools from the marketplace but also you have a lot of flexibility in building your own solutions and setting it as you wish it to work. Check out those links to get more it is very easy so why not to use it. I also would like to recommend you using the github templates it is really cool probably your organization or team has some set of standards good practices how to to build the projects so they can be easily shared with the templates we are doing it in epsilon with the shiny structure and we have our own own way how to do this so with the github templates we are sure that we will follow the same the same pattern in the whole team. So the second building block the environment lack of development environment is a root of a lot of problems because basically you can write the same code but it will behave differently on different machines because it uses different versions so it will it can be really painful and it can also affect your other work so make sure that you have a development environment set but luckily there is a great solution to this in our in our world is rnf and what it is doing rnf is a package that creates a separate environment for each project so that the packages and the versions are only for this project and they're not affecting any others and it's also super easy to be shared and the whole configuration is saved in the single file called rnf lock and you can commit it push it to the repository and your colleagues can can just take it and reuse it it is very simple to set up you might be familiar with packrat it it also has like an icon in our studio but please don't use it please stick to the rnf it is much more simpler and intuitive and the defaults are are better i find it very hard to set up a packrat but with rnf it's just one comment you will see on the next slide it also simplifies a lot a process of using docker and building images there because all the restoring logic is already in the rnf lock file so you can just do the docker snapshot and send to the docker hub your new version of the image and it will be restored from the rnf so you don't need to modify the docker file so working with rnf is super easy you basically need four commands so you initialize that in your project you install the packages that you need and then you snapshot the package versions used in the project with the dependencies and it updates the rnf lock file that you can then share and your colleagues can restore your environment and that's all and you are sure that you are having the same environment and you're working on the same code and you can also do this process on the server to restore on the air there and the deployment will be will be smooth the data workflow so the third part our third area I would like to to show you the drake package which I think is really really equal to keep the the data process clean it organize your project or your files functions and results in the single plan which is which is like a a single source of truth so you can have a mess in your scripts but you can always check in the plan so what is actually happening in my data workflow step by step and what are the dependencies and for this the drake is creating a really cool visual structure visual graph of all the dependencies what is happening where and also what are you really like in drake is that it's kind of lazy in doing ablates so in this example here the the function that is creating the his was modified and it is outdated and also and when you rerun the whole process and all the dependencies of of this part will need to be rerun but not the others that are up to date so drake controls it so it is especially useful if there are some heavy calculations involved so you don't need to worry whether I should update this part or not and drake takes care of it so this is this is super cool I would like to present you one example from real life from the real project that I was working on with client so you see that the data process is a little bit complicated here but having this structure really helped me to go through it with the client and to discuss and to make sure that that the logic the business logic hill here is correct also this is great for finding some bottlenecks so here for example the loading row data into the application it's it's kind of long in comparison to other parts so if we would like to speed up the process we know where parts to optimize the other crucial thing when we are talking about data is making sure that all the assumptions that we have in code are fulfilled when we load a new data set and this is why in epsilon we've created the data validator package and it is available on github it's open sourced and it creates the nice visual report based on the rules from assertor package you might be familiar with it the the report created as you can see on the gith it's a nice html report that you can share for example via email with the management and it is totally independent from the r also you can set triggering such such report on on some events for example on on rstudio connect and it will be automated the last part about data is how to load it into shiny efficiently i recommend the plumber package here because it allows you to load only what is needed so usually there is no need to load a whole big data set into the application and the plumber allows you to build rest api that simply converts your your your logic and into api so you can grab what you what's truly need it's also super easy to to deploy it with the docker or rstudio connect so check the plumber documentation also one more point make sure that you're using efficient data libraries like fst data table or aro it can vastly improve the performance of your application so free takeaways for each area always use version control follow the rules use branches do the code review this is super important and set a working environment development for the project with aren't with docker or with both this is a must for for every project to make sure that everything's work correctly and organize and validate your data and use plumber with needed be careful to not overload your application with with with the two big data sets thank you i will be super excited to discuss those and other related issues with you you can reach me on twitter on email check the apsilon site and apsilon blog for a nice solutions and thank you very much