 This part will be about how do we combine these two languages in an analysis and I will quickly start with the motivation why at all we want to do that and then quickly go through different levels of how we could integrate those two languages. In the practical parts we will then actually see several of these levels, but just as a disclaimer at the beginning, it's not that I have been doing this since a long time, I only started maybe half a year ago and I'm still learning a lot of new things. Also the theory and exercises we will do on this part are barely scratching the surface of what's out there that allows you to do this integration of these two languages so it's probably we're going to see some questions arising that I will not be able to answer and I would like to ask you to bear with me. So why would we want to integrate those two languages, particularly for data science? They have some things in common, they're both open source programming languages, they both have really large communities and these communities are very active and are providing new libraries and tools continuously basically. Both languages also have a well-defined interface to compile languages like C or C++ and that makes it also easy to go to that lower level if efficiency is really what's asked for, but then they're also different. Python is a really general purpose programming language, it's really cleanly designed efficient that is very modular and therefore it's usually the language of choice for people who want to deploy software, who want to roll it out to their customers and who have large projects with a lot of interacting parts and there are some really great Python packages like SciPy or SCVelo if you are interested in single cell data or also Scikit-learn, Keras or PyTorch if you're interested in machine learning, so that's all things that speak for using Python. Similarly R has a number of things speaking for it, it's more a statistical modeling and data analysis language designed by statisticians is therefore really great for exploring data and visualized data and also there we have a number of packages mostly from the Bioconductor project that are really strong motivators to use that language like just naming a few here HR, the EC2, SCRAN or SCATURE, Tidyverse, GTplot2 and Shiny so if you if you google Python versus R you will find a million and one different comparisons of the two languages I quite like this tweet here which compared R to Batman and Python to Superman but that's not really what I want to do here I don't want to discuss pros and cons of these because why would you choose if you can have both of them and in a typical data analysis project we already combine various tools from different sources because it's very unlikely that any one single tool or programming language will give us everything we will need but at the same time it's also nice to have an analysis workflow that is as simple as possible that has let's say few scripts or a few steps and few intermediate files and in my my personal case the reason why I got into this was really because I'm usually more an R and Bioconductor user and I use that primarily to analyze single cell data transcriptomics data but there is a very nice package from the TICE lab called SCVelo that does that implements RNA velocity analysis you will actually hear more about that package tomorrow and so I was in the situation where I had to combine SCVelo with my existing R Bioconductor workflow so when you when you're thinking about integrating those two things you can you can do this in many ways of course and and a simple way would be you're just breaking your your analysis workflow your pipeline into chunks that are homogeneous so each chunk is either something that happens purely in R or something that happens purely in Python and you run them one after the other and and that works really easily the the only problem is that these chunks don't really talk to each other and at the end of each chunk you need to kind of save the state of the project usually in the form of some output files and in the beginning of the next chunk you need to read that state back in so what you're ending up usually in a project like that is with a lot of disk space dedicated to saving these intermediate files and also some plumbing code that writes them out or reads them back in since a long time already there are so-called or what I call bridge packages out there so so these are packages that are either a Python package that allows you to interact with R or an R package that allows you to interact with Python I call them bridges because you would still only primarily work with one language but through that special package you get kind of a bridge towards the other language and we will see such bridge packages in action later in the in the exercise I think those are are really great and some of them are quite quite comprehensive in what functionality they provide but they are specialized packages and they kind of force you to to understand the functionality and also the vocabulary of that specialized package so in order to use Python from R it's not enough to just know Python you also need to know the the the interface of the bridging package which would be reticulate for example and the same is true for the other direction you need to know how R pi two for example works and what kind of methods and functions it provides in order to to use that bridge so for me this is not super ideal and it also doesn't allow me to just have my unmodified R code and my unmodified Python code and have them talk to each other so the that would be something that I would call a truly integrated workflow where I can write a single script that has R parts and Python parts without me having to use a special interface so a lot of the the integration of the two worlds would happen transparently for me that that would be kind of my preferred solution and it turns out that actually we have solutions like that out there which we will also see in the exercises so what's really special about that is that we have objects that will be shared between the two worlds so we can have for example a data frame in R that is available as a pandas data frame object in python and that that of course the nice thing about that is we don't have to to write that many output files anymore just so you know if you have a question or so feel free to interrupt me anytime by just raising your hand in the zoom feedback okay so let me very quickly give you just a few words on each of these three approaches so as I said breaking things up into pure R or python chunks is something we have maybe already done already we can still organize workflows and integrate them to some degree by for example using make files or snake make or nime or whatever common workflow language and that's super flexible but I said it's not really integration it's more um uh one after the other with a requirement to connect the different chunks yourself so that's not really what what I want to talk about in in this block in the bridge approach we have two prominent packages one that's called reticulate which allows you to use python from r and one that's called rpi which allows you to to use r from python and we will use both of those in the exercises I think they would be the the method of choice if you anyway want to primarily use one language and just do a few things in the other language but you need to kind of know a bit about those packages which I see is their biggest disadvantage and then finally we have the the truly integrative approach before I go there just a quick word about display conventions now for the slides and also later for the exercises you you see these blocks with the colored backgrounds they they are actually showing code and the background color actually indicates what language that code is from so the gray would always be for for r while this this yellowish background color would always indicate python code and yeah actually I don't think I have any bash blocks in in the in the exercises but those would be blue so a few examples before we go through the truly integrated level if I want to use reticulates to call python from r it's actually quite easy you load the reticulate library then there is an import function which is an r function that allows you to import python modules and what you get back is is an object that is kind of an handle to the to the methods and functions defined in that python module and you can then call them using the dollar operator so in in python you would actually use the dot operator but here you you then use the the dollar a bit like like this function would be a list element of the python package object it behaves a bit like a list and in this case you you see the example is just listing the files in the current directory of course in r we get back an r character vector in in python if we do the same thing in python what we get back is a is a is a list of strings so you see that this list of strings is automatically converted into an r character vector when I call the function from r so you could say reticulate in that sense is not there is not so much vocabulary you need to to learn this is relatively close to how the code would look like in python these type conversions are available for for a lot of default data types I have you hear the conversion table so here you see there are data type and the python data type and reticulate automatically converts between those two in both directions and here are some examples so I guess any any standard data structure in r is is covered here but of course special objects like s4 classes like for example the the single cell experiment object that we may use later in the course that's not automatically translated I will talk more about that later and because reticulate really does this bridging very nicely it's it's also used widely in other packages that may even hide the fact from you as an r user that you're calling python code in the background and and two prominent examples of that are the r tensorflow and the r keras libraries which actually call the python libraries in the background through functionality provided by reticulate there is a question about whether python can import sparse matrices via reticulate and yes it can and actually there will be an example doing that in the exercises so let's look at the other direction if you're more a python programmer and you would like to use r functionality you can use r pi 2 it's actually much older than reticulate reticulate is maybe only two years old while r pi 2 is I think at least 10 years old and there was even an r pi before that and you will see in the exercises r pi 2 is actually a huge package with a lot of different um functionalities there is for example a low level and a high level interface plus a specific support for for jupiter lab notebooks um the high level interface looks like this I'm importing the library in in python and I can then use the dot r object from from the r objects as an entry point into into the r world and I can basically ask for a bit like like this dot r would be a dictionary for elements were the dictionary key here that I'm asking for actually refers to the object name in the r world and what I'm getting back is then automatically translated into a python object of the of the suitable type in this case you see it's actually a list of floats and in r that's the same thing in r directly and it's actually the same value it's just that r does some um rounding when it's printing that variable so in the type conversions are I would say more flexible in r pi 2 but also much more complicated um the the low level interface is something that is very efficient it it actually hardly copies over things it's just wrapping the r objects and gives you python pointers to them so that's good for speed and efficiency and you you even have a in the higher level interface it does do some copying and conversion using these converter classes and you can also create new conversion converter classes for specific objects that may not be supported yet and one of these that we will actually see later in the exercises is converting from a from a single cell experiment object into an and data object I can't really go much further into details here if you are interested then you will find more information under this link and so finally the truly integrated level where I'm using a single script or notebook that uses both languages they are available both in our studio and they're actually driven by reticulate in the background and they're also supported by Jupiter and in this case they are driven by by r pi specifically by r pi i python r magic and we will see both of these in the exercises so I think those probably for me are the easiest to use my personal favorite is our studio writing our markdown documents um with the reticulate in the background that allows me to just include python chunks in it um of course things need to be set up properly so of course you need a python installation of course the python modules need to be there and of course you need to tell reticulate which python to use and we will see how to do that in the exercises um before we actually start doing that um I would like to just say a few words about about notebooks in general I I found this hilarious presentation by Joel Gruz who is a data scientist who's also written a couple of books about data science and is a is a python programmer and he basically had this presentation at jupitercon 2018 where he said he says he doesn't like notebooks he was talking about jupiter notebooks but I think his arguments apply equally to to our studio notebooks and in a nutshell um he says notebooks are bad because they allow you to run the different chunks of code out of order and by doing that or by editing let's say an an upstream block and just rerunning that upstream block but not all the downstream blocks you could create an incoherent state of your notebook and that may be confusing that's essentially his his message and I I fully agree with that and that's why I also wanted to mention that quickly so how does this look like in uh in our studio uh here you see a screenshot you can really have these code blocks that are started in an or marked and documented by the three back ticks and then in in uh in these way curly brackets the language that the code will will be parsed with so this is an art block and this is a python block and that's pretty much all there is to it so you can really use your pure python language in that script as you do your pure r without having to learn the bridge syntax still you are able to access objects from the other side using the special pi r objects or the special r python object for example here you see I'm I'm using an r chunk that accesses the y here which is a non-pi array via the pi object I don't want to spend too much time here we will see that in the exercises and similarly in Jupiter you can do that by using these special percent r or percent percent r indicators at the beginning of a line or of a chunk to indicate that this cell in the notebook is actually containing r code so with that I'm through the introduction