 So hi, I'm Kelly Jean and today I'll be talking about how to turbocharge your data science with Python or are Just kidding. I lied to you already. I'm gonna be talking about how to turbocharge your data science with Python and are But first what the heck is a data scientist? It depends on the company, but here are a few roles that you can have So for example, there's data science analysts and traditionally these are a data analyst or business analysts They focus on ensuring metrics are accessible to the right stakeholders and analyzing the health of the business They also use data to find areas that could be made more efficient or optimized and Then there's product data scientists So they partner with product managers and engineers to focus on product initiatives For example building models to improve user experience on the product and then there's experimentation data scientists So these are data scientists who work on things like AV testing and other experiments to measure the impact of changes And then you there's growth and marketing data scientists They focus on things like optimizing Google AdWords spend SEO analysis LTV modeling and other things like that and This isn't an exhaustive list of the types of data scientists you can encounter in the wild And these aren't mutually exclusive roles So many data scientists actually do Parts and pieces of the of these roles. It really depends on the company But what all these roles is having common is that they leverage data to Solve problems and gain insights And I realize that this is a very broad definition for a data science role but the rule really is that broad and I'm largely a product data scientist So I'm going to be focusing on working through that workflow of a product data scientist But before jumping into this workflow, what's our According to the our website it are as a language and environment for statistical computing and graphics Basically, it's a programming language built by statisticians not computer scientists. So it's built by statisticians Whose first name started with the letter R It has some differences with Python For example, you'll see indexing starts at one Not zero, but the syntax is largely the same So loops have the same general structure lists in Python Are called vectors in R and they have the similar properties that they do in in Python and Data frames exist in both R and Python and they share similar properties that make it a whole lot easier to work with data One big difference is that when typical typical people say Python they refer to a type of snake while for R they're referring to the letter in the alphabet and our and Python are Actually very similar when it comes to doing most data science work So there is this great debate. Should I use Python or R in data science and you see it pop up in reddit It's like R versus Python Python versus R I Should I learn our Python somewhat experienced programmer dot dot dot Is our better than Python or anything I started learning R half a year ago, and I wonder if I should switch You also get the opposite. Why do we need Python when R is so brilliant blah blah Which is fast and easy language for sentiment analysis on Twitter data are Python So you get the idea that they just keep asking our Python And there's also a lot of Twitter stuff, but I got tired of screenshotting So there's There's like to further emphasize that the two languages are very similar You also have a coolant packages in Python and R So as I mentioned we have data frames for both Python has pandas and numpy R has base R and when I say base R I'm saying you just need R without any packages And then you can use dplyr and R to do data manipulation Manipulation and then for plotting. There's Matt pop live see warm bokeh in Python and our has base gg plot to high charter For statistics, yeah stats motto in Python R has base for common machine learning models. There's second learn And are you actually needed a lot of different random packages to get the equivalent of second learn to have carrot JLM actually boosts, etc, etc for deep learning they both have tensorflow and You also have packages to connect the other language languages, so for example to go from Python to read in R in Python you have R pi 2 and others and then for R To get Python code in R. You have reticulate snake charmer, etc So when I'm asked Python or R As I already mentioned as a data scientist. I have both So I'll walk you through a modeling problem That is a pretty basic data science task that most data scientists have done at one point another in their career And I'll show you where I would use R and where I'd use Python So this modeling problem is just Given a binary response or a binary variable How can we use other features in a data set to predict that binary variable? so I have a New York City Dog data set that I found it's publicly available and it has things like the dog's name the gender things like the coloring the zip code and also a Feature that's whether or not the dog has been spayed slash neutered So I'm going to use this data set To build a model to predict whether a New York City dog has been spayed or neutered Before we jump into this problem I'm going to discuss when my typical data scientific method looks like So first there's ETLs so extract transform and load and what this means in general is You have data that is as clean as possible that can be readily explored So you want to get your data to that state and that's what you do when you're doing ETLs And if you're lucky you have a data engineer to do that for you instead Data engineers are great. They like to do those things as a data scientist I don't really like to do those things So hopefully your company has someone to do that Next there's pre-learning and What I mean by this is EDA exploratory data analysis any feature engineering so You might think okay, maybe if I do some sort of Transformations to a feature it might be more predictive. You can do this during this phase when you're visualizing the data and then the next step is learning so modeling your data or training your model and Data scientists will say this a lot. They're training. They're training the model and what they mean essentially is they're sending data through an algorithm or function to optimize another function And then there's post post learning so you've trained a model now You should definitely evaluate this model and see how it's doing on your data And you should probably document this and present it and when you document it You can present it to other stakeholders in a consumable format and If all goes well, then It's deployment time. So you can think of creating a microservice or a data science as a service to call the model in production So I'm going to focus on these middle three sections because That's where I think using R and Python is the most interesting Like for deployment, it's clear. You should use Python. You shouldn't be using R For ETLs. I don't really like to do any ETL work. So I'm not going to talk about that So I'll focus on these middle three Okay, so the plan of action for this dog dataset is To use the other variables a.k.a. Features such as the dog name gender etc to provide a prediction for whether or not we believe a dog is fate or neutered and The free learning I'll do an R and the rest I'll do in Python Okay, so the free learning this the EDA part It can be quickly done in R and we can easily share this if we use our markdown So by now you've seen a lot of presentations We've seen Jupiter notebooks and we saw how great it was so allows for reproducible analysis You can organize your code chunks It's easy to provide reports to others which you put in notebooks and our markdown is very similar But I would say one big pro for our markdown is that once you get a handle of the weird syntax in R It provides very clean visuals So I typically like not to use Python and I use our markdown. So let me show you a R markdown file of the dog dataset and So you can see I have So I have code chunks right here and you can see like to define a code chunk I just start it with three vac techs And then I'll end it the code chunks with three vac techs to show that that code chunk is done And what I like about our markdown is I can run each individual command So I can load stuff you'll see there's a green status bar that tells me where that Line of code is running at that point So I have the dogs dataset loaded. I can in the R console Just look at this dataset very quickly. I can do other stuff with this dataset So I can do a summary of it without affecting the R markdown file and like Jupiter notebooks, I can run the code chunk then I'll see a The output of that code chunk And I when I'm running say a big code chunk. I can see how it's progressing where it's stuck at Where it errors out at very easily and I can do what a thing what we call is to knit this document which is to essentially generate HTML or PDF So in Jupiter notebooks, you can do that similar process But with our markdown you can see okay right now. It's running the code chunk geography So if I see an error out there, I know where to look And I know what's taking a long time. So right now it's 65 percent complete. It's still running and What I get back So I don't know where my presentation went. I was on the note. Okay So I'll get an HTML document and What you'll see is like you have a nice table of contents. You have a header and where this comes from is in the R markdown file You'll see I start the R markdown with the yaml So I'm just trying the type title. I'm saying it's gonna be an HTML document. I have a table of contents and I Write content with markdown. That's why it's called our markdown you write in markdown and you get the This nice output That's very similar to Jupiter notebooks, but I think it's just for me personally more organized So I can see box plots if I want to include code like I can in Jupiter notebooks I can do that as well. I can include maps different visualizations, etc So you can see things like okay for Most people have just not put their dog name And that can potentially be some feature engineering that you can do is the dog name missing and things like that Okay, so ours great at doing that But Python is a lot better once you get to the modeling phase at least for me personally So Python plus scikit-learn Which provides the machine learning packages and the Cisco models that you can use to fit data is great And then plus pandas and numpy just makes it working with data and modeling data Just a lot cleaner for me and a lot easier than it would be in our So I can I've created you for a notebook where I've essentially imported pandas imported numpy and all these scikit-learn Helpers and functions so that I've loaded the data I split the data into training and test set. So this is just a pre-votes Function and scikit-learn and then you can immediately start Fitting your model. So here I'm running a logistic regression with an O2 penalty and that means I'm running a ridge regression And I can look at Different model metrics feature importance other plots So here I'm looking at the AUC score Which is a ML model metric that goes between zero and one If it should be Higher than point five if you're if it's less than point five you're doing something wrong so make sure it's above point five and What's great about scikit-learn is I can fit a bunch of different models. So here I'm fitting another logistic regression, but this time with an L1 penalty and I can essentially run the same function Because it has the same structure and features That the ridge regression had and fit another model and I can see do the same thing with a tree-based model a GVM And get the same sort of report without any additional effort so I can quickly fit different models and see improvements in my Scoring of the model very easily and very quickly So now I have two files though I have a file in R and I have a file in Python and It's kind of annoying because now my workflow is very separated out but Is there a way where we can connect these two languages connect these two files that makes it a much more smooth process and fortunately, there is So in my Juvenile notebook, I was actually reading the data by calling R and this was done using our pie, too So I import the package and I just have these few lines of code which is activating R and then I'm using this R function readRDS and reading in in our object in RDS into Python and then converting that last line is converting that RDS object into a pandas data frame and I can also do the reverse so in our Particularly in with our markdown with reticulate. I can easily run Python code by just specifying in the Code chunk header instead of saying R. I just say Python and then all of a sudden that code chunk will run Python So I can run a for loop and I can even import paper mail and execute Jupyter notebooks so I'm essentially connecting and automating my Jupyter notebook that I created to fit different models with my R markdown file and this allows me to Ensure that in my R markdown. I can actually report out the Python Models that I've built out and evaluated in Python in my R file So that's my talk and you can find the slides and the code at my GitHub Thanks