 Hey everyone, my name is Jeffrey Miu. I'm a product manager here at Microsoft and I'm working on the Python data science and AI team in VS code And by the end of the talk I hope that everyone can see how you can accomplish your own data science tasks faster and more efficiently in VS code This is the last session before lunch so I'm extra thankful for everyone to come and I know this is I Hopefully everyone's like actually a true VS code supporter because I know you're coming and taking your time of your lunch to Come watch this talk So today I'll be going through a typical data science project in VS code I'll be walking you through step-by-step throughout the project from the getting started to the data preprocessing and cleaning To the model selection and training as well and finally to the employment of the models And alongside I'll be showcasing all of the data science features VS code has to offer to speed up your development things like VS code notebooks data wrangler which is completely brand-new projects and GitHub code spaces and more and hopefully by the end of 30 minutes. You'll see how VS code revolutionizes the way you do data science If you use Python or VS code before much of what you'll see today Well, I should be features or processes. You haven't seen before it'll be brand new really cool cutting-edge stuff So a quick little bit about myself before we get started I've been on the Python team at VS code for around three years now I've also loved dogs as you see in the screen cats and Python The language I'm still unsure about the snake so we'll see but again You can find my contact info on the screen and if you have any questions during the presentation Feel free to come reach out to me at the Microsoft booth that we have downstairs So let's start with a quick demo of what we're actually building today Everyone's probably heard of like the big great Resignation that's happening right now in the world and especially prominent in the tech industry that most of us are in right now So because of this I want to analyze some salary data Along with the skill sets and background from the Python industry to see if I can develop some sort of model a crude model To predict what your salary should be given your technical skill set and background with the hopes that it helps determine your personal net worth and value in industry So let's just go over to my Let's go over to my favorite editor VS code and I just have like a simple notebook I'm a VS code notebooks right now and I have some simple iPi widgets that just created a form to make it a little more interactive Just to showcase this power, but let's just do for the first prediction Let's just pick a full stack developer. I picked someone that has a bachelor's degree Typical someone that works with some sequel databases. They know Azure and AWS Some of the works with react and Docker and get and I'll just put three years of coding experience And then let me just run the prediction and they can see what we get and we can see that at least this model Is predicting that their annual salary should be around hundred twenty three thousand US dollars So I'll go over the data set of what we use in a bit But I just want to caveat that this data set is from over 80,000 different tech salaries around the world So it's a statistically statistically significant data set size But now that we saw a sneak peek of what we're actually going to be building today Let's get started with the actual building of this from scratch So let me just jump back Okay, let me just move this so it's easier So today we're going to be walking through the typical data science workflow to build our salary prediction model and each step I'm going to be showing you how this code improves the way that you typically do it So at a high level I like to break it down into three different phases three different steps The first phase is the data exploration phase and this includes things like Providing the appropriate data set getting the data into data stores doing data cleaning and pre-processing such that can be fed into models and Doing any data exploratory analysis of looking for trends or patterns within your data The second phase consists of model training So this includes things like building the training script figuring out which models you need to use Figure the compute because I'm most likely you don't want to be running it on your local machine And then just doing some tuning on the parameters as well And then finally the last phase I like to call it as productionization where you actually just me saving your packages Save packaging for your model having it deployed to the cloud and creating any API so that you can call your inferencing service From your application and by inferencing here I'm just talking about the model prediction and the way this diagram is actually laid out doesn't necessarily show how much time This spent on each step so in the latest survey from Anaconda They found that the majority of data scientists times actually spent during the first phase of this data exploration part Specifically around data cleaning and data pre-processing and preparation You can see on the screen takes up around over 50% of the working hours on average So it's definitely not proportional to What you actually have to do like this one phase in the beginning takes up over 50% of your time and This kind of correlates to the majority of data scientists saying that it's the least enjoyable part of the job Because of how mundane and tedious this task is so I'm just going to be showing Some of the features that we have in VS code that could should productively speed up your time on this such slow It less enjoyable areas of data science So let's start off with the data exploration phase So to start off we'll be using the latest stack overflow data to survey that asked a bunch of questions like what languages you work with what frameworks Etc as well as things about your job information So what role you work for your salary and that's how we can be that's the data We can be using to create this model. So a lot of relevant data That we can extract there's a lot of relevant data that we can extract features from and like I mentioned this This data has been shared publicly of the audience the community so you can see the link up here And there's over 80,000 responses as well Cool, so that's enough slides for now. Let's move on to the live demo, which will be the majority of this talk so Before you start with any coding, of course, you'll need to have VS code downloaded So if you don't have it you can just go to code visual studio calm and download it for your computer It's again completely free open source and cross-platform if you haven't tried it But once you actually have this code downloaded you'll need to go to the extensions tab So this is what these could looks like in a Mac But if you go to the extensions tab, which is this kind of like four cubes right here and be just searched for the keyword Python And that should be the first one from Microsoft and installing it should get we'll get you all the Python Jupyter notebook related functionality for VS code and that's where all the data science features They'll be showcasing today reside as well Cool, so let's just go back to the top of our notebook So We'll be using three super common Python data science library packages for your project that you can install via anaconda or pip We're gonna be using pandas and numpy for data ingestion and data cleaning and data preprocessing as well as sklearn for model training and inferencing and Once we actually have everything installing it up. We can actually get started So to get started I have my Jupyter notebook here and I'll walk through step-by-step of what I actually done and Although I might be stepping through the code a little bit quickly because it is a 30 minute talk The focus here is mostly around the tooling and the productivity provides So if you want to actually look at the code itself just feel free to pause and re-watch this recorded talk anytime to look at the code in More detail But again one of the first things we need to do with any data science project will be to import your data set Right, so you can see here in the second cell and hopefully it's pretty good size, but let me know if I need to zoom in But you can see the first thing you'll want to do is do a pandas So it'll do pd.read and you can see as I'm typing so read.tsv Intel VS code gives you intelligent suggestions and auto completions in the cell So it gives me things like oh, it suggests It's what I'm trying to do so you can it can kind of tell I'm trying to do read CSV So I can just tab that as well It gives me some doc strings about how I can actually use read CSV as well as some examples And then I have the I'm going to use the stack overflow survey data set and again It understands that this is this files in my explorer So I can it gives me that odd suggestion and then to run the cell I can either click this run button here or I can just use a Jupyter hotkey shift enter and Again, now that we've created this new variable DF There's oftentimes you're actually gonna be having hundreds of different variables in your notebook And it's kind of hard to keep track of them, right? So although notebooks are great for flexibility where you can run cells out of order multiple times One of the biggest annoyances with data scientists is being able to keep track of the state of all these variables, right? So we made that easier by building in a built-in variable explorer within data within the notebook editor So if you just click on this icon at the top right, which brings up the bottom tab and you click on Jupyter variables. You'll now see Essentially all your active variables within your notebook and your running kernel and you can also see the most up-to-date values You can see here. I have this DF that I just created gives me information like what the type is the size So you saw it's over 80,000. I have a bunch of other Variables on here because I've ran this notebook just for the sake of this demo But you can see how it's really easy to keep track of alphabetized and just super easy to keep track of all everything that's happening with your notebook So now that we've actually imported our data into our notebook We'll want to look at the data itself and again the probably the most most common API that every data science use is a DF.head So once I run the DF.head we can kind of see like a snippet of our data. It says there's actually five 50 columns, but if I scroll across If I scroll across you can see there's definitely not 50 that's being shown here And this is because by default Jupyter always truncates large data sets So if you want to see the entire data set we can go to Google try to search on Stack Overflow like what code we need to do to write To show the entire data set But you can see this view is always going to be pretty limited because it has to fit within a notebook cell output And this is where data Wrangler VS Code data Wrangler can come in and help So today, I'm excited to show you like a really sneak peek of a project that we've been working on I'm trying to scroll for some reason. Yeah. Yes, sorry It's I'm trying to show today I'm really excited to show a sneak peek of a new project we've been working on on the VS Code team called data Wrangler And hasn't been released to the public yet So I think this might be actually the one of the first public audiences to see it But in a nutshell data Wrangler is a free code-centric data cleaning data preprocessing tool That can interact that you can interact with in an Excel like interface But it's bread and butter is automatically generating the underlining Python code automatically in the background as you perform your data operations But again, rather than giving like a whole marketing spiel of what it is Let's just go ahead and actually use it. So again when you're working with VS Code notebooks And you do any outputs to data frame You'll now see a launch data Wrangler button that you see here above the table When you click on it, it'll take you to a new tab where you can perform all of your data cleaning and preparation tasks And the cool part is this is a sandbox environment So here you can explore manipulate the data with your heart's content without ever having to worry about messing up your original data set So this is just everything sandbox here When you first launch it, you'll see a grid view of your entire data set rather than a truncated view that you just saw earlier So it gives you way more information. You can see every single column here You can also see the summary statistics, which is super nice So if you see on the left-hand side here, it gives you information about which columns are missing values So it makes it really easy to tell if the stack overflow or whatever data set you're working with needs to be cleaned And if so, which areas of the data that I actually need to look at and clean So again going back to the problem at hand We see the data set has a ton of information like 50 plus columns and we have to sift through to figure out the relevant data For them to feed to our model, but we'll set up with something simple such as filtering for Python developers only So again, I've looked at this data set obviously, so I do know some of the key Coms we're gonna be looking at so you can see here language. I've worked with we're gonna want to filter for the keyword Python So I can just right-click click filter and then I'm gonna want to just pull this down But I'm gonna want to search for anything that contains the keyword Python Because you can see a lot of the data is aggregated right here And you can see as I do an operation not only do I get a preview of what's actually happening to the data grid So you can see hopefully the contrast kind of shows but the rows that are being removed are highlighted in red But also at the bottom you can see data wrangler automatically generates the code that is required to do that operation So instills really trust in the tool rather than having a black box and just doing it in the background It tells you exactly what's actually happening So again, if I if I like what I see here, I can just click accept The other thing that typical data scientists have to do is you have to filter between your data, right? So again, like I mentioned, there's around 50 plus columns But we can easily choose which columns we want to keep you see there's a lot of data A lot of it's really interesting and relevant that might help us but things that might not be useful Like if they have a stack overflow account or if they have like how often they check out stack overflow stuff like that So we can just do interactive like we would interact with Excel just doing like a control click And if I wanted to remove them, I can just do a simple drop column click accept and you can see it get a preview What's actually happening and they're all now removed One really other cool feature I want to show is if we look at the comp and there's the last column There's a column called converted comp early and we can see this is the actual data that we want to get because it's the annual salary But we can see there's a lot of missing values in here So one thing we can do is to actually remove these missing values, right as many data scientists do So to do so we can easily again right click on the column click drop missing values in column Again gives me a preview of what's actually being removed. I can click accept and again You can see all the code that's automatically generated for me within data Wrangler The other thing I'll need to do is I need to convert this year this comp column from this currency string format To some sort of integer so I can actually do some numeric operations on it And actually such that our model can understand it because it's not going to really understand string So we want to convert it to something that her model can understand So to do so normally I would need to write some custom code Maybe I need to go write some regex to extract it out Many of you are probably cringing at the word regex right now because every data scientist read it But we can actually leverage a built-in transformation within data Wrangler called flash fill to do so So if I just right click and click flash fill Flash fill is an operation that's built into data Wrangler that uses an AI model in the back end to understand What information you're trying to derive from the original source column just based on a single example? So you can see here. Let's just say I want to extract out the actual 50 like the actual currency value So I'll just type in fifty one thousand five hundred fifty two and then you can see that just based on a single example This AI model is able to infer the rest of the values for the new column And the other great thing is you can see it perfectly did it and it also it gives me the exact code And we try to make the code as human readable as possible So we don't we try to use as many built-in Python libraries We don't really you can see there's no other additional imports We need but the code is super clear in the super clear definition with a really nice doctrine as well So it tells you it's extremely clear and reproducible for the users So again, this looks good to me, so I can just click click click accept And now I have this brand new column and again I just did this in a few seconds But if I had to do this with regex and googling this might take like maybe five ten plus minutes to do The last part I just wanted to quickly show is let's say This is still you can see it's object type So basically it's the string type right now, but like I mentioned earlier We want to convert it to something that is Understandable, so let's move it Let's just convert it. Let's say to a float 64 and then I can click accept So now we can see it's in a float. It's a float value The other really interesting thing is after converting it now We can see there's still your data wrangler also has built in visualizations So this is automatically generated again I didn't write any code to do this, but you'll notice there's a lot of outliers in this histogram A lot of the data is concentrated at the very left of it So we can actually zoom in on this histogram just by dragging and dropping and we can see a lot of the a lot of the data is Below a 500k salary. So the south of the salary points here They range from let's say like zero or something to like over 10 million So we're gonna want to remove a lot of these outliers as we call them in the industry and again We can just do so by if we look at the summary statistics again We can see let's see the the minimum value is 48 So I don't I think that's also a noise as well because most likely someone's not making $48 annually as a As a software engineer so again, we can just do a filter operation again and then filter for things in that range so let's just Hide this real quick write this down and let's say we want to filter for just arbitrarily things that are Data points that are greater than $10,000 a year and then let's also add something else. Let's say a max of Let me just do this. Let's say a max of 500,000. So we want to do less than 500,000 And you can see as I'm typing all this out the code is automatically being generated for me the previous up being updated live So you can see this rose highlighted red because it's 500,000 or over and again once everything looks good to me I can click accept and now my data is filtered and if I look back at the visualization now data looks a lot more clean Right, there's a lot less outliers if I want to remove these outliers I can set the filter be even lower, but yeah So the last part just wanted to quickly show is a lot of the data is aggregated into One specific column. So if I want to look at what packages people work with you can see a lot of the data is right Especially here. It's all agree into one column. So normally I can I would need to split out this column I would need to encode it for my model to understand But I can actually just leverage data wranglers built in multi-label text binarizer Just doing a right-click specifying a delimiter of a semi colon and you can see data wrangler automatically figures out Hey, what are all the categories in here based on this delimiter separation and you can see it creates new columns for each of them So there's one for dotnet core dotnet framework apache spark Etc and sets a value of one to encode if this respondent actually uses it So again, I can quickly click accept. So again, I'll skip the rest of the data cleaning operation It's just for the sake of time but I wanted to point out that when you are done you can just quickly click this button export code back to notebook and exit and all the code you literally just wrote is automatically generated or thrown back into your notebook and You can just rerun it as normal and again We didn't actually do any modifications to the original data unless you actually run this code yourself So you can see hopefully you can see how much time this saves during the most time-consuming process of the data science workflow So let's just jump back to the slides and just quickly summarize So data wrangler again code centric data cleaning preparation tool that was built in directly to notebooks So I didn't have to use any separate tools had smart suggestions quick insights of Some summary statistics of like what I need to do with my data set a bunch of built-in data transformations I just only touched the tip of the iceberg and a bunch of built-in data visualization. So you saw with like the histogram as well So because we're working with Now that we've been actually cleaned and encoder our data We can go ahead to the second phase of training which is around Which is training so because we're working with the continuous data set of salary data We want to we can easily leverage sk learns existing library of linear models rather than writing your models from scratch So here's just a simple graphic of how you can actually choose which models right for your given problem space But just specifically for this problem I chose ridge regression because for the salary For the salary prediction case we're working with the regression problem But with under a hundred thousand data points But there's potentially a lot of different features that we can determine some salary So you can see if I just follow that flow chart. I end up on ridge regression But let's just jump back to the notebook again to quickly see the actual code for this So let me just jump back To the model training section so you can see here again I'll need to first split my data into test and training sets I'm just using the train test split function from sk learn makes it super simple Saves me from running a lot of code and you can see I'm allocating 20% of my test data for training and the rest will be Used for testing sorry 20% for training testing the rest will be used for training The input for my model will be a bunch of features that I've cleaned and pre-processed that I've already done with data wrangler ahead of time Just for the sake of this demo things like you saw in the beginning of the presentation like your batch ground education level What ml packages or what web packages are using so you can see that's my x value and my y is just converted Compurely because that's what we're actually going to be predicting upon and Then here I'm just defining the model and then the fit functions during the actual training So because I'm just using a simple tabular data set. There's no images or anything It's not too complex of a data set the training is pretty quick to run So I'm able to run on my local machine But if I was doing something with like a convolutional neural net or with like millions of images my laptop would definitely be too slow So this is where the remote ssh feature can come in and save the day And if you want to leverage more like compute power for your training job Or just have your train job run in the cloud You can make it super easy to connect to remote servers using a remote ssh And all you need to do is just download this remote ssh extension again from the marketplace So just search remote you'll see remote ssh there and once you actually have that downloaded you can simply just go into the You'll see this green button the bottom left and here you can click connect to hosts And then you can just type in your host IP of that server and then basically BS codes back in will be run on that server But the front end will remain the same So hopefully the main thing you'll notice as the end user is that all your training everything runs Just snap here because it's running on that back end as well, but the front end is exactly the same So once I'm actually done with the training you can see I've got an accuracy of around 87% which is pretty solid again Just for this simple trivial example But once I'm done with the training and I'm happy with the code I want to share the code to others or even use get to deploy the code What's great is VS code is native get integration with full support for notebooks So if I just go into the source control tab if I want to just do a quick like see what changes I made to like this This notebook I made what's really great is VS code puts it into a really nice UI So rather than viewing as like a crude json diff if you had to do this within a terminal VS code understands like hey, what did I change in the actual input cell via code? What did I change in the metadata? What did I change the outputs and organize that all that for you? So it makes it super easy to just do a get diff if you need to as well The other amazing thing about get integration I want to show is that the entire data experience data science experience They just saw can be done with github code spaces So github code spaces essentially VS code built into github, but how you access it is if you have a GitHub repository like you said here you see this code icon if I do click this drop-down You'll see this now new code spaces button and then I can click that and it launches something like this where you can see It's the exact same interface exact same features my same notebooks same data set everything But now I'm running in the browser. So I don't actually need to have VS code downloaded to do everything You just saw today So that was kind of like the training phase in a nutshell The last step is kind of to just to production as the model So once we're actually happy with the model we can actually let me just get back to the code But once we're actually happy with the model I will need to do again I just made it super simple for the sake of time But we'll just need to serialize it with the pickle package such that I can save it in my local directory And then I can just use this to Upload and deploy it to the cloud and once my model is actually saved then we can use Like online services or cloud services like Azure to deploy the cloud and what's great is VS code is built in direct integrations with these So these Azure services so things like Azure machine learning if you want to run experiments or Azure functions If you want to host an endpoint, but it's all directly integrated within VS code again Just through the extensions tab and again I don't want to get I won't get into the details for this talk But I'll just quickly showcase some things you can do with the Azure machine learning extension So you can see here you can deploy your models out here and what's great is It will automatically keep track of the versioning of your models So all that's an automatically for you You keep you can create endpoints to actually hit that model if you're trying to create API endpoint as well And then there's computes and data stores if you want to store your data set here as well So again, just for the sake of time, I'm just gonna skip through this a little bit But just want to close out quickly. I know we talked a lot about different features in VS code today But I just want to summarize everything we've done and the easy into easy takeaways So I'll take you back to the data science workflow diagram that you saw earlier We started off with the data exploration right and we did that with data Wrangler, right? All of that was done data Wrangler the all the code is automatically generated You can see how much faster it was the next step We did was the training steps and that's where we use VS code notebooks with Jupiter notebooks And that's where we developed the training script We did the compute and we can connect to remote SSH as well for remote compute power and the last part was productionization where we used just different Azure extensions to this for VS code So things like Azure machine learning and that's where we were able to do some inferencing We're able to store models of the cloud, etc But the main most important thing which many of you might not realize is I did this all inside of VS code I didn't need to context which back and forth between multiple tools or do much setup at all And this is how VS code is actually revolutionizing the way you do data science So what's next? So if you want if you're interested in building your if you're interested in trying out your data science workflow With VS code you can go to this aka MS link slash VS code stupider and I'll tell you how you can get started So like I mentioned as well data wrangler hasn't actually been officially released to the public yet So we're still trying to target some sort of beta build potentially the summer so stay tuned for that but if you want to Maybe get like a sneak peek or join our mailing list for early insiders access You feel free to click this link and just submit Just submit your information and you'll be notified of when it's like the official beta is released And you'll be added to our mailing list of beta builds So we have a lot more exciting new features in Python data science coming in the pipeline So hopefully I'll be back again soon to share this with you, but thank you so much for your time and thank you