 Hello everyone and welcome to another episode of Code Emporium where we are going to talk about some Python libraries that you should be using as a data scientist. Python has been breaking the ranks as one of the most popular languages out there especially in terms of job opportunity. Studies have shown that about 75% of all jobs open as data scientist positions require Python as a prerequisite. And so this video is going to be focused mainly on Python libraries that you'll be using as a data scientist. Also I have two years of experience as a data scientist and I can tell you Python is indeed important and I'm hoping that my experience could also help you and others like you out there too. The way I'm going to be structuring this video is that I'm going to give you certain points on different parts of the machine learning pipeline and then also give the corresponding Python library that I think is the most useful in that domain. And then we'll talk a little bit about details from there in each case. Before we get started, please do give this video a little thumbs up because it'll help the algorithm and spread the video to people who are like you and would like content like this too. Also if you like what you see, please do consider subscribing. We now have a Discord server so I'll be super active on that. We can just come on, hang out. We'll talk for a bit on just stuff that geeks talk about in artificial intelligence. It's going to be super fun. So please do check that out. The links are in the description down below. And with that out of the way, let's get back to the video. So the first phase in the machine learning pipeline is data wrangling and the most important Python library in this category goes to pandas. So data wrangling is the process of essentially manipulating data, either the structure of table data or actual data within the tables themselves. And typically this data is in a tabular format. And if it is, it can be easily manipulated with pandas. So pandas treats tables as a built in data type called a data frame. And with that we can perform operations that include like changing structures and also changing values pretty easily. It's definitely one of the most important Python libraries out there to check out. And so I recommend you do learn that if that's the first thing you ever do with Python. The second phase of the machine learning pipeline we're going to look at is data visualization and the top contender here goes to matplotlib. And there's also some secondary and tertiary mentions of Seaborn and Plotly. So data visualization is essentially basically changing all the tables and numbers into nice infographics and pictures because that is super important since pictures speak over a thousand words. And especially when you're talking to stakeholders as you would be in data science, you're talking to people who just don't necessarily have a data background and will be easily overwhelmed with just large numbers and instead showing them a nice cute little visualization will go a long way in pushing your point forward, pushing your agenda forward and they get to understand exactly what you're talking about without being bogged down by the numbers. Also data visualizations are super important just for you to understand exactly what's going on, be it bar charts or line plots or whatever it is. Matplotlib is a great Python library since it supports all of this from bar charts, line charts, box plots, anything you'd want and anything you'd need. I have a special mention for Seaborn because it is built on top of Matplotlib and essentially it allows us to do the same things that Matplotlib does but with less code essentially and also it has good built in visualizations that just come out of the box that I kind of do like for certain cases and so I throw in Seaborn whenever I can. I also mentioned Plotly but Plotly is great for creating these interactive diagrams and interactive charts. Basically you might have seen these charts where you can kind of hover over some parts of it with your mouse and a tool tip pop up just like comes on the screen and you'll see some additional information. It's great for these interactive charts and I've also seen it in a couple of business intelligence tools too. So definitely do also recommend checking out Plotly for just some additional fluff in your data science visualization arsenal. It's pretty cool so please do check it out. There's also a ton of other data visualization libraries out there that I have not mentioned but I haven't really used them too much. You can always explore them as when you wish but I've just thought I'd give you only the most important ones. Moving on to number three in the machine learning pipeline that is feature engineering and the most important Python library in this category goes to stats models. So feature engineering is essentially trying to figure out what features can be used in your model whether it's a classification model or a regression model. The idea here is to try to convert and make sense of that data that you have in order to extract certain features that you can use to input to a potential model and here stats models can help because it allows you to come up with very simple models like logistic regression linear regression super easily. Essentially what you would do is if you have an idea of a feature in mind you would try to use that as an X variable like an input potential feature and then the Y variable would potentially be the target and you can try to find out like the correlation between these two variables within just a few lines of code. You can then make decisions on whether to include in your model if you see some high correlation with it or try to increase the complexity of that feature maybe by using square terms of that cube terms of that feature or a log or a square root or something of that sort and see if it actually has still some correlation with the output feature. If it does, maybe you decide to include it. If it doesn't, it's probably not worth it and so stats models can be super useful in just engineering features for your model. And now coming up on number four, we have the modeling phase and the most important Python library for modeling goes to scikit-learn as there is no surprise here. Scikit-learn has always been touted as one of the go-to libraries to just go and model things. You want to create a model, you use scikit-learn. But something that is underrated about scikit-learn is the ability to create machine learning pipelines. Machine learning pipelines are essentially end-to-end flows and not just creating a model but also from the creation of pre-processing stages. It could be standard scaling or encoding of your variables to then piping it into a machine learning model to training that model and having a trained model in hand and then also making sure that all of your data is in the correct format. This entire flow can be easily coded out in a very reproducible and reusable manner using scikit-learn. So what essentially that means is you can take this pipeline that you create with scikit-learn for a specific problem and then you can transfer it to another problem literally using the exact same code. And all you would need to do is maybe just change the input data and it can be used super quick and super easily if you've just built one model out completely. And I feel like this part and this aspect of scikit-learn has not been getting enough traction. And in order to address that concern, I actually created a video on the machine learning pipeline to explain exactly how scikit-learn can be used to create such machine learning pipelines and I encourage everybody to actually code out their pipelines keeping this in mind. So do check it out after this video. Coming up on number five of the machine learning flow and we have model interpretability and the most important Python library in this category has to go to SHAP. So model interpretability is essentially taking a model and trying to determine what features are actually the most important to that model. Now, SHAP is a great Python library because it also employs reusability like I mentioned in the machine learning pipeline section which was number four. By reusability here, I mean that we could use the same code using SHAP to establish the model interpretability of like a scikit-learn built-in model and reuse that same code for establishing the interpretability of a customized PyTorch neural network. It is all the same code. It's literally a copy and paste with just a different model in it. SHAP is essentially a library that uses game theory to assign feature importance to a set of individual features and you have like a number of these features that essentially compete for model importance and because it can be treated in a way that is very abstract from the actual model itself, it can be used to interpret any machine learning model out there. And in fact, it is so reusable that I've created an entire video just on SHAP and just how powerful it can be using some example models for interpretability. So if you're really interested in how to expand your vision of model interpretability, please do check out that video. As a number six here in the machine learning flow, it's kind of a bonus and I've included SQL processing and I've given this vote to Panda SQL. So typically you don't really need this for the most part because maybe let's say that the way that you're getting your data is typically from a data warehouse. So you'd write queries to query a data warehouse and then you would process that data with pandas and then moving forward with the machine learning flow. But there are certain situations where you might be reading data from a CSV file and maybe you wanna process everything within pandas itself from the very beginning. And it's just sometimes it becomes a little tough to actually just process data using Python pandas commands. And it's just easier and more intuitive to like create group buys, aggregations and just more complex operations with SQL code. And so Panda SQL allows us to manipulate data frames with SQL code. And that's why I've just included it here. So it can be useful in some cases, but it's definitely not a 100% priority. Should be easy to pick up regardless anyways. And that's all I have for you now. I hope you really did like these tips and like these libraries. Did you learn something new here? Did you not learn anything new? Please let us know in the comments down below. And also please do remember to give this video a like to spread the word out. Please do subscribe. Join the Discord server down in the description below. Like I said before, we're gonna have a lot of activity going on there. I'm just setting it up now, but for sure, I'm sure that we're gonna make it a big fun place for all geeks to enjoy. And until next time, I will see you very soon.