 Hello everyone and welcome to another episode of Code Emporium where we are going to talk about some data science interview questions. So right here now we have this wonderful site in Telepot and over here if we scroll down we have 78 data science interview questions with some answers. So I'm going to go through some of these questions with their answers, maybe take them only at like 10 or 15 of them at a basic level and then try to give my two cents on what I would add to these questions to actually make them interview ready because typically a lot of these questions when they are in like their blog post form they seem to be very technical and very textbook and I want to give my two cents on seeing like how you can make it more interview ready so with that let's get started and if you haven't hit that like button on the video already please do hit that like we have now a discord server so join the community links in the description below and we're going to talk about so many things and I'm super active there so we would love to have you and with that let's get started all right starting with some of the basic data science interview questions here is what do you understand by linear regression and what do you understand by logistic regression for questions like these they are extremely open-ended and the way that I would typically deal with this is just before you even start answering these questions to the interviewer try to ask them what they're actually looking for because if I'm asked what do I understand by linear regression I understand a lot of things but it might not be something that your interview is actually looking for do you do they want you to first start with a simple explanation with the four main assumptions of linear regression do they want you to talk about the loss function and how it has a closed form and also their non-closed form which you can solve with like a gradient descent approach or do they want to know something else remember an interview is not just you answering questions it is a conversation between you and the interviewer you need to get to know them as much as they need to get to know you too so do be communicative now the question like this on what is a confusion matrix and also the next question here which is what do you understand by true positive rate and false positive rate both of these questions are very different from what is logistic regression what is linear regression because now these questions have a very precise definition that you can probably explain in a couple of sentences I'd also encourage you when you're explaining this to add certain examples in this case they do have the cool confusion matrix table which is good but many interviews are verbal and you're going to have to communicate these ideas in words so you might want to practice a one or two liner definition in words so that you don't stumble too much during your interview for these simple definitions in addition to this you might want to define true positive rates and false positive rates probably with an example I have a video about this that I made a while back so if you are interested in this please do check it out coming on to our next questions what is data science and how is data science different from traditional application programming in typical interview fashion I would actually not be asking these very vague and broad questions because even if a candidate is able to explain to me what data science is it's not like I get any more information from them as I would if I were to give them a case study or some of the more technical questions like we had for the first four but if for some reason you are given this very broad question I think this first two sentences probably has some pretty good information that you could use coming to the next question explain the differences between supervised and unsupervised learning now I like the way that they split this up into a tabular column and they do provide some relevant points supervised learning contains both inputs and the expected output that is labeled data where it's unsupervised is unlabeled data that is true then supervised is used to create models that can be employed to predict or classify things sure and this unsupervised learning is used to extract meaningful information about large volumes of data maybe give some examples there because these two definitions are a little bit vague some practical applications would be pretty useful here and then commonly use supervised learning algorithms are linear regression decision tree etc and then we have k means a priori algorithm etc now when you're actually talking to interview I just do little tidbit is that I don't like the word etc if you're going to use etc you might as say this this and this or just say this and this without using etc what is dimensionality reduction dimensionality reduction is the process of converting a data set with high number of dimensions to a data set with low number of dimensions and this is done by dropping some fields or columns from the data set however this is not done haphazardly in this process the dimensions or the fields are dropped only after making sure the remaining information will still be enough to succinctly describe similar information this answer is a little incomplete even on a technical level because this only talks about feature selection by dropping features but you can also have algebraic combinations of multiple features to input as a single feature into your network and this is the rationale behind principal component analysis for example although I will say that principal component analysis does make it hard to interpret the model in the end which is very important when dealing with production systems in the industry and it might be a little overlooked in academia I would still mention this though if given the question of dimensionality reduction because it paints the complete picture what is bias in data science maybe the question framing could be a little different because data science is vague and bias itself has multiple definitions but we'll see what they have here bias is a type of error that occurs in a data science model because of using an algorithm that is not strong enough to capture the underlying patterns or trends that exist in the data in other words this error occurs when the data is too complicated for the algorithm to understand so it ends up building a model that makes simple assumptions and this leads to lower accuracy because of underfitting overall this is true underfitting is a result of having an extremely complex data pattern but not a very complex model that is able to capture all of those patterns in data high bias can definitely occur in a lot of these more simplistic models but in an interview when probably being asked this question you might be asked about the tradeoff between high bias and high variance since there is a tug of four that goes on between the two why is Python used for data cleaning in data science and why is are used in data visualization I'm reading these two questions out because they they kind of seem very related to each other honestly the use of are in data visualization is probably more of a thing of the past while are still great for data visualizations and I personally used to use our for visualizing data because I thought it was better at Python at some point but I think Python has evolved to a point where there isn't so much that you can do with are that you cannot do with Python that you would use on a daily basis and so for the most part I would just stick to Python in general because it is a more general programming language that can be used for a variety of other applications even within the data science and machine learning domain what are popular libraries used in data science so they mentioned TensorFlow, pandas, mapplotlib and PyTorch I'm not sure if this question would be asked as is it would be more of like what are libraries that you know of or what are applications that you have built rather than the tools that you have used to build those applications because in the end the final products and actionable products are more important than what underlying tools you used to build them because you could use any tools as long as it gets you there. What is variance in data science? Now this kind of hearkens back to the bias question and how I said that variance and bias are kind of asked together. But in general they say that variance is a type of error that occurs in a data science model when the model ends up being too complex and learns features from data along with the noise that exists in it. This kind of error can occur if the algorithm used to train the model has a high complexity even though the data and the underlying patterns and trends are quite easy to discover. This makes the model a very sensitive one that performs well on the training data set but poorly on the test data set and on any kind of data that the model has not yet seen. Variance generally leads to poor accuracy and testing and results in overfitting. Overall, this is actually a pretty good answer and pretty well encompassed. In fact, before getting too much into the weeds of this answer, I would probably talk to the interviewer first and ask them how much they would want to know. Because when you're explaining this kind of concept to it's also easy to give examples with you know, certain neural networks that may have a tendency of overfitting or maybe you want to talk about random force classifiers and ensemble learning and how that entire modeling approach can be used to curb overfitting even if only a little. And that's all that we have time for today. I hope this basic interview session did kind of help you just understand what you can expect from even simpler questions and how you should be answering these questions. Things may seem pretty simple up front, but the manner in which you respond makes a huge difference to how you're perceived in the interview. And so I hope this video did help you out in just understanding a little more about that communication dynamic. Remember communication in an interview is absolute key. Thank you all so much for watching and in future videos, I might continue this playlist of data science interview questions do drop a like and comment down below if you really kind of like this series. Do subscribe for more and also hit us up on Discord. We have that Discord server link down in the description below and we would so love to have you join our wonderful community planning some big things here. So we'll keep you posted on that and I will see you in the next one. Bye bye.