 Hello everyone and welcome to another episode of Code Emporium where we are now going to talk about data science interview questions. So in this video, we're going to look at a blog by SimplyLearn where they have the top 50 data science interview questions and answers for 2021. Now the reason I'm kind of making this video and I probably also want to make this a series is because there are so many blog posts out there that have very textbook questions and they give very textbook answers but in reality when you're actually answering these questions during an actual interview you might want to spice things up and add a little bit more shades of practicality to it and so this video is going to be me reading through these questions looking at their answers and also trying to give my own two cents on well how I would improve some of these answers and so I hope that you will enjoy this video it's a nice little trial for me to do so let's get to it but of course before we get to it I would really appreciate it if you could give this video a good old like the more you like the more other people will see the video and then they'll like it and the phenomenon goes on and I'll become a really happy person and you'll be a happy person and it helps everyone. Also please do check out our Discord server down in the description below we are gonna be talking about a bunch of things over there so hop on over I'm pretty active over there let's start a community together and with that let's get started with the video. Question number one what are the differences between supervised and unsupervised learning so I like this little table here they say supervised learning works on label data's input unsupervised learning works on unlabeled data's input fair. Supervised learning has a feedback mechanism I do not know what a feedback mechanism is in this context but if you do please comment in the description and if you're in an interview do try to at least describe one line of what that actually is because I'm sure your interviewer will also be pretty curious. Let's see so supervised learning you had decision trees logistic regression support vector machines that's cool whereas unsupervised learning is k-means hierarchical clustering in a priori algorithm so pretty cool I love that they gave example models here but in addition to this I would also like to know an additional real world application of where you would use supervised learning and where you would use unsupervised learning for example it could be something as simple as just saying fraud classification for supervised learning and then probably topic modeling for unsupervised learning and you probably get into very small details for both if your interviewer requires it try to gauge the room try to see if your interviewer is interested and only then you give that more information. How is logistic regression done so logistic regression measures the relationship between the dependent variable and one or more independent variables by estimating probability using its underlying logistic function and they have a couple of images here this is a pretty broad question and I would first ask the interviewer what they really want to hear specifically because if you start going on a rant about the likelihood estimation of logistic regression then they might just tune out and they'll be like oh that's that's not what I wanted to hear so you want to first clarify what they want to hear before actually giving your entire response and going super deep into it it's communication that's important here explain the steps in making a decision tree take the entire data as input calculate the entropy of a target variable as well as other predictor attributes calculate your information gain of all attributes choose the attributes with the highest information gain as the root node and repeat the same procedure on every branch until the decision node of each branch is finalized and they give a little example there so I kind of like this response because it's pretty simple it's to the point doesn't go into very very deep details and again if there are more details your interviewer will ask you to specify more details probably along like what is information gain or what is entropy here and can you explain it in simpler words but all in all good explanation how do you build a random force model a random force model is built up of a number of decision trees if you split the data into different packages and make a decision tree in each of the different groups of data the random force brings off these trees together so first of all this definition seems a little bit more vague I would probably go a little more technical than what is given right here but still a pretty good description for one or two liner sentence to maybe talk a little bit about ensemble learning algorithms and how random force combats overfitting but all in all pretty good answer how can you avoid overfitting of your model overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture there are three main methods to overfitting keep the model simple use cross-validation techniques and use regularization techniques so overall they do have the right key points here however I want to completely emphasize this first point keeping the model simple is probably the most important feature of this entire three-step process here because with keeping your model simple and only using certain features you are eliminating so much of the hassle of interpretation of your model and model interpretability is extremely important in the industry although sometimes it might be a little overlooked in academia so keeping your model simple very important and probably emphasize that more than the others what are the feature selection methods used to select the right variables there are two main methods for feature selection filter and wrapper methods okay this involves linear discriminant analysis a nova chi-squared and the wrapper methods involve the forward selection backward selection and recursive feature elimination so all in all this is an example of a very textbook answer however when I talk about feature selection especially in an interview I would more talk about like the data analysis standpoint of trying to determine the correlations that exists between your potential features and also your labels if there are any and then using certain techniques there to just determine which features to select in which to not in addition to that there is also like the semantic side of things of like okay the meaning of the features actually matter if there are two features that have very similar meaning to each other you would only want to select one and that too the one that makes more sense even semantically so these are just some extra points that you want to keep in mind when answering these interview questions in a more practical sense you're given a data set consisting of variables with more than 30% of missing values how do you deal with them the following are ways to handle missing data values if the data set is large we can just simply remove the rose with missing data values okay so you should not really be resorting to removing rose because removing rose means removing data which could be potentially used by your model so instead of that I would probably say you would need to take that column that has the 30% missing values try to see if you can impute certain values in those columns a good library for this is psychic learns simple imputer now once you've imputed certain values to this try to determine the correlation between this column and also the label column and just try to see if there is some correlation at all if there exists a correlation then that means that even though that this column is missing so many values it is still useful for predicting the output of your model if there is no correlation on the other hand well you can probably think of just removing the column itself and not the rose of data what is dimensionality reduction and what are its benefits dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions to convey similar information concisely this reduction helps in compressing data and reducing storage space it also reduces computation time as fewer dimensions lead to less computing it removes redundant features for example there's no point storing of value in two different units so I think that this is also a little bit too simplistic for dimensionality reduction but it gets the broad strokes right in general though like in a practical sense of using dimensionality reduction there is actually also quite a bit of cons specifically in terms of data interpretability when you create these algebraic combinations of features and use those features as actual features to your model you lose a sense of interpretation which is very important in production systems and so there might be situations where you'll see that the cons actually outweigh the pros despite the pros being well you can compute faster and you can train faster even with a large amount of data it's important to just note that and keep it in mind at least for your interviewer to know that you are aware of some of the shortcomings that dimensionality reduction has to offer how will you create eigenvalues and eigenvectors of the following three cross three matrix so this is a very technical problem and well they just solve it very technically as well but I would also recommend you to understand what eigenvalues and eigenvectors are through blue one brown has a great video resource on this so I just recommend you to check that out too how should you maintain a deployed model the steps to maintain a deployed model are versus monitor so constant monitoring of all models is needed to determine their performance accuracy and when you change something you want to figure out how your changes are going to affect things this needs to be monitored to ensure that it's doing what it's supposed to do that's absolutely right you do need to monitor every single time you do deploy a model you want to make sure that how many requests are there coming in a second how many 500s are we seeing because of an internal error that's there if we do see 500s do we have logs that we can use to debug that especially as soon as you deploy new code these graphs are something that you really need to check evaluate then evaluate metrics of the current model that are calculated to determine if a new algorithm is needed the new models are compared to each other to determine which model performs the best the best performing model is rebuilt on the current state of data I would say that these last three points on evaluate compare and rebuild these are not in the deployment phase but they should actually happen much before like in the model selection slash training phase then yeah you don't really need to mention it here what are recommender systems okay so here again a super broad question where they talk about collaborative filtering and content-based filtering so these are fine but the main problem here is that we don't use collaborative filtering and content-based filtering on their own because of issues of scalability it's just way too too many real-time systems way too much data to just use these techniques on their own instead a lot of recommendation systems try to create vectors for every user basically called embedding vectors which encode the meaning for every user or like product vectors which encode the meaning for every single product and technically these vectors can be plotted in in some like n-dimensional space and then we can find like for a given user or for a given product we can find the nearest users or the nearest products to give recommendations Spotify for example actually uses this where they take a user vector and try to find the nearest user vectors or the song vectors and try to find the nearest song vectors with something called approximate nearest neighbors which is a super fast implementation of the nearest neighbors algorithm here's a github repository showing that code I do encourage you to actually look at this I'll link it down in the description below all in all just try to make sure that you are aware that content-based filtering collaborative filtering and their base form is not very usable in a production environment what is the significance of a p-value so a p-value typically less than 0.05 indicates strong evidence against the null hypothesis so you reject it p-value greater than 0.05 is weak evidence against the null hypothesis so you accept the null hypothesis not quite p-value at 0.05 this is considered to be marginal meaning you can go either way this is an example of an answer that is not really technically correct either in fact in the simplest terms p-value it represents a probability a probability of how ridiculous the null hypothesis is or seems in fact Cassie Kozarkoff who is a chief AI data scientist at Google has an amazing explanation of what a p-value is and so I highly recommend you watch her video also adding to this the 0.05 is kind of an arbitrarily hard cut off like you don't just not reject or not make a change just because it's 0.051 for example there may be certain situations where there's just no reason to keep a current system and as long as the new system significantly doesn't like drastically harm we would use that system despite what the p-value might say even if it is like not completely a 0.05 how can outliers be treated you can drop outliers only if it is a garbage value and they give an example and if the outliers have extreme values they can be removed for example if all data points are clustered between 0 to 10 but one point lies at 100 then we can remove this point if you cannot drop outliers you can try the following try a different model try normalizing the data and you can try using algorithms that are less affected by outliers for this answer I would actually err towards not dropping outliers on its own if there are certain outliers in the data I would pick them out and individually see what makes those outliers outliers like why do they exist and how did they get their values if we see that some outliers just happen because of actual human errors then maybe we can toss it but if they organically did appear and did happen you're better off including that in your data so that your model can kind of also pick up on them again it really depends on the data and the problem can you calculate accuracy using a confusion matrix I like how they gave that figure over here and yes you can calculate simple accuracy true positive plus true negative divided by all four which is exactly what they do and with a confusion matrix you can also calculate like precision and recall pretty easily and that's exactly what they do in the next question so that's cool people who bought this also bought blank recommendation seen on Amazon are a result of which algorithm textbook definition says collaborative filtering but like I said recommendation algorithms can be implemented very differently these days in order to compensate for scalability write a basic sequel query that lists all orders with customer information so the way that I would actually give a sequel question is instead of just like one question after another that is not related to each other I would give a scenario from that start with simple questions maybe like a very simple question of just like getting orders and then building on top of that to make more and more complex queries this is could be done to just get a good sense of how well a candidate can break down a problem convert it into code and also get an idea of their technical skills of how much sequel they really know you are given a data set on cancer detection you have built a classification model and achieved an accuracy of 96% why shouldn't you be happy with this model performance and what can you do about it cancer detection results in imbalanced data an imbalanced data said accuracy should not be based as a measure of performance it is important to focus on the remaining 4% which presents patients who are wrongly diagnosed yeah this is pretty true but adding to this is even if we had a model that says death model response return false or something like that and we just gave the same value for everything we can get an accuracy of like 96% because that is the majority case and that happens because of class imbalances so maybe you can provide a very small example there instead of just a basic textbook definition and also probably add some evaluation metrics that would be useful maybe a precision a recall or something else that you think would be completely useful depending on the problem at hand which of the following machine learning algorithms can be used for inputting missing values for both categorical and continuous variables okay so for actually imputing values I wouldn't use a specific machine learning algorithm because machine learning models are very noisy I would rather just impute these values with whatever the mean of that column is or median or anything that just makes sense maybe for a categorical variable you would want to impute it with just another class label of unknown in any case I would probably use at psychic learns simple imputer like I mentioned before for this case rather than actually putting the burden of imputation on another model which could be another source of error that you would need to deal with below are eight actual values of the target variable in the train file what is the entropy of the target variable pretty simple calculation of entropy but in this case I would also want to make sure that you understand what entropy is what information gain is and how they relate to each other just from a broad picture standpoint because in many cases they probably won't just give you a typical exam question like this we want to predict the probability of death from heart disease based on three risk factors age gender and blood cholesterol level what is the most appropriate algorithm for this case so the answer seems to be logistic regression if we wanted to predict a probability of death probably something like logistic regression which actually results in a probability value quote unquote would be useful it's also important to note that logistic regression doesn't necessarily produce probability values especially if the data set is imbalanced I have an entire video and model calibration to to actually attest to this after studying the behavior of a population you have identified four special individual types that are valuable to your study you would like to find all the users who are most similar to each individual type which algorithm is the most appropriate for the study they say the answer here is k-means clustering but it might be even better to model this as a supervised learning problem what if since we know that there's four types of users we can create a four-way classification and we can now set a very very high probability threshold let's say it could be like 0.8 or 0.9 for a given class and only if it is above this threshold then we would classify or categorize the variables as that particular class there are some other ways that you could probably do this too maybe you can create four different classifiers each of which are binary classifiers for individual classes and also have like a very high threshold but in general k-means clustering is not the only way to look at a problem we can formulate a problem in multiple ways your organization as a website where visitors randomly receive one of two coupons it is also possible that visitors to their website will not receive a coupon you have been asked to determine if offering a coupon to website visitors has an impact on purchase decisions which analysis method should you use so they say one-way ANOVA but that is a very very broad answer to give and there is so many concepts here that you would need to illustrate of AB testing so whether you're using like hypothesis testing or Bayesian testing you first want to define your problem you want to make it clear what your KPIs are in this case it's like purchase conversion and you also want to also just say what other considerations that you would make before conducting the test it could be on sample size it could be test duration or anything else and during this entire process do ask questions to your interviewer just go back and forth any little detail that you think that is not very clear you ask your interviewer because it's super important for them to know and they will also think that you are asking the right questions as you should be and well that's all I'm going to look at for this episode everything else here is just on basic concepts so I'm going to leave this video over here I hope you all liked this kind of video and I'm planning on making this a recurring series comment down below if you really like it if you like the video please do give this a good old thumbs up please subscribe for more content and join our discord server for more amazing chats with yours truly and so many others like you and we're trying to grow community here so I appreciate your support thank you all so much for watching and I'll see you soon bye bye