 Hello everyone, A.J. Hawthorne here, and thank you all so much for 20,000 subscribers! To commemorate, let's go through 100 data science interview questions. What are some of the main steps for data wrangling and data cleaning applied before machine learning? So before anything else, it's important to take a step back and talk to your stakeholders, understand what problem they want you to solve, and sometimes they don't even know what you can do with the data. So it's your job to also communicate what you can do with your data. Once you have a goal in mind, and you know exactly what your stakeholders want, you do some exploratory data analysis, EDA. This involves swimming through your data and picking apart each and every kind of indicator that would be indicative of the outcome objective. You can throw up a bunch of plots, you can create a small linear model where you have a bunch of independent variables that you think would be useful, and your dependent variable and see how they correlate with each other. Although it looks like you're training a model, it is still a very large part of exploratory data analysis for you to understand the relationships between variables. Your model is really only going to be as good as your data, so you want to make sure that you spend enough time doing some EDA. How do you deal with unbalanced binary classification? This kind of depends on your problem. I would typically go for an under-sampling approach, but you also have over-sampling approaches too. What is the difference between a boxplot and histogram? How they look. But also, I actually prefer using box plots simply because you can represent a large amount of data just in a small space without overlapping, unlike histograms where, well, you could represent multiple histograms by overlapping them, but it becomes pretty crowded. For example, if you're asked to plot out the total number of orders that every person has made in every single month, then you can have 12 distributions, and each distribution being a boxplot over the month, and they can all just be next to each other, as opposed to 12 histograms that are overlapping each other and are just difficult to see altogether. But to one's own, they're both pretty good visual tools. Despite the different types of regularization methods such as L1 and L2, L1 regularization axis feature selection, L2 regression also penalizes coefficients, but not to the extent of zeroing them out. What is cross-validation? In modeling, and specifically supervised learning, you would need to estimate the parameters of a model, which is done through training. However, there are other parameters that you wouldn't need to manually set. These are hyperparameters. To understand what those values are, you can use cross-validation techniques. I would typically do a grid search of hyperparameters, so a grid search would essentially try to map out different types of values within certain ranges, and it'll choose the set of hyperparameters that gives you the optimal minimum. Grid search takes time, but you can also do this manually and try to figure out which hyperparameters work best to minimize loss. How do you define a set of metrics? A very business-oriented question. Of course, we can use certain metrics such as precision, recall, and accuracy, but these are generic metrics that are used in general for solving machine learning problems. But as you get into a specific business, there are certain things that you would want to optimize. It could be a function of precision and recall, or something completely different too. You try to pick your features, model a problem, according to whatever optimization or whatever you need to optimize. Explain what precision and recall are. Precision. Of the number of people who you said got the virus, how many of them actually got the virus? Recall. Of the total number of people who got the virus, how many of them did you say got the virus? Explain what false positives and false negatives are, why is it important, what's the difference between them, provide examples when false positives are more important than false negatives and false negatives are more important than false positives? Somebody who doesn't have the virus, but your model says they have the virus, that's a false positive. False negatives. A person has the virus, but your model says they don't have the virus. Which one is more important depends on the situation. In this virus example, if a person has the virus and they are flagged as not having the virus, that can be extremely dangerous. In that situation, a false negative is much worse. In a court of law where we have two cases, the positive class being guilty and the negative class being innocent, a person who is innocent is marked as guilty. That is worse than a guilty person being flagged as innocent. What is the difference between supervised and unsupervised learning? Give concrete examples. Unsupervised learning, you have training examples, you have labels, and you train a model accordingly giving them the question and the answer. Support vector machines, KNNs, random forest, unsupervised learning, you only have examples with no labels. It would need to learn itself on certain interactions and patterns that happen within the data. Clustering. Example of unsupervised learning. Why would you use random forest versus SVM? Well, typically I would actually make use of a random forest classifier over SVM in general since random forests are more interpretable. And when you're actually working on a business problem, you need to make sure that your variables are always interpretable. There needs to be a direct relationship between your input variables and also your output label. Without that, you're building a model but you can't really assess the importance of one feature over the other. That kind of becomes pointless. I would also use extreme gradient boosting of the like because it's just much faster than SVM. SVM with its kernelization can become pretty convoluted. Why is dimensionality reduction important? I would typically use it to bring down a large number of features into about two or three dimensions so that you can visualize what's going on. However, it is important to note that they can lose their meaning and interpreting those features may not be the same as you probably just looking at the numbers yourself. But in a lot of cases, it really does help in visualization. Why is naive base so bad? And how can you improve span detection algorithms that use the naive base algorithm? Naive base is considered naive because of its condition of independence. So that means that for an email that you have and you're building this spam detector, going to give an email, a spam classifier using naive base will basically treat every word as independent of each other. So every word is independent of every other word, which we know is not how English works. English is a bunch of grammar and tokens that just interact with each other. So it oversimplifies the problem. How would you improve a spam detection algorithm that uses naive base? They say you can de correlate the features so that the assumptions hold true. But I just wouldn't use naive base for this. What are the drawbacks of a linear model? Well, I guess the biggest one is simplicity. You won't be able to capture the pattern seen in a very complex problem using something like linear or logistic regression. But I will say that they are good for doing some initial EDA and trying to understand how certain features are correlated with each other. Do you think that 50 decision trees are better than one large one? Why? Or why not? Well, this depends. So let's say that we have a decision tree and you train it on 100 samples, right? And we have one test sample that we just feed it in and you get the output. Let's say that we keep randomizing the inputs and just taking that same test value, but we have a different set of 100 inputs every single time that we want to train a decision tree. Now for the first time, you get one value for the test sample. So the second time, you get another value and this value could be jumping around a lot. So in other words, it has a very high variance. So in order to decrease that variance, what you would do is you have 50 decision trees, then you take the test sample and you feed it to all of them. You get 50 outputs, take the average of it, and that's your final answer. This value will have a much lower variance than just using a single decision tree. So it could lead to higher performance. But on the off side, let's say that you have a decision tree that all, let's say hypothetically this decision tree doesn't give you a very high variance in values. So even if you were to train it on different sets of samples, that test point, it would kind of not have a high variance in itself. In this case, there's really no need to use 50 predictors because all 50 predictors are just going to predict the same thing. If you're doing it once, you're doing it 50 times, it's just slower and you're getting the same answer anyways. Why is the mean square error a bad measure of model performance and what do you suggest instead? The mean squared error is not very robust outliers. For something like this, you can use the absolute loss, which will ignore outliers altogether. But there are some cases where if you have like two cohorts of data points, even the L1 or the L2 loss is not really going to work in your favor, but you can use something called the pseudo-huber loss, which is a combination of both of those losses. What is collinearity and what do you do to deal with it? How do you remove multicollinearity? It is pretty common that during machine learning, two of your predictors that you were using are very, very highly correlated. Now the effects of these on much larger models are actually not too bad. You can have multicollinearity and your model will still give good results. However it is good practice to try to minimize that as much as possible. If you have two predictors that are kind of doing the same thing, you could remove one of them. In most modeling techniques, using two very collinear features doesn't necessarily destroy your model unless it's like some linear regression. To remove multicollinearity is pretty interesting because in a business setting, all your input features are more than likely tangible things. You know that they have some meaning. So I would try to create different charts of what your data represents, each of those features perhaps, see their distributions, see how they vary over time. If you do see that they have very similar patterns throughout, then you might as well omit one of them. What is random forest and why is it good? Random forest a collection of decision trees that takes an average of an output of multiple decision trees for its overall combined output. And this is good because it reduces the variance of your output like we talked about before. What is a kernel and explain the kernel trick? Well I tend to shy away from these kind of questions because they're not very representative of what you would actually be asked in an interview. In fact, I don't really use SVMs much at all at work. I tend to use more of a variant of gradient boosting because it's just a lot faster and your features are also more interpretable, which is more important than anything. What is overfitting when your model starts memorizing instead of generalizing? What is boosting? Now boosting is more of a concept and not just necessarily applied to decision trees. The idea is that they combine a bunch of weak learners in order to make a strong learner. I have an entire video on this with a cool story so I think you should watch it. And that's it. I hope you enjoyed this little data science interview stint. I might do some of these in the future, hopefully with a better setting. Thank you all so much for subscribing and watching and I will see you all in the next one.