Data Science Interview Questions | Data Science Jobs l Data Science Career Opportunities
Articles Blog

Data Science Interview Questions | Data Science Jobs l Data Science Career Opportunities


Hey guys, welcome to the session by Intelllipaat. So data is today’s oil and data scientists are the rockstars of this era. And if you want to be a data scientist, you
obviously have to attend a data science interview and the interviewer over there could ask a wide range of questions related to statistics, machine learning and complicated puzzles. So, we’ve come for the session on data science
interview questions so that you can ace any data science interview by all of the major
companies. So before we start off with all of the Q&A’s, let’s actually go through this interesting puzzle. So let’s say there are five lanes on a racetrack and 25 horses in total. Now you would have to find out the minimum number of races to be conducted to determine the three fastest horses. So how would you do it. So what you understand by linear regression. Well linear regression is a supervised learning algorithm which helps us in finding the linear relationship between two variables. So one is the predictor or the independent variable and the other is the response or the dependent variable. So we try to understand how does the dependent variable change with the independent variable. So linear regression is a supervised learning algorithm which helps us in finding the linear relationship between two variables. So one is the predictor of the independent
variable and the other is the response or the dependent variable. And we try to understand how does the dependent variable change with the independent variable. So let’s see there’s this telecom company
called as Neo. And now the data scientist at this company
wants to understand if there’s a linear relationship between the monthly charges incurred by the customer and the tenure of the customer. So, he collects all of the data and builds
a linear model between the monthly charges and the tenure. So here, monthly charges would be the dependent variable and tenure would be the independent variable. And then linear regression, there could be more than one independent variable. So if there’s just one independent variable it is known as simple linear regression and if there’s more than one independent variable, it is known as multiple linear regression. So guys this is the underlying concept of
linear regression where we have one dependent variable and multiple or a single independent variable. And we try to understand the linear relationship between the dependent variable and the independent variables. Now we have our next question over here. So the question is, What do you understand
by logistic regression? Well, logistic regression is actually your
classification algorithm which can be used when the dependent variable is binary. So let’s take this example. So here we are trying to determine whether
it will rain or not on the basis of temperature and humidity that is temperature and humidity are the independent variables and rain would be our dependent variable. That is, we’re trying to understand whether it will rain or not on the basis of the temperature and the humidity. And again logistic regression algorithm, it actually produces an S curve. So let’s say, x axis over here, it represents the a number of runs scored by Virat Kohli and the y axis represents the probability
of Team India winning the match. So let’s say this point over here, it denotes
50 runs. So what we can see from this graph is so if Virat Kohli scores more than 50 runs, then there is a greater probability for Team
India to win the match. And similarly if Virat Kohli scores less than
50 runs then the probability of Team India winning the match is less than 50 percent. So let’s take this value here. So let’s see the number of runs scored by
Virat Kohli is around 60. So if the number of runs scored by Virat Kohli is around 60 then the probability of Team India winning the match would be let’s say around 65 percent or so. Again let’s take this value here. So let’s say this is around 97 runs or 95 runs, and if Virat Kohli scores 95 or 97 runs then
the probability of Team India winning the match is one which is 100% isn’t it. So similarly this value here. So let’s say this is around 5 runs or 10
runs. So if Virat Kohli scores 5 or 10 runs
then the probability of team India winning the match is 0. So basically in logistic regression the y value lies within 0 and 1 range. And this is how logistic regression works. Now let’s head on to the next question. So what is the confusion matrix? So confusion matrix is actually a table which is used to estimate the performance of a model. It tabulates actual values and the predicted
values in 2X2 matrix. So these are the actual values and these are the predicted values. So this what you see true positives. So this does denotes all of those records where the actual values were true and the predicted values were also true. So these denote all of the true positives. After that we have the false negatives. So
false negatives denote all of those records where the actual value were true but the predicted value was false. So where the actual value is true but the predicted value is false that is known as a false negative. Then we have false positives. So in false positive, the actual value is false but the predicted value is true. And such values are known as false positives. And finally we have the true negatives where the actual values are false and the predicted values are also false. So if you want to get the correct values then correct values would basically represent all of the true positives and the true negatives. And this is how confusion metrics actually
works. Now let’s head on to next question. So, what do you understand by true positive rate and false positive rate? So let’s start with true positive rate. So in machine learning, true positive rate
which is also referred to as sensitivity or recall is used to measure the percentage of
actual positives which are correctly identified. So the formula for true positive rate is true
positives divided by all the positives. So I am stating it again, true positive rate is basically the measure of the percentage of actual positives which have been correctly identified. Now let’s look at false positive rate. So false positive rate is basically the probability of falsely rejecting the null hypothesis for a particular test. So the false positive rate is calculated as the ratio between the number of negative events wrongly categorized as positive. That is all of the false positives upon the
total number of actual events. So this is how we can calculate true positive rate and false positive rate. Now we have the next question and we are supposed to explain what is ROC Curve. So ROC Curve which actually stands for Receiver Operating Characteristics is basically a plot between the true positive
rate and the false positive rate and it helps us to find out the right trade-off between
the true positive rate and the false positive rate for different probability thresholds
of the predicted values. So the closer the curve is to the upper left
corner, the better the model is. Or in other words, whichever curve has greater area in the red that would be the better model. So let’s say we have this curve over here
and let’s say there is another curve which goes like this which is nearer to this upper
left corner than in that case since the second curve covers greater area under it that would be a better model than the first model. So this ROC curve helps us to find out the area under the curve as well as the right trade-off between the true positive rate and the false
positive rate. Now we’ll head on under the next question. So what do you understand by decision tree? So, decision tree is a supervised learning algorithm which is used for both classification and regression, right. So decision tree can be used for both classification purpose as well as regression purpose. So in this case, the dependent variable can be both a numerical value as well as a categorical value. So there is a flowchart like structure where the topmost node is known as that root node, the internal nodes with children are known as the branch nodes and the final nodes without children are known as the leaf nodes. So here, each node actually denotes a test on an attribute and each edge represents an outcome of the test and each leaf node holds a class label. So let’s say this first node over here. We’re trying to determine the age of the patient. So let’s say if the age of the patient is greater than 50. If the condition is true will come here. If the condition is false then we’ll come here. After that over here we’ll check if the person smokes or not. And if that person smokes, will come here. If that person doesn’t smoke will come here. Similarly over here, if the test condition could be whether the patient has any children or not. If the condition is true will come here if the condition is false will come here. So this is how the decision tree works. And finally we’ll have class labels over
here. So these would represent individual class
labels. So let’s say this represents that the person
has cancer. This represents that the person does not have cancer. Similarly this would represent the person
has cancer. And again this would represent that the person does not have cancer. So in decision tree, we basically have
a series of test conditions which would give us the final class labels.

11 thoughts on “Data Science Interview Questions | Data Science Jobs l Data Science Career Opportunities

  1. Guys, which technology you want to learn from Intellipaat? Comment down below and let us know so we can create in depth video tutorials for you.:)

  2. 🔥🔥🔥Intellipaat's Data Science online training course: https://intellipaat.com/data-scientist-course-training/ 🔥🔥🔥

  3. 👋 Guys everyday we upload high quality in depth tutorial on your requested topic/technology so kindly SUBSCRIBE to our channel👉( http://bit.ly/Intellipaat ) & also share with your connections on social media to help them grow in their career.🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top