This article was originally published on V7. You’ve found and applied for your dream job. They called you back with an invitation for an interview. Hell yeah! – you think. Finally, a chance to work on a data science project you care about. But— After a few minutes of excitement, the reality hits hard: Now, you have to absolutely nail your job interview. How do you prepare to answer a gazillion of technical questions? Don’t worry! It’s not as hard as it seems. This article will walk you through the list of the most popular data science interview questions and answers to help you ace your interview and land the job of your dreams.
- Basic data science concepts questions
- Technical knowledge questions
- Statistics questions
- Cultural fit questions
💡 V7 is currently hiring for a few technical positions. Check out V7 Careers to learn more about working for one of the fastest-growing ML startups. We are looking forward to your applications!
Data science interview questions on basic concepts
Let’s start with a few simple questions exploring the basic concepts of data science.
What is “Deep Learning”?
Deep learning is a subfield of machine learning and artificial intelligence that uses deep neural networks – algorithms roughly inspired by the human brain – to teach computers to imitate how humans think and learn.
Unlike traditional machine learning deep learning models store millions, or even billions of parameters, making them more obscure to interpretation but very powerful in understanding data. A deep learning model is like a “hologram” of information, storing interconnected weights that represent fragments of what it learned. Learning can be supervised, semi-supervised or unsupervised, and the adjective “deep” refers to the use of multiple layers in the network.
Can you name a few Deep Learning Frameworks?
The three most frameworks used in cutting edge research are:
Other less commonly used deep learning frameworks include:
- Swift for TensorFlow
To impress your interviewer, make sure that you stay up-to-date with new emerging deep learning frameworks even if they aren’t widely used at the moment.
💡 Pro tip: Ask your employer what framework they’re using as early as you can in the interview process so that you can read up about it in advance. At V7, we use Pytorch and generally prefer speaking to people who use it.
Can you explain the fundamentals of neural networks?
Artificial Neural Networks (ANN) are computational networks that loosely model the neurons in the human brain. Their goal is to recognize the relationships between various data points in a given dataset and to learn (often autonomously) to perform specific tasks by analyzing relevant examples. Neural networks are organized into several layers, including an input layer, an output layer, and hidden layers. Within these layers are “weights” that carry information about the training set, such as the features found within images or audio waveforms.
During training, they perform gradient descent towards a loss function (hint: Make sure you’re very comfortable with these definitions!). As this occurs, the model learns the gradient of the slope towards the loss function via backpropagation. The simplest neural network model is called the perceptron, and it’s a supervised learning algorithm invented in 1957. It consists only of two layers: an input layer and an output layer. Today’s most advanced transformer networks can contain hundreds of layers and billions of parameters and can be trained on billions of training data items.
Please explain the role of data cleaning in data analysis.
Data cleaning is the process of data preparation. It aims to identify incomplete, incorrect, or irrelevant parts of the training data and then replace or delete them. For example, outliers, mislabeled data, or biased samples can reduce the accuracy of a resulting model. Data cleaning ensures better quality training data, and therefore, a more accurate model.
How should you maintain a deployed model?
To maintain a deployed model, one has to: Monitor the model proactively The first step in ensuring your models’ performance accuracy is to monitor them after their deployment. Applying any changes to your models, or replacing them with new versions in the case of deep learning, will require you to check it against a common test set to avoid deterioration or catastrophic forgetting. Often your corpus of training data will grow, as your models need to tackle more varied examples. Ensure that your test sets also grow to represent the increased diversity of your data in production! Measure model’s accuracy Testing for accuracy doesn’t end at the training phase – the model’s average precision, recall, or accuracy per class may vary over time as either your model is re-trained, or more often so your data begins to change as the world progresses. Ensure you have a way of continuously sampling production data, labeling it as ground truth, and testing the latest version of the model against it for the most up-to-date accuracy. A model predicting the stock market trained in 2019 would do terribly in 2021, for example. Uptime is everything If your model is running on multiple GPUs, you’ll have to make sure you develop a strong inference engine to keep it running at an unpredictable scale. If DevOps is not your strong suit, make sure you look for third-party tools to keep models running, or that your employer has talent in place to keep things running. If you’re working on computer vision, for example, you can use V7’s inference engine to keep models running on up to 100 GPU servers.
Can you explain the Decision Tree algorithm?
The Decision Tree is a supervised machine learning algorithm used for classification and regression problems. We use a Decision Tree to train a model predicting the value of a target variable. This algorithm uses the tree representation to solve the problem in which the leaf node corresponds to a class label and attributes are represented on the internal node of the tree. We can distinguish two types of Decision Trees algorithm (based on the type of our target variable):
- Categorical Variable Decision Tree: The Decision Tree has a categorical target variable.
- Continuous Variable Decision Tree: The Decision Tree has a continuous target variable.
Can you compare the validation set with the test set?
A validation set is used for model tuning during training. It’s the set of data that the model “checks against” to see how well it’s doing during the training process. It should look similar to the training set but varied enough to teach the model how to identify brand new examples. A test set is a further set of data that the fully trained model tests itself against when it completes training. It’s all data the model has never seen before and is used to check for generalization.
💡 Pro tip: Looking for high quality data for your projects? Check out: 65+ Best Free Datasets for Machine Learning.
Think of it this way: The training set is your studying material, the validation set are the question/answer pairs you check against while studying to make sure you’re remembering everything, and the test set is your final exam.
What are tensors?
In data science, tensors are a type of data structure used in linear algebra to describe a multilinear relationship between sets of algebraic objects within a vector space. Tensors generalize scalars, vectors, and matrices to higher dimensions. If a scalar is a point, you add a dimension and get a vector (line with direction), you add another dimension and get a matrix (grid of values), stack those together and you get a 3D tensor.
A single-dimensional tensor can be represented as a vector, while a two-dimensional tensor is represented as a matrix. Colour images are technically 3D tensors, containing a grid of R, G, and B values (and sometimes a fourth alpha channel).
Data science interview questions on technical concepts
Now, let’s have a look at some of the technical questions you might get when interviewing for a data scientist position.
What are the differences between supervised and unsupervised machine learning?
Here’s the comparison between supervised and unsupervised machine learning.
How does logistic regression work?
Logistic regression (also called the logit model) is a statistical method used to analyze a dataset with one or more independent variables determining the outcome. It measures the relationship between the dependent variable and independent variables(s) using the Logistic Function (Sigmoid) to model probability. It’s used to model a binary outcome – a variable that can have only two possible values: 0 and 1.
What is the bias-variance tradeoff?
Bias is the phenomenon that occurs when the results produced by an algorithm are systemically prejudiced due to faulty data points. Bias can be the cause of underfitting and overfitting. We distinguish between low bias and high bias in traditional machine learning algorithms. Low bias machine learning algorithms include: Decision Trees, k-NN, SVM High bias machine learning algorithms include: Linear Regression and Logistic Regression In deep learning, bias can be troublesome as it can be hard to spot. For example, training a stock market predictor mostly on technology stocks will bias the model towards thinking most companies behave like tech companies. To avoid this, it is paramount that you spend more time on your training data than you do on modeling, and create bias-busting test sets to prevent damaging results. Variance is a type of error in your model occurring due to the model’s sensitivity to the changes in the independent variables (features). The model picks up even the most minor details about the relationship between features and target. It also learns the noise in the training data set and, as a result, performs poorly on the test data set. It can lead to high sensitivity and overfitting. To achieve good prediction performance, you need to have low bias and low variance. In simple words, the bias-variance tradeoff is the balance between the Bias error and the Variance error.
Here’s what it means— If your model is too simple (it has only a few parameters), it’s characterized by high bias and low variance, leading to underfitting. On the other hand, if your model is too complex – meaning that it has high variance and low bias, you will be dealing with overfitting. Essentially, the bias-variance tradeoff is about finding the right balance without overfitting or underfitting the data.
Can you explain the concepts of overfitting and underfitting?
Overfitting is a modeling error where the machine learning model learns “too much” from the training data, paying attention to the points of data that are noisy or irrelevant. Overfitting negatively impacts the models’ ability to generalize. Underfitting is a scenario where a statistical model or the machine learning model cannot accurately capture the relationships between the input and output variables. Underfitting occurs because the model is too simple—informed by not enough training time, too few features, or too much regularization.
What are exploding gradients?
Exploding gradients relate to the accumulation of significant error gradients, resulting in very large updates to the neural network model weights during training. This, in turn, leads to an unstable network. The values of the weights can also become so large as to overflow and result in something called NaN values.
What are dimensionality reduction and its benefits?
Dimensionality reduction is the process of reducing the number of input variables in a dataset. As the number of features increases, your model becomes more complex, making a predictive modeling task more challenging.
Some of the advantages of dimensionality reduction include:
- Higher model accuracy thanks to clean data.
- Faster model training – fewer dimensions mean shorter computing time.
- Reduced storage space needs.
- Fewer dimensions allow usage of algorithms unfit for a large number of dimensions.
- It helps remove redundant features.
How do Backpropagation and Feed-forward work?
Backpropagation stands for “backward propagation of errors.”. It refers to the algorithm used for training feedforward neural networks by repeatedly adjusting the network’s weights to minimize the difference between the actual output vector of the net and the desired output vector. Backpropagation aims to minimize the cost function by adjusting the network’s weights and biases. The cost function gradients determine the level of adjustment concerning parameters like activation function, weights, bias, etc. Feedforward Propagation occurs when the input data is fed in the forward direction through the network. Each hidden layer receives the input data, processes it (using an Activation Function), and passes it onto the next layer.
In the feedforward propagation, the Activation Function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer.
What are autoencoders?
An autoencoder is an unsupervised learning technique used to learn efficient data encodings. The goal of an autoencoder is to learn a lower-dimensional representation (encoding) for higher-dimensional data by training the network to capture the most important parts of the input image. Autoencoders consist of 3 parts: encoder, bottleneck, and decoder. They find application in dimensionality reduction, image denoising, and even generation of image and time series data.
💡 Pro tip: If you’d like to learn more about autoencoders, you can check out: An Introduction to Autoencoders: Everything You Need to Know.
What is pooling in CNNs?
Pooling in convolutional neural networks is a form of non-linear down-sampling. The pooling layer is one of the building blocks of a CNN, and it is used to reduce the dimensions of the feature maps. You are effectively squeezing patches of the images into compressed representations. There’s 3 common forms of pooling – Max, Average, and Sum:
During convolutions, you are pooling information from your original image. The size of your convolutional kernel can be tweaked for your data’s needs. The example below added 1 pixel of padding, for example.
Another parameter of your convolutions is the stride or the number of pixel-steps the kernel takes as it performs convolutions.
What are the different layers in CNN?
In its simplest form, a CNN architecture consists of convolutional layers, pooling layers, and fully connected (FC) layers. In addition to this, CNN’s architecture also includes two crucial parameters, such as a dropout layer and the activation function.
What is RNN?
RNN stands for Recurrent Neural Network, and it’s a type of artificial neural network that uses sequential or time series data. Like other algorithms, RNNs use training data to learn, however— What makes them unique is their internal state (memory) used to process sequences of inputs. In simple terms, RNNs take information from prior inputs to influence the current input and output. They are used for speech recognition, language translation, music composition, text summarization, video tagging, and more.
What is the significance of the p-value?
A p-value (probability value) describes the likelihood of a particular result occurring by random chance when the null hypothesis is assumed to be true. P-value is used for statistical significance tests. In simple terms, the p-value helps us answer the following question (based on our null hypothesis): Does the data really represent the observed effect?
Can you explain the SVM machine learning algorithm in detail?
Support Vector Machines (SVMs) are a set of supervised learning models used for solving regression and classification problems and outliers detection. Support Vectors are the two closest points in a plane that belong to different classes, plotted in the same (usually 2D, but sometimes 3D) plane. A Hyperplane is a line that linearly separates and classifies a set of data. The distance between the hyperplane and the nearest data point from either set is known as the margin. An SVM’s goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly.
What are support vectors in SVM?
Support vectors are data points that lie closest to the decision surface or the hyperplane. They influence the position and orientation of the decision surface or the hyperplane and are most difficult to classify. By using the support vectors, you can maximize the margin of the classifier, and by removing them, you will change the position of the decision surface/hyperplane.
What are the different kernel functions in SVM?
A Kernel function is used to take data as input and transform it into the desired form of processing data. Kernel functions provide shortcuts to avoid complex calculations. Some of the most popular Kernel functions in SVM include:
- Linear Kernel
- Polynomial Kernel
- Gaussian Radial Basis Function (RBF)
- Sigmoid Kernel
- Gaussian Kernel
- Bessel function Kernel
- ANOVA kernel
- Hyperbolic Tangent Kernel
- Laplace RBF Kernel
What are the types of biases that can occur during sampling?
There are three common types of biases you might encounter during sampling:
- Selection bias
- Undercoverage bias
- Survivorship bias
Within each there can be many more specific types:
- Instrument bias: the device used to collect the data applied specific features to it (eg compression, color tint, audio quality) which is picked up and overfitted to.
- Population bias: the populations to which the selection was applied are not representatives of the total distribution of these objects.
- Background bias: background elements, such as the background color of an image, or background noises in audio, influence the model’s classification.
- Parametric bias: The hyperparameters used during training make the model classify some objects better or differently than others.
- Our Distribution bias: When given examples that are very different from the training set, the model defaults towards a specific set of classes that have nothing to do with the query.
What is ensemble learning?
Ensemble learning refers to the method of combining a diverse set of learners together to improvise on the stability and predictive power of the model. Two types of Ensemble learning methods are: Bagging Bagging method helps you to implement similar learners on small sample populations. It helps you to make nearer predictions Boosting Boosting
What is Random Forest and how does it work?
Random Forest is a machine learning method used for regression and classification tasks. It consists of a large number of individual decision trees that operate as an ensemble and form a powerful model. The adjective “random” relates to the fact that the model uses two key concepts:
- A random sampling of training data points when building trees
- Random subsets of features considered when splitting nodes
In the case of classification, the output of the random forest is the class selected by most trees, and in the case of regression, it takes the average of outputs of individual trees.
Random Forest is also used for dimensionality reduction and treating missing values and outlier values.
What is a confusion matrix?
The confusion matrix (also known as an error matrix) a 2X2 table that describes the classification’s model (or classifier’s) performance on a set of test data where true values are known. The matrix compares the actual target values with values predicted by the machine learning model. For example, here’s an example of a binary classification confusion matric confusion matrix of binary classification with two possible predicted classes and four outcomes: True positive(TP) — Correct positive prediction False positive(FP) — Incorrect positive prediction True negative(TN) — Correct negative prediction False negative(FN) — Incorrect negative prediction
Various measures that can be derived from a confusion matrix include: Accuracy – (TP+TN)/total Precision – TP/predicted yes Error rate – (FP+FN)/total Specificity – TN/actual no Sensitivity – TP/actual yes False positive rate – FP/actual no
Can you name three disadvantages of using a linear model?
A few of the biggest linear model’s disadvantages include: The linear model is limited to linear relationships: It only considers linear relationships between dependent and independent variables—assuming there is a straight-line relationship between input and output variables. However, this assumption can often be incorrect as sometimes the relationship between values is curved (e.g. age and income graph). The linear model is sensitive to outliers: The sensitivity to poor quality data causes the linear model to underperform. With a large number of outliers, the model will be skewed away from the actual underlying relationship. The linear model assumes that the data is independent: In cases of clustering in space and time, this assumption is incorrect.
What are recommender systems?
A recommender system is a subclass of information filtering techniques. It’s a machine learning system aiming at predicting users’ interests and recommending to them product items they are likely to be interested in. The data used in building recommender systems is derived from user’s ratings and preferences. They operate using either a single input (a song or a video etc.,) or multiple inputs within the platform. Companies like Netflix, Spotify, or Youtube are among the popular ones that use recommender systems to a large extent.
Data science interview questions on statistics
What is sampling? Can you name a few sampling methods?
Sampling refers to the selection of individual elements or a subset/group that you will collect data from in your research. A few of the most common sampling methods include:
- Simple Random Sampling
- Systematic Sampling
- Stratified Sampling
- Cluster Sampling
What is the difference between type I vs. type II errors?
A type I error (false-positive) occurs when a researcher rejects a null hypothesis that is actually true in the population; a type II error (false-negative) occurs when the researcher fails to reject a null hypothesis that is actually false in the population.
What are the assumptions required for linear regression?
There are four assumptions associated with a linear regression model: Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The residuals are independent—there is no correlation between consecutive residuals in time series data. Independence: Observations are independent of each other. Normality: For any fixed value of X, Y, the model’s residuals are normally distributed.
What is selection bias?
A selection bias is an error occurring when the conducted research doesn’t have a random selection of participants/elements—this lack of randomization in the process of the sample collection results in a distortion of statistical analysis. Sometimes, the selection bias is also referred to as the selection effect. When the selection biased isn’t taken into account, the study results might be inaccurate.
Data science interview questions on cultural fit
- How did you become interested in data science?
- Looking back at your early career or studies, what common mistakes did you make that you have now improved upon?
- Can you name any data scientists or researchers that you admire?
- What branch of data science are you most excited about for its outlook in the next 5 years?
- What is the latest data science book/paper you read? What did you learn from it?
- What datasets have you enjoyed using the most so far? Can you describe its nuances in detail?
- What is the largest corpus of data that you’ve worked with so far? Describe the challenges you found in modeling it.
💡 V7 currently has a few open positions that you can check out here: V7 Careers.
How to land a data scientist job: Summary
As the famous saying goes: “If you fail to plan, you plan to fail” Here, at V7, we know that attending job interviews can be a stressful experience for some of the candidates. But— Knowing what to expect and preparing yourself in advance is a surefire way to nail your next interview. Make sure to do your research and take time to explore more in-depth questions on technical knowledge that you might encounter. Finally, last piece of advice from the V7 team— Stick to the job’s responsibilities in your answers. If it’s a deep learning job, don’t spend time talking about traditional machine learning. If you’ll be working with structured data, don’t spend time talking about 100-layer CNN!