An interview test question which will sort out 98% of candidates (Hint: The answer is “Johnny”).

ChatGPT answers interview questions

8 min readJan 13, 2023

I have been trying out ChatGPT for a few days, and I am impressed at what it can do at this moment. Yes, the writing is a bit juvenile, and sometimes the code it spits out looks like it has been lifted from Stack Overflow. The fact that it understands what you are asking and gives you something that is coherant and not totally off tangent says volumes of its ability and potential.

Recently, I went to an interview for a machine learning developer position at a Berlin medical imaging startup (name withheld to protect the innocent). As soon as I step feet on premise, the HR lady gave me a 17-question multiple choice test and 25 minutes to finish it. The test read like trivia and I thought they were really trying to convince me how much they know. In retrospect I should just ask these Bozos where the door was …

Long story short, I thought it would be fun to see what ChatGPT would think of these questions. I have C+P’d an excerpt of questions. Then I show what the responses chosen ChatGPT and justification. I would be interested in what you think of these answers.

Questions:

Question 1:

You are given observations of planet locations assuming planar (2D) orbits (ellipses) around unknown origins (x,y,z) (3D). Each sample consists of 10 different observations. A colleague suggests finding a latest representation to capture the different orbits to compress the samples. How would you go about it? (bottleneck=latent space dimension size):

Using a variant autoencoder (VAE) with a bottleneck of 4
Using a variant autoencoder (VAE) with a bottleneck of 8
Using a autoencoder (AE) with a bottleneck of 4
Using a autoencoder (AE) with a bottleneck of 8

Question 2:

You want to train a model for an image segmentation task. Each pixel can be 1 of 3 classes. There is a known prior about the histogram (normalized) of pixel classes in the input samples. How can you leverage this prior:

Using forward KL Divergence in the loss function “comparing” the prediction histogram (normalized) and prior
Penalizing the entropy difference of prior and prediction of the sample batch
Using IOU regularization comparing histogram intersection over union for the 3 classes
Train 3 learners one for each class and take majority voting as final prediction

Question 3:

You are given some image data set with the objective to finding semantic segmentation of the images using a CNN. In addition, you observe that the correct segmentation of a pixel an average is a function of “far” pixels. How would you use best this observation in your design?

Use “big” convolution kernels with a deep network
Make sure that the effective respective field is sufficient
Divide the image into small patches and train to solve for each
Make sure that the objective takes into account global behavior

Question 4:

You observed that in your problem the ground truth is invariant to some transformation of your input data, What technique can help you leverage this?

Boosting
Dimensionality reduction
Data Augmentation
Symmetric architecture

Question 5:

You are trying to learn the best sequence of action to win some game (single player). In the beginning you have no knowledge of the game objective, but you can observe the score you get when you complete the game. You can tweak the level of stochasticity in the actions you take (compared to greedy choice). How would you set this knob?

If the game is deterministic set to 0, otherwise decrease over time
Start high and decrease as you gain more confidence in the predictions
Start low and increase as you gain more confidence in the predictions
Introduce it randomly to encourage exploitation

Question 6:

How would you not use an autoencoder?

To reduce the dimensionality of data
Use the encoder as a generative model to create more samples in latent space
To predict the labels of input data using latent representations
To generate more samples from an encoded latent vector

Question 7:

You want to model a Discrete Fourier Transform (DFT). Suggest a baseline acrhitecure?

ResNet
Inception
Fully Connected Layer
Transformer based architecture

Question 8:

You want to model a discrete Fourier Transform (DFT), a linear transformation. Howe would you approximate the sufficient number of layers for the model?

One Layer is enough
Add layers until the error over the training data is low enough
Compute the VC dimension according to the problem and use the theoretical bound
Hyper parameter optimization

AND NOW … The ChatGPT Responses:

Question 1: (1) or (2)

Using a variant autoencoder (VAE) with a bottleneck of 4 or 8 would be a good approach for finding a latent representation to capture the different orbits and compress the samples. The VAE is a type of autoencoder that is specifically designed for generating new data, which makes it well-suited for finding a compressed representation of the input data that can be used to generate new samples. The bottleneck size (latent space dimension size) can be adjusted to control the level of compression, with a smaller bottleneck size resulting in more compression but potentially losing some of the original information

Question 2: (1) or (2)

Using forward KL Divergence in the loss function “comparing” the prediction histogram (normalized) and prior is a good way to leverage the known prior about the histogram of pixel classes in the input samples. The KL divergence is a measure of the difference between two probability distributions, so using it in the loss function will encourage the model to produce predictions that are similar to the known prior distribution
Penalizing the entropy difference of prior and prediction of the sample batch is also a way to leverage the known prior, since entropy is a measure of the disorder or randomness of a probability distribution. By penalizing the difference in entropy between the prediction and the prior, the model will be encouraged to produce predictions that are similar to the prior
Using IOU regularization comparing histogram intersection over union for the 3 classes is not a good way to leverage the known prior, since IOU is a measure of the overlap between two sets and does not relate to the probability distribution
Train 3 learners one for each class and take majority voting as final prediction is not a good way to leverage the known prior, since it is training 3 independent models and doesn’t take into account the prior information

Question 3: (4)

Using “big” convolution kernels with a deep network may not be beneficial, as larger kernels can be sensitive to noise, and deep networks can be difficult to train and prone to overfitting
Making sure that the effective receptive field is sufficient can be beneficial, but it is not enough
Dividing the image into small patches and training to solve for each is not a good approach, as it will not take into account the relationship between pixels in the image
Make sure that the objective takes into account global behavior is the best approach for leveraging the observation that the correct segmentation of a pixel on average is a function of “far” pixels. One way to do this is to use a global context loss function, such as a fully connected CRF or a non-local block, which considers the interactions between all pixels in the image. This will help the model to take into account the relationships between distant pixels when making segmentation predictions. Another approach to take into account is to use a multi-scale architecture that combines features from different scales, this can be achieved by using skip connections, or adding a module to the architecture that handles different scales (like a pyramid pooling module)

Question 4: (3)

Boosting is an ensemble method that can be used to improve the performance of a model but it does not address the issue of data invariance
Dimensionality reduction can be useful for reducing the complexity of data but it does not address the issue of data invariance
Data Augmentation is a technique that can help leverage the observation that the ground truth is invariant to some transformation of your input data. Data augmentation involves applying various operations such as rotation, scaling, flipping, and translation to the input data in order to artificially increase the size of the dataset. This can help to improve the robustness of the model by exposing it to different variations of the input data and making it invariant to certain transforms
Symmetric architecture is a technique where the architecture is identical for both the input and output. It can be useful to improve the performance of the model but it does not address the issue of data invariance

Question 5: (3)

If the game is deterministic set to 0, otherwise decrease over time may not be a good approach since it does not take into account the model’s confidence in its predictions
Start high and decrease as you gain more confidence in the predictions may lead to exploration for too long, and the agent may not learn the best strategy quickly
Start low and increase as you gain more confidence in the predictions is a good approach for setting the knob for the level of stochasticity in the actions taken. Starting with a low level of stochasticity allows the model to explore different options and gain a better understanding of the game before making more confident, greedy choices. As the model gains more confidence in its predictions, the level of stochasticity can be increased to encourage exploitation and refine the learned strategy
Introduce it randomly to encourage exploitation may cause the model to not learn a consistent strategy

Question 6: (3)

To predict the labels of input data using latent representations is a case where an autoencoder is not the best approach. Autoencoder is a type of neural network that is trained to reconstruct the input data, it is not designed to predict labels or perform classification tasks. There are other types of neural networks such as supervised feedforward neural networks that are better suited for this task and can be trained using labeled data to predict the labels of input data.

Question 7: (3)

ResNet is a convolutional neural network architecture designed for image classification tasks, it is not well suited for DFT modeling
Inception is a convolutional neural network architecture designed for image classification tasks, it is not well suited for DFT modeling
A Fully Connected Layer is a baseline architecture to model a Discrete Fourier Transform (DFT). The DFT is a linear transformation that can be represented by a matrix multiplication, a fully connected layer can be used to represent this matrix and perform the DFT on an input signal. The fully connected layer can be trained with a suitable loss function to approximate the DFT matrix
Transformer-based architectures are used for sequence modeling tasks and it is not well suited for DFT modeling

Question 8: (2)

One Layer is not always enough, as the DFT is a complex linear transformation and a single layer may not be able to capture the complexity of the problem
Add layers until the error over the training data is low enough is a practical approach for approximating the sufficient number of layers for the DFT model. The DFT is a linear transformation, and as such it can be represented by a matrix multiplication, which can be approximated by a neural network with enough layers. By training the model with a suitable loss function and monitoring the error over the training data, you can increase the number of layers until the error is low enough, indicating that the model is able to approximate the DFT matrix well
Computing the VC dimension according to the problem and using the theoretical bound is a theoretical approach, it may not always be practical
Hyperparameter optimization is a useful technique, but it should be used in conjunction with other methods such as monitoring the error over the training data to determine the sufficient number of layers for the model