Author | SUNIL RAY compile | Flin source | analyticsvidhya

introduce

If you ask me about the two most intuitive algorithms in machine learning — those are K-nearest Neighbor (kNN) and tree-based algorithms. Both are easy to understand, easy to explain, and easy to show people. Interestingly, we put both algorithms to a skill test last month.

If you’re not familiar with machine learning, be sure to test on the basis of an understanding of both algorithms. They are simple but powerful and are widely used in industry. This skill test will help you test yourself on the K-nearest neighbor algorithm. It is designed to test your knowledge of kNN and its applications.

More than 650 people signed up for the test. If you’re one of the people who missed out on this skill test, then this article is testing questions and solutions. This is a list of the participants who took the exam.

  • Datahack.analyticsvidhya.com/contest/ski…

Useful resources

Here are some resources to delve into the topic.

  • Fundamentals of machine learning algorithms (with Python and R code) : A simple guide to Logistic regression in R
    • www.analyticsvidhya.com/blog/2017/0…
  • K- Nearest Neighbor (kNN) algorithm
    • www.analyticsvidhya.com/blog/2018/0…

Skills Quiz

1) The K-NN algorithm performs more calculations in the test time than in the training time.

A) true B) false

Solution: A

The training phase of the algorithm only includes the feature vectors and category labels of the training samples.

In the test phase, test points are classified by assigning the most frequently used labels from the K training samples closest to the query point — thus requiring higher computational effort.

2) Assuming that the algorithm you use is the K-nearest neighbor algorithm, in the image below, ____ will be the best value for k.

A) 3 B) 10 C) 20 D) 50

When the value of k is 10, the verification error is minimum.

3) Which of the following distance measures cannot be used in k-NN?

A) Manhattan B) Minkowski C) Tanimoto D) Jaccard E) Mahalanobis F

Solution: F

All of these distance measures can be used as distance measures for K-nn.

4) Which of the following is true of the K-NN algorithm?

Can be used for classification B) can be used for regression C) can be used for classification and regression

C

We can also use k-nn for regression problems. In this case, the prediction can be based on the mean or median of the k most similar instances.

5) Which of the following statements is true about the K-NN algorithm?

  1. K-nn works better if all data have the same ratio
  2. K-nn works well with a small number of input variables (p), but has difficulty with a large number of inputs
  3. K-nn makes no assumptions about the functional form of the problem being solved

A) 1 and 2 B) 1 and 3 C) only 1 D) all of the above

D

The above statement is the assumption of the kNN algorithm

6) Which of the following machine learning algorithms can be used to estimate missing values for categorical and continuous variables?

A) K-nn B) linear regression C) Logistic regression

Solution: A

K-nn algorithm can be used to estimate missing values of categorical variables and continuous variables.

7) Which of the following is true about the Manhattan distance?

A) Available for continuous variables B) Available for classified variables C) Available for classified variables and continuous variables D) None

Solution: A

The Manhattan distance is designed to calculate the distance between real value features.

8) For categorical variables in K-NN, which of the following distance measures shall we use?

  1. Hamming distance
  2. Euclidean distance
  3. Manhattan distance

A) 1 B) 2 C) 3 D) 2 E) 2 F) 1 and 3

Solution: A

Euclidean and Manhattan distances are used in the case of continuous variables, while hamming distances are used in the case of categorical variables.

9) which of the following is the Euclidean distance between two data points A (1,3) and B (2,3)?

A) 1 B) 2 C) 4 D) 8

Solution: A

SQRT ((1-2) ^ 2 + (3-3) ^ 2) = SQRT (1 ^ 2 + 0 ^ 2) = 1

10) which of the following is the Manhattan distance between two data points A (1,3) and B (2,3)?

A) 1 B) 2 C) 4 D) 8

Solution: A

SQRT (mod ((1-2)) + mod ((3-3))) = SQRT (1 + 0) = 1

Content: 11-12

Suppose you have given the following data, where x and y are two input variables and Class is the dependent variable.

The following scatter diagram shows the above data in 2D space.

11) Suppose you want to use the Euclidean distance in 3-NN to predict the categories of new data points x = 1 and y = 1. Which category does the data point fall into?

A) + B) — C) cannot judge D) None of these

Solution: A

All three closest points are of the + class, so this point will be classified as the + class.

12) In the previous question, you now want to use 7-nn instead of 3-knn, which of the following x = 1 and y = 1 belong to?

A) + class B) — class

C) Can’t judge

B

This point will now be classified as a – class because there are four – class points and three + class points in the nearest circle.

The content of 13-14:

Suppose you provide the following two types of data, where “+” represents a positive class and “-” represents a negative class.

13) Which of the following k values of k in k-NN minimizes the accuracy of leaving one method for cross-validation?

A) 3 B) 5 C) both D) none

B

5-NN will leave at least one cross-validation error.

14) Which of the following is the accuracy without cross-validation when K = 5?

A) 2/14 B) 4/14 C) 6/14 D) 8/14 E) None of the above

Solution: E

In 5-NN, we will have 10/14 cross-validation accuracy.

15) Regarding k in k-nn, which of the following is true according to the bias?

A) As you increase k, the bias increases B) as you decrease k, the bias increases C) Can’t judge D) None of these

Solution: A

Large K represents simple models, which are always seen as highly biased

16) Which of the following is true of k in the variance k-nn?

A) As you increase k, the variance increases B) as you decrease k, the variance increases C) Can’t tell D) None of these are true

B

Simple models will be treated as models with small variances

17) The following two distances (Euclidean and Manhattan distances) have been given and we usually use them in k-NN algorithms. These distances are between points A (x1, y1) and B (x2, Y2).

Your task is to mark two distances by looking at the following two figures. Which of the following is true about the figure below?

A) Euclidean distance B) Euclidean distance C) Euclidean distance D) Euclidean distance solution: B) Euclidean distance

The Euclidean distance works (left) and the Manhattan distance (right).

18) Which of the following would you consider in k-NN when you find noise in the data?

A) I will increase the value of k B) I will decrease the value of k C) noise cannot depend on k D) none of these

Solution: A

To make sure you categorize, you can try increasing the value of k.

19) In K-NN, overfitting is likely due to the presence of dimension. Which of the following options would you consider to solve this problem?

  1. Dimension reduction
  2. Feature selection

A) 1 B) 2 C) 1 and 2 D) None of these

C

In this case, you can use a dimensionality reduction algorithm or a feature selection algorithm

20) Here are two statements. Which of the following two statements is true?

  1. K-nn is a memory-based approach, in which the classifier adjusts as soon as we collect new training data.
  2. In the worst case, the computational complexity of the new sample classification increases linearly as the number of samples in the training dataset increases.

A) 1 B) 2 C) 1 and 2 D) None of these

C

21) Suppose you are given the following images (1 left, 2 middle and 3 right), now your task is to find the k value of k-nn in each image, where K1 represents the first graph, k2 represents the second graph, and k3 is the third graph.

A) k1 > k2 > k3 B) k1 < k2 C) k1 = k2 = k3 D) None of these

D

The value of k is highest in K3 and lowest in K1

22) In the figure below, which of the following k values gives the lowest retention method cross validation accuracy?

A) 1 B) 2 C) 3 D) 5

If the value of k is kept at 2, the accuracy of cross-validation is lowest. You can try it yourself.

23) A company built a kNN classifier that achieved 100% accuracy on training data. When they deployed the model on the client side, they found that it was not accurate at all. Which of the following could be wrong?

Note: The model was successfully deployed and no technical issues were found on the client side other than model performance

A) the model may be over-fitted B) the model may not be fitted C) it cannot be judged D) None of these are true

Solution: A

It seems to perform well on training data in an over-fitted module, but it is not general enough to give the same results on new data.

24) You have given the following two statements, which one is true in the case of k-nn?

  1. If k is very large, we can include other kinds of points in the neighborhood.
  2. If the value of k is too small, the algorithm is very sensitive to noise

A) 1 B) 2 C) 1 and 2 D) None of these

C

Both options are true, and both are self-evident.

25) Which of the following statements is true for the K-NN classifier?

A) The higher the k value is, the better the classification accuracy is

B) The smaller the k value is, the smoother the decision boundary is

C) The decision boundary is linear

D) K-NN does not require explicit training steps

D

Option A: Not always. You have to make sure that k is not too high or too low.

Option B: The statement is not true. Decision boundaries can be somewhat jagged

Option C: Same as option B

D

26) True or false question: Can 1-NN be used to construct a 2-NN classifier?

A) true B) false

Solution: A

You can implement 2-NN classifiers by combining 1-NN classifiers

27) In k-nn, what happens when you increase/decrease the value of k?

A) The larger the K value, the smoother the boundary

B) The boundary becomes smoother as the value of K decreases

C) The smoothness of the boundary is independent of K value

D) None of these

Solution: A

By increasing the value of K, the decision boundary becomes smoother

28) The following are two statements given about the K-NN algorithm. Which one is true?

  1. We can select the optimal value of k by means of cross validation
  2. The Euclidean distance treats each feature equally

A) 1 B) 2 C) 1 and 2 D) None of these

C

Both statements are correct

Content 29-30: Suppose you have trained a K-NN model and now you want to make predictions on test data. Before obtaining the prediction, suppose you calculate the time for k-NN to predict the test data category.

Note: Calculating the distance between two observations will take time D.

29) If there are N (very large) observations in the test data, how much time will 1-nn take?

A) N * D B) N * D * 2 C) (N * D) / 2 D) None of these

Solution: A

N is very large, so choice A is correct

30) What is the relation between the time taken by 1-nn, 2-nn, 3-nn?

A) 1-nn > 2-nn > 3-nn B) 1-nn < 2-nn < 3-nn C) 1-nn ~ 2-nn ~ 3-nn D) None of these

C

In kNN algorithm, the training time of any k value is the same.

Population distribution

Here’s the score distribution of the participants:

You can here (datahack.analyticsvidhya.com/contest/ski… Access scores. More than 250 people took the skills test and the highest score was 24.

The original link: www.analyticsvidhya.com/blog/2017/0…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/