3_knn

K-Nearest Neighbors

1. K Nearest Neighbors (KNN)• The basic idea of KNN is to figure out what an unlabeled example is according to its neighbors.• Parameter

k

indicates the number of neighbors we're considering.• The label is determined by majority voting.– For example, for

k = 7

, and 6 of examples →

1

and 1 example →

0

6 ⁄ 7 = 85.7 % \to 1

and

1 ⁄ 7 = 14.3 % \to 0

.– Note: In order to avoid ties, it's better to pick odd numbers for

k

.• Distance is KNN is defined as Euclidean Distance →

d (\vec{a}, \vec{b}) = \sqrt{(a_{1} - b_{1})^{2} + \dots + (a_{m} - b_{m})^{2}}

– Note: We should use some feature normalization like Minmax Scaling before calculating distance.– Note: We could also use Manhattan distance →

d (\vec{a}, \vec{b}) = | a_{1} - b_{1} | + . . . + | a_{m} - b_{m} |

• KNN can be susceptible to outliers. – In order to determine if an unlabeled example is an outlier, we can average its distance to

k

nearest neighbors and if the average distance is greater than some threshold, we can label it as an outlier (an odd example).– We can determine the threshold through cross validation.• How to account for categorical features in KNN?– For ordinal categorical features, we can use Gower Distance.– For comparing two asymmetric binary variables, we can use Jaccard Distance.* Note: For two asymmetric binary variables, we exclude the zeros. But for a symmetric binary we include the zeros as well.– For multi-category variables we can use Hamming Distance.• How does KNN predict regression models?– It just averages the value of

k

nearest neighbors.• Partial Considerations– KNN performs poorly when features are in dimensionality (especially euclidean distance).* One way to mitigate this problem is dimensionality reduction.* Another way is to do forward feature selection using cross validation.– KNN is sensitive to scaling (this is true for any model that's based upon distance).– One way to improve the performance of KNN is to weigh the votes.* For any node

i

(

i ⩽ k

), the weight is calculated by →

\frac{d_{1} + d_{2} + \dots + d_{k}}{d_{i}}

* So, if the distance to neighbor node

i

d_{i}

, gets smaller it will get higher weight.* Note: Numerator is fixed and only the denominator changes.– KNN is computationally expensive →

O (n . d . k)

where

n

is the number of nodes, and

d

is the dimension of each node.* To reduce the complexity down, we can use a k-d tree data structure which allows us to locate the relevant points faster than iterating over all the points. * Note that k-d tree also suffers from increase in dimensionality.• Example: Cyber Security– Every once in a while these programs will make calls to the operating system (i.e. system calls).– Our goal is to detect intrusive processes/programs in all the machines in a system.– Our features are going to be system calls which look like this:*

p i d_{1} = [' m k d i r',' m o u n t',' p o l l',' c h o w n']

p i d_{2} = [' c h r o o t',' b i n d',' f o r k',' o p e n',' p r e a d y']

p i d_{3} = . . .

– The labels are

i n s t r u s i v e \to 1

and

n o n - i n t r u s i v e \to 0

.– We can treat system calls as just words and perform TF-IDF on the them to make them numerical.*

p i d_{1} = [0.15, 0.12, 0.13, 0.31]

p i d_{2} = [0.16, 0.31, 0.46, 0.21, 0.11]

p i d_{3} = . . .

– For an unlabeled example, we calculate its distance to all other nodes, and determine its label based on the majority voting of its closest

k

neighbors.– Let's say we also have a categorical features named 'priority'. It has three values: High, Medium, Low. Let's calculate Gower distance.* High → 0, Medium → 1, Low → 2* Gower Distance =

v a l u e ⁄ \max (v a l u e)

d_{G o w e r} (h i g h) = 0 ⁄ 2, d_{G o w e r} (m e d i u m) = 1 ⁄ 2, d_{G o w e r} (l o w) = 2 ⁄ 2