Simple Implementation of Nearest-Neighbor Classification

Nearest-Neighbor Classifier is a simple classifier to assign one test samle to one category, by finding the nearest point in train data.

Nearest is detected with the Euclidean distance, for example, the distance of two D-dimentional point \(p, q\) is

If we want to find the point in train data which has the smallest distance, we don’t need to caculate the distance for every train sample, we can use matrix to deal with.

  1. Let each row of train data matrix minus the test sample
  2. The result matrix in step 1 is A, then every item in the diagonal of matrix \(AA^t\) is the square of the each distance for every train sample.
import numpy as np

def nn(train_data, sample):
    dist = train_data - np.tile(sample, (len(train_data), 1))
    dist = np.matrix.dot(dist, dist.transpose())
    fin_dist = np.diag(dist)

    min_index = np.argmin(fin_dist)

    return min_index

if __name__ == "__main__":
    data = np.genfromtxt("pima-indians-diabetes.data",delimiter=",")
    np.random.shuffle(data)
    split = len(data) / 2
    train_data = data[0:split]
    test_data = data[split + 1:]

    active_feat = range(1, 4)
    
    correct = 0;
    wrong = 0;
    
    for i in range(len(test_data)):
        nn_index = nn(train_data[:, active_feat], test_data[i][active_feat])
        if(test_data[i][8] == train_data[nn_index][8]):
            correct += 1
        else:
            wrong += 1

    accuracy = correct * 1.0 / (correct + wrong)
    print accuracy

The dataset in the code is Pima Indians Diabetes Data Set.