Machine learning algorithms have revolutionized the way we make predictions, classify data, and extract insights from complex datasets. Among the plethora of machine learning algorithms available, k-Nearest Neighbors (k-NN) stands out as a simple yet powerful method for classification and regression tasks. In this blog, we will explore the k-NN algorithm in depth and provide you with ten practical code examples to help you master this versatile technique.
Understanding k-Nearest Neighbors (k-NN)
The k-NN algorithm is a non-parametric and instance-based machine learning technique used for classification and regression tasks. It operates on the principle that objects with similar characteristics tend to belong to the same class or have similar values. In essence, k-NN makes predictions by finding the k nearest data points to a given query point and determining the majority class or averaging their values.
What is k-Nearest Neighbors?
The k-Nearest Neighbors algorithm is a supervised machine learning method used for both classification and regression tasks. It’s a non-parametric, instance-based learning algorithm, which means it doesn’t make any assumptions about the underlying data distribution. Instead, it relies on the data itself to make predictions.
At its heart, k-NN operates based on the idea that similar data points should have similar labels or values. The “k” in k-NN represents the number of nearest neighbors to consider when making a prediction. It works as follows:
- Calculate the distance between the input data point and all other data points in the dataset.
- Select the top “k” closest data points.
- For classification, take a majority vote among the “k” neighbors to determine the class label. For regression, calculate the mean or weighted mean of the “k” neighbors’ values to predict a continuous value.
Let’s dive into the key concepts of the k-NN algorithm before we move on to the code examples.
- Distance Metric: k-NN relies on a distance metric (e.g., Euclidean, Manhattan) to measure the similarity between data points. The choice of distance metric can significantly impact the algorithm’s performance.
- Choosing k: The hyperparameter “k” represents the number of nearest neighbors to consider when making predictions. Selecting an appropriate value for k is crucial and can be determined through techniques like cross-validation.
- Majority Voting: In classification tasks, k-NN uses majority voting among the k-nearest neighbors to assign a class label to the query point. The class with the highest number of occurrences among the neighbors is selected.
- Regression in k-NN: For regression tasks, k-NN computes the average (or weighted average) of the target values of the k-nearest neighbors to predict the target value for the query point.
Now that we have a fundamental understanding of k-NN, let’s explore ten code examples to see how this algorithm can be applied in various scenarios.
1. K-NN for Classification
# Import necessary libraries from sklearn.neighbors import KNeighborsClassifier # Create a k-NN classifier with k=3 knn = KNeighborsClassifier(n_neighbors=3) # Fit the classifier to your training data knn.fit(X_train, y_train) # Make predictions on new data predictions = knn.predict(X_test)
2. K-NN for Regression
# Import necessary libraries from sklearn.neighbors import KNeighborsRegressor # Create a k-NN regressor with k=5 knn = KNeighborsRegressor(n_neighbors=5) # Fit the regressor to your training data knn.fit(X_train, y_train) # Predict values for new data points predictions = knn.predict(X_test)
3. Choosing an Optimal k
from sklearn.model_selection import cross_val_score # Initialize an empty list to store cross-validation scores cv_scores =  # Perform 10-fold cross-validation for different values of k for k in range(1, 21): knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy') cv_scores.append(scores.mean()) # Find the optimal k with the highest cross-validation score optimal_k = np.argmax(cv_scores) + 1
4. Scaling Features for k-NN
from sklearn.preprocessing import StandardScaler # Initialize a scaler scaler = StandardScaler() # Fit the scaler to your training data scaler.fit(X_train) # Transform both the training and testing data X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test)
5. Weighted k-NN
# Create a weighted k-NN classifier with custom weights knn = KNeighborsClassifier(n_neighbors=3, weights='distance') # Fit the classifier to your training data knn.fit(X_train, y_train) # Make predictions on new data predictions = knn.predict(X_test)
6. Distance Metrics
# Create a k-NN classifier with Manhattan distance knn_manhattan = KNeighborsClassifier(n_neighbors=3, metric='manhattan') # Create a k-NN classifier with Minkowski distance (custom p value) knn_minkowski = KNeighborsClassifier(n_neighbors=3, metric='minkowski', p=4)
7. Handling Imbalanced Classes
from imblearn.over_sampling import SMOTE # Apply SMOTE to balance the classes sm = SMOTE(random_state=42) # Resample the training data X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
8. Using Ball Tree Algorithm
from sklearn.neighbors import BallTree # Create a BallTree for efficient k-NN queries tree = BallTree(X_train) # Find the indices of the k-nearest neighbors distances, indices = tree.query(X_test, k=3)
9. Visualization of k-NN
import matplotlib.pyplot as plt # Visualize decision boundaries of k-NN x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1)) Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, alpha=0.8)
10. Performance Evaluation
from sklearn.metrics import classification_report, confusion_matrix # Evaluate the k-NN classifier's performance print(confusion_matrix(y_test, predictions)) print(classification_report(y_test, predictions))
In this comprehensive guide, we’ve explored the k-Nearest Neighbors (k-NN) algorithm in detail, covering its fundamental concepts and providing you with ten practical code examples. Whether you’re dealing with classification or regression tasks, imbalanced datasets, or need to optimize hyperparameters like k and distance metrics, k-NN offers a versatile and intuitive approach.
By mastering the k-NN algorithm and experimenting with the code examples, you’ll be well-equipped to apply this technique to a wide range of machine learning problems. Remember that k-NN is just one tool in your machine learning toolbox, and choosing the right algorithm for a given task is essential for success in the field of data science and machine learning.