Understanding K-means Clustering and its Use cases in Security Domain

rishabhsharma
3 min readJul 19, 2021

--

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

“The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.”

A cluster refers to a collection of data points aggregated together because of certain similarities”.

K-means Algorithm

K-means algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

How the K-means algorithm works

The way k-means algorithm works is as follows:

  1. Specify number of clusters K.
  2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
  3. Keep iterating until there is no change to the centroids. i.e. assignment of data points to clusters isn’t changing.
  4. Compute the sum of the squared distance between data points and all centroids.
  5. Assign each data point to the closest cluster (centroid).
  6. Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

Advantages of K-means Clustering

  1. It is fast, Robust, Easy to understand, Comparatively Efficient.
  2. If data sets are distinct, then gives the best results.
  3. Flexible, Easy to interpret.
  4. Better computational Cost, Enhances Accuracy.

Disadvantages of K-means Clustering

  1. If there are two highly overlapping data, then it cannot be distinguished and cannot tell that there are two clusters.
  2. Cannot handle outliers and noisy data.
  3. Do not work for the non-linear data set, Lacks consistency, Sensitive to scale.

Use-Cases in Security Domain

With the advancement in technology and the increase in the number of digital sources, data quantity increases every day and, consequently, the cyber security related data quantity. Traditional security systems such as Intrusion Detection Systems (IDS) are not capable of handling such a growing amount of data set in real time.

k-means clustering is one of the commonly used clustering algorithms in cyber security analytics aimed at dividing security related data into groups of similar entities, which in turn can help in gaining important insights about the known and unknown attack patterns. This technique helps a security analyst to focus on the data specific to some clusters only for the analysis. To improve performance, k-means can exploit the triangle inequality to skip many point-center distance computations, without affecting the clustering results.

--

--

rishabhsharma

AWS Certified ☁️ | PySpark | DevOps | Machine Learning 🧠 | Kubernetes ☸️ | SQL 🛢