K-mean Clustering & Use Cases in Security Domain
K-means clustering is a very popular unsupervised learning algorithm. In this article I want to provide a bit of background about it, and show how we could use it in an anecdotal real-life situation.
K-mean Clustering Algorithm
K-Means Clustering is an algorithm that, given a dataset, will identify which data points belong to each one of the k clusters. It takes your data and learns how it can be grouped.
Through a series of iterations, the algorithm creates groups of data points — referred to as clusters — that have similar variance and that minimize a specific cost function: the within-cluster sum of squares.
Within-Cluster Sum of Squares: where C_i is a cluster and "mu_i" is the mean of the data points in cluster C_i
By using the within-cluster sum of squares as cost function, data points in the same cluster will be similar to each other, whereas data points in different clusters will have a lower level of similarity.
K-Means Clustering is part of a group of learning algorithms called unsupervised learning. In this type of learning model, there is no explicit identification of a label/class/category for each data point.
Each data point in your dataset is a vector of attributes, i.e., features without a specific label that could assign it to a specific cluster or class. The algorithm will then learn on its own how to group data points that have similar features and cluster them together.
One important detail about K-Means Clustering is that, even though it identifies which data point should part of which cluster, you'll have to specify the parameter K, representing the total number of clusters that you want to use to "distribute" your data.
Your preferred workout is jogging and, since you're extremely data-inclined, you make sure to keep track of your performance. So you end up compiling a dataset similar to this
It consists of the date you went jogging, the total distance run (Km), duration (Min) and the number of days since your last workout. Each row in your dataset contains the attributes or features of each workout session.
Ultimately, the question you want answer is: "How can I identify similar workout sessions?"
By identifying workout sessions that are similar to each other you can have a better understanding of your overall performance and get new ideas on how to improve.
A clustering algorithm like K-Means Clustering can help you group the data into distinct groups, guaranteeing that the data points in each group are similar to each other.
A good practice in Data Science & Analytics is to first have good understanding of your dataset before doing any analysis.
Taking an exploratory view of your dataset, you start by plotting a pair-plot, in order to get a better idea about the correlation between different features.
The plots on the diagonal correspond to the distribution of each feature, while the others are scatterplots of each pair of features.
One interesting plot is distance vs duration, which shows a linear correlation.
Whereas, when we compare distance and duration with the number of days since the last workout session, the is no evident correlation between them.
Now that you have a good idea of how your dataset looks like, it's time to use K-Means and see how the individual sessions are grouped. You can start off by splitting your dataset into two distinct groups.
Recommended by LinkedIn
Real use Cases:-
k-means can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. think of a scenario in which you want to make groups of similar things from a randomly distributed collection of things; k-means is very suitable for such scenarios.
Here is a list of five interesting use cases for k-means.
1. Document classification:-
cluster documents in multiple categories based on tags, topics, and the content of the document. this is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. the initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarity in document groups.
2. Delivery store optimization:-
optimize the process of good delivery using truck drones by using a combination of k-means to find the optimal number of launch locations and a genetic algorithm to solve the truck route as a traveling salesman problem.
3. Insurance fraud detection:-
machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.
4. Cyber-profiling criminals:-
cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.
5. Identifying crime localities:-
with data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.
Thank you for reading the article I hope you like it!!
Keep Learning Keep Sharing :)
|| Aspiring Computer Science Engineer || AWS | Docker | Linux |
3yGreat job Muskaan Bakshi 👍
ATSE@RedHat || Openshift || 3x RedHat Certified || DevOps(Docker🐋, Kubernetes☸, Jenkins👨🍳) || Ansible || Cloud Computing ☁(AWS) |||
3yGood Work Muskaan Bakshi
UI/UX Developer
3yGood job.. ✌️✌️
|Oracle Argus Safety Database| ARGUS CONSULTANT / PHARMACOVIGILANCE CONSULTANT / LIFE SCIENCES AND PHARAMA | PLSQL | Microsoft SQL Server | Oracle DB| IBM Cognos BI |1x Google Cloud Certified | 3x Azure Cloud Certified |
3ygreat work 😊