K-means Clustering and its real use-case in the Security Domain
Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.
Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.
Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.
Clustering is used to create a group (cluster) of the data so that it can easily find the necessary data. Clustering is a classification of similar objects into several different groups, it is usually applied in the analysis of statistical data which can be utilized in various fields, for example, machine learning, data mining, pattern recognition, image analysis and bioinformatics
Clustering including supervised learning types. There are four types of clustering algorithms that have been compared based on performance, such as K-Means, hierarchical clustering, self-organization map (SOM) and expectation maximization (EM Clustering). Based on these test results can be concluded that the k-means algorithm performance and EM better than a hierarchical clustering algorithm. In general, partitioning algorithms such as K-Means and EM highly recommended for use in large-size data. This is different from a hierarchical clustering algorithm that has good performance when they are used in small size data
In this post, we will cover only Kmeans which is considered as one of the most used clustering algorithms due to its simplicity.
K-means algorithm is an iterative algorithm that tries to partition the dataset into K-pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.
The way k-means algorithm works is as follows:
- Specify number of clusters K.
- Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
- Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.
- Compute the sum of the squared distance between data points and all centroids.
- Assign each data point to the closest cluster (centroid).
- Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.
The approach k-means follows to solve the problem is called Expectation-Maximization.
Use-Cases in the Security Domain
we use K-means clustering in many domains , we have many use-cases in security domain , Here is the one of the important topic were we use k-mean clustering for an best approach.
Analyzing Logs from Proxy Server and Captive Portal Using K-Means Clustering Algorithm
The traffic on World Wide Web is rapidly increasing, and an enormous amount of generated data due to users’ various interactions with websites. Thus, web data becomes one of the most valuable resources for information retrievals and knowledge discoveries. The study utilized the logs from the Proxy Server and Captive Portal database and used Web Usage Mining to discover useful and exciting patterns from the web data. Moreover, k-means clustering algorithm was used to provide specific groups of the user access patterns specifically for the number of user sessions and websites accessed by the network users. Based on the results, it had been found out that most of the time, users are more engage in utilizing the internet.
1. Operational Framework
1) data collection 2) preprocessing 3) data transformation 4) pattern discovery and 5) pattern analysis.
1.1. Data Collection
Data Collection is primarily the first step in the web usage mining process . Proxy servers are employed to improve navigation speed through caching, and they collect data from the users accessing huge groups of web servers . The web logs from the proxy server contain the actual HTTP requests from multiple clients to multiple Web servers
1.2. Data Preprocessing
The purpose of preprocessing is to transform the unstructured raw data into a set of user profiles . It has three major tasks, namely: 1) data cleaning, 2) user identification, and 3) session identification. Data cleaning is the removal of irrelevant data User identification task is to identify the user that made the session while session identification is the login and logoff activities done by the users .
1.3. Data Transformation
In this stage, data from the user sessions database is extracted and transformed into a comma separated values (CSV) file. This file contains the dataset which is necessary for discovering session patterns. The CSV file is significant for a Matlab software in generating the clusters using k-means.
1.4. Pattern Discovery
Once all user transactions have been identified, a variety of data mining techniques is performed for pattern discovery in the web usage mining and one of those is clustering . Clustering techniques are widely utilized in web usage mining (WUM) to capture similar trends and interests among users accessing a website. Clustering aims to divide a data set into groups or clusters where inter-cluster similarities are minimized while the intra cluster similarities are maximized. The knowledge discovered from the clustering may be used to analyze the session patterns of the users . The k-Means clustering algorithm is one of the most commonly used methods for partitioning the data . It is more suitable for large datasets. k-Means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Euclidean distance is used as a metric. The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets . K-Means algorithm is iterative in nature and repeats for each object. It converges until the objects are stable . K-Means clustering is simple, and the necessary steps it follows are :
3.5. Pattern Analysis
The last phase of web usage mining is pattern analysis which deals with the visualization and interpretation of the unusual pattern to users and filtering of non-usable information. Visualization assists an analyst to better navigation of the data. Thus, visualization techniques, such as graphing patterns are utilized for an easier interpretation of the results .
Analyzing the logs coming from the proxy server and captive portal traffic in the network. Thus, based on the results, it had been found out that most of the time, users are more engage in utilizing the internet. It can also be used in identifying when the students and employees stay active in browsing the internet and the number of websites they accessed. Hence, it is recommended to exploit the use of other clustering algorithms other than k-means in identifying and grouping web user patterns.
Example : We can use these in the test data in the industry of school Management
The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is more specifically based on what is known and not known about the criminal .
Profiling is information about an individual or group of individuals that are accumulated, stored, and used for various purposes, such as by monitoring their behavior through their internet activity .
Cyber Profiling process can be directed to the benefit of:
1. Identification of users of computers that have been used previously.
2.Mapping the subject of family, social life, work, or network-based organizations, including those for whom he/she worked.
3.Provision of information about the user regarding his ability, level of threat, and how vulnerable to threats .
4.Identify the suspected abuser
The way in which Cyber Profiling works :
Use-Cases in the Security Domain
Here is a list of some of the interesting use cases of K-means in Security Domain:
1. Identifying crime localities
With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.
2. Insurance fraud detection
Machine Learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. Since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.
3. Call record detail analysis
A call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.
4. Automatic clustering of it alerts
Large enterprise it infrastructure technology components such as network, storage, or database generate large volumes of alert messages. Because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.
5. Rideshare data analysis
The publicly available uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. Analyzing this data is useful not just in the context of uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.
6. Crime document classification
Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarity in document groups.
These were few use cases but the list goes on be it in Security Domain or any other, K-means is very effective as well as easy way of Clustering in machine learning.