Clustering in Machine learning
What is Clustering ?
- Clustering is the process of dividing a population or data into groups
- such that data points in the same group are similar to other data points in the same group and less similar to content in other groups.
- Clustering is very important as it determines the clustering of existing unlabeled data.
- It is used as a method to find useful patterns, useful features and commonalities in the sample.
- common technologies include: Statistical data analysis , Social network analysis, etc.
Example:
- Clustering is used by the Amazon in its recommendation system to provide the recommendations as per the past search of products.
- Netflix also uses this technology to show users videos and websites based on their viewing history.
Types of Clusters
- Partitioning Clustering (e.g. K-means): Separates data into distinct groups based on their centroids.
- Density-Based Clustering (e.g. DBSCAN): Identifies clusters by recognizing areas of high data density and separating them from sparser regions.
- Distribution Model-Based Clustering (e.g. Expectation-Maximization with GMM): Divides data based on the likelihood of belonging to a specific distribution, often assuming shapes like Gaussian curves.
- Fuzzy Clustering (e.g. Fuzzy C-means) : Soft method where a data point can belong to multiple clusters with varying degrees of membership.
K-means Clustering
- K-means is a clustering algorithm that groups anonymous data sets into different groups.
- Here K defines the number of groups that need to be created in the process, if K = 2 there will be two groups,
- if K = 3 there will be three groups, etc.
- The k-means algorithm partitions the given data into k clusters.
- Each cluster has a cluster center called centroid. k is specified by the user.
- It is a center-based algorithm where each cluster is associated with a center.
- The main goal of this algorithm is to minimize the distance between data points and their corresponding clusters.
k-means group algorithm generally performs two tasks:
- The best value of the center point K or center of gravity is determined by an iterative method.
- Assign each data point to the nearest k-center.
- Data points that are close to a given position k form a cluster.
- So each cluster has data points that have some similarities and distances with other clusters.
K-means algorithm
- Step-1: Select K number to determine the number of clusters.
- Step-2: Select random K points or centroids. (There may be other data in the input data set).
- Step 3: Assign each data point to the nearest centroid, which will form the previously specified K groups.
- Step 4: calculate the difference and place the new centers of gravity.
- Step 5: Repeat step 3; This means that all data again points to the nearest centroid of each group.
- Step 6: If there is a location, go to step 4, if not, go to completion.
- Step 7: The model is ready.
1import matplotlib.pyplot as plt
2from sklearn.cluster import KMeans
3x = [4,5,10,4,3,11,14,6,10,12]
4y = [21,19,24,17,16,25,24,22,21,21]
5plt.scatter(x, y)
6plt.show()
7data = list(zip(x, y))
8print(data)
9kmeans = KMeans(n_clusters=2)
10kmeans.fit(data)
11plt.scatter(x,y,c=kmeans.labels_)
12plt.show()
13
Hierarchical Clustering
- Hierarchical clustering, also known as hierarchical cluster analysis or HCA, is another unsupervised machine learning algorithm used to group unlabeled datasets into clusters.
- In this algorithm, we create a hierarchical structure consisting of tree-shaped groups, called dendrogram.
- Sometimes, the results of K-means cluster and hierarchical cluster may be similar but they work differently.
- Hierarchical clustering is better because there is no need to predetermine the number of clusters as we do in the K-Means algorithm.
Methods of Hierarchical clustering
There are two methods of hierarchical clustering:
- Aggregation: Aggregation is a method where the algorithm first treats all data points as a group and puts them together until one group remains.
- Split: The split algorithm is the opposite of the merge algorithm as it is a top-down approach.
Agglomerative Hierarchical clustering
- Step 1: Create each data point as a group. Let's assume there are N points, so the number of groups will also be N.
- Step 2: Take the two closest points or groups and combine them into a single group. Therefore, there will now be N-1 groups.
- Again, select the two closest groups and combine them to create a single group. There will be N-2 groups.
- Step 4: Repeat step 3 until only one group remains.
- Step 5: After collecting all the groups into one large group, create a dendrogram to divide the groups into each question.
- The closest of two groups is important for hierarchical grouping.
- There are many ways to calculate the distance between two groups, such as the Euclidean distance, and this method determines the grouping rules.
- These measurements are called connection methods.
1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.cluster import AgglomerativeClustering
4x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
5y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]
6plt.scatter(x, y)
7plt.show()
8data = list(zip(x, y))
9hierarchical_cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
10labels = hierarchical_cluster.fit_predict(data)
11plt.scatter(x, y, c=labels)
12plt.show()
13
Kohonen Self-Organizing Maps
- The concept of self-organizing map (SOM) was first proposed by Kohonen.
- This is a way to reduce data because it is an unsupervised neural network that learns using unsupervised learning to create discrete sparse connections.
- Representation of the input space for training the model. This representation is called a map.
- To directly reduce the complexity of the translation problem,
- SOM is used to include various objects at a lower level (or dimensionality reduction) during processing and integration.
- The output layer and the input layer are the two layers that make up the SOM. This is also called Kohonen mapping.
- The main advantage of using SOM is that the data is easy to read and understand.
Feature selection and Dimensionality reduction
- The number of input variables or features of the dataset is called dimensionality.
- Many practical features often make demonstration work more difficult to model, often called a curse.
- High dimensional data can also lead to overfitting,
- where the model fits the training data too closely and does not fit the new data well.
- Therefore it is often necessary to reduce the number of practical features.
- This will reduce the number of practical features.
- The dimensionality of the feature space is therefore called “dimensionality reduction.”
- Dimensionality reduction is a data preparation/preprocessing technique used to pre-model data.
- It can be done after data cleaning and data scaling and before training the prediction model.
- There are two main methods of size reduction: feature selection and feature removal.
Feature selection
- Feature selection will select a subset of primary features that are relevant to the problem at hand.
- The goal is to reduce the size of the dataset while preserving the most important features.
- There are many feature selection methods, including filtering techniques, wrapping techniques, and embedding techniques.
- Filtering techniques sort features according to their relationships with target variables.
- Wrapping techniques use performance-based models as criteria for feature selection,
- and layers provide specific options with Embedded standard, training models.
Feature Extraction
- Feature extraction will create new features by combining or modifying old features.
- The aim is to create a set of properties that capture the essence of raw materials in a low-cost environment.
- There are many methods for feature extraction, including principal component analysis (PCA), linear discriminant analysis (LDA).
- Note: Feature selection and dimensionality reduction are two methods used to reduce the number of features.
Principal Component Analysis
- Principal component analysis is an unsupervised learning algorithm used for dimensionality reduction in machine learning.
- Dimensionality reduction converts features into smaller ones.
- This method was suggested by Karl Pearson.
- Its function is that when data in high-dimensional space is mapped to data in low-dimensional space,
- the difference between data in low-dimensional space should be largest.
- It is a statistical technique that transforms the analysis of relationships into a set of independent features with the help of orthogonal transformation.
- These updated features are called core features.
Some terms used in the PCA algorithm
- Dimension: It is the number of features or variables in the given data. More simply, it is the number of lines in the file.
- Correlation: It shows the extent to which two variables are related to each other.
- For example, if one variable changes, another variable also changes. The range of correlation is -1 to +1.
- Orthogonal: It means that the differences between variables are not equal, so the correlation between a pair of variables is zero.
- Eigenvector: If there is a square matrix M and a non-zero vector v is given. If av is a multiple of v, v will be an eigenvector.
- Covariance Matrix: Reduces calculation time.