Clustering in Machine learning

What is Clustering ?

Clustering is the process of dividing a population or data into groups
such that data points in the same group are similar to other data points in the same group and less similar to content in other groups.
Clustering is very important as it determines the clustering of existing unlabeled data.
It is used as a method to find useful patterns, useful features and commonalities in the sample.
common technologies include: Statistical data analysis , Social network analysis, etc.

Example:

Clustering is used by the Amazon in its recommendation system to provide the recommendations as per the past search of products.
Netflix also uses this technology to show users videos and websites based on their viewing history.

Types of Clusters

Partitioning Clustering (e.g. K-means): Separates data into distinct groups based on their centroids.
Density-Based Clustering (e.g. DBSCAN): Identifies clusters by recognizing areas of high data density and separating them from sparser regions.
Distribution Model-Based Clustering (e.g. Expectation-Maximization with GMM): Divides data based on the likelihood of belonging to a specific distribution, often assuming shapes like Gaussian curves.
Fuzzy Clustering (e.g. Fuzzy C-means) : Soft method where a data point can belong to multiple clusters with varying degrees of membership.

K-means Clustering

K-means is a clustering algorithm that groups anonymous data sets into different groups.
Here K defines the number of groups that need to be created in the process, if K = 2 there will be two groups,
if K = 3 there will be three groups, etc.
The k-means algorithm partitions the given data into k clusters.
Each cluster has a cluster center called centroid. k is specified by the user.
It is a center-based algorithm where each cluster is associated with a center.
The main goal of this algorithm is to minimize the distance between data points and their corresponding clusters.

k-means group algorithm generally performs two tasks:

The best value of the center point K or center of gravity is determined by an iterative method.
Assign each data point to the nearest k-center.
Data points that are close to a given position k form a cluster.
So each cluster has data points that have some similarities and distances with other clusters.

K-means algorithm

Step-1: Select K number to determine the number of clusters.
Step-2: Select random K points or centroids. (There may be other data in the input data set).
Step 3: Assign each data point to the nearest centroid, which will form the previously specified K groups.
Step 4: calculate the difference and place the new centers of gravity.
Step 5: Repeat step 3; This means that all data again points to the nearest centroid of each group.
Step 6: If there is a location, go to step 4, if not, go to completion.
Step 7: The model is ready.

1import matplotlib.pyplot as plt
2from sklearn.cluster import KMeans
3x = [4,5,10,4,3,11,14,6,10,12]
4y = [21,19,24,17,16,25,24,22,21,21]
5plt.scatter(x, y)
6plt.show()
7data = list(zip(x, y))
8print(data)
9kmeans = KMeans(n_clusters=2)
10kmeans.fit(data)
11plt.scatter(x,y,c=kmeans.labels_)
12plt.show()
13

Hierarchical Clustering

Hierarchical clustering, also known as hierarchical cluster analysis or HCA, is another unsupervised machine learning algorithm used to group unlabeled datasets into clusters.
In this algorithm, we create a hierarchical structure consisting of tree-shaped groups, called dendrogram.
Sometimes, the results of K-means cluster and hierarchical cluster may be similar but they work differently.
Hierarchical clustering is better because there is no need to predetermine the number of clusters as we do in the K-Means algorithm.

Methods of Hierarchical clustering

There are two methods of hierarchical clustering:

Aggregation: Aggregation is a method where the algorithm first treats all data points as a group and puts them together until one group remains.
Split: The split algorithm is the opposite of the merge algorithm as it is a top-down approach.

Agglomerative Hierarchical clustering

Step 1: Create each data point as a group. Let's assume there are N points, so the number of groups will also be N.
Step 2: Take the two closest points or groups and combine them into a single group. Therefore, there will now be N-1 groups.
Again, select the two closest groups and combine them to create a single group. There will be N-2 groups.
Step 4: Repeat step 3 until only one group remains.
Step 5: After collecting all the groups into one large group, create a dendrogram to divide the groups into each question.
The closest of two groups is important for hierarchical grouping.
There are many ways to calculate the distance between two groups, such as the Euclidean distance, and this method determines the grouping rules.
These measurements are called connection methods.

1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.cluster import AgglomerativeClustering
4x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
5y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]
6plt.scatter(x, y)
7plt.show()
8data = list(zip(x, y))
9hierarchical_cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
10labels = hierarchical_cluster.fit_predict(data)
11plt.scatter(x, y, c=labels)
12plt.show()
13

Kohonen Self-Organizing Maps

The concept of self-organizing map (SOM) was first proposed by Kohonen.
This is a way to reduce data because it is an unsupervised neural network that learns using unsupervised learning to create discrete sparse connections.
Representation of the input space for training the model. This representation is called a map.
To directly reduce the complexity of the translation problem,
SOM is used to include various objects at a lower level (or dimensionality reduction) during processing and integration.
The output layer and the input layer are the two layers that make up the SOM. This is also called Kohonen mapping.
The main advantage of using SOM is that the data is easy to read and understand.

Feature selection and Dimensionality reduction

The number of input variables or features of the dataset is called dimensionality.
Many practical features often make demonstration work more difficult to model, often called a curse.
High dimensional data can also lead to overfitting,
where the model fits the training data too closely and does not fit the new data well.
Therefore it is often necessary to reduce the number of practical features.
This will reduce the number of practical features.
The dimensionality of the feature space is therefore called “dimensionality reduction.”

Dimensionality reduction is a data preparation/preprocessing technique used to pre-model data.
It can be done after data cleaning and data scaling and before training the prediction model.
There are two main methods of size reduction: feature selection and feature removal.

Feature selection

Feature selection will select a subset of primary features that are relevant to the problem at hand.
The goal is to reduce the size of the dataset while preserving the most important features.
There are many feature selection methods, including filtering techniques, wrapping techniques, and embedding techniques.
Filtering techniques sort features according to their relationships with target variables.
Wrapping techniques use performance-based models as criteria for feature selection,
and layers provide specific options with Embedded standard, training models.

Feature Extraction

Feature extraction will create new features by combining or modifying old features.
The aim is to create a set of properties that capture the essence of raw materials in a low-cost environment.
There are many methods for feature extraction, including principal component analysis (PCA), linear discriminant analysis (LDA).
Note: Feature selection and dimensionality reduction are two methods used to reduce the number of features.

Principal Component Analysis

Principal component analysis is an unsupervised learning algorithm used for dimensionality reduction in machine learning.
Dimensionality reduction converts features into smaller ones.
This method was suggested by Karl Pearson.
Its function is that when data in high-dimensional space is mapped to data in low-dimensional space,
the difference between data in low-dimensional space should be largest.
It is a statistical technique that transforms the analysis of relationships into a set of independent features with the help of orthogonal transformation.
These updated features are called core features.

Some terms used in the PCA algorithm

Dimension: It is the number of features or variables in the given data. More simply, it is the number of lines in the file.
Correlation: It shows the extent to which two variables are related to each other.
For example, if one variable changes, another variable also changes. The range of correlation is -1 to +1.
Orthogonal: It means that the differences between variables are not equal, so the correlation between a pair of variables is zero.
Eigenvector: If there is a square matrix M and a non-zero vector v is given. If av is a multiple of v, v will be an eigenvector.
Covariance Matrix: Reduces calculation time.

Table of contents

Clustering in Machine learning

What is Clustering ?

Example:

Types of Clusters

K-means Clustering

K-means algorithm

Hierarchical Clustering

Methods of Hierarchical clustering

Agglomerative Hierarchical clustering

Kohonen Self-Organizing Maps

Feature selection and Dimensionality reduction

Feature selection

Feature Extraction

Principal Component Analysis

Some terms used in the PCA algorithm

More Like this..

Work with us

Brands

Our Services