Clustering in Machine learning

Clustering in Machine learning

What is Clustering ?

  • Clustering is the process of dividing a population or data into groups
  • such that data points in the same group are similar to other data points in the same group and less similar to content in other groups.
  • Clustering is very important as it determines the clustering of existing unlabeled data.
  • It is used as a method to find useful patterns, useful features and commonalities in the sample.
  • common technologies include: Statistical data analysis , Social network analysis, etc.


  • Clustering is used by the Amazon in its recommendation system to provide the recommendations as per the past search of products.
  • Netflix also uses this technology to show users videos and websites based on their viewing history.

Types of Clusters

  • Partitioning Clustering (e.g. K-means): Separates data into distinct groups based on their centroids.
  • Density-Based Clustering (e.g. DBSCAN): Identifies clusters by recognizing areas of high data density and separating them from sparser regions.
  • Distribution Model-Based Clustering (e.g. Expectation-Maximization with GMM): Divides data based on the likelihood of belonging to a specific distribution, often assuming shapes like Gaussian curves.
  • Fuzzy Clustering (e.g. Fuzzy C-means) : Soft method where a data point can belong to multiple clusters with varying degrees of membership.

K-means Clustering

  • K-means is a clustering algorithm that groups anonymous data sets into different groups.
  • Here K defines the number of groups that need to be created in the process, if K = 2 there will be two groups,
  • if K = 3 there will be three groups, etc.
  • The k-means algorithm partitions the given data into k clusters.
  • Each cluster has a cluster center called centroid. k is specified by the user.
  • It is a center-based algorithm where each cluster is associated with a center.
  • The main goal of this algorithm is to minimize the distance between data points and their corresponding clusters.
k-means group algorithm generally performs two tasks:
  • The best value of the center point K or center of gravity is determined by an iterative method.
  • Assign each data point to the nearest k-center.
  • Data points that are close to a given position k form a cluster.
  • So each cluster has data points that have some similarities and distances with other clusters.

K-means algorithm

  • Step-1: Select K number to determine the number of clusters.
  • Step-2: Select random K points or centroids. (There may be other data in the input data set).
  • Step 3: Assign each data point to the nearest centroid, which will form the previously specified K groups.
  • Step 4: calculate the difference and place the new centers of gravity.
  • Step 5: Repeat step 3; This means that all data again points to the nearest centroid of each group.
  • Step 6: If there is a location, go to step 4, if not, go to completion.
  • Step 7: The model is ready.
1import matplotlib.pyplot as plt
2from sklearn.cluster import KMeans
3x = [4,5,10,4,3,11,14,6,10,12]
4y = [21,19,24,17,16,25,24,22,21,21]
5plt.scatter(x, y)
7data = list(zip(x, y))
9kmeans = KMeans(n_clusters=2)

Hierarchical Clustering

  • Hierarchical clustering, also known as hierarchical cluster analysis or HCA, is another unsupervised machine learning algorithm used to group unlabeled datasets into clusters.
  • In this algorithm, we create a hierarchical structure consisting of tree-shaped groups, called dendrogram.
  • Sometimes, the results of K-means cluster and hierarchical cluster may be similar but they work differently.
  • Hierarchical clustering is better because there is no need to predetermine the number of clusters as we do in the K-Means algorithm.

Methods of Hierarchical clustering

There are two methods of hierarchical clustering:
  • Aggregation: Aggregation is a method where the algorithm first treats all data points as a group and puts them together until one group remains.
  • Split: The split algorithm is the opposite of the merge algorithm as it is a top-down approach.

Agglomerative Hierarchical clustering

  • Step 1: Create each data point as a group. Let's assume there are N points, so the number of groups will also be N.
  • Step 2: Take the two closest points or groups and combine them into a single group. Therefore, there will now be N-1 groups.
  • Again, select the two closest groups and combine them to create a single group. There will be N-2 groups.
  • Step 4: Repeat step 3 until only one group remains.
  • Step 5: After collecting all the groups into one large group, create a dendrogram to divide the groups into each question.
  • The closest of two groups is important for hierarchical grouping.
  • There are many ways to calculate the distance between two groups, such as the Euclidean distance, and this method determines the grouping rules.
  • These measurements are called connection methods.
1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.cluster import AgglomerativeClustering
4x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
5y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]
6plt.scatter(x, y)
8data = list(zip(x, y))
9hierarchical_cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
10labels = hierarchical_cluster.fit_predict(data)
11plt.scatter(x, y, c=labels)

Kohonen Self-Organizing Maps

  • The concept of self-organizing map (SOM) was first proposed by Kohonen.
  • This is a way to reduce data because it is an unsupervised neural network that learns using unsupervised learning to create discrete sparse connections.
  • Representation of the input space for training the model. This representation is called a map.
  • To directly reduce the complexity of the translation problem,
  • SOM is used to include various objects at a lower level (or dimensionality reduction) during processing and integration.
  • The output layer and the input layer are the two layers that make up the SOM. This is also called Kohonen mapping.
  • The main advantage of using SOM is that the data is easy to read and understand.

Feature selection and Dimensionality reduction

  • The number of input variables or features of the dataset is called dimensionality.
  • Many practical features often make demonstration work more difficult to model, often called a curse.
  • High dimensional data can also lead to overfitting,
  • where the model fits the training data too closely and does not fit the new data well.
  • Therefore it is often necessary to reduce the number of practical features.
  • This will reduce the number of practical features.
  • The dimensionality of the feature space is therefore called “dimensionality reduction.”

  • Dimensionality reduction is a data preparation/preprocessing technique used to pre-model data.
  • It can be done after data cleaning and data scaling and before training the prediction model.
  • There are two main methods of size reduction: feature selection and feature removal.

Feature selection

  • Feature selection will select a subset of primary features that are relevant to the problem at hand.
  • The goal is to reduce the size of the dataset while preserving the most important features.
  • There are many feature selection methods, including filtering techniques, wrapping techniques, and embedding techniques.
  • Filtering techniques sort features according to their relationships with target variables.
  • Wrapping techniques use performance-based models as criteria for feature selection,
  • and layers provide specific options with Embedded standard, training models.

Feature Extraction

  • Feature extraction will create new features by combining or modifying old features.
  • The aim is to create a set of properties that capture the essence of raw materials in a low-cost environment.
  • There are many methods for feature extraction, including principal component analysis (PCA), linear discriminant analysis (LDA).
  • Note: Feature selection and dimensionality reduction are two methods used to reduce the number of features.

Principal Component Analysis

  • Principal component analysis is an unsupervised learning algorithm used for dimensionality reduction in machine learning.
  • Dimensionality reduction converts features into smaller ones.
  • This method was suggested by Karl Pearson.
  • Its function is that when data in high-dimensional space is mapped to data in low-dimensional space,
  • the difference between data in low-dimensional space should be largest.
  • It is a statistical technique that transforms the analysis of relationships into a set of independent features with the help of orthogonal transformation.
  • These updated features are called core features.

Some terms used in the PCA algorithm

  • Dimension: It is the number of features or variables in the given data. More simply, it is the number of lines in the file.
  • Correlation: It shows the extent to which two variables are related to each other.
  • For example, if one variable changes, another variable also changes. The range of correlation is -1 to +1.
  • Orthogonal: It means that the differences between variables are not equal, so the correlation between a pair of variables is zero.
  • Eigenvector: If there is a square matrix M and a non-zero vector v is given. If av is a multiple of v, v will be an eigenvector.
  • Covariance Matrix: Reduces calculation time.