# Regression and Classification in ML (Machine Learning)

## What is Regression in Machine Learning?

- Regression in machine learning involves predicting a continuous output or numerical value.
- In simpler terms, it's like estimating a quantity.
**Evaluation:**Mean Squared Error (MSE), R-squared, etc**Examples of Algorithms :**Linear Regression, Ridge Regression, lasso Regression.- For example, predicting house prices based on features like size, number of bedrooms, and location is a regression task.

## Classification in Machine Learning

- Classification, on the other hand, is about assigning items into predefined categories or classes.
- Instead of predicting a continuous value, the algorithm categorizes input data into distinct groups.
**Evaluation :**Accuracy, Precision, Recall, F1 Score and Confusion Matrix.**Examples of Algorithms**Support Vector Machines, Decision Trees, etc.- For instance, classifying ema
**ils as spam or not spam, or identifying whether an image contains a cat or a dog.** - classification involves training a model to recognize patterns that define different classes
- , and then using those patterns to assign new, unseen data to the correct category.

## Binary Classification in Machine Learning

- Binary classification is a type of machine learning task where the goal is to categorize input data into one of two possible classes or categories.
- It's like making a yes/no decision, true/false prediction, or putting things into one of two boxes.

**Example: Spam Email Detection**

- Imagine you're building a spam filter for emails.
- The two classes in this scenario are "spam" and "not spam" (ham).
- The algorithm learns from labeled examples of emails – some marked as spam and some as not spam.
- After training, when you give it a new, unseen email, it predicts whether it belongs to the "spam" category or the "not spam" category.

## Multiple Classification in Machine Learning

- Multiple classification, also known
**as multiclass classification**, involves categorizing input data into more than two classes or categories. - Instead of making a binary decision (yes/no, spam/not spam),
- the algorithm is trained to recognize and assign input data into several distinct categories.
- Each instance belongs to one and only one class.
- Clear decision boundaries separating classes.
**Example: Handwritten Digit Recognition**- Consider the task of recognizing handwritten digits from 0 to 9.
- This is a multiclass classification problem because there are multiple
**classes (0, 1, 2, ..., 9).** - The algorithm is trained on a dataset containing examples of each digit.
- After training, when you present it with a new handwritten digit, the model predicts which digit it is.

## Multilabel Classification in Machine Learning

- Multilabel classification is a type of machine learning task where each input can belong to multiple classes simultaneously.
- In other words, instead of assigning just one label or category to each piece of data,
- Each instance can be associated with multiple labels simultaneously.
- Overlapping or shared decision boundaries as instances may belong to multiple labels.
- the algorithm can assign multiple labels, indicating that the input belongs to multiple categories at the same time.
**Example: Topic Tagging for Articles**- Consider a scenario where you have a collection of articles, and each article can be about multiple topics such as science, technology , etc.
- In a multilabel classification system, an article discussing both science and technology would be assigned labels for both categories.

## Confusion Matrix in Machine Learning

- A confusion matrix is a table that helps evaluate the performance of a machine learning model, especially in classification tasks.
- It provides a comprehensive breakdown of the model's predictions by comparing is to the actual outcomes.

### Components of a Confusion Matrix:

#### True Positives (TP):

- Instances where the model correctly predicts the positive class.
- Example: The model correctly identifies 50 spam emails.

#### True Negatives (TN):

- In this model correctly predicts the negative class.
- Example: The model correctly identifies 100 non-spam emails.

#### False Positives (FP):

- Instances where the model predicts the positive class incorrectly (false alarm).
- Example: The model incorrectly classifies 10 non-spam emails as spam.

#### False Negatives (FN):

- Instances where the model predicts the negative class incorrectly (miss).
- Example: The model misses 5 actual spam emails.

### Usefulness of Confusion Matrix:

#### Accuracy

It helps calculate the overall accuracy of the model using the formula (TP + TN) / (TP + TN + FP + FN).

#### Precision

Precision is the ratio of correctly predicted positive observations to the total predicted positives, calculated as TP / (TP + FP).

#### Recall (Sensitivity)

Recall is the ratio of correctly predicted positive observations to all actual positives, calculated as TP / (TP + FN).

- Recall, also known as sensitivity or true positive rate, is the ratio of correctly predicted positive observations to all actual positives.
- A high recall indicates that the model is capturing a large proportion of the actual positive instances.

```
1# Assume True Positives (tp) and False Negatives (fn) are calculated or obtained
2tp = 120
3fn = 20
4
5# Calculate Recall
6recall = tp / (tp + fn)
7
8print(f"Recall: {recall}")
```

#### F1 Score

- The F1 score is the harmonic mean of precision and recall, providing a balance between the two metrics.
- To calculate accuracy using a confusion matrix, you can use the following

```
1# Assume precision and recall are calculated or obtained
2precision = 0.85
3recall = 0.75
4
5# Calculate F1 score
6f1_score = 2 * (precision * recall) / (precision + recall)
7
8print(f"F1 Score: {f1_score}")
```

### Trade-off between Precision and Recall:

#### High Precision, Low Recall:

The model is cautious in making positive predictions but may miss some positive instances.

#### High Recall, Low Precision:

The model predicts many positive instances, but some of them may be incorrect.

### ROC Curve

- The Receiver Operating Characteristic (ROC) curve is a graphical representation.
- It illustrates the performance of a binary classification model across different thresholds.
- It is particularly useful when evaluating the trade-off between the true positive rate (sensitivity or recall).

## Linear Regression with One Variable

- Linear regression with one variable, also known as simple linear regression,
- It is a basic method for predicting a dependent variable based on a single independent variable.
- Linear regression with one variable aims to find a straight line that best fits the relationship,
- between the independent variable (X) and the dependent variable (Y).

```
1Y= mx+b
2
```

## Linear Regression with Multiple Variables

- Linear regression with multiple variables is an extension of simple linear regression,
- where instead of just one independent variable (x), we have multiple independent variables (x1,x2,...,xn).
- This allows us to consider more factors that may influence the dependent variable (y).

## Logistic Regression

- Logistic regression is a type of machine learning algorithm used for classification tasks.
- Unlike linear regression, which predicts a continuous output,
- logistic regression predicts the probability that an instance belongs to a particular category.
- Imagine predicting whether a student will pass (1) or fail (0) based on the number of hours they studied.
- Logistic Regression can output the probability of passing, and if it's above a threshold (e.g., 0.5), we classify it as a pass.

**Key Concepts:**

#### 1. Binary Classification:

Logistic regression is commonly used for binary classification problems where there are two possible outcomes (e.g., spam or not spam).

#### 2. Probability Output:

- Instead of predicting a specific value, logistic regression predicts the probability that an instance belongs to the positive class.
- The output is between 0 and 1

```
1# Importing necessary libraries
2from sklearn.linear_model import LogisticRegression
3import pandas as pd
4from sklearn.model_selection import train_test_split
5from sklearn import metrics
6import matplotlib.pyplot as plt
7
8# Reading the dataset
9df = pd.read_csv("diabetes.csv")
10
11# Displaying information about the dataset
12print(df.info())
13
14# Splitting data into feature and target variables
15X = df.drop('Outcome', axis=1)
16y = df['Outcome']
17
18# Checking the label distribution
19print(y.value_counts())
20
21# Splitting data into training and test sets
22X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=16)
23
24# Checking the label distribution in the test set
25print(y_test.value_counts())
26
27# Creating and training the Logistic Regression model
28logr = LogisticRegression(max_iter=1000) # Setting max_iter to avoid warnings
29logr.fit(X_train.values, y_train)
30
31# Making predictions on the test set
32predicted = logr.predict(X_test.values)
```

## Advanced Python

### 1. NumPy

- NumPy is a powerful Python library for numerical computing.
- It introduces a multidimensional array object (numpy array) that efficiently handles large datasets and
- provides a wide range of mathematical functions to operate on these arrays.

### Pandas

- Pandas is a data manipulation library built on top of NumPy.
- It provides data structures like DataFrames, which are ideal for working with structured data.
- Pandas simplifies tasks like data cleaning, exploration, and analysis.

### Scikit-Learn

- Scikit-Learn is a comprehensive machine learning library in Python.
- It provides tools for various machine learning tasks, including classification,
- regression, clustering, and more. It's built on NumPy, SciPy, and Matplotlib.

## Conclusion

Regression predicts numerical values, classification assigns data to predefined categories, confusion matrices evaluate model performance, logistic regression is for binary classification with probability output.