Regression and Classification in ML (Machine Learning)

BCA Labs


Regression and Classification in ML (Machine Learning)

What is Regression in machine learning?

  • Regression in machine learning involves predicting a continuous output or numerical value.
  • In similar terms, it's like estimating a quantity.
  • Evaluation: Mean Squared Error (MSE), R-squared, etc.
  • Examples of Algorithms: Linear Regression, Ridge Regression, Lasso Regression.
  • For example, predicting house prices based on features like size, number of bedrooms, and location is a regression task.

Classification in Machine Learning

  • Classification, on the other hand, is about assigning items into predefined categories or classes.
  • Instead of predicting a continuous value, the algorithm categorizes input data into distinct groups.
  • Evaluation: Accuracy, Precision, Recall, F1 Score and Confusion Matrix.
  • Examples of Algorithms: Support Vector Machines, Decision Trees, etc.
  • For instance, classifying emails as spam or not, or identifying whether an image contains a cat or a dog.
  • Classification involves training a model to recognize patterns that define different classes and then using those patterns to assign new, unseen data to the correct category.

Binary Classification in Machine Learning

  • Binary classification is a machine learning task that aims to categorize input data into one of two possible classes or categories.
  • It's like making a yes/no decision, a true/false prediction, or putting things into one of two boxes.

Example: Spam Email Detection
  • Imagine you're building a spam filter for emails.
  • The two classes in this scenario are "spam" and "not spam" (ham).
  • The algorithm learns from labeled examples of emails - some marked as spam and some as not spam.
  • After training, giving it a new unseen email predicts whether it belongs to the "spam" category or the "not spam" category.

Multiple Classification in Machin Learning

  • Multiple classification, also known as multiclass classification, involves categorizing input data into more than two classes or categories.
  • Instead of making a binary decision (yes/no, spam/not spam), the algorithm is trained to recognize and assign input data into several distinct categories.
  • Each instance belongs to one and only one class.
  • Clear decision boundaries separating classes.
    • Example: Handwritten Digit Recognition
  • Consider the task of recognizing handwritten digits from 0 to 9.
  • This is a multiclass classification problem because there are multiple classes (0,1,2,......,9). 
  • The algorithm is trained on a data set containing examples of each digit.
  • After training, when you present it with a new handwritten digit, the model predicts which digit it is.

Multilabel Classification in Machine Learning

  • Multilabel classification is a machine-learning task where each input can belong to multiple classes simultaneously.
  • In other words, instead of assigning just one label or category to each piece of data.
  • Each instance can be associated with multiple labels simultaneously.
  • Overlapping or shared decision boundaries as instances may belong to multiple labels, the algorithm can assign various labels, indicating that the input belongs to multiple categories simultaneously.
    • Example: Topic Tagging for Articles
  • Consider a scenario where you have a collection of articles, and each article can be about multiple topics such as science, technology, etc.
  • An article discussing science and technology would be labeled for both categories in a multilabel classification system.

Confusion Matrix in Machine Learning

  • A confusion matrix is a table that helps evaluate the performance of a machine learning model, especially in classification tasks.
  • It provides a comprehensive breakdown of the model's predictions by comparing it to the actual outcomes.

Components of a Confusion Matric:

True Positive (TP):

  • Instances where the model correctly predicts the positive class.
  • Example: The model correctly identifies 50 spam emails.

True Negative (TN):

  • This model correctly predicts the negative class.
  • Example: The model correctly identifies 100 non-spam emails.

False Positives (FP):

  • Instances where the model mispredicts the positive class (false alarm).
  • Example: The model incorrectly classifies 10 non-spam emails as spam.

False Negatives (FN):

  • Instances where the model predicts the negative class incorrectly (miss).
  • Example: The model misses 5 actual spam emails.

The usefulness of the Confusion Matrix:


It helps calculate the overall accuracy of the model using the formula (TP + TN) / (TP + TN + FP + FN).


Precision is the ratio of correctly predicted positive observation to the total predicted positives, calculated as TP / (TP + FP).

Recall (Sensitivity)

Recall is the ratio of correctly predicted positive observations to all actual positives, calculated as TP / (TP + FN).

  • Recall, also known as sensitivity or true positive rate, is the ratio of correctly predicted positive observations to all actual positives.
  • A high recall indicates that the model is capturing a large proportion of the actual positive instances.

# Assume True Positives (tp) and False Negatives (fn) are calculated or obtained
tp = 120
fn = 20

# Calculate Recall
recall = tp / (tp + fn)

print(f"Recall: {recall}")

F1 Score

  • The F1 Score is the harmonic mean of precision and recall, providing a balance between the two matrices.
  • To calculate accuracy using a confusion matric, you can use the following:

# Assume precision and recall are calculated or obtained
precision = 0.85
recall = 0.75

# Calculate F1 score
f1_score = 2 * (precision * recall) / (precision + recall)

print(f"F1 Score: {f1_score}")

Trade-Off between Precision and Recall:

High Precision, Low Recall:

The model is cautious in making positive predictions but may have some positive instances.

High Recall, Low Precision:

The model predicts many positive instances, but some of them may be incorrect.

ROC Curve

  • The Reciever Operating Characteristic (ROC) curve is a graphical representation.
  • It illustrates the performance of a binary classification model across different thresholds.
  • It is particularly useful when evaluating the trade-off between the true positive rate (sensitivity or recall).

Linear Regression with One Variable

  • Linear regression with one variable, also known as simple linear regression, it is a basic method for predicting a dependent variable based on a single independent variable.
  • Linear regression with one variable aims to find a straight line that best fits the relationship between the independent variable (x) and the dependent variable (Y).

Y= mx+b

Linear Regression with Multiple Variables

  • Linear regression with multiple variables is an extension of simple linear regression, where instead of just one independent variable (X), we have multiple independent variables (x1,x2,....,xn).
  • This allows us to consider more factors that may influence that dependent variable (y).

Logistic Regression

  • Logistic regression is a type of machine learning algorithm used for classification tasks.
  • Unlike linear regression, which predicts a continuous output, logistic regression predicts the probability that an instance belongs to a particular category.
  • Imagine predicting whether a student will pass (1) or fail (0) based on the number of hours they studied.
  • Logistic Regression can output the probability of passing, and if it's above a threshold (e.g., 0.5), we classify it as a pass.

Key Concepts

1. Binary Classification

Logistic regression is commonly used for binary classification problems with two possible outcomes (e.g., spam or not spam).

2. Probability Output:

  • Instead of predicting a specific value, logistic regression predicts the probability that an instance belongs to the positive class.
  • The output is between 0 and 1.

# Importing necessary libraries
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt

# Reading the dataset
df = pd.read_csv("diabetes.csv")

# Displaying information about the dataset

# Splitting data into feature and target variables
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Checking the label distribution

# Splitting data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=16)

# Checking the label distribution in the test set

# Creating and training the Logistic Regression model
logr = LogisticRegression(max_iter=1000)  # Setting max_iter to avoid warnings, y_train)

# Making predictions on the test set
predicted = logr.predict(X_test.values)

Advanced Python

1. NumPy

  • NumPy is a powerful library for numerical computing.
  • It introduces a multidimensional array object (numpy array) that efficiently handles large datasets and provides various mathematical functions to operate on these arrays.

2. Pandas

  • Pandas is a data manipulation library built on top of Numpy.
  • It provides data structures like DataFrames, ideal for working with structured data.
  • Pandas simplifies tasks like data-cleaning, exploration, and analysis.

3. Scikit-Learn

  • Scikit-Learn is a comprehensive machine-learning library in Python.
  • It provides tools for various machine-learning tasks, including classification, regression, clustering, and more. It's built on NumPy, SciPy, and Matplotlib.


Regression predicts numerical values, classification assigns data to predefined categories, confusion matrices evaluate model performance, and logistic regression is for binary classification with probability output.

Post a Comment


Post a Comment (0)