Data Preprocessing

4 min readSep 20, 2021

Data preprocessing is a process of preparing the raw data and making it suitable for a Data Science model. It is the first and crucial step while creating a Data Science model.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot be directly used for machine learning models. Data preprocessing is required tasks for cleaning the data and making it suitable for a machine learning model which also increases the accuracy and efficiency of a Data Science model.

It involves below processes:

Data Encoding
Normalization
Standardization
Imputation of missing values
Discretization

Dataset Description

The Iris dataset was used in R.A. Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

Import python libraries:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

Read dataset file:

In this dataset there where five columns of id , length ,width of petal and sepal and according to that result is about predicate the particular species that follows the same elements in the raw data.

In that there are null data ,missing values and many type of ununiform structure of data that creates many problems and inefficient result given by data science model.

Data Encoding

Encoding is the process of converting the data or a given sequence of characters, symbols, alphabets etc., into a specified format. In this we assign unique values to all the categorical attribute. like pass as 1 and fail as 0.

there are two types of encoding

1. label encoding

Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

As you can see ‘Species’ column has 3 categories of flower. After Using Label Encoder we labeled the data.

2. Onehot encoder

Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column.

Normalization

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

Standardization

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Imputation of missing values

Missing values are data that are not available in dataset. there can be single value can be missing or only on value is available and all others are missing.

here is example of simple imputation by adding mean value into missing values.

Discretization

Discretization is the process through which we can transform continuous variables, models or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.

There are 3 types of Discretization available in Sci-kit learn.

1. Quantile Discretization Transform