Pandas for Data Science: A Step-by-Step Guide

March 03, 2024

Pandas is an open-source Python library widely used for data pre-processing. This library offers numerous predefined functions for managing raw data efficiently we will discuss about it in detail using the best example.

First of all, we need a dataset we can create by our own using Excel or SQL or download from various sources like Kaggle,aws, or other websites

https://www.kaggle.com/datasets/saadharoon27/diwali-sales-dataset

Diwali Data Set

When you receive data, it often comes in raw and isn’t directly usable for machine learning models. This is where data processing comes in. Pandas is a Python library specifically designed for data analysis, organization, and cleaning. It helps in structuring and preparing the data so that it can be effectively utilized in machine learning models. With Pandas, you can easily handle data manipulation tasks such as sorting, filtering, and cleaning, making it a crucial tool for preprocessing raw data into a format suitable for machine learning algorithms.

To use Pandas, you need to install it first using pip, which stands for Python Install Package. You can do this by running the command pip install pandas in your command line or terminal. After installing Pandas, you can import it into your Python script or notebook.

Pandas provide various functions to handle null values, remove unused columns, and convert categories into proper formats. These functions make data preprocessing tasks easier and more efficient. With Pandas, you can clean and prepare your data for machine learning tasks effectively.

In the below example, we will analyze sales data using pandas:

This file was created by us in Excel

Import Python Libraries

Note: we are using jupyter notebook to write these commands:

Read data=>

We are calling sales files here with the title “Salese2.csv” It should keep in the same folder

Data Information

Using the shape command we can know how many rows and columns are in the data table

According to the above command, there are 10 rows (first value in shape ) and 10 columns (second value is shape)

In info, we can get complete information regarding the dataset like how many columns, rows, and types of each value, etc..

In datasets many times some values are missing these are called nulls it's a very common problem we have to face mostly, nulls are like poison for our dataset cause of nulls our insights of data get biased results or wrong predictions so it is highly recommended first remove or fill nulls (it depends on your domain of dataset)