Pandas for Data Science: A Step-by-Step Guide

 Pandas is an open-source Python library widely used for data pre-processing. This library offers numerous predefined functions for managing raw data efficiently we will discuss about it in detail using the best example.

First of all, we need a dataset we can create by our own using Excel or SQL or download from various sources like Kaggle,aws, or other websites

https://www.kaggle.com/datasets/saadharoon27/diwali-sales-dataset

Diwali Data Set

When you receive data, it often comes in raw and isn’t directly usable for machine learning models. This is where data processing comes in. Pandas is a Python library specifically designed for data analysis, organization, and cleaning. It helps in structuring and preparing the data so that it can be effectively utilized in machine learning models. With Pandas, you can easily handle data manipulation tasks such as sorting, filtering, and cleaning, making it a crucial tool for preprocessing raw data into a format suitable for machine learning algorithms.

To use Pandas, you need to install it first using pip, which stands for Python Install Package. You can do this by running the command pip install pandas in your command line or terminal. After installing Pandas, you can import it into your Python script or notebook.

Pandas provide various functions to handle null values, remove unused columns, and convert categories into proper formats. These functions make data preprocessing tasks easier and more efficient. With Pandas, you can clean and prepare your data for machine learning tasks effectively.

In the below example, we will analyze sales data using pandas:

This file was created by us in Excel

Import Python Libraries

Note: we are using jupyter notebook to write these commands:

Read data=>

We are calling sales files here with the title “Salese2.csv” It should keep in the same folder

Data Information

Using the shape command we can know how many rows and columns are in the data table

According to the above command, there are 10 rows (first value in shape ) and 10 columns (second value is shape)

In info, we can get complete information regarding the dataset like how many columns, rows, and types of each value, etc..

In datasets many times some values are missing these are called nulls it's a very common problem we have to face mostly, nulls are like poison for our dataset cause of nulls our insights of data get biased results or wrong predictions so it is highly recommended first remove or fill nulls (it depends on your domain of dataset)

In this pic, we can see we have 3 nulls in Brand and 1 in dis

In the above dataset, we filled null brands with ‘“others”

Here we are filling dis with 0%

Sometimes we have a big size of dataset like we have around 25000 rows and we have 2 or 3 nulls then we can drop nulls instead of fill nulls

ex=> data.dropna(inplace=True) depends on domain and dataset

Filter Data

We can filter our data like in the above pic we are calling only Samsung brand

Categorization data

A lot of functions are in pandas so Click on the below link video tutorial on pandas

PLAY ALL

Comments

Popular posts from this blog

From Data Collection to Model Deployment: A Step-by-Step Guide in Machine Learning

What is Machine Learning?