Pandas for Data Science: A Step-by-Step Guide
Pandas is an open-source Python library widely used for data pre-processing. This library offers numerous predefined functions for managing raw data efficiently we will discuss about it in detail using the best example.
First of all, we need a dataset we can create by our own using Excel or SQL or download from various sources like Kaggle,aws, or other websites
https://www.kaggle.com/datasets/saadharoon27/diwali-sales-dataset
Diwali Data Set
When you receive data, it often comes in raw and isn’t directly usable for machine learning models. This is where data processing comes in. Pandas is a Python library specifically designed for data analysis, organization, and cleaning. It helps in structuring and preparing the data so that it can be effectively utilized in machine learning models. With Pandas, you can easily handle data manipulation tasks such as sorting, filtering, and cleaning, making it a crucial tool for preprocessing raw data into a format suitable for machine learning algorithms.
To use Pandas, you need to install it first using pip, which stands for Python Install Package. You can do this by running the command pip install pandas in your command line or terminal. After installing Pandas, you can import it into your Python script or notebook.
Pandas provide various functions to handle null values, remove unused columns, and convert categories into proper formats. These functions make data preprocessing tasks easier and more efficient. With Pandas, you can clean and prepare your data for machine learning tasks effectively.
In the below example, we will analyze sales data using pandas:
This file was created by us in Excel
Import Python Libraries
Note: we are using jupyter notebook to write these commands:
Read data=>
We are calling sales files here with the title “Salese2.csv” It should keep in the same folder
Data Information
Using the shape command we can know how many rows and columns are in the data table
According to the above command, there are 10 rows (first value in shape ) and 10 columns (second value is shape)
In info, we can get complete information regarding the dataset like how many columns, rows, and types of each value, etc..
In datasets many times some values are missing these are called nulls it's a very common problem we have to face mostly, nulls are like poison for our dataset cause of nulls our insights of data get biased results or wrong predictions so it is highly recommended first remove or fill nulls (it depends on your domain of dataset)
In this pic, we can see we have 3 nulls in Brand and 1 in dis
In the above dataset, we filled null brands with ‘“others”
Here we are filling dis with 0%
Sometimes we have a big size of dataset like we have around 25000 rows and we have 2 or 3 nulls then we can drop nulls instead of fill nulls
ex=> data.dropna(inplace=True) depends on domain and dataset
Filter Data
We can filter our data like in the above pic we are calling only Samsung brand
Categorization data
A lot of functions are in pandas so Click on the below link video tutorial on pandas
Comments
Post a Comment