Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Complete Guide to Pandas DataFrame with real-time use case
Latest

Complete Guide to Pandas DataFrame with real-time use case

Last Updated on September 8, 2022 by Editorial Team

Author(s): Muttineni Sai Rohith

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Complete Guide to Pandas Dataframe With Real-time UseΒ Case

After my Pyspark Seriesβ€Šβ€”β€Šwhere readers are mostly interested in Pyspark Dataframe and Pyspark RDD, I got suggestions and requests to write on Pandas DataFrame, So that one can compare between Pyspark and Pandas not in consumption terms but in Syntax terms. So today in this article, we are going to concentrate on functionalities of Pandas DataFrame using TitanicΒ Dataset.

Pandas refer to Panel Data/ Python Data analysis. In general terms, Pandas is a python library used to work with datasets.

DataFrame in pandas is a two-dimensional data structure or a table with rows and columns. DataFrames provides functions for creating, analyzing, cleaning, exploring, and manipulating data.

Source: pandas.pydata.org

Installation:

pip install pandas

Importing pandas:

import pandas as pd
print(pd.__version__)

This will print the Pandas version if the Pandas installation is successful.

Creating DataFrame:

Creating an empty DataFrame β€”

df=pd.DataFrame()
df.head(5) # prints first 5 rows in DataFrame

Creating DataFrame fromΒ dictβ€”

employees = {'Name':['chandu','rohith','puppy'],'Age':[26,24,29],'salary':[180000,130000,240000]}
df = pd.DataFrame(employees)
df.head()
Source: Output

Creating Dataframe from a list of listsΒ β€”

employees = [['chandu',26,180000], ['rohith', 24, 130000],['puppy', 29 ,240000]]
df = pd.DataFrame(employees, columns=["Name","Age","Salary"])
df.head()
Source: Output

Importing data from CSV File mentioned here.

df = pd.read_csv("/content/titanic_train.csv")
df.head(5)
Source: Data from TitanicΒ Dataset

As shown above, df.head(n) will return the first n rows from DataFrame while df.tail(n) will return the last n rows from DataFrame.

print(df.shape)  #prints the shape of DataFrame - rows * columns
print(df.columns)  #returns the column names
Source: Output

value_counts()β€” returns the unique values with their counts in a particular column.

df["Embarked"].value_counts()
Source: Output

df.describe()β€Šβ€”β€ŠGetting information about all numerical columns in DataFrame

df.describe()
Source: Output

df.info()β€Šβ€”β€Šreturns the count and data type of all columns in DataFrame

df.info()
Source: Output

As seen above, There is the count of Age and Cabin is less than 891, so there might be missing values in those columns. We can also see the dtype of the columns in the DataFrame.

Handling MissingΒ Values

Getting the count of missing ValuesΒ β€”

df.isnull().sum()
Source: Output

As seen above, columns β€œAge”, β€œCabin” and β€œEmbarked” has missingΒ values.

To get the percentage of missing valuesΒ β€”

df.isnull().sum() / df.shape[0] * 100
Source: Output

As we can see, the missing values percentage of Cabin is more than 75%, so let’s drop theΒ column.

df=df.drop(['Cabin'],axis=1)

The above command is used to drop certain columns from DataFrame.

Imputing MissingΒ Values

Let’s impute the missing values in the Age column by meanΒ Value.

df['Age'].fillna(df['Age'].mean(),inplace=True)

And impute the missing values in an Embarked column by ModeΒ Value.

df['Embarked'].fillna(df['Embarked'].mode().item(),inplace=True)

In the above example,Β .item() is used as we are dealing with a string column. I think all the missing Values are handled, let’s checkΒ β€”

Source: Output

Renaming Columns

df=df.rename(columns={'Sex':'Gender','Name':'Full Name'})
df.head()
Source: Output

Adding/Modifying Columns

df['last_name']=df['Full Name'].apply(lambda x: x.split(',')[0])
df['first_name']=df['Full Name'].apply(lambda x: ' '.join(x.split(',')[1:]))
df.head(5)
Source: Output

Adding Rowsβ€Šβ€”β€Šwe use df.append() method to add theΒ rows

row=dict({'Age':24,'Full Name':'Rohith','Survived':'Y'})
df=df.append(row,ignore_index=True)
df.tail()
Source: Output

A new row is created, and NaN values are initialized for columns with NoΒ Values

using loc()Β method:

df.loc[len(df.index)]=row
df.tail()

Deleting Rows

using df.index() methodΒ β€”

df=df.drop(df.index[-1],axis=0) # Deletes last row
df.head()

Encoding Columns

For most of the machine learning algorithms, we should have numerical data instead of Data in String format. So Encoding data is a must operation.

df['Gender']=df['Gender'].map({"male":'0',"female":"1"})
df.head(5)
Source: Output

As this process becomes hectic for all columns if we use the above method, there are many methods such as LabelEncoder, OneHotEncoder, and MultiColumnLabelEncoder available to Encode the DataFrames. They are explained clearly in the below ArticleΒ β€”

Encoding Methods to encode Categorical data in Machine Learning

Filtering theΒ Data

Selecting the data only when age is greater thanΒ 25.

df[df["Age"]> 25].head(5)
Source: Output

Similarly we can use >Β , < and == operations.

df[(df["Age"]< 25) & (df["Gender"]=="1")].head(5)

selecting data when Age is less than 25 and Gender is 1. In the above way, we can also filter multipleΒ columns.

apply() function:

Let’s assume that people aged less than 15 and greater than 60 are going to be saved first. Let’s make a save_first column using apply() function.

def save_first(age):
if age<15:
return 1
elif age>=15 and age<60:
return 0
elif age>=60:
return 1
df['save_first']=df['Age'].apply(lambda x: save_first(x))

Selecting particular Columns andΒ Rows:

df_1 = df[['Age','Survived','Gender']]
df_1.head()
Source: Output

usingΒ .iloc()β€Šβ€”β€ŠIt uses numerical indexes to return particular rows and columns in the DataFrame

df_2 = df.iloc[0:100,:]
df_2.head()

returns the first 100 rows and all columns in DataFrame

df_2 = df.iloc[0:100, [0,1,2]]
df_2.head()

returns first 100 rows and first 3 columns in DataFrame

Source: Output

.loc() functionβ€Šβ€”β€Šsimilar toΒ .iloc() but it uses Column Names instead of numerical indexes.

df_2 = df.loc[0:100, ['Age','Survived','Gender']]
df_2.head()
Source: Output

Sorting

We can perform sorting operations in DataFrame using sort_values() method.

df=df.sort_values(by=['Age'],ascending=False)
df.head()
Source: Output

We can also use multiple columnsβ€Šβ€”β€Šfirst sorts by the first column, followed by the secondΒ column.

df=df.sort_values(by=['Age', 'Survived'],ascending=False)
df[15:20]
Source: Output

Join

Join is nothing but combining multiple DataFrames based on a particular column.

Let's perform 5 types of Joinsβ€Šβ€”β€Šcross, inner, left, right, andΒ outer

cross joinβ€Šβ€”β€Šalso known as a cartesian join, which returns all combinations of rows from eachΒ table.

cross_join = pd.merge( df1 , df2 ,how='cross')

inner joinβ€Šβ€”β€Šreturns only those rows that have matching values in both Dataframes.

inner_join = pd.merge( df1 , df2 ,how='inner', on='column_name')

left joinβ€Šβ€”β€Šreturns all the rows from the first DataFrame, if there are no matching values, then it returns null Values for Second DataFrame

left_join = pd.merge( df1 , df2 ,how='left', on='column_name')

right joinβ€Šβ€”β€Šreturns all the rows from the Second DataFrame. If there are no matching values, then it returns null Values for first DataFrame

right_join = pd.merge( df1 , df2 ,how='right', on='column_name')

outer joinβ€Šβ€”β€Šreturns all the rows from both first and Second DataFrames. In case there is no match in the first Dataframe, Values in the Second DataFrame will be null and ViceVersa

outer_join = pd.merge( df1 , df2 ,how='outer', on='column_name')

GroupBy()

This method is used to group a dataframe based on a fewΒ columns.

groups = df.groupby(['Survived'])
groups.get_group(1)

get_group() method is used to get the data belonging to a certainΒ group.

GroupBy() generally used with mathematical functions such as mean(), min(), max()Β etc.,

groups['Age'].mean()
Source: Output
groups['Age'].count()
Source: Output
groups['Age'].min()
groups['Age'].max()

UsingΒ .agg()Β function

import numpy as np
group_agg =df.groupby(['Survived']).agg({'Age':lambda x: np.mean(x)})
group_agg.head()
Source: Output

So I think I have covered most of the basic concepts related to Pandas DataFrame.

Happy Coding…


Complete Guide to Pandas DataFrame with real-time use case was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓