15 Steps to Getting Started with Pandas — Complete Beginner’s Guide
Last Updated on July 17, 2023 by Editorial Team
Author(s): Fares Sayah
Originally published on Towards AI.
Essential Pandas functions for working with data — Read, Write and Manipulate Data
Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language. Mastering Pandas will take your analysis skills to the next level and knowing best practices will save you a lot of time and energy.
In this article, we will explore the Pandas library and how to use it to handle different types of data that you may face during your analysis. By the end of the tutorial, you’ll be more fluent in using Pandas functionalities.
Links to Data used in this Article:
To read the data you just need to paste the in pd.read_csv()
function:
Table Of Content
1. How to read/write a tabular data file using Pandas?
2. How do I select a pandas Series from a DataFrame?
3. How do I rename columns in a pandas DataFrame?
4. How do I remove columns from Pandas DataFrames?
5. How do I sort a Pandas DataFrames or Series?
6. How do I filter rows of a Pandas DataFrames by column value?
7. How do I use string methods in pandas?
8. How do I change the data type of a pandas Series?
9. When should I use a “groupby” in pandas?
10. How do I handle missing values in pandas?
11. What do I need to know about the Pandas index?
12. How do I select multiple rows and columns from a pandas DataFrame?
13. How do I work with dates and times in pandas?
14. How to find and remove duplicate rows in pandas?
15. How do I apply a function to a pandas Series or DataFrame?
1. How to read/write a tabular data file using Pandas?
pandas.read_csv()
is the best and easy way to read a csv
file. It has a lot of parameters that satisfy most of the cases. To read only columns we need, pass a list of columns names that you want to usecols
. We can also specify the number of rows just by passing a number to nrows
.
2. How do I select a Pandas Series from a DataFrame?
We can select Pandas Series by accessing directly the column, eg: df[‘City’]
another way is to access the column as a property, but in this case, the name of the column must obey the conditions of variable naming(no spacing, begins with a letter, …).
City Shape Reported State
0 Ithaca TRIANGLE NY
1 Willingboro OTHER NJ
2 Holyoke OVAL CO
3 Abilene DISK KS
4 New York Worlds Fair LIGHT NY0 Ithaca
1 Willingboro
2 Holyoke
3 Abilene
4 New York Worlds Fair
Name: City, dtype: object
3. How do I rename columns in a Pandas DataFrame?
One way of renaming the columns in a Pandas DataFrame
is by using the rename()
function. This method is quite useful when we need to rename some selected columns because we need to specify information only for the columns which are to be renamed.
Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time',
'Location'],
dtype='object')Index(['City', 'Colors_Reported', 'Shape_Reported', 'State', 'Time',
'Location'],
dtype='object')Index(['city', 'colors reported', 'shape reported', 'state', 'time',
'location'],
dtype='object')Index(['city', 'colors_reported', 'shape_reported', 'state', 'time',
'location'],
dtype='object')
The columns can also be renamed by directly assigning a list containing the new names to the columns
attribute of the DataFrame
the object for which we want to rename the columns. The disadvantage of this method is that we need to provide new names for all the columns even if want to rename only some of the columns.
4. How do I remove columns from Pandas DataFrames?
Drop one or more than one column from a DataFrame
can be achieved in multiple ways. The most common one in .drop()
method. Using it we can drop multiple columns or rows.
Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time',
'Location'],
dtype='object')Index(['City', 'Shape Reported', 'State', 'Time', 'Location'], dtype='object')Index(['Shape Reported', 'State', 'Time'], dtype='object')
5. How do I sort a Pandas DataFrames or Series?
To sort a Pandas DataFrame we use .sort_values()
method. It can sort values in Ascending or Descending order.
star_rating title duration
0 9.3 The Shawshank Redemption 142
1 9.2 The Godfather 175
2 9.1 The Godfather: Part II 200
3 9.0 The Dark Knight 152
4 8.9 Pulp Fiction 154star_rating title duration
941 7.4 A Bridge Too Far 175
938 7.4 Alice in Wonderland 75
975 7.4 Back to the Future Part III 118
933 7.4 Beetlejuice 92
972 7.4 Blue Valentine 112
We can sort by multiple criteria by passing a list of columns you want to sort by.
6. How do I filter rows of a Pandas DataFrames by column value?
Filtering is a common operation in data analysis and Pandas provides a variety of ways to filter data points. Here we used: Logical operators and Multiple logical operators. There are a lot of other filtering techniques like: .isin()
, .query()
…
To apply filtering by multiple criteria, use the ‘&
’, ‘U+007C
’ instead of ‘and
’, ‘or
’. If we have a longer condition like this one we can use ‘isin
’ method.
7. How do I use string methods in pandas?
The string methods on Index are especially useful for cleaning up or transforming DataFrame columns.
0 CHIPS AND FRESH TOMATO SALSA
1 IZZE
2 NANTUCKET NECTAR
3 CHIPS AND TOMATILLO-GREEN CHILI SALSA
4 CHICKEN BOWL
Name: item_name, dtype: object0 chips and fresh tomato salsa
1 izze
2 nantucket nectar
3 chips and tomatillo-green chili salsa
4 chicken bowl
Name: item_name, dtype: object0 False
1 False
2 False
3 False
4 False
Name: item_name, dtype: bool
8. How do I change the data type of a pandas Series?
To check the types of your data, you can use .dtypes
and it will return a pandas series of columns associated with there dtype
. The simplest way to convert a pandas column of data to a different type is to use astype()
.
order_id int64
quantity int64
item_name object
choice_description object
item_price object
dtype: objectdtype('float64')
9. When should I use a “groupby” in pandas?
‘groupby()
’: Group DataFrame
or Series
using a mapper or by a Series
of columns. A groupby
operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
genre
Action 126.485294
Adventure 134.840000
Animation 96.596774
Biography 131.844156
Comedy 107.602564
Crime 122.298387
Drama 126.539568
Family 107.500000
Fantasy 112.000000
Film-Noir 97.333333
History 66.000000
Horror 102.517241
Mystery 115.625000
Sci-Fi 109.000000
Thriller 114.200000
Western 136.666667
Name: duration, dtype: float64count mean max min
genre
Action 136 126.485294 205 80
Adventure 75 134.840000 224 89
Animation 62 96.596774 134 75
Biography 77 131.844156 202 85
Comedy 156 107.602564 187 68
Crime 124 122.298387 229 67
Drama 278 126.539568 242 64
Family 2 107.500000 115 100
Fantasy 1 112.000000 112 112
Film-Noir 3 97.333333 111 88
History 1 66.000000 66 66
Horror 29 102.517241 146 70
Mystery 16 115.625000 160 69
Sci-Fi 5 109.000000 132 91
Thriller 5 114.200000 120 107
Western 9 136.666667 175 85
Multiple aggregation functions can be applied simultaneously.
10. How do I handle missing values in pandas?
Missing Data is a very big problem in real-life scenarios. In Pandas missing data is represented by two values: NaN
or None
. Panas has several useful functions for detecting, removing, and replacing null values in Pandas DataFrame: .isna()
used to find NaN
, .dropna()
used to remove NaN
, and .fillna()
to fill NaN
with a specific value.
(18241, 6)City 25
Colors Reported 15359
Shape Reported 2644
State 0
Time 0
Location 25
dtype: int64(2486, 6)
(2486, 6)
(18237, 6)
2644
0VARIOUS 2977
LIGHT 2803
DISK 2122
TRIANGLE 1889
OTHER 1402
Name: Shape Reported, dtype: int64
11. What do I need to know about the Pandas index?
It is common in tabular data to use an index in the range of 0
to len(data)
. For specific cases (like time series data) we need to change the index to something more meaningful. To set an index, we simply pass the column to .set_index()
.
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8,
9,
...
18231, 18232, 18233, 18234, 18235, 18236, 18237, 18238, 18239,
18240],
dtype='int64', length=18241)Index([ 'Ithaca , NY', 'Willingboro , NJ',
'Holyoke , CO', 'Abilene , KS',
'New York Worlds Fair , NY', 'Valley City , ND',
'Crater Lake , CA', 'Alma , MI',
'Eklutna , AK', 'Hubbard , OR',
...
'Pismo Beach , CA', 'Lodi , WI',
'Anchorage , AK', 'Capitola , CA',
'Fountain Hills , AZ', 'Grant Park , IL',
'Spirit Lake , IA', 'Eagle River , WI',
'Eagle River , WI', 'Ybor , FL'],
dtype='object', name='Location', length=18241)
12. How do I select multiple rows and columns from a pandas DataFrame?
Pandas is built on top of NumPy so it tries to pursue his conventions about slicing. while ‘iloc
’ works with numerics, it's built like NumPy array. this is not the case for ‘loc
’ which slices on other types.
City Holyoke
Shape Reported OVAL
State CO
Name: 2, dtype: objectCity Shape Reported State
0 Ithaca TRIANGLE NY
1 Willingboro OTHER NJ
2 Holyoke OVAL COCity State
0 Ithaca NY
1 Willingboro NJ
2 Holyoke CO
13. How do I work with dates and times in pandas?
DateTime
is a collection of dates and times in the format of “yyyy-mm-dd HH:MM:SS
” where yyyy-mm-dd
is referred to as the date and HH:MM:SS
is referred to as Time. Having our dates as datetime64
objects will allow us to access a lot of date and time information through the .dt
API.
.to_datetime()
will convert the string presenting our data to datetime64[ns]
object.
0 1930-06-01 22:00:00
1 1930-06-30 20:00:00
2 1931-02-15 14:00:00
3 1931-06-01 13:00:00
4 1933-04-18 19:00:00
Name: Time, dtype: datetime64[ns]0 22
1 20
2 14
3 13
4 19
Name: Time, dtype: int640 Sunday
1 Monday
2 Sunday
3 Monday
4 Tuesday
Name: Time, dtype: object
14. How to find and remove duplicate rows in pandas?
An important part of Data analysis is analyzing Duplicate Values and removing them. Pandas duplicated()
method helps in analyzing duplicate values only. It returns a boolean series that is True
only for Unique elements.
(943, 4)
148
7
(936, 4)
15. How do I apply a function to a pandas Series or DataFrame?
Pandas.apply
allow the users to pass a function and apply it on every single value of the Pandas series.
age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213user_id
1 1
2 0
3 1
4 1
5 0
Name: gender, dtype: int64Man 889
Child 54
Name: age, dtype: int64
Conclusion
Mastering Pandas will take your analysis skills to the next level and knowing best practices will save you a lot of time and energy. In this article we covered:
- 15 hands-on recipes to quickly start using Pandas. All of them are useful and come in handy for particular cases.
- Pandas is a powerful library for both data analysis and manipulation. It provides numerous functions and methods to handle data in tabular form. As with any other tool, the best way to learn Pandas is through practicing.
Thank you for reading. Please let me know if you have any feedback or suggestions.
References
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI