Text Analysis with Pandas Guide
Last Updated on July 25, 2023 by Editorial Team
Author(s): Fares Sayah
Originally published on Towards AI.
Hands-On guide on how to use Pandas to perform analysis on textual data
Most of the time raw data comes in a form that makes analysis difficult. Python provides a lot of built-in functions to manipulate string objects.
We can write our functions to manipulate strings and use DataFrame.apply()
to apply them but this may be slow sometimes.
Instead, we can use pandas functions, some of the functions are presented in this article but if you need more you can check the documentation.
Table Of Content
1. Manipulate Case
2. Split Strings
3. Replace String
4. Concatenate
5. Additional Methods:
6. Information Extraction from Text
1. Manipulate Case
Pandas provide several functions for string manipulations:
.lower():
Converts all uppercase characters in strings in the DataFrame to lowercase and returns the lowercase strings in the result..upper():
Converts all lowercase characters in strings in the DataFrame to uppercase and returns the uppercase strings in the result..strip():
If there are spaces at the beginning or end of a string, we should trim the strings to eliminate spaces using strip() or remove the extra spaces contained by a string in DataFrame..islower():
It checks whether all characters in each string in the Index of theDataFrame
in lower case or not, and returns a Boolean value..isupper():
It checks whether all characters in each string in the Index of theDataFrame
in upper case or not, and returns a Boolean value..isnumeric():
It checks whether all characters in each string in the Index of theDataFrame
are numeric or not, and return a Boolean value..swapcase():
It swaps the case lower to upper and vice-versa.
Lowercase all letters:
0 lev gor'kov
1 NaN
2 brillouin
3 albert einstein
4 carl m. bender
dtype: object
Uppercase all letters:
0 LEV GOR'KOV
1 NaN
2 BRILLOUIN
3 ALBERT EINSTEIN
4 CARL M. BENDER
dtype: object
Uppercase the first letter:
0 Lev gor'kov
1 NaN
2 Brillouin
3 Albert einstein
4 Carl m. bender
dtype: object
Uppercase the first letter of each word:
0 Lev Gor'Kov
1 NaN
2 Brillouin
3 Albert Einstein
4 Carl M. Bender
dtype: object
2. Split Strings
.split(β β)
: Splits each string with the given pattern. In the example below, we have pandas series of Physicist names which we want to separate into First Name and Last Name and nicely format them (title them). Using expand=True
will return the result DataFrame
which will easily concatenate with another DataFrame
.
Before Splitting:
0 lev gor'kov
1 NaN
2 brillouin
3 albert einstein
4 carl m. bender
dtype: object
After Splitting:
First Name Last Name
0 Lev Gor'Kov
1 NaN NaN
2 Brillouin None
3 Albert Einstein
4 Carl M. Bender
3. Replace String
When working with text data, you will often want to remove some characters or words from the text..replace(a,b)
replaces the value a with the value b. In the example below we are replacing Dr.
and Pr.
with an empty string.
If the text you want to remove or replace is not clear you can use regular expressions.
Before Replacing:
0 lev gor'kov
1 NaN
2 Dr. brillouin
3 Pr. albert einstein
4 carl m. bender
dtype: object
After Replacing:
First Name Last Name
0 Lev Gor'Kov
1 NaN NaN
2 Brillouin None
3 Albert Einstein
4 Carl M. Bender
4. Concatenate
Concatenating two columns is a common task if you are working with text data. This can be done using .cat()
methods.
cat(sep=β β)
: It concatenates theDataFrame
index elements or each string inDataFrame
with a given separator. In the example below, we have two pandas series (First Name and Last Name) and we want to concatenate them into one Pandas series.
Concatinate and ignore missing values:
0 Albert Doe
1 John Piter
2 Robert David
3 <NA>
4 Jack Carl
dtype: string
Concatinate and replace missing values with "-":
0 Albert Doe
1 John Piter
2 Robert David
3 - Eden
4 Jack Carl
dtype: string
5. Additional Methods:
.startswith(pattern)
: It returns true if the element or string in the DataFrame Index starts with the pattern..endswith(pattern)
: It returns true if the element or string in the DataFrame Index ends with the pattern..repeat(value)
: It repeats each element with a given number of times like the below example, there are two appearances of each string in DataFrame..find(pattern)
: It returns the first position of the first occurrence of the pattern.
6. Information Extraction from Text
In working with data and especially in NLP tasks you will need to do some basic data analysis to your data (find long text, clean text, count wordsβ¦).
.len():
With the help oflen()
we can compute the length of each string in DataFrame & if there is empty data in DataFrame, it returnsNaN
..count(pattern):
It returns the count of the appearance of pattern in each element inDataFrame
like the below example, it counts spaces in each string ofDataFrame
and returns the total number of words in each string..findall(pattern):
It returns a list of all occurrences of the pattern. In the example below we passed a regex to find time in our data.
Here we use chaining instead of creating new columns directly to the DataFrame
. Method chaining is a programmatic style of invoking multiple method calls sequentially with each call performing an action on the same object and returning it. Method chaining substantially increases the readability of the code.
Conclusion
- We have covered some of the Pandas' functions to manipulate text data. All of them are useful and come in handy for particular cases.
- Pandas is a powerful library for both data analysis and manipulation. It provides numerous functions and methods to handle data in tabular form. As with any other tool, the best way to learn about Pandas is through practicing.
Thank you for reading. Please let me know if you have any feedback or suggestions.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI