Text Analysis with Pandas Guide

Last Updated on July 25, 2023 by Editorial Team

Author(s): Fares Sayah

Originally published on Towards AI.

Hands-On guide on how to use Pandas to perform analysis on textual data

Text Analysis with Pandas Guide — Photo by Stone Wang on Unsplash

Most of the time raw data comes in a form that makes analysis difficult. Python provides a lot of built-in functions to manipulate string objects.

We can write our functions to manipulate strings and use DataFrame.apply() to apply them but this may be slow sometimes.

Instead, we can use pandas functions, some of the functions are presented in this article but if you need more you can check the documentation.

Table Of Content

1. Manipulate Case
2. Split Strings
3. Replace String
4. Concatenate
5. Additional Methods:
6. Information Extraction from Text

1. Manipulate Case

Pandas provide several functions for string manipulations:

.lower(): Converts all uppercase characters in strings in the DataFrame to lowercase and returns the lowercase strings in the result.
.upper(): Converts all lowercase characters in strings in the DataFrame to uppercase and returns the uppercase strings in the result.
.strip(): If there are spaces at the beginning or end of a string, we should trim the strings to eliminate spaces using strip() or remove the extra spaces contained by a string in DataFrame.
.islower(): It checks whether all characters in each string in the Index of the DataFrame in lower case or not, and returns a Boolean value.
.isupper(): It checks whether all characters in each string in the Index of the DataFrame in upper case or not, and returns a Boolean value.
.isnumeric(): It checks whether all characters in each string in the Index of the DataFrame are numeric or not, and return a Boolean value.
.swapcase(): It swaps the case lower to upper and vice-versa.

Lowercase all letters:
0 lev gor'kov
1 NaN
2 brillouin
3 albert einstein
4 carl m. bender
dtype: object

Uppercase all letters:
0 LEV GOR'KOV
1 NaN
2 BRILLOUIN
3 ALBERT EINSTEIN
4 CARL M. BENDER
dtype: object

Uppercase the first letter:
0 Lev gor'kov
1 NaN
2 Brillouin
3 Albert einstein
4 Carl m. bender
dtype: object

Uppercase the first letter of each word:
0 Lev Gor'Kov
1 NaN
2 Brillouin
3 Albert Einstein
4 Carl M. Bender
dtype: object

2. Split Strings

.split(‘ ‘): Splits each string with the given pattern. In the example below, we have pandas series of Physicist names which we want to separate into First Name and Last Name and nicely format them (title them). Using expand=True will return the result DataFrame which will easily concatenate with another DataFrame.

Before Splitting:
0 lev gor'kov
1 NaN
2 brillouin
3 albert einstein
4 carl m. bender
dtype: object

After Splitting:
 First Name Last Name
0 Lev Gor'Kov
1 NaN NaN
2 Brillouin None
3 Albert Einstein
4 Carl M. Bender

3. Replace String

When working with text data, you will often want to remove some characters or words from the text..replace(a,b) replaces the value a with the value b. In the example below we are replacing Dr. and Pr. with an empty string.

If the text you want to remove or replace is not clear you can use regular expressions.

Before Replacing:
0 lev gor'kov
1 NaN
2 Dr. brillouin
3 Pr. albert einstein
4 carl m. bender
dtype: object

After Replacing:
 First Name Last Name
0 Lev Gor'Kov
1 NaN NaN
2 Brillouin None
3 Albert Einstein
4 Carl M. Bender

4. Concatenate

Concatenating two columns is a common task if you are working with text data. This can be done using .cat() methods.

cat(sep=’ ‘): It concatenates the DataFrame index elements or each string in DataFrame with a given separator. In the example below, we have two pandas series (First Name and Last Name) and we want to concatenate them into one Pandas series.

Concatinate and ignore missing values:
0 Albert Doe
1 John Piter
2 Robert David
3 <NA>
4 Jack Carl
dtype: string

Concatinate and replace missing values with "-":
0 Albert Doe
1 John Piter
2 Robert David
3 - Eden
4 Jack Carl
dtype: string

5. Additional Methods:

.startswith(pattern): It returns true if the element or string in the DataFrame Index starts with the pattern.
.endswith(pattern): It returns true if the element or string in the DataFrame Index ends with the pattern.
.repeat(value): It repeats each element with a given number of times like the below example, there are two appearances of each string in DataFrame.
.find(pattern): It returns the first position of the first occurrence of the pattern.

6. Information Extraction from Text

In working with data and especially in NLP tasks you will need to do some basic data analysis to your data (find long text, clean text, count words…).

.len(): With the help of len() we can compute the length of each string in DataFrame & if there is empty data in DataFrame, it returns NaN.
.count(pattern): It returns the count of the appearance of pattern in each element in DataFrame like the below example, it counts spaces in each string of DataFrame and returns the total number of words in each string.
.findall(pattern): It returns a list of all occurrences of the pattern. In the example below we passed a regex to find time in our data.

Here we use chaining instead of creating new columns directly to the DataFrame. Method chaining is a programmatic style of invoking multiple method calls sequentially with each call performing an action on the same object and returning it. Method chaining substantially increases the readability of the code.

Conclusion

We have covered some of the Pandas' functions to manipulate text data. All of them are useful and come in handy for particular cases.
Pandas is a powerful library for both data analysis and manipulation. It provides numerous functions and methods to handle data in tabular form. As with any other tool, the best way to learn about Pandas is through practicing.

Thank you for reading. Please let me know if you have any feedback or suggestions.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

Text Analysis with Pandas Guide

Author(s): Fares Sayah

Hands-On guide on how to use Pandas to perform analysis on textual data

Table Of Content

1. Manipulate Case

2. Split Strings

3. Replace String

4. Concatenate

5. Additional Methods:

6. Information Extraction from Text

Conclusion

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

RNNs Cannot Think What Transformers Think Cheaply. ICLR 2026 Proved the Gap Is Exponential.

Time Series Made So Easy My Aunt Got It on the Second Read

Claude Cowork 101

Is 3-Bit KV Cache the Holy Grail? A Reality Check on Google’s TurboQuant

LangGraph Multi-Agent Architecture: Building a Self-Critiquing AI Debate System

AutoML on Autopilot

I Ran This Open-Source AI Tool on a Messy Codebase and Got 71x Fewer Tokens — Here Is Exactly What Happened

Month in 4 Papers (April 2026)

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Text Analysis with Pandas Guide

Author(s): Fares Sayah

Hands-On guide on how to use Pandas to perform analysis on textual data

Table Of Content

1. Manipulate Case

2. Split Strings

3. Replace String

4. Concatenate

5. Additional Methods:

6. Information Extraction from Text

Conclusion

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement