Regular Expression (RegEx) in Python : The Basics

Last Updated on May 25, 2022 by Editorial Team

Author(s): Hrishikesh Patel

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Regular Expression (RegEx) in Python: The Basics

Master the fundamentals of RegEx in Python

Consider you have a lot of text data, and you want to extract meaningful information. For example, you might want to extract hashtags, @mentions, URLs, etc. from any tweets. What’s the best way to do it? You got it right — it is to use a regular expression or regex. The regex is a sequence of characters that form a pattern that matches the text. We can define a pattern for hashtags, and it can be used to match any hashtags in the given tweets.

Though regex implementation is mostly similar across different programming languages, there can be minor differences. In this story, you’ll learn to use regex in Python. This story covers the basics of regex. I’ll write another story for advanced regex.

RegEx in Python

Python has a dedicated package called ‘re’ for working with regex. Click here to read its documentation. It has different functions such as .search(), .split(), .findall(), .sub(), etc. I will show the usage of the findall() function to find all desired information from the text using the regex pattern.

Illustrating re.findall() function (image by author)

You may wonder what the small character “r” means before the regex pattern and how we can generate different patterns. Let’s dive into that!

Raw string in Python

Before delving deep into the regex, it is crucial to understand what raw string is. Python has special characters such as newline character (\n), tab space(\t), etc. in strings. What if we need \n to be a part of the string instead of being treated specially? In this case, we should use raw strings. The following example illustrates the difference between normal and raw strings. Using raw strings for regex patterns is recommended to avoid the Python interpreter treating the strings unexpectedly.

Summary of typical regex metacharacters

Metacharacters are characters with special meaning in the regex pattern. For example, metacharacter \d represents a digit from 0 to 9. The following table summarizes basic metacharacters used in regex.

Important metacharacters used in regex pattern (image by author)

1. Literal match

In the absence of metacharacters, you can get an exact match.

Illustrating literal string match (image by author)

2. Match a digit using \d

\d represents any digit from 0 to 9.

Illustrating usage of \d (image by author)

3. Match a non-digit using \D

\D matches any single non-digit character.

4. Match a word character using \w

\w matches any single word character. It can include anything from A to Z, a to z, numbers 0 to 9, and an underscore(_).

Illustrating usage of \w (image by author)

5. Match a non-word character using \W

Non-word characters include anything except the word characters mentioned above.

6. Match whitespace with \s

\s allows to match single whitespace character.

Illustrating usage of \s (image by author)

7. Match a non-whitespace with \S

\S can be used to match single non-whitespace character.

Quantifiers in regex

Let’s first extract a phone number from the text.

Since \d matches a single digit, we must write it ten times to extract a
ten-digit number. But wait — it doesn’t look pretty. Here’s the solution — use quantifiers for characters in the pattern.

Summary of regex quantifiers (image by author)

1. Match one or more times using +

The + matches one or more occurrences of its preceding character. So \d+ means match one or more occurrences of a digit.

Illustrating usage of +(plus) quantifier (image by author)

Similarly, you can match zero or more occurrences of its preceding character using *. So \w* means to match zero or more occurrences of a word character.

2. Match exactly n occurrences using {n}

The {n} matches exactly n occurrences of its preceding character in the pattern. So

Illustrating usage of {n} quantifier (image by author)

Other variations:

{n,m} — Matches its preceding character at least n and at most m times e.g., \d{2, 4} will match a digit at least two times and at most 4 times.
{n,} — Matches its preceding character at least n times and there is no upper limit e.g., \w{4,} will match a word character at least four times with no upper limit.
{,m} — Matches its preceding character from zero to m times e.g., \D{,4} will match any non-digit character at most four times, while it can be zero time as well.

3. Match zero or one-time using?

The ? matches its preceding character zero or one time. For example, cats? will match cat as well as cats.

Illustrating usage of? (question-mark) quantifier (image by author)

Note — All these quantifiers are applied to their preceding characters, not the entire word e.g., in mango+ pattern, the + only applies to the last character o, not the word mango.

But what if you want to match the special characters like \ ,*, +,? etc. Since these are special characters, you cannot directly match them as shown below:

Special characters cannot be matched directly (image by author)

The solution is to use the escape character \ before the special character to be matched e.g., use \+ in the regex to match + . Similarly, you can use \\ in the pattern to match \ in the text.

Escape character (\) is required to match special characters such as +, \, *, ?, etc. (image by author)

Thanks for reading this far. I now have a bonus for you.

Bonus

There is a really cool website https://regex101.com/, where you can test your regex pattern. It also supports different programming languages. Go check out the page and have some fun with regex.

Thanks for reading my first ever story on medium. I will appreciate your feedback, and please feel free to post your questions in the comments. Follow me on Medium if you’d like more stories like this.

Regular Expression (RegEx) in Python : The Basics was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Regular Expression (RegEx) in Python : The Basics

Author(s): Hrishikesh Patel

Regular Expression (RegEx) in Python: The Basics

Master the fundamentals of RegEx in Python

RegEx in Python

Raw string in Python

Summary of typical regex metacharacters

1. Literal match

2. Match a digit using \d

3. Match a non-digit using \D

4. Match a word character using \w

5. Match a non-word character using \W

6. Match whitespace with \s

7. Match a non-whitespace with \S

Quantifiers in regex

1. Match one or more times using +

2. Match exactly n occurrences using {n}

3. Match zero or one-time using?

Bonus

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Fundamental Mathematics of Machine Learning

Built-In AI Web APIs Will Enable A New Generation Of AI Startups

Auditing Predictive A.I. Models for Bias and Fairness

Why is Llama 3.1 Such a Big deal?

5 AI Real-World Projects To Set Foot in The Door

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Regular Expression (RegEx) in Python : The Basics

Author(s): Hrishikesh Patel

Regular Expression (RegEx) in Python: The Basics

Master the fundamentals of RegEx in Python

RegEx in Python

Raw string in Python

Summary of typical regex metacharacters

1. Literal match

2. Match a digit using \d

3. Match a non-digit using \D

4. Match a word character using \w

5. Match a non-word character using \W

6. Match whitespace with \s

7. Match a non-whitespace with \S

Quantifiers in regex

1. Match one or more times using +

2. Match exactly n occurrences using {n}

3. Match zero or one-time using?

Bonus

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement