Regex for Python developers

Arnav Goel
9 min readJun 17, 2023

--

Introduction

A regular expression (regex or regexp) is a sequence of characters that specifies a match pattern in text. Different regular expression engines are not fully compatible with each other. The syntax and behavior of a particular engine is called a regular expression flavor. Popular regular expression flavors are Perl, PCRE, PHP, .NET, Java, JavaScript, XRegExp, VBScript, Python, Ruby, Delphi, R, Tcl, POSIX, and many others. For the purpose of the article, we would talk about Python Regex.

Regex functionality in Python resides in a module named re. The re module contains many useful functions and methods. For this section, we need to understand basic information on how regex works and to read the below default boundaries and punctuation values.

What is Regex Used For?

Regex matching a webpage

RegEx can be used for string validation in front/backend web development, It is most commonly used today for scraping text data or finding a particular text in a large amount of data.

  1. Finding specific text, string replacement on text editor (eg. Vim or other common IDE)
  2. Finding specific text in logs for debugging
  3. Data scraping such as web scraping specific text. eg. find all pages that contain a certain set of words eventually in a specific order
  4. Validating text such as email validation or if a string is well-formed
  5. Data wrangling (transforming data from one format to another)

Understanding Regex

Literal Characters

The most basic regular expression consists of a single literal character, such as a. It matches the first occurrence of that character in the string. There are some other non-printable characters to keep in mind. The list below is not comprehensive and we will talk about more non-printable characters in the future:

\t      tab character (ASCII 0x09)
\r carriage return (ASCII 0x0D)
\n line feed (ASCII 0x0A)
\d Matches any Unicode decimal digit. Includes [0-9] many other digit characters.
\D Matches any character which is not a decimal digit. Opposite of /d
\s Matches all whitespace characters
\S Matches characters that are not whitespace.
\w Matches word characters including alphanumeric characters.
If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
\W Matches any character which is not a word character.
This is the opposite of \w.
\b word boundary
\B Not a word boundary
^ Beginning of a string
$ End of a string
\A Matches only at the start of the string.
\Z Matches only at the end of the string.

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a new line.

Special Characters

There are 12 special characters that are reserved so that we can access more patterns in regex. These special characters are often called “metacharacters”. Most of them are errors when used alone.

. (period or dot)    In the default mode, this matches any character 
except a new line. If the DOTALL flag has been specified,
this matches any character including a newline.
\ (backslash)        Either escapes special characters (permitting you to match characters like '*', '?'), 
or signals a special sequence
^ (caret) Matches the start of the string, and in MULTILINE mode
also matches immediately after each newline.
$ (dollar sign) Matches the end of the string or just before the
newline at the end of the string, and in MULTILINE
mode also matches before a newline
| (vertical bar) A|B, where A and B can be arbitrary REs, creates a regular
expression that will match either A or B.
? (question mark) Causes the resulting RE to match 0 or 1 repetitions of
the preceding RE. ab? will match either ‘a’ or ‘ab’.
* (asterisk) Causes the resulting RE to match 0 or more repetitions
of the preceding RE, as many repetitions as are possible.
ab* will match ‘a’, ‘ab’, or ‘a’ followed by ‘b’s.
+ (plus sign) Causes the resulting RE to match 1 or more repetitions of
the preceding RE. ab+ will match ‘a’ followed by any
non-zero number of ‘b’s; it will not match just ‘a’.
(,) (parenthesis) Matches whatever regular expression is inside the
parentheses, and indicates the start and end of a group;
the contents of a group can be retrieved after a match
has been performed, and can be matched later in the string
[ (square bracket) Used to indicate a set of characters{ (curly brace) Specifies the number of copies of previous RE to be matched;

Quantifiers — * + ? and {}

  • * match 0 or more repetitions of the preceding RE
  • +. match 1 or more repetitions of the preceding RE
  • ? causes the resulting RE to match 0 or 1 repetitions of the preceding RE
  • {m} Specifies that exactly m copies of the previous RE should be matched
  • {m, n} causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible
abc*        matches a string that has ab followed by zero or more c
abc+ matches a string that has ab followed by one or more c
abc? matches a string that has ab followed by zero or one c
abc{2} matches a string that has ab followed by 2 c
abc{2,} matches a string that has ab followed by 2 or more c
abc{2,5} matches a string that has ab followed by 2 up to 5 c
a(bc)* matches a string that has a followed by zero or more copies of the sequence bc
a(bc){2,5} matches a string that has a followed by 2 up to 5 copies of the sequence bc

Greedy and Lazy match

The quantifiers ( * + {}) are greedy operators, so they expand the match as far as they can through the provided text.

For example, <.+> matches <div>simple div</div> in This is a <div> simple div</div> test.

In order to catch only the div tag we can use a ? to make it lazy:

<.+?> matches any character one or more times included
inside < and >

Notice that a better solution should avoid the usage of . in favor of a more strict regex:

<[^<>]+> same as <.+?> but stricter

Anchors — ^ and $

^The        matches any string that starts with "The"
end$ matches a string that ends with "end"
^The end$ starts with "The end" and ends with "The end" (exact match)
roar matches any string that has the text roar in it

A good example could be to check if there are decimal numbers in the string ^\d+$. \d+ says to match that all the numbers should be decimals but that would match abcd4cd as well. So we want to add ^ before that and $ after that to make it an exact match.

Grouping and capturing

  • Parenthesis examples — Matches the exact regular expression inside the parentheses
1. a(bc)           parentheses create a capturing group with value bc
2. a(?:bc)* using ?: we disable the capturing group
3. a(?<foo>bc) using ?<foo> we put a name to the group
  • Bracket expressions — Returns true if RE matche one of the regular expressions inside the brackets.
[abc]            matches a string that has either an a or ab or ac
[a-c] same as previous
[a-fA-F0-9] a string that represents a single hexadecimal digit, case insensitively
[0-9]% a string that has a character from 0 to 9 before a % sign
[^a-zA-Z] a string that has not a letter from a to z or from A to Z. (^ is used as negation)

Boundaries — \b and \B

\b represents an anchor like caret (it is similar to $ and ^) matching positions where one side is a word character (like \w) and the other side is not a word character (for instance it may be the beginning of the string or a space character).

\babc\b      performs a "whole words only" search

It comes with its negation, \B. This matches all positions where \b doesn’t match and could be if we want to find a search pattern fully surrounded by word characters.

\Babc\B      matches only if pattern is fully surrounded by word characters

Flags

A regex usually comes within this form /abc/, where the search pattern is delimited by two slash characters /. At the end we can specify a flag with these values (we can also combine them each other):

  • g (global) does not return after the first match, restarting the subsequent searches from the end of the previous match
  • m (multi-line) when enabled ^ and $ will match the start and end of a line, instead of the whole string
  • i (insensitive) makes the whole expression case-insensitive (for instance /aBc/i would match AbC)

Python Regex Specific Flags

There are still many more flags that we can use from the regex library and have been somewhat mentioned earlier.

  • re.ASCII: Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching
  • re.DEBUG: Display debug information about compiled expression
  • re.IGNORECASE (re.I): Perform case-insensitive matching; expressions like [A-Z] will also match lowercase letters.
re.match('test', 'TeSt', re.IGNORECASE) # Regex python function defined below
  • re.LOCALE (re.L): Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale. re.LOCALE can be used only with bytes patterns and is not compatible with re.ASCII.
  • re.MULTILINE (re.M): When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline).

Regex Functions in Python

Regex example in Python

These functions are described in detail at https://docs.python.org/3/library/re.html#functions.

re.compile(pattern, flags=0)

Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below.

re.search(pattern, string, flags=0)

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object.

re.match(pattern, string, flags=0)

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.

re.fullmatch(pattern, string, flags=0)

If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern

re.split(pattern, string, maxsplit=0, flags=0)

Split string by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

# Split by non-alphanumeric characters
>>> re.split(r'\W+', 'Words, words, words.')
['Words', 'words', 'words', '']

re.findall(pattern, string, flags=0)

# Find all words that start with f
>>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
['foot', 'fell', 'fastest']

re.sub(pattern, repl, string, count=0, flags=0)

repl can be both functions and string

  • If repl is string
# Change the word and to &
re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
  • If repl is function

Calls the function on every match for substitution

def dashrepl(matchobj):
if matchobj.group(0) == '-': return ' '
else: return '-'
>> re.sub('-{1,2}', dashrepl, 'pro--gram-files')# pro-gram files (Converted -- to - but converted - to ' ' based on the function)

re.escape(pattern)

Converts strings that may have metacharacters into a literal string.

>> re.escape('https://www.python.org')
# Result: https://www\.python\.org

Conclusion

In conclusion, I hope after this article people can understand regular expressions and quickly be able to read or write regex code. It is very important that one practices the regex by themselves to ensure they have the correct understanding and can read regex correctly. [Regex Practice Link]. Finally, the regex documents are very clear and concise and I highly recommend a read after this article. [Python regex link]

Appendix

For more comprehensive documentation, please look at:

Regex tutorial

Comprehensive Regex cheatsheet

For python specific regex information,

--

--