Download Kite Free! Install Kite Free!

Regular Expressions

Sterling Clark
By Sterling Clark
March 13, 2019

Table of Contents

Introduction to REGEX

REGEX is a module used for regular expression matching in the Python programming language. In fact, REGEX is actually just short for regular expressions, which refer to the pattern of characters used in a string. This concept can apply to simple words, phone numbers, email addresses, or any other number of patterns. For example, if you search for the letter “f” in the sentence “For the love of all that is good, finish the job,” the goal is to look for occurrences of the character “f” in the sentence. This is the most basic application of regular expressions: you can look for only alphabetic characters in strings mixed with letters, numbers, and special characters. On the other hand, in a string that read “a2435?#@s560” you could choose to look only for the letters within that string. You could also look through text specifically for phone numbers (###-###-####). The format of a phone number is a very specific pattern of numbers and hyphens and more than just a single character – the general syntax of which we’ll discuss next.

First, it should be quickly noted that regex is generally case-sensitive: the letter “a” and the letter “A” would be considered to be separate characters. Also, when dealing with numbers, you will never deal with more than one digit at a time, since there isn’t a single character that represents anything beyond 0 through 9. Let’s go through some of the important meta-characters used to type out the patterns we need to look for. Just like regular strings, the patterns always start and end with double quotations (“”). So let’s say you’re looking for occurrences of the letter “e”: you can exactly write “e”. If you’re looking for a phrase, a part of a word, or a whole word such as “was”, then you can write exactly “was”. The two different applications of regular expressions are no different from entering a regular string.

Using characters to create indentations

Now let’s get into something special: we can actually use the period (.) to represent any character other than a newline character, which creates indentations. Let’s say the pattern you’re looking for is “h.s”: this means any character ranging from a letter, a number, or a special character can be between the “h” and the “s”. Finally, we have two characters that reference the specific position of a pattern.

  • The caret (^) looks for a pattern that starts the string or text. So if you had the sentence “This looks like a tree” and you look for the pattern “^This” it will successfully match since “This” is in the beginning. The caret must be the first character of the pattern.
  • On the opposite end of the spectrum, we have the dollar sign ($) which indicates the pattern must be at the end. So taking the previous example, if the pattern is “tree$”, you will return a successful match since the word “tree” ends the string. The dollar sign must always conclude the pattern.

The next couple of meta-characters refer to the number of times a regex occurs in a string.

  • The asterisk (*) checks for zero or more occurrences of a pattern. This means that regardless of if the specific character, characters, or pattern actually occurs or not, it will always be a match. For example, if we had the pattern“abc*”, then as long as we have a string containing “ab” it will pass. The “c” can occur or not and it’s will meet requirements. So the strings “ab”, “abc”, and “abccc” all match the pattern.
  • The plus sign (+) checks for one or more occurrences of a pattern. This means that as long as the pattern is matched at least once, a successful match has been made. No occurrence means that the match was unsuccessful. You can also do braces () and in between you enter the specific number of occurrences you are looking for. All of these meta-characters follow the regex.
  • The vertical bar (|), much like in programming languages, represents “or”. If you had the sentence “I’m departing from Miami at six o’clock” and the regex is “go|departing”, the match would be successful because even though “go” isn’t present, “departing” is.

Sets in REGEX

Next, we’ll discuss sets created by brackets ([]). A set expands the possibilities for making patterns, and represents exactly 1 character. For example, if you have the pattern “abc”, then that means you’re literally looking for “abc”. However, when the pattern is “[abc]”, you’re looking for occurrences “a”, “b”, or “c”. Similarly, “0123” means you are literally looking for “0123”. If you have “[0123]”, then you’re looking for occurrences of 0, 1, 2, or 3.

A hyphen (-) between two letters or characters means that any occurrence of a character between the two are a match. So “[0-9]” refers to all numerical digits while “[a-zA-Z]” refers to all alphabetical characters whether they are lower case or upper case. You can also limit the characters: for example, “[4-7]” or “[p-v]” are perfectly acceptable as well.

The function of a caret (^) changes within a set. The caret looks for everything except the pattern you entered. So if you have [^abc], you want to match any character except “a”, “b”, or “c”. Other than caret, the meta-characters in sets have no special function. That means that “[+]” is literally looking for occurrences of the character “+” and is no longer considered a meta-character. If you want to apply meta-characters to sets, you use them outside the set, like “[0-9]*” or “[G-N]$”. You can make many different patterns by combining sets like “[v-z][a-g]”. This is how you find numbers with multiple digits. You can do “[0-9][0-9]” to search for a two digit number.

Special sequences using the backslash

Lastly, we’ll briefly discuss special sequences. First, special sequences are initiated by another meta-character not previously discussed, the backslash (\) and a particular letter depending of the sequence. Special sequences work very similarly to other meta-characters in the sense that they perform special functions just like meta-characters. In fact, some of these share the same function as some meta-characters. The sequences “\A”, “\b”, and “\B” refers to the specific position of the characters just like the caret and the dollar sign.

The “\A” sequence checks if the pattern matches the beginning of the string. For example, if we had the pattern “\AThe” and we had the string “The Tree”, then the pattern matches. However, if we had the string “Find The Tree”, then there is no match because “the” does not initiate the string.

The sequence “\b” indicates that a pattern either begins or ends a word within the string.

  • If you would like to see if a word begins with “eb”, the pattern would look like “\beb”.
  • If you would like to see if a word begins with “eb”, the pattern would look like “eb\b”.
  • If we had the word “celeb”, it will not match the pattern “\beb” since it does not start with “eb”.

The word “celeb” will match the pattern “eb\b” since the word ends with “eb”. The sequence “\B” is implemented the same way as “\b” but has the exact opposite meaning. The sequence “\B” matches as long a word does not begin or end with the pattern. Let’s look at the previous example again. If we have the word “celeb” and the pattern “\Beb”, then the pattern matches since “eb” does not start the word. If we have the pattern “eb\B”, the word would match not match the pattern since “eb” ends the word.

Many of the other sequences are meant to segregate specific types of characters. For example, “\d” returns a match for any character that is a digit and “\D” returns matches for anything but a digit. For this reason, special sequences are used for very broad applications. If you just want to search all numbers, letters, or anything just as broad, special sequences are more convenient. Otherwise, the other meta-characters are recommended.

Python programming using REGEX

Now, we can talk about the module that allows the Python programming language to use REGEX, “re”. “re” is built into Python so installation is not required if you already have it installed. In order to use the module, all you have to do is import it.

import re

Let’s first talk about the compile function. Using “re.compile”, we can turn a REGEX into an object to be used later. You can just enter the pattern into the other functions, but creating an object is more convenient.

import re
pattern = "[abcABC]+"
regex = re.compile(pattern)

Next up is the search function. This will return a match object after finding the first instance of a regex in a string. For example:

import re
pattern = "[abcABC]+"
regex = re.compile(pattern)
string1 = "Jaime lives in Florida."
Results = regex.search(string1)
print(Results)
#Results = <_sre.SRE_Match object; span=(1, 2), match='a'>

If there is no occurrence found, then only “None” is returned.

import re
pattern = "[abcABC]+"
regex = re.compile(pattern)
string2 = "Ron lives in New Jersey"
Results = regex.search(string2)
print(Results)
#Results = None

Now let’s cover the match function. The function checks if the beginning of the string matches the REGEX. If it does, it will return the match object and “None” otherwise. Note that it is easier for this to fail since there is this extra restriction.

import re
pattern = "[abcABC]+"
regex = re.compile(pattern)
string3 = "Luis never makes excuses."
Results = regex.match(string3)
print(Results)
#Results = None
string4 = "Archer never makes excuses."
Results = regex.match(string4)
print(Results)
#Results = <_sre.SRE_Match object; span=(0, 1), match='A'>

Finally, there’s a function called “fullmatch”. Unlike re.match, re.fullmatch will check if the entire string matches the pattern exactly. For example:

import re
pattern = "[abcABC]+"
regex = re.compile(pattern)
string5 = "Another one bites the dust"
Results = regex.fullmatch(string5)
print(Results)
#Results = None
string6="ABCABabbcaa"
Results = regex.fullmatch(string6)
print(Results)
#Results = <_sre.SRE_Match object; span=(0, 11), match='ABCABabbcaa'>

Choosing between re.match and re.search

It should be noted that, like most modules involving strings, functions can limit the scope of the search by entering in the first and last indexes of the desired section of the string. This poses the question of whether it’s better to use re.match or re.search since they have similar purposes.

The main issue with re.match and re.fullmatch is that they are both very limited in what parts of the strings you’re searching. Let’s say you had all the contents of a book in a text file, and I wanted to check the entire document for a particular, rare pattern in the form of a single word or phrase. The only real way to have a productive search with re.fullmatch is you split the entire document into a very large list of many words and individually check each word. That can take time and memory. Now, how would we apply re.match to this problem?

In short, we would still have to execute this similarly since it only checks the beginning of a string. Now, re.search becomes optimal because you don’t have to split the entire document since re.search searches the entire string.

Let’s look at a case where we only need to confirm that the beginning of a string matches the regex. This is quite easy if we use re.match, but if we use re.search, it may return true where true as well. The problem with re.search, however, is that re.search will look for any occurrence within the string. So re.search may return true where it absolutely shouldn’t. In this case, we could change the regex and add a caret (^) in order to correct this. However, it would be easier to use re.match in this case.

Concluding thoughts

Ultimately, Regex as a tool is a versatile tool for analyzing any form of text. You can scan through documents of many formats and of any volume of information to pull specific information using Regex. You can scan through a book to find all of the occurrences of a word, all the way to scanning an online directory to find the contact information of specific companies. The automation of these detail-intensive tasks is needed in our modern world and after reading this article, you have taken the first steps to mastering this tool.