8  Regular Expressions

Regular Expression is a tiny language for pattern matching. It allows representing complex patterns in very few characters. It is used for finding patterns in text, replacing some patterns with other and to make sure the given text confirms to a know pattern etc.

Let us see a small example.

import re
m = re.match("ab+", "abbbbcd")
m
<re.Match object; span=(0, 5), match='abbbb'>
m.group()
'abbbb'

The standard library module re provides regular expression support in Python.

The pattern ab+ matches any text having char a followed by one or more b characters.

Let us look at common patterns supported by regular expressions.

Patterns:

c - one character
. - any character
[abcd] - one of the characters specified in the group
[^abcd] - any charater other than the ones in the group
x* - zero or more occurances of x (x could be any of the above patterns)
x+ - one or more occurances of x
x? - zero or one occurance of x
(x) - match x and also remember it for use in substitution

\d - any digit
\s - any whitespace

^ - beginning of a string
$ - end of a string

Let us look at simple example.

text = "10 apples and 20 mangos"

Extract all numbers from text.

re.findall("[0-9]+", text)
['10', '20']
re.findall("[0-9]+", "1 apple, 2 oranges and 3 mangos")
['1', '2', '3']
re.sub("[0-9]+", "xx", text)
'xx apples and xx mangos'

Problem: Write a function squeeze to replace multiple continuous space characters with a single space.

>>> squeeze("a   b   c d")
'a b c d'

8.1 Resources

The Regular Expressions chapter in Dive Into Python is a very good resource to learn about regular expressions in depth.