Text Processing
- String methods
- Regular Expressions
- Pattern matching and extraction
- Search and Replace
- Compiling Regular Expressions
- Further Reading on Regular Expressions
String methods
- translate string characters
str.maketrans()to get translation tabletranslate()to perform the string mapping based on translation table
- the first argument to
maketrans()is string characters to be replaced, the second is characters to replace with and the third is characters to be mapped toNone - character translation examples
>>> greeting = '===== Have a great day =====' >>> greeting.translate(str.maketrans('=', '-')) '----- Have a great day -----' >>> greeting = '===== Have a great day!! =====' >>> greeting.translate(str.maketrans('=', '-', '!')) '----- Have a great day -----' >>> import string >>> quote = 'SIMPLICITY IS THE ULTIMATE SOPHISTICATION' >>> tr_table = str.maketrans(string.ascii_uppercase, string.ascii_lowercase) >>> quote.translate(tr_table) 'simplicity is the ultimate sophistication' >>> sentence = "Thi1s is34 a senten6ce" >>> sentence.translate(str.maketrans('', '', string.digits)) 'This is a sentence' >>> greeting.translate(str.maketrans('', '', string.punctuation)) ' Have a great day '
- removing leading/trailing/both characters
- only consecutive characters from start/end string are removed
- by default whitespace characters are stripped
- if more than one character is specified, it is treated as a set and all combinations of it are used
>>> greeting = ' Have a nice day :) ' >>> greeting.strip() 'Have a nice day :)' >>> greeting.rstrip() ' Have a nice day :)' >>> greeting.lstrip() 'Have a nice day :) ' >>> greeting.strip(') :') 'Have a nice day' >>> greeting = '===== Have a great day!! =====' >>> greeting.strip('=') ' Have a great day!! '
- styling
- width argument specifies total output string length
>>> ' Hello World '.center(40, '*') '************* Hello World **************'
- changing case and case checking
>>> sentence = 'thIs iS a saMple StrIng' >>> sentence.capitalize() 'This is a sample string' >>> sentence.title() 'This Is A Sample String' >>> sentence.lower() 'this is a sample string' >>> sentence.upper() 'THIS IS A SAMPLE STRING' >>> sentence.swapcase() 'THiS Is A SAmPLE sTRiNG' >>> 'good'.islower() True >>> 'good'.isupper() False
- check if string is made up of numbers
>>> '1'.isnumeric() True >>> 'abc1'.isnumeric() False >>> '1.2'.isnumeric() False
- check if character sequence is present or not
>>> sentence = 'This is a sample string' >>> 'is' in sentence True >>> 'this' in sentence False >>> 'This' in sentence True >>> 'this' in sentence.lower() True >>> 'is a' in sentence True >>> 'test' not in sentence True
- get number of times character sequence is present (non-overlapping)
>>> sentence = 'This is a sample string' >>> sentence.count('is') 2 >>> sentence.count('w') 0 >>> word = 'phototonic' >>> word.count('oto') 1
- matching character sequence at start/end of string
>>> sentence 'This is a sample string' >>> sentence.startswith('This') True >>> sentence.startswith('The') False >>> sentence.endswith('ing') True >>> sentence.endswith('ly') False
- split string based on character sequence
- returns a list
- to split using regular expressions, use
re.split()instead
>>> sentence = 'This is a sample string' >>> sentence.split() ['This', 'is', 'a', 'sample', 'string'] >>> "oranges:5".split(':') ['oranges', '5'] >>> "oranges :: 5".split(' :: ') ['oranges', '5'] >>> "a e i o u".split(' ', maxsplit=1) ['a', 'e i o u'] >>> "a e i o u".split(' ', maxsplit=2) ['a', 'e', 'i o u'] >>> line = '{1.0 2.0 3.0}' >>> nums = [float(s) for s in line.strip('{}').split()] >>> nums [1.0, 2.0, 3.0]
- joining list of strings
>>> str_list ['This', 'is', 'a', 'sample', 'string'] >>> ' '.join(str_list) 'This is a sample string' >>> '-'.join(str_list) 'This-is-a-sample-string' >>> c = ' :: ' >>> c.join(str_list) 'This :: is :: a :: sample :: string'
- replace characters
- third argument specifies how many times replace has to be performed
- variable has to be explicitly re-assigned to change its value
>>> phrase = '2 be or not 2 be' >>> phrase.replace('2', 'to') 'to be or not to be' >>> phrase '2 be or not 2 be' >>> phrase.replace('2', 'to', 1) 'to be or not 2 be' >>> phrase = phrase.replace('2', 'to') >>> phrase 'to be or not to be'
Further Reading
Regular Expressions
- Handy reference of regular expression (RE) elements
| Meta characters | Description |
|---|---|
\A |
anchor to restrict matching to beginning of string |
\Z |
anchor to restrict matching to end of string |
^ |
anchor to restrict matching to beginning of line |
$ |
anchor to restrict matching to end of line |
. |
Match any character except newline character \n |
| | | OR operator for matching multiple patterns |
(RE) |
capturing group |
(?:RE) |
non-capturing group |
[] |
Character class - match one character among many |
\^ |
prefix \ to literally match meta characters like ^ |
| Greedy Quantifiers | Description |
|---|---|
* |
Match zero or more times |
+ |
Match one or more times |
? |
Match zero or one times |
{m,n} |
Match m to n times (inclusive) |
{m,} |
Match at least m times |
{,n} |
Match up to n times (including 0 times) |
{n} |
Match exactly n times |
Appending a ? to greedy quantifiers makes them non-greedy
| Character classes | Description |
|---|---|
[aeiou] |
Match any vowel |
[^aeiou] |
^ inverts selection, so this matches any consonant |
[a-f] |
- defines a range, so this matches any of abcdef characters |
\d |
Match a digit, same as [0-9] |
\D |
Match non-digit, same as [^0-9] or [^\d] |
\w |
Match alphanumeric and underscore character, same as [a-zA-Z0-9_] |
\W |
Match non-alphanumeric and underscore character, same as [^a-zA-Z0-9_] or [^\w] |
\s |
Match white-space character, same as [\ \t\n\r\f\v] |
\S |
Match non white-space character, same as [^\s] |
\b |
word boundary, see \w for characters constituting a word |
\B |
not a word boundary |
| Flags | Description |
|---|---|
re.I |
Ignore case |
re.M |
Multiline mode, ^ and $ anchors work on lines |
re.S |
Singleline mode, . will also match \n |
re.X |
Verbose mode, for better readability and adding comments |
See Python docs - Compilation Flags for more details and long names for flags
| Variable | Description |
|---|---|
\1, \2, \3 ... \99 |
backreferencing matched patterns |
\g<1>, \g<2>, \g<3> ... |
backreferencing matched patterns, prevents ambiguity |
\g<0> |
entire matched portion |
\0 and \100 onwards are considered as octal values, hence cannot be used as backreference.
Pattern matching and extraction
To match/extract sequence of characters, use
re.search()to see if input string contains a pattern or notre.findall()to get a list of all matching portionsre.finditer()to get an iterator ofre.Matchobjects of all matching portionsre.split()to get a list from splitting input string based on a pattern
Their syntax is as follows:
re.search(pattern, string, flags=0) re.findall(pattern, string, flags=0) re.finditer(pattern, string, flags=0) re.split(pattern, string, maxsplit=0, flags=0)
- As a good practice, always use raw strings to construct RE, unless other formats are required
- this will avoid clash of backslash escaping between RE and normal quoted strings
- examples for
re.search
>>> sentence = 'This is a sample string' # using normal string methods >>> 'is' in sentence True >>> 'xyz' in sentence False # need to load the re module before use >>> import re # check if 'sentence' contains the pattern described by RE argument >>> bool(re.search(r'is', sentence)) True >>> bool(re.search(r'this', sentence, flags=re.I)) True >>> bool(re.search(r'xyz', sentence)) False
- examples for
re.findall
# match whole word par with optional s at start and e at end >>> re.findall(r'\bs?pare?\b', 'par spar apparent spare part pare') ['par', 'spar', 'spare', 'pare'] # numbers >= 100 with optional leading zeros >>> re.findall(r'\b0*[1-9]\d{2,}\b', '0501 035 154 12 26 98234') ['0501', '154', '98234'] # if multiple capturing groups are used, each element of output # will be a tuple of strings of all the capture groups >>> re.findall(r'(x*):(y*)', 'xx:yyy x: x:yy :y') [('xx', 'yyy'), ('x', ''), ('x', 'yy'), ('', 'y')] # normal capture group will hinder ability to get whole match # non-capturing group to the rescue >>> re.findall(r'\b\w*(?:st|in)\b', 'cost akin more east run against') ['cost', 'akin', 'east', 'against'] # useful for debugging purposes as well before applying substitution >>> re.findall(r't.*?a', 'that is quite a fabricated tale') ['tha', 't is quite a', 'ted ta']
- examples for
re.split
# split based on one or more digit characters >>> re.split(r'\d+', 'Sample123string42with777numbers') ['Sample', 'string', 'with', 'numbers'] # split based on digit or whitespace characters >>> re.split(r'[\d\s]+', '**1\f2\n3star\t7 77\r**') ['**', 'star', '**'] # to include the matching delimiter strings as well in the output >>> re.split(r'(\d+)', 'Sample123string42with777numbers') ['Sample', '123', 'string', '42', 'with', '777', 'numbers'] # use non-capturing group if capturing is not needed >>> re.split(r'hand(?:y|ful)', '123handed42handy777handful500') ['123handed42', '777', '500']
- backreferencing
# whole words that have at least one consecutive repeated character >>> words = ['effort', 'flee', 'facade', 'oddball', 'rat', 'tool'] >>> [w for w in words if re.search(r'\b\w*(\w)\1\w*\b', w)] ['effort', 'flee', 'oddball', 'tool']
- The
re.searchfunction returns are.Matchobject from which various details can be extracted like the matched portion of string, location of matched portion, etc - Note that output here is shown for Python version 3.7
>>> re.search(r'b.*d', 'abc ac adc abbbc') <re.Match object; span=(1, 9), match='bc ac ad'> # retrieving entire matched portion >>> re.search(r'b.*d', 'abc ac adc abbbc')[0] 'bc ac ad' # capture group example >>> m = re.search(r'a(.*)d(.*a)', 'abc ac adc abbbc') # to get matched portion of second capture group >>> m[2] 'c a' # to get a tuple of all the capture groups >>> m.groups() ('bc ac a', 'c a')
- examples for
re.finditer
>>> m_iter = re.finditer(r'(x*):(y*)', 'xx:yyy x: x:yy :y') >>> [(m[1], m[2]) for m in m_iter] [('xx', 'yyy'), ('x', ''), ('x', 'yy'), ('', 'y')] >>> m_iter = re.finditer(r'ab+c', 'abc ac adc abbbc') >>> for m in m_iter: ... print(m.span()) ... (0, 3) (11, 16)
Search and Replace
Syntax
re.sub(pattern, repl, string, count=0, flags=0)
- examples
- Note that as strings are immutable,
re.subwill not change value of variable passed to it, has to be explicity assigned
>>> ip_lines = "catapults\nconcatenate\ncat" >>> print(re.sub(r'^', r'* ', ip_lines, flags=re.M)) * catapults * concatenate * cat # replace 'par' only at start of word >>> re.sub(r'\bpar', r'X', 'par spar apparent spare part') 'X spar apparent spare Xt' # same as: r'part|parrot|parent' >>> re.sub(r'par(en|ro)?t', r'X', 'par part parrot parent') 'par X X X' # remove first two columns where : is delimiter >>> re.sub(r'\A([^:]+:){2}', r'', 'foo:123:bar:baz', count=1) 'bar:baz'
- backreferencing
# remove any number of consecutive duplicate words separated by space # quantifiers can be applied to backreferences too! >>> re.sub(r'\b(\w+)( \1)+\b', r'\1', 'aa a a a 42 f_1 f_1 f_13.14') 'aa a 42 f_1 f_13.14' # add something around the matched strings >>> re.sub(r'\d+', r'(\g<0>0)', '52 apples and 31 mangoes') '(520) apples and (310) mangoes' # swap words that are separated by a comma >>> re.sub(r'(\w+),(\w+)', r'\2,\1', 'a,b 42,24') 'b,a 24,42'
- using functions in replace part of
re.sub() - Note that Python version 3.7 is used here
>>> from math import factorial >>> numbers = '1 2 3 4 5' >>> def fact_num(n): ... return str(factorial(int(n[0]))) ... >>> re.sub(r'\d+', fact_num, numbers) '1 2 6 24 120' # using lambda >>> re.sub(r'\d+', lambda m: str(factorial(int(m[0]))), numbers) '1 2 6 24 120'
Compiling Regular Expressions
- Regular expressions can be compiled using
re.compilefunction, which gives back are.Patternobject - The top level
remodule functions are all available as methods for this object - Compiling a regular expression helps if the RE has to be used in multiple places or called upon multiple times inside a loop (speed benefit)
- By default, Python maintains a small list of recently used RE, so the speed benefit doesn't apply for trivial use cases
>>> pet = re.compile(r'dog') >>> type(pet) <class 're.Pattern'> >>> bool(pet.search('They bought a dog')) True >>> bool(pet.search('A cat crossed their path')) False >>> remove_parentheses = re.compile(r'\([^)]*\)') >>> remove_parentheses.sub('', 'a+b(addition) - foo() + c%d(#modulo)') 'a+b - foo + c%d' >>> remove_parentheses.sub('', 'Hi there(greeting). Nice day(a(b)') 'Hi there. Nice day'
Further Reading on Regular Expressions
- Python re(gex)? - a book on regular expressions
- Python docs - re module
- Python docs - introductory tutorial to using regular expressions
- Comprehensive reference: What does this regex mean?
- rexegg - tutorials, tricks and more
- regular-expressions - tutorials and tools
- CommonRegex - collection of common regular expressions
- Practice tools
- regex101 - visual aid and online testing tool for regular expressions, select flavor as Python before use
- debuggex - railroad diagrams for regular expressions, select flavor as Python before use
- regexone - interative tutorial
- cheatsheet - one can also learn it interactively
- regexcrossword - practice by solving crosswords, read 'How to play' section before you start