Python_Basics/Text_Processing.md at master · learnbyexample/Python

Text Processing

String methods

translate string characters
- str.maketrans() to get translation table
- translate() to perform the string mapping based on translation table
the first argument to maketrans() is string characters to be replaced, the second is characters to replace with and the third is characters to be mapped to None
character translation examples

>>> greeting = '===== Have a great day ====='
>>> greeting.translate(str.maketrans('=', '-'))
'----- Have a great day -----'

>>> greeting = '===== Have a great day!! ====='
>>> greeting.translate(str.maketrans('=', '-', '!'))
'----- Have a great day -----'

>>> import string
>>> quote = 'SIMPLICITY IS THE ULTIMATE SOPHISTICATION'
>>> tr_table = str.maketrans(string.ascii_uppercase, string.ascii_lowercase)
>>> quote.translate(tr_table)
'simplicity is the ultimate sophistication'

>>> sentence = "Thi1s is34 a senten6ce"
>>> sentence.translate(str.maketrans('', '', string.digits))
'This is a sentence'
>>> greeting.translate(str.maketrans('', '', string.punctuation))
' Have a great day '

removing leading/trailing/both characters
only consecutive characters from start/end string are removed
by default whitespace characters are stripped
if more than one character is specified, it is treated as a set and all combinations of it are used

>>> greeting = '      Have a nice day :)     '
>>> greeting.strip()
'Have a nice day :)'
>>> greeting.rstrip()
'      Have a nice day :)'
>>> greeting.lstrip()
'Have a nice day :)     '

>>> greeting.strip(') :')
'Have a nice day'

>>> greeting = '===== Have a great day!! ====='
>>> greeting.strip('=')
' Have a great day!! '

styling
width argument specifies total output string length

>>> ' Hello World '.center(40, '*')
'************* Hello World **************'

changing case and case checking

>>> sentence = 'thIs iS a saMple StrIng'

>>> sentence.capitalize()
'This is a sample string'

>>> sentence.title()
'This Is A Sample String'

>>> sentence.lower()
'this is a sample string'

>>> sentence.upper()
'THIS IS A SAMPLE STRING'

>>> sentence.swapcase()
'THiS Is A SAmPLE sTRiNG'

>>> 'good'.islower()
True

>>> 'good'.isupper()
False

check if string is made up of numbers

>>> '1'.isnumeric()
True
>>> 'abc1'.isnumeric()
False
>>> '1.2'.isnumeric()
False

check if character sequence is present or not

>>> sentence = 'This is a sample string'
>>> 'is' in sentence
True
>>> 'this' in sentence
False
>>> 'This' in sentence
True
>>> 'this' in sentence.lower()
True
>>> 'is a' in sentence
True
>>> 'test' not in sentence
True

get number of times character sequence is present (non-overlapping)

>>> sentence = 'This is a sample string'
>>> sentence.count('is')
2
>>> sentence.count('w')
0

>>> word = 'phototonic'
>>> word.count('oto')
1

matching character sequence at start/end of string

>>> sentence
'This is a sample string'

>>> sentence.startswith('This')
True
>>> sentence.startswith('The')
False

>>> sentence.endswith('ing')
True
>>> sentence.endswith('ly')
False

split string based on character sequence
returns a list
to split using regular expressions, use re.split() instead

>>> sentence = 'This is a sample string'

>>> sentence.split()
['This', 'is', 'a', 'sample', 'string']

>>> "oranges:5".split(':') 
['oranges', '5']
>>> "oranges :: 5".split(' :: ') 
['oranges', '5']

>>> "a e i o u".split(' ', maxsplit=1) 
['a', 'e i o u']
>>> "a e i o u".split(' ', maxsplit=2) 
['a', 'e', 'i o u']

>>> line = '{1.0 2.0 3.0}'
>>> nums = [float(s) for s in line.strip('{}').split()]
>>> nums
[1.0, 2.0, 3.0]

joining list of strings

>>> str_list
['This', 'is', 'a', 'sample', 'string']
>>> ' '.join(str_list)
'This is a sample string'
>>> '-'.join(str_list)
'This-is-a-sample-string'

>>> c = ' :: '
>>> c.join(str_list)
'This :: is :: a :: sample :: string'

replace characters
third argument specifies how many times replace has to be performed
variable has to be explicitly re-assigned to change its value

>>> phrase = '2 be or not 2 be'
>>> phrase.replace('2', 'to')
'to be or not to be'

>>> phrase
'2 be or not 2 be'

>>> phrase.replace('2', 'to', 1)
'to be or not 2 be'

>>> phrase = phrase.replace('2', 'to')
>>> phrase
'to be or not to be'

Further Reading

Regular Expressions

Handy reference of regular expression (RE) elements

Meta characters	Description
`\A`	anchor to restrict matching to beginning of string
`\Z`	anchor to restrict matching to end of string
`^`	anchor to restrict matching to beginning of line
`$`	anchor to restrict matching to end of line
`.`	Match any character except newline character `\n`
\|	OR operator for matching multiple patterns
`(RE)`	capturing group
`(?:RE)`	non-capturing group
`[]`	Character class - match one character among many
`\^`	prefix `\` to literally match meta characters like `^`

Greedy Quantifiers	Description
`*`	Match zero or more times
`+`	Match one or more times
`?`	Match zero or one times
`{m,n}`	Match `m` to `n` times (inclusive)
`{m,}`	Match at least m times
`{,n}`	Match up to `n` times (including `0` times)
`{n}`	Match exactly n times

Appending a ? to greedy quantifiers makes them non-greedy

Character classes	Description
`[aeiou]`	Match any vowel
`[^aeiou]`	`^` inverts selection, so this matches any consonant
`[a-f]`	`-` defines a range, so this matches any of abcdef characters
`\d`	Match a digit, same as `[0-9]`
`\D`	Match non-digit, same as `[^0-9]` or `[^\d]`
`\w`	Match alphanumeric and underscore character, same as `[a-zA-Z0-9_]`
`\W`	Match non-alphanumeric and underscore character, same as `[^a-zA-Z0-9_]` or `[^\w]`
`\s`	Match white-space character, same as `[\ \t\n\r\f\v]`
`\S`	Match non white-space character, same as `[^\s]`
`\b`	word boundary, see `\w` for characters constituting a word
`\B`	not a word boundary

Flags	Description
`re.I`	Ignore case
`re.M`	Multiline mode, `^` and `$` anchors work on lines
`re.S`	Singleline mode, `.` will also match `\n`
`re.X`	Verbose mode, for better readability and adding comments

See Python docs - Compilation Flags for more details and long names for flags

Variable	Description
`\1`, `\2`, `\3` ... `\99`	backreferencing matched patterns
`\g<1>`, `\g<2>`, `\g<3>` ...	backreferencing matched patterns, prevents ambiguity
`\g<0>`	entire matched portion

\0 and \100 onwards are considered as octal values, hence cannot be used as backreference.

Pattern matching and extraction

To match/extract sequence of characters, use

re.search() to see if input string contains a pattern or not
re.findall() to get a list of all matching portions
re.finditer() to get an iterator of re.Match objects of all matching portions
re.split() to get a list from splitting input string based on a pattern

Their syntax is as follows:

re.search(pattern, string, flags=0)
re.findall(pattern, string, flags=0)
re.finditer(pattern, string, flags=0)
re.split(pattern, string, maxsplit=0, flags=0)

As a good practice, always use raw strings to construct RE, unless other formats are required
- this will avoid clash of backslash escaping between RE and normal quoted strings
examples for re.search

>>> sentence = 'This is a sample string'

# using normal string methods
>>> 'is' in sentence
True
>>> 'xyz' in sentence
False

# need to load the re module before use
>>> import re
# check if 'sentence' contains the pattern described by RE argument
>>> bool(re.search(r'is', sentence))
True
>>> bool(re.search(r'this', sentence, flags=re.I))
True
>>> bool(re.search(r'xyz', sentence))
False

examples for re.findall

# match whole word par with optional s at start and e at end
>>> re.findall(r'\bs?pare?\b', 'par spar apparent spare part pare')
['par', 'spar', 'spare', 'pare']

# numbers >= 100 with optional leading zeros
>>> re.findall(r'\b0*[1-9]\d{2,}\b', '0501 035 154 12 26 98234')
['0501', '154', '98234']

# if multiple capturing groups are used, each element of output
# will be a tuple of strings of all the capture groups
>>> re.findall(r'(x*):(y*)', 'xx:yyy x: x:yy :y')
[('xx', 'yyy'), ('x', ''), ('x', 'yy'), ('', 'y')]

# normal capture group will hinder ability to get whole match
# non-capturing group to the rescue
>>> re.findall(r'\b\w*(?:st|in)\b', 'cost akin more east run against')
['cost', 'akin', 'east', 'against']

# useful for debugging purposes as well before applying substitution
>>> re.findall(r't.*?a', 'that is quite a fabricated tale')
['tha', 't is quite a', 'ted ta']

examples for re.split

# split based on one or more digit characters
>>> re.split(r'\d+', 'Sample123string42with777numbers')
['Sample', 'string', 'with', 'numbers']

# split based on digit or whitespace characters
>>> re.split(r'[\d\s]+', '**1\f2\n3star\t7 77\r**')
['**', 'star', '**']

# to include the matching delimiter strings as well in the output
>>> re.split(r'(\d+)', 'Sample123string42with777numbers')
['Sample', '123', 'string', '42', 'with', '777', 'numbers']

# use non-capturing group if capturing is not needed
>>> re.split(r'hand(?:y|ful)', '123handed42handy777handful500')
['123handed42', '777', '500']

backreferencing

# whole words that have at least one consecutive repeated character
>>> words = ['effort', 'flee', 'facade', 'oddball', 'rat', 'tool']

>>> [w for w in words if re.search(r'\b\w*(\w)\1\w*\b', w)]
['effort', 'flee', 'oddball', 'tool']

The re.search function returns a re.Match object from which various details can be extracted like the matched portion of string, location of matched portion, etc
Note that output here is shown for Python version 3.7

>>> re.search(r'b.*d', 'abc ac adc abbbc')
<re.Match object; span=(1, 9), match='bc ac ad'>
# retrieving entire matched portion
>>> re.search(r'b.*d', 'abc ac adc abbbc')[0]
'bc ac ad'

# capture group example
>>> m = re.search(r'a(.*)d(.*a)', 'abc ac adc abbbc')
# to get matched portion of second capture group
>>> m[2]
'c a'
# to get a tuple of all the capture groups
>>> m.groups()
('bc ac a', 'c a')

examples for re.finditer

>>> m_iter = re.finditer(r'(x*):(y*)', 'xx:yyy x: x:yy :y')
>>> [(m[1], m[2]) for m in m_iter]
[('xx', 'yyy'), ('x', ''), ('x', 'yy'), ('', 'y')]

>>> m_iter = re.finditer(r'ab+c', 'abc ac adc abbbc')
>>> for m in m_iter:
...     print(m.span())
... 
(0, 3)
(11, 16)

Search and Replace

Syntax

re.sub(pattern, repl, string, count=0, flags=0)

examples
Note that as strings are immutable, re.sub will not change value of variable passed to it, has to be explicity assigned

>>> ip_lines = "catapults\nconcatenate\ncat"
>>> print(re.sub(r'^', r'* ', ip_lines, flags=re.M))
* catapults
* concatenate
* cat

# replace 'par' only at start of word
>>> re.sub(r'\bpar', r'X', 'par spar apparent spare part')
'X spar apparent spare Xt'

# same as: r'part|parrot|parent'
>>> re.sub(r'par(en|ro)?t', r'X', 'par part parrot parent')
'par X X X'

# remove first two columns where : is delimiter
>>> re.sub(r'\A([^:]+:){2}', r'', 'foo:123:bar:baz', count=1)
'bar:baz'

backreferencing

# remove any number of consecutive duplicate words separated by space
# quantifiers can be applied to backreferences too!
>>> re.sub(r'\b(\w+)( \1)+\b', r'\1', 'aa a a a 42 f_1 f_1 f_13.14')
'aa a 42 f_1 f_13.14'

# add something around the matched strings
>>> re.sub(r'\d+', r'(\g<0>0)', '52 apples and 31 mangoes')
'(520) apples and (310) mangoes'

# swap words that are separated by a comma
>>> re.sub(r'(\w+),(\w+)', r'\2,\1', 'a,b 42,24')
'b,a 24,42'

using functions in replace part of re.sub()
Note that Python version 3.7 is used here

>>> from math import factorial
>>> numbers = '1 2 3 4 5'
>>> def fact_num(n):
...     return str(factorial(int(n[0])))
... 
>>> re.sub(r'\d+', fact_num, numbers)
'1 2 6 24 120'

# using lambda
>>> re.sub(r'\d+', lambda m: str(factorial(int(m[0]))), numbers)
'1 2 6 24 120'

Compiling Regular Expressions

Regular expressions can be compiled using re.compile function, which gives back a re.Pattern object
The top level re module functions are all available as methods for this object
Compiling a regular expression helps if the RE has to be used in multiple places or called upon multiple times inside a loop (speed benefit)
By default, Python maintains a small list of recently used RE, so the speed benefit doesn't apply for trivial use cases