learn/doc/python_character_string at main · gyuho/learn

back to contents

Python: character, string


immutable string

#!/usr/bin/python -u

if __name__ == "__main__":
    txt = "Hello"
    for c in txt:
        print c
    """
    H
    e
    l
    l
    o
    """

    # txt[1] = "X"
    # print txt
    # TypeError: 'str' object does not support item assignment

escape special characters

#!/usr/bin/python -u

if __name__ == "__main__":
    print "\\"  # \
    print "%%"  # %
    print "\""  # "

string literals in Python

As you see the code below, Python has different string literals for ASCII and Unicode characters.

val1 = "aaé"
print val1        # aaé
print type(val1)  # <type 'str'>
 
print val1.encode('utf-8')
"""
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
"""
 
print val1.encode('ascii')
"""
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
"""
 
val2 = u"aaé"
print val2                  # aaé
print type(val2)            # <type 'unicode'>
print val2.encode('utf-8')  # aaé
 
print val2.encode('ascii')
"""
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
"""
 
 
print val2.encode('ascii', 'ignore') # aa
# é is missing
 
 
import unicodedata
unicodedata.normalize('NFKD', val2).encode('ascii','ignore')
# aae
# é got converted to e

This can be tricky when an external service returns different types of string to your Python program, as described here:

Frustration #1: Inconsistent Errors

Although converting when possible seems like the right thing to do, it’s actually the first source of frustration. A programmer can test out their program with a string like: The quick brown fox jumped over the lazy dog and not encounter any issues. But when they release their software into the wild, someone enters the string: I sat down for coffee at the café and suddenly an exception is thrown. The reason? The mechanism that converts between the two types is only able to deal with ASCII characters. Once you throw non-ASCII characters into your strings, you have to start dealing with the conversion manually.

So, if I manually convert everything to either byte str or unicode strings, will I be okay? The answer is…. sometimes.

Overcoming frustration: Correctly using unicode in python2

Here’s a quick solution:

def convert_to_str(st):
    """ use this function to convert all strings to str"""
    if isinstance(st, unicode):
        return st.encode('utf-8')
    return str(st)
 
 
val1 = "ébc"
val2 = u"ébc"
 
print val1, type(val1), convert_to_str(val1), type(convert_to_str(val1))
# ébc <type 'str'> ébc <type 'str'>
 
print val2, type(val2), convert_to_str(val2), type(convert_to_str(val2))
# ébc <type 'unicode'> ébc <type 'str'>

But there are many other corner cases you need to consider, as explained here.