Message 129460 - Python tracker

Message129460

Author	vstinner
Recipients	belopolsky, ezio.melotti, lemburg, vstinner
Date	2011-02-25.23:03:06
SpamBayes Score	5.587931e-06
Marked as misclassified	No
Message-id	<1298674986.91.0.66075330086.issue11322@psf.upfronthosting.co.za>
In-reply-to

Content
We should first implement the same algorithm of the 3 normalization functions and add tests for them (at least for the function in normalization): - normalize_encoding() in encodings: it doesn't convert to lowercase and keep non-ASCII letters - normalize_encoding() in unicodeobject.c - normalizestring() in codecs.c normalize_encoding() in encodings is more laxist than the two other functions: it normalizes " utf 8 " to 'utf_8'. But it doesn't convert to lowercase and keeps non-ASCII letters: "UTF-8é" is normalized "UTF_8é". I don't know if the normalization functions have to be more or less strict, but I think that they should all give the same result.

Content

We should first implement the same algorithm of the 3 normalization functions and add tests for them (at least for the function in normalization):

 - normalize_encoding() in encodings: it doesn't convert to lowercase and keep non-ASCII letters
 - normalize_encoding() in unicodeobject.c
 - normalizestring() in codecs.c

normalize_encoding() in encodings is more laxist than the two other functions: it normalizes "  utf   8  " to 'utf_8'. But it doesn't convert to lowercase and keeps non-ASCII letters: "UTF-8é" is normalized "UTF_8é".

I don't know if the normalization functions have to be more or less strict, but I think that they should all give the same result.

History
Date	User	Action	Args
2011-02-25 23:03:06	vstinner	set	recipients: + vstinner, lemburg, belopolsky, ezio.melotti
2011-02-25 23:03:06	vstinner	set	messageid: <1298674986.91.0.66075330086.issue11322@psf.upfronthosting.co.za>
2011-02-25 23:03:06	vstinner	link	issue11322 messages
2011-02-25 23:03:06	vstinner	create