WDDX cannot deserialize serialized UTF-8 encoded non-ASCII text
| Bug #37571 | WDDX cannot deserialize serialized UTF-8 encoded non-ASCII text | ||||
|---|---|---|---|---|---|
| Submitted: | 2006-05-23 22:50 UTC | Modified: | 2008-09-07 18:06 UTC | ||
| From: | jdolecek at NetBSD dot org | Assigned: | |||
| Status: | Closed | Package: | WDDX related | ||
| PHP Version: | 5.1.4 | OS: | * | ||
| Private report: | No | CVE-ID: | None | ||
[2006-05-23 22:50 UTC] jdolecek at NetBSD dot org
Description:
------------
WDDX cannot be used to encode certain UTF8-encoded iso-8859-1 text. Particularily those iso-8859-1 characters, which after conversion to UTF-8 generate sequence of characters with value in 128-160 range, which are recognized as control characters. Control characters are turned into <char code="XX"/> sequence by WDDX.
wddx_deserialize() expects UTF-8 encoded string, and implicitly converts the text back to iso-8859-1 before deserializing the structure. This is done _before_
the <char code="XX"/> is replaced by the character. The < is thus recognized as part of the UTF-8 sequence, two-byte sequence is recoded to single-byte character and the result contains invalid XML (fragment 'char code="XX"/>'). Deserialization thus fails silently.
I.e.:
1. iso-8859-1 is Z (ord(Z) > 128)
2. UTF-8 string is XY
3. WDDX serializes that as X<char code="ord(Y)"/>
4. deserializer converts UTF-8 input to iso-8859-1 before
starting deserialization, result is Bchar code="ord(Y)"/>
5. deserializer detects invalid XML and aborts the decode,
returns empty string
Fix:
Only recode ASCII control characters to <char code="XX" /> sequence:
--- wddx.c.orig 2006-05-24 00:39:34.000000000 +0200
+++ wddx.c
@@ -399,7 +399,8 @@ static void php_wddx_serialize_string(wd
break;
default:
- if (iscntrl((int)*(unsigned char *)p)) {
+ if (iscntrl((int)*(unsigned char *)p)
+ && isascii((int)*(unsigned char *)p)) {
FLUSH_BUF();
sprintf(control_buf, WDDX_CHAR, *p);
php_wddx_add_chunk(packet, control_buf);
Note - this patch also makes problem of Bug #37569 go away, but that patch is still useful to apply for code clarity.
This bug is probably same problem as Bug #35241.
Reproduce code:
---------------
On UNIX with iso-8859-1 locale or Windows with Windows-1250 locale:
var_dump(
wddx_deserialize(wddx_serialize_value(utf8_encode(chr(200))))
);
Expected result:
----------------
string(1) "Č"
Actual result:
--------------
string(0) ""
Patches
Pull Requests
History
AllCommentsChangesGit/SVN commits
[2006-05-24 06:46 UTC] derick@php.net
[2006-05-25 12:28 UTC] jdolecek at NetBSD dot org
[2006-08-02 15:45 UTC] iliaa@php.net