Issue41928
Created on 2020-10-04 11:21 by ivan.sorokin.tech, last changed 2021-01-22 01:30 by andreaerdna.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| 23.zip | ivan.sorokin.tech, 2020-10-04 11:21 | |||
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 23736 | open | andreaerdna, 2020-12-10 19:45 | |
| Messages (3) | |||
|---|---|---|---|
| msg377931 - (view) | Author: Ivan Sorokin (ivan.sorokin.tech) | Date: 2020-10-04 11:21 | |
See attached sample. Well-known unzip command line tool lists its contents correctly:
$ unzip -l 23.zip
Archive: 23.zip
Length Date Time Name
--------- ---------- ----- ----
81408 2012-10-23 19:03 Β' ΦΑΣΗ ΠΕ06 ΣΧΟΛΕΙΑ ΕΑΕΠ (ΙΝΤ).xls
--------- -------
81408 1 file
But ZipFile lists the same file inside this archive as
ü' öÇæå Åä06 æòÄèäêÇ äÇäÅ (êîÆ).xls
It's because ZipFile completely ignores Unicode Path Extra Field (0x7075) zip header field.
See .ZIP specification for details on this field meaning and usage:
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
|
|||
| msg377945 - (view) | Author: Ivan Sorokin (ivan.sorokin.tech) | Date: 2020-10-04 15:24 | |
Grand unified algorithm to read filenames from zip files correctly: 1. Do zip entry have «Unicode Path Extra Field» (0x7075)? Use it for file name. 2. Is Unicode flag (0x800) set in «Flags» Field of zip entry? Assume «Filename» Field is in UTF-8. 3. Do «HostOS» Field of zip entry have values of 0 (FAT) or 11 (NTFS)? Assume «Filename» Field is in OEM charset corresponding to system locale. 4. Assume «Filename» Field is in UTF-8. p7zip with oemcp patch (https://github.com/unxed/oemcp/) uses exactly this method, and is able to process all zip files in my test set correctly (my test set contains several zips generated by different packers on windows, macos, linux, and by online services). The same algorithm should be used in any zip unpacker wishing to process non-latin filenames as gently as possible. |
|||
| msg385467 - (view) | Author: Andrea Giudiceandrea (andreaerdna) * | Date: 2021-01-22 01:30 | |
I submitted more than a month ago a PR that adds support for Unicode Path Extra Field in ZipFile. The PR https://github.com/python/cpython/pull/23736 is awaiting a review in order to be merged. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2021-01-22 01:30:26 | andreaerdna | set | messages: + msg385467 |
| 2020-12-10 19:45:19 | andreaerdna | set | keywords:
+ patch stage: patch review pull_requests: + pull_request22595 |
| 2020-12-09 17:19:56 | andreaerdna | set | nosy:
+ andreaerdna |
| 2020-10-04 15:24:54 | ivan.sorokin.tech | set | messages: + msg377945 |
| 2020-10-04 11:21:41 | ivan.sorokin.tech | create | |