[Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info
MRAB
python at mrabarnett.plus.com
Fri May 10 16:30:54 CEST 2013
More information about the Python-Dev mailing list
Fri May 10 16:30:54 CEST 2013
- Previous message: [Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info
- Next message: [Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 10/05/2013 11:55, Ben Hoyt wrote: > A few of us were having a discussion at > http://bugs.python.org/issue11406 about adding os.scandir(): a > generator version of os.listdir() to make iterating over very large > directories more memory efficient. This also reflects how the OS gives > things to you -- it doesn't give you a big list, but you call a > function to iterate and fetch the next entry. > > While I think that's a good idea, I'm not sure just that much is > enough of an improvement to make adding the generator version worth > it. > > But what would make this a killer feature is making os.scandir() > generate tuples of (name, stat_like_info). The Windows directory > iteration functions (FindFirstFile/FindNextFile) give you the full > stat information for free, and the Linux and OS X functions > (opendir/readdir) give you partial file information (d_type in the > dirent struct, which is basically the st_mode part of a stat, whether > it's a file, directory, link, etc). > > Having this available at the Python level would mean we can vastly > speed up functions like os.walk() that otherwise need to make an > os.stat() call for every file returned. In my benchmarks of such a > generator on Windows, it speeds up os.walk() by 9-10x. On Linux/OS X, > it's more like 1.5-3x. In my opinion, that kind of gain is huge, > especially on Windows, but also on Linux/OS X. > > So the idea is to add this relatively low-level function that exposes > the extra information the OS gives us for free, but which os.listdir() > currently throws away. Then higher-level, platform-independent > functions like os.walk() could use os.scandir() to get much better > performance. People over at Issue 11406 think this is a good idea. > > HOWEVER, there's debate over what kind of object the second element in > the tuple, "stat_like_info", should be. My strong vote is for it to be > a stat_result-like object, but where the fields are None if they're > unknown. There would be basically three scenarios: > > 1) stat_result with all fields set: this would happen on Windows, > where you get as much info from FindFirst/FindNext as from an > os.stat() > 2) stat_result with just st_mode set, and all other fields None: this > would be the usual case on Linux/OS X > 3) stat_result with all fields None: this would happen on systems > whose readdir()/dirent doesn't have d_type, or on Linux/OS X when > d_type was DT_UNKNOWN > > Higher-level functions like os.walk() would then check the fields they > needed are not None, and only call os.stat() if needed, for example: > > # Build lists of files and directories in path > files = [] > dirs = [] > for name, st in os.scandir(path): > if st.st_mode is None: > st = os.stat(os.path.join(path, name)) > if stat.S_ISDIR(st.st_mode): > dirs.append(name) > else: > files.append(name) > > Not bad for a 2-10x performance boost, right? What do folks think? > > Cheers, > Ben. > [snip] In the python-ideas list there's a thread "PEP: Extended stat_result" about adding methods to stat_result. Using that, you wouldn't necessarily have to look at st.st_mode. The method could perform an additional os.stat() if the field was None. For example: # Build lists of files and directories in path files = [] dirs = [] for name, st in os.scandir(path): if st.is_dir(): dirs.append(name) else: files.append(name) That looks much nicer.
- Previous message: [Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info
- Next message: [Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-Dev mailing list