[Python-ideas] BetterWalk, a better and faster os.walk() for Python
John Mulligan
phlogistonjohn at asynchrono.us
Fri Nov 23 17:08:03 CET 2012
More information about the Python-ideas mailing list
Fri Nov 23 17:08:03 CET 2012
- Previous message: [Python-ideas] BetterWalk, a better and faster os.walk() for Python
- Next message: [Python-ideas] BetterWalk, a better and faster os.walk() for Python
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, I've been idly watching python-ideas and this thread piqued my interest, so I'm unlurking for the first time. I'm really happy that someones looking into this. I've done some similar work for my day job and have some thoughts about the APIs and approach. I come at this from a C & Linux POV and wrote some similar wrappers to your iterdir_stat. What I do similarly is to provide a "flags" field like your "fields" argument (in my case a bitmask) that controls what extra information is returned/yielded from the call. What I do differently is that I (a) always return the d_type value instead of falling back to stat'ing the item (b) do not provide the pattern argument. I like returning the d_type directly because in the unix style APIs the dirent structure doesn't provide the same stuff as the stat result and I don't want to trick myself into thinking I have all the information available from the readdir call. I also like to have my Python functions map pretty closely to the C calls. I know that my Python is only issuing the same syscalls that the equivalent C code would. In addition, I like control over error handling that calling stat as a separate call gives me. For example there is a potential race condition between calling the readdir and the stat, like if the object is removed between calls. I can be very granular (for lack of a better word) about my error handling in these cases. Because I come at this from a Linux platform I am also not so keen on the built in pattern matching that comes for "free" from the FindFirst/FindNext Window's API provides. It just feels like this should be provided at a higher layer. But I can't say for sure because I don't know how much of a performance boost this is on Windows. I have a confession to make: I don't often use an os.walk equivalent when I use my library. I often call the listdir equivalents directly. So I've never benchmarked any os.walk equivalent even though I wrote one for fun! In addition I have a fditerdir call that supports a directory file descriptor as the first argument. This is handy because I also have a wrapper for fstatat (this was all created for Python 2 and before 3.3 was released). I really like how your library is better in that you can get more fields from the direntry, I only support the d_type field at this time and have been meaning to extend the API. I can only yield tuples at the moment but a namedtuple style would be much nicer. IMO, think the ideal value would be some sort of abstract direntry structure that could be filled in with the values that readdir or FindFirst provide and then possibly provide a higher level function that combines iterdir + stat if you get DT_UNKNOWN. In other words, provide an easy call like iterdir_stat that builds on an iterdir that gets the detailed dentry data. PS. If anyone is curious my library is available here: https://bitbucket.org/nasuni/fsnix Thanks! -- John M. On Friday, November 23, 2012 12:39:42 AM Ben Hoyt wrote: > In the recent thread I started called "Speed up os.walk()..." [1] I > was encouraged to create a module to flesh out the idea, so I present > you with BetterWalk: > > https://github.com/benhoyt/betterwalk#readme > > It's basically all there, and works on Windows, Linux, and Mac OS X. > It probably works on FreeBSD too, but I haven't tested that. I also > haven't written thorough unit tests yet, but intend to after some > further feedback. > > In terms of the API for iterdir_stat(), I settled on the more explicit > "pass in what stat fields you want" (the 'fields' parameter). I also > added a 'pattern' parameter to allow you to make use of the wildcard > matching that FindFirst/FindNext provide (it's useful for globbing on > POSIX too, but not a performance improvement). > > As for benchmarks, it's about what I saw earlier on Windows (2-6x on > recent versions, depending). My initial tests on Mac OS X show it's > 5-10x as fast on that platform! I haven't double-checked those results > yet though. > > The results on Linux were somewhat disappointing -- only a 10% speed > improvement on large directories, and it's actually slower on small > directories. It's still doing half the number of system calls ... so I > believe this is because cached os.stat() is super fast on Linux, and > so the slowdown from using ctypes / pure Python is outweighing the > gain from not doing the system call. That said, I've also only tested > Linux in a VirtualBox setup, so maybe that's affecting it too. > > Still, if it's a significant win for Windows and OS X users, it's a > good thing. > > In any case, I'd love it if folks could run the benchmark on their > system (with and without -s) and comment further on the idea and API. > > Thanks, > Ben. > > [1] > http://mail.python.org/pipermail/python-ideas/2012-November/017770.ht > ml _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > http://mail.python.org/mailman/listinfo/python-ideas
- Previous message: [Python-ideas] BetterWalk, a better and faster os.walk() for Python
- Next message: [Python-ideas] BetterWalk, a better and faster os.walk() for Python
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-ideas mailing list