DT_GNU_HASH: ~ 50% dynamic linking improvement
Michael Meeks
michael.meeks@novell.com
Thu Jun 29 19:39:00 GMT 2006
More information about the Binutils mailing list
Thu Jun 29 19:39:00 GMT 2006
- Previous message (by thread): [PATCH] DT_GNU_HASH: ~ 50% dynamic linking improvement
- Next message (by thread): DT_GNU_HASH: ~ 50% dynamic linking improvement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Jakub, On Wed, 2006-06-28 at 19:09 +0200, Jakub Jelinek wrote: > The following patches introduce an optional ELF hash section Dudie; you rock ! - thanks for taking this on board; this is really nice :-) and here was I thinking I was going to have to write a length paper to try to convince Ulrich :-) You made my month (really). > The initial design was done by Ulrich Drepper and was discussed with > Michael Meeks a few months ago as well. But nothing came off of these > discussions and the proposed approach is quite different. This looks rather like -zdynsort + -zhash-vals prototype; but I'm too pleased to quibble :-) Anyhow - I've been thinking about various other optimisations, many of which we can easily & neatly fit into this nice framework, I've numbered them to help chewing them over: 1. Extensibility + I want to see a multiply per lookup. + ie. since the L2 cache misses are what hammer us - if [sic] we want to extend the .chain -Bdirect support in future - being able to put that in the same record is really important. + a 'record size' field gives us that with ~no bloat as of now. + [ and I really think some form of direct linking is the future - but we can discuss that later clearly ] + ie. having an extensibility story that is better than 'just add a new section [ somewhere miles away in memory/not-in-cache ] is not so fun. 2. chain walking: + the 'not having a .chain <-> .dynsym offset equivalence is an interesting innovation over my approach; having said that I'm not sure why you did that. Most chains are rather short - and adding a set of bytes that are almost always length 1/2/3 seems strange; may I suggest a better approach ? + Since we can sort (almost all) of .dynsym - there is no problem ensuring that ~all symbols that we don't want in .hash occur at the end of the .dynsym table [ cf. the next UNDEF point ]. + Furthermore - by inspection it can be seen that 1/2^32 ~=== 1/2^30 [1] + using this theorem, we can easily steal 1/2 bits from the top of the stored hash and use them as chain terminators. + ie. lets have: .hash[symhash % nbuckets] -> .chain[5] .chain[5] -> [ (not-end) | hashval[5] & 0x3fff ] .chain[6] -> [ (not-end) | hashval[6] & 0x3fff ] .chain[7] -> [ (is-end) | hashval[7] & 0x3fff ] + that saves us several bytes per chain, and cf. above, no measurable loss in performance. [ of course chain[5] <-> .dynsym[5] here as before ] + in fact, for extensibility a per-shlib hash comparison 'mask' field would make for a rather better behaviour - so we can snarf bits from here in future if we come up with brighter ideas. 3. What we hash: + it is incredible to me that UNDEF symbols are included in the symbol hash; I see no legitimate use for that [ eg. 'free' is an expensive hit in the hash for virtually every app ;-] + with 2 bits, a 'chain terminator' and a 'defined terminator' if we don't bin the undef's from the hash completely we can at least sort them to the end of the chain and often avoid comparing them - [ currently they would be filtered out only in check_match after a .dynsym check for "if ((sym->st_value == 0 /* No value. */... || (type_class & (sym->st_shndx == SHN_UNDEF)))" + This may seem an insignificant win: most 'normal' libs I've seen only import 1/3rd of their symbols + OTOH - C++ 'plugins' [ which we can't load RTLD_LOCAL because of vague symbols tend to have map files / visibility markup to reduce their exposed symbol count substantially, eg. + as an example 'writer' lives in 'libsw' using objdump -T libsw680li.so | grep UND | wc -l: Undefined: 6465 Defined: 3580 + ie. for every 'defined' symbol that someone might legitimately want to look up we have ~2 other symbols that they just -cannot- want to compare. + ie. in some cases by dropping UNDEF symbols, we can have our new / larger .chain section and still save space [ on disk & in cache ]. So - some comments on your patch. I notice you don't sort .dynstr as I did ;-) that simplifies life clearly; of course - I have devious reasons for wanting that. Wrt. -Bdirect if we sort our relocations right and use -Bdirect, we can turn linking into a linear walk of both the library list, .hash, .chain, .dynsym, and .dynstr [ if we play our cards right ;-]; but that's for another day [ I'd just love to make linking completely disappear from the L2 cache miss profile (when using C) - I think it can be done ;-]. Reading the patch - I really like it :-) my calculation logic to lookup the hash stuff in a separate section was *really* ugly and evil, and would have required some re-factoring to make it at all pleasant, your chain walking code is sweet and pleasant to the eye etc. :-) Anyhow - we can build some other rather nice optimisations on top of this I think; and if we can do some of the above we can save space as well for a pure gnu-hash system I think. My only concern really is with extensibility. Thanks so much for doing this - of course, as cache sizes increase this may appear to be less useful[2]; but the faster we can make the linker, the more we can split up OO.o into more libs, lazy load more, componentize more and (hopefully) save size/time on startup & at runtime. Regards, Michael. [1] - so your paragraph on the hash: "530000 unique dynamic symbols ... Dan Bernstein's hash ... fewest collisions (only 29) ... The standard SysV ELF hash function gave bad results ... 1726 collisions" Is interesting - and I'm sure the hash you propose is great - but of course, with the hash comparison the reduction in the number of L2 cache misses is so great that the 'quality' of the hash at this level is rather uninteresting wrt. the optimisation [ IHMO etc. ]. ie. 29/530000 ~= 1726/530000, well nearly anyway ;-) focusing on a 0.005% hit vs. 0.3% is interesting but not worth fighting over. More interestingly can we pick a few bits to steal and only degrade that by a factor of 4 or so ? ie. 0.005% vs. 0.02% ;-) [2] - the Core Duo here with it's 2Mb cache is amazingly faster for OO.o linking, but clearly at smaller sizes there is this appalling cliff to drop-off wrt. linking performance. -- michael.meeks@novell.com <><, Pseudo Engineer, itinerant idiot
- Previous message (by thread): [PATCH] DT_GNU_HASH: ~ 50% dynamic linking improvement
- Next message (by thread): DT_GNU_HASH: ~ 50% dynamic linking improvement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Binutils mailing list