Questions regarding address relaxation on IA-64
Jim Wilson
wilson@specifix.com
Thu Mar 22 20:31:00 GMT 2007
More information about the Binutils mailing list
Thu Mar 22 20:31:00 GMT 2007
- Previous message (by thread): Questions regarding address relaxation on IA-64
- Next message (by thread): Patch to update libtool in GCC and binutils trees
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 2007-03-22 at 20:21 +0300, Alexander Monakov wrote: > It seems that on IA64 addresses of global variables are loaded with two > instructions: "addl rXX = r1, <offset>" and "ld8 rXX = [rXX]", with the > latter being later changed to "nop" by the linker. This causes the > following questions: These are the R_IA_64_LTOFF22X and R_IA_64_LDXMOV relocations, used for link time rewriting of the code to optimize global variable reference. You may want to reference the Itanium Processor-specific Application Binary Interface (psABI) and the Itanium Software Conventions and Runtime Architecture Guide (SCRA). Both are available from the developer.intel.com web site, along with other places. The first one talks about relocations, and the second one talks about coding conventions. In this case, this is the code sequence emitted to access external data, e.g. a global variable. > * Is the purpose of the "ld8" instruction to load the correct offset if > it does not fit into "addl" immediate operand? In pic code, a global variable is accessed indirectly via the GOT. So the addl instruction computes the address of the GOT entry that holds the address of the global variable, and then the ld8 loads that address into a register. As an optimization, at link time, if we detect that the global variable address is within range of the GP register, then we can compute the address directly with the addl instruction, and the ld8 is no longer needed. This results in faster code by eliminating a load. Because deleting an instruction is hard, we replace the load with a nop. This optimization could potentially be performed on any target, but as far as I know only the IA-64 target does this. > * Is it possible to use "movl rXX = <offset>" (move long immediate, in > MLX bundle) + "addl rXX = r1, rXX" for the same purpose? That would no longer be position independent (PIC), and hence would violate the Itanium ABI which requires all code to be PIC. It would also not function in a shared library, which can only work when code is PIC. > * Is it possible to tell compiler and linker that offsets will be small > enough so that only "addl rXX = r1, <offset>" will be needed (and if it is > not possbile, why)? The offset between the GP value and the global variable address won't be known until link time, as we don't know which object file will define the global variable, and we don't know whether it will be linked in early or late on the command line. The position off the defining object file on the link command may affect whether it is close enough to the GP value. We also don't know whether it will even be linked in, it might be in a shared library for instance. We also won't know the size of the got and other sections put in the same segment as the got until link time. There are probably also other factors I can't think of immediately. Since we have to decide at compile time whether to emit the addl/ld8, we have no choice but to emit both insns, and let the linker optimize. Note that is a variable is defined static in the same module that we are compiling, then we can know a little about where the variable will end up. We can and do emit a different code sequence in this case. See the discussion of "own" data in the SCRA. > I have noticed that with -mno-pic GCC generates "movl rXX = <address>" > (MLX bundle). This causes a couple of questions, too: -mno-pic violates the Itanium ABI, and can't be used for application code. This exists only for use by the kernel and some low level drivers. Or maybe it is EFI (the bios) that uses it. I don't remember exactly. > * Is it possible to use "mov rXX = <offset-or-address>" (short immediate > form) + "ld8 rXX = [rXX]", with ld8 being changed to "nop" by linker if > necessary? Introducing a load instruction will make code slower which is undesirable in general. Beyond that, there is the problem that this works only if you can put the global variable address someplace convenient to load it from. If you have non-pic code, then you don't have a GP reg or got, which means there isn't anyplace convenient to store the address. > * Why is mov+ld8 preferred in PIC code, and movl - in non-PIC code? It is addl+ld8 in PIC code not mov+ld8. This is a fundamental property of PIC code. PIC means position-independent code. PIC code can be loaded anyplace in memory without requiring additional relocations (an over simplification but details aren't important now). PIC code works by having one special value, the GP reg, which gets initialized at load time. The GP points at the GOT (global offset table), and the GOT contains the address of every global variable used in the code. We can now access any global variable within any code changes no matter where the code is loaded in memory by using the addl+ld8 sequence. In non-PIC code, we don't need the got. We just load the address of the variable directly via a movl instruction. Please see the psABI and SCRA. > On Itanium2, 8-byte loads can issue from memory ports 0 and 1 only, so our > scheduler places stop bits after each pair of ld8s to avoid stalls due to > resource oversubscription. Does emitting these extra stop bits gain us anything? Either way, the hardware is going to stall, whether it figures out on its own, or whether we tell it to stall. If there is no penalty for letting the hardware stall on its own, then maybe we should. > What can you suggest to solve this problem? Maybe linker should be taught > to delete stop bit following a bundle, if it relaxed the bundle so that it > consists of nops only, and there is a stop bit preceding this bundle? Sounds reasonable. We still have the resource over subscription problem, as there are only so many nops we can execute before the hardware stalls, so we have to be careful about how many stop bits we delete. The linker currently doesn't know anything about resource constraints or templates, as it doesn't have to, so it would be complicated to do anything clever there. But this is only if we want to avoid resource stalls. If we don't care about resource stalls due to too many nops, we could just delete now unnecessary stop bits and not worry about it. If we want to get more involved, we could try to delete entire bundles that end up as nops. The reason why the IA-64 linker relaxation doesn't try to delete instructions is because we have to worry about keeping the bundles correct. It is too much of a hassle to try to reorganize bundles at link time. But if linker relaxation gives us an entire bundle of nops, then there would be no problem with deleting an entire bundle, and that would avoid resource stalls with issuing useless nops. It would take a bit of work to write the code though. The IA-64 linker relaxation stuff is already a bit complicated. See elfNN_ia64_relax_ldxmov in binutils src/bfd/elfNN-ia64.c. -- Jim Wilson, GNU Tools Support, http://www.specifix.com
- Previous message (by thread): Questions regarding address relaxation on IA-64
- Next message (by thread): Patch to update libtool in GCC and binutils trees
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Binutils mailing list