[PATCH v9 04/19] gas: implement parsing of object attributes v2

Jan Beulich jbeulich@suse.com
Thu Nov 6 17:05:42 GMT 2025
On 04.11.2025 12:52, Matthieu Longo wrote:
> On 2025-10-24 15:19, Jan Beulich wrote:
>> On 01.09.2025 18:56, Matthieu Longo wrote:
>>> --- a/bfd/elf-attrs.c
>>> +++ b/bfd/elf-attrs.c
>>> @@ -255,8 +255,147 @@ bfd_elf_set_obj_attr_contents (bfd *abfd, bfd_byte *buffer, bfd_vma size)
>>>     write_obj_attr_section_v1 (abfd, buffer, size);
>>>   }
>>>   
>>> +/* The first two tags in gnu-testing namespace are known, and so have a name and
>>> +   can be initialized to the default value ('0' or NULL) depending on the
>>> +   encoding specified on the subsection.  Any tags above 1 will be considered
>>> +   unknown, so will be default initialized in the same way but its status will
>>> +   be set to obj_attr_subsection_v2_unknown.  */
>>> +static const obj_attr_info_t known_attrs_gnu_testing[] =
>>> +{
>>> +  {
>>> +    .tag = {"GNUTestTag_0", 0},
>>> +    .default_value.uint_val = 0,
>>> +  },
>>> +  {
>>> +    .tag = {"GNUTestTag_1", 1},
>>> +    .default_value.uint_val = 0,
>>> +  },
>>> +};
>>
>> While discussing v8 you confirmed that these aren't part of the spec. Afaict the
>> comment didn't change, though, and hence I'm still wondering where "known" comes
>> from.
> 
> I mean "known" from the perspective of GNU binutils. The "known" here 
> does not relate to any spec. Depending on whether a tag is known or 
> unknown, ld's processing will vary between merging, dropping, and errorring.
> 
> Also in the context of the GNU testing tags, the tags above 1 are 
> considered "unknown", which means that ld can drop them if they are 
> inside an optional subsection, whereas ld will raise an error in a 
> required subsection.
> 
> I am open to change the name if you don't like it.

The name is fine; it's the comment which doesn't make clear what "known" really
means here.

>>> +/* List of known GNU subsections.
>>> +   Note: this list needs to be sorted.  */
>>> +static const known_subsection_v2 obj_attr_v2_known_gnu_subsections[] =
>>> +{
>>> +  {
>>> +    /* Note: the currently set values for the subsection name, its optionality,
>>> +       and encoding are irrelevant for a testing subsection.  These values are
>>> +       unused.  This entry is only a placeholder for list of known GNU testing
>>> +       tags.  */
>>> +    .subsec_name = NULL,
>>
>> When using dedicated initializers, NULL and 0 (and false) initializers can be
>> omitted, and imo in most cases should be.
> 
> And so how does it work ? If you omit it, is it initialized to 0 or 
> uninitialized ?
> 
> The fact that I am wondering it means that it creates ambiguity. Then I 
> need to figure it out by searching what the C specs say.

Well, that's the thing: The C spec defines the behavior. There's no
ambiguity.

Quote:

"...; all subobjects that are not initialized explicitly shall be initialized
 implicitly the same as objects that have static storage duration."

> Expressiveness is better than implicitness when the resulting behavior 
> is not obvious.

Otoh extra code means extra time to read and understand it. Every little
bit counts.

>>> +/* Search for a subsection matching NAME in the list of subsections known from
>>> +   bfd (generic or backend-specific).  Return the subsection information if it
>>> +   is found, or NULL otherwise.  */
>>> +const known_subsection_v2 *
>>> +obj_attr_v2_identify_subsection (const struct elf_backend_data *bed,
>>> +				 const char *name)
>>> +{
>>> +  /* Check known backend subsections.  */
>>> +  const known_subsection_v2 *known_subsections
>>> +    = bed->obj_attr_v2_known_subsections;
>>> +  const size_t known_subsections_size = bed->obj_attr_v2_known_subsections_size;
>>> +
>>> +  for (unsigned i = 0; i < known_subsections_size; ++i)
>>> +    {
>>> +      int cmp = strcmp (known_subsections[i].subsec_name, name);
>>> +      if (cmp == 0)
>>> +	return &known_subsections[i];
>>> +      else if (cmp > 0)
>>
>> Another such "else" (and more further down).
> 
> Given the discussion in patch 3/19, do you still request to change it ?

Well, I would much prefer their removal, but I'm not going to insist.

>>> --- a/gas/config/obj-elf-attr.c
>>> +++ b/gas/config/obj-elf-attr.c
>>> @@ -22,8 +22,70 @@
>>>   
>>>   #ifdef TC_OBJ_ATTR
>>>   
>>> +#include "obstack.h"
>>>   #include "safe-ctype.h"
>>>   
>>> +/* A variant type to store information about known OAv2 identifiers.  */
>>> +typedef union {
>>> +  uint8_t u8;
>>> +  bool b;
>>> +} oav2_identifier_variant_value;
>>> +
>>> +typedef enum {
>>> +  OAv2_ASM_ID_VALUE_UNDEFINED = 0,
>>> +  OAv2_ASM_ID_VALUE_U8,
>>> +  OAv2_ASM_ID_VALUE_BOOL,
>>> +} oav2_identifier_variant_type_info;
>>> +
>>> +typedef struct {
>>> +  oav2_identifier_variant_value val;
>>> +  oav2_identifier_variant_type_info vtype;
>>> +} oav2_identifier_variant_t;
>>> +
>>> +typedef struct {
>>> +  const char *const name;
>>> +  const oav2_identifier_variant_t value;
>>> +} oav2_identifier_t;
>>> +
>>> +/* A variant type to store the argument values of an assembly directive.  */
>>> +struct arg_variant_t;
>>> +
>>> +typedef struct {
>>> +  size_t len;
>>> +  struct arg_variant_t *elts;
>>> +} arg_variant_list;
>>> +
>>> +typedef union {
>>> +  const char *string;
>>> +#if (TC_OBJ_ATTR_v1)
>>> +  uint32_t u32;
>>> +#endif
>>
>> How does v1 come into play here?
>>
> 
> There is no good reason to keep this.
> The tag is a uint32_t so I guess that at the beginning, I wanted to save 
> the parsed value as a uint32_t. Then I moved everything to use uint64_t, 
> keys and associated unsigned integer values.
> 
> Since it is inconsistent, I propose you to move the definition of 
> obj_attr_tag_t to uint64_t with the following changes (at the HEAD of my 
> branch, so the diff will be splitted to the respective patches).

Hmm, it didn't even occur to me that obj_attr_tag_t would be involved here.
But yes, if that's what it takes to make things consistent, so be it.

>>> +/* Extract a string literal ("[^.]+") from the input.  */
>>> +static bool
>>> +extract_string_literal (arg_t *arg_out)
>>> +{
>>> +  skip_whitespace (input_line_pointer);
>>> +
>>> +  if (*input_line_pointer != '"')
>>> +    {
>>> +      as_bad ("missing '\"', expected a string literal");
>>> +      return false;
>>> +    }
>>> +
>>> +  int len;
>>> +  char *obstack_buf = demand_copy_C_string (&len);
>>
>> The function already skips whitespace and checks for a starting '"', so
>> why open-code that here again?
> 
> I can remove the previous lines, but I cannot detect missing closing '"' 
> in my understanding. So I am losing the error message too.

Aiui you'd hit

      as_warn (_("unterminated string; newline inserted"));

in next_char_of_string(). That's what other users of demand_copy_C_string()
also get, i.e. the overall result is going to be more consistent with the
open-coding dropped.

> The issue with the existing parsing utility functions is that they do 
> too much, or not enough. And their interface is not necessarily friendly 
> for what I want to do.

I think in such a case the general infrastructure would want improving,
rather than everyone rolling their own "clone". Yet for not it's still not
quite clear to me what exactly you're missing from the functions that are
available.

>>> +  switch (known_identifier->value.vtype)
>>> +    {
>>> +    case OAv2_ASM_ID_VALUE_U8:
>>> +      val_out->val.u64 = known_identifier->value.val.b;
>>> +      val_out->vtype = VALUE_UNSIGNED_INTEGER;
>>> +      break;
>>> +    case OAv2_ASM_ID_VALUE_BOOL:
>>> +      val_out->val.u64 = known_identifier->value.val.u8;
>>> +      val_out->vtype = VALUE_UNSIGNED_INTEGER;
>>> +      break;
>>> +    default:
>>> +      abort ();
>>> +    }
>>
>> How come bool and u8 are all that need handling? Surely the reader could be
>> helped by having a comment tio this effect?
> 
> Sorry, I don't understand the questions.
> The only known identifiers are declared in 
> known_identifiers_subsection_optional and 
> known_identifiers_subsection_encoding.
> Only bool and u8 are required, hence 3 values in 
> oav2_identifier_variant_type_info (u8, bool, undefined).

This being a (relatively) general purpose helper function, I don't see why it
may (silently!) rely on what is being passed to it. If there was a comment
saying "we only need this much for now", things would be quite a bit more
clear. A person wanting to understand why nothing else if handled here would
then also know the answer without having to guess.

>>> +  return true;
>>> +}
>>> +#endif /* TC_OBJ_ATTR_v2 */
>>> +
>>> +#if (TC_OBJ_ATTR_v1)
>>> +/* Look up attribute tags defined in the backend (object attribute v1).  */
>>> +static bool
>>> +obj_attr_v1_lookup_known_attr_tag_symbol (const char *identifier,
>>> +					  arg_token_t token_type,
>>> +					  arg_t *val_out)
>>> +{
>>> +#ifndef CONVERT_SYMBOLIC_ATTRIBUTE
>>> +#define CONVERT_SYMBOLIC_ATTRIBUTE(a) -1
>>> +  (void) identifier;
>>
>> As indicated in v8 review - the common way is use of ATTRIBUTE_UNUSED.
> 
> Fixed but unsure about the formatting (see below).
> 
> diff --git a/gas/config/obj-elf-attr.c b/gas/config/obj-elf-attr.c
> index 82ec6a79783..b574d46b80c 100644
> --- a/gas/config/obj-elf-attr.c
> +++ b/gas/config/obj-elf-attr.c
> @@ -354,15 +354,16 @@ resolve_if_matching (const char *identifier,
>   #if (TC_OBJ_ATTR_v1)
>   /* Look up attribute tags defined in the backend (object attribute 
> v1).  */
>   static bool
> -obj_attr_v1_lookup_known_attr_tag_symbol (const char *identifier,
> -                                         arg_token_t token_type,
> -                                         arg_t *val_out)
> -{
> +obj_attr_v1_lookup_known_attr_tag_symbol (
>   #ifndef CONVERT_SYMBOLIC_ATTRIBUTE
> -#define CONVERT_SYMBOLIC_ATTRIBUTE(a) -1
> -  (void) identifier;
> +  #define CONVERT_SYMBOLIC_ATTRIBUTE(a) -1
> +  const char *identifier ATTRIBUTE_UNUSED,
> +#else
> +  const char *identifier,

Well, no. See other uses of ATTRIBUTE_UNUSED: It's applied if any build
configuration would leave a parameter unused. No #ifdef-ary like this. 

>>> +/* Look up known symbols, and try to resolve the given identifier.  */
>>> +static bool
>>> +lookup_known_symbols (const char *identifier,
>>> +		      arg_token_t token_type,
>>> +		      arg_t *val_out)
>>> +{
>>> +  if (identifier == NULL)
>>> +    return false;
>>> +
>>> +  /* The identifier should match the value in val_out.  */
>>> +  gas_assert (val_out->val.string == identifier);
>>> +
>>> +  arg_token_t high_ttype = (token_type & HT_MASK);
>>> +
>>> +#if (TC_OBJ_ATTR_v2)
>>> +  static const oav2_identifier_t known_identifiers_subsection_optional[] = {
>>> +    { "optional", {
>>> +	.val.b = true,
>>> +	.vtype = OAv2_ASM_ID_VALUE_BOOL
>>> +      }
>>> +    },
>>> +    { "required", {
>>> +	.val.b = false,
>>> +	.vtype = OAv2_ASM_ID_VALUE_BOOL
>>> +      }
>>> +    },
>>> +  };
>>
>> Interesting - the "keywords" a user would specify need to be all lowercase here,
>> but ...
>>
>>> +  static const oav2_identifier_t known_identifiers_subsection_encoding[] = {
>>> +    { "ULEB128", {
>>> +	.val.u8 = obj_attr_encoding_v2_to_u8 (OA_ENC_ULEB128),
>>> +	.vtype = OAv2_ASM_ID_VALUE_U8
>>> +      }
>>> +    },
>>> +    { "uleb128", {
>>> +	.val.u8 = obj_attr_encoding_v2_to_u8 (OA_ENC_ULEB128),
>>> +	.vtype = OAv2_ASM_ID_VALUE_U8
>>> +      }
>>> +    },
>>> +    { "NTBS", {
>>> +	.val.u8 = obj_attr_encoding_v2_to_u8 (OA_ENC_NTBS),
>>> +	.vtype = OAv2_ASM_ID_VALUE_U8
>>> +      }
>>> +    },
>>> +    { "ntbs", {
>>> +	.val.u8 = obj_attr_encoding_v2_to_u8 (OA_ENC_NTBS),
>>> +	.vtype = OAv2_ASM_ID_VALUE_U8
>>> +      }
>>> +    },
>>> +  };
>>
>> ... can be all low or all upper case here (but then not e.g. Uleb128). Wouldn't
>> case-insensitive comparison be preferable, leaving more freedom to users?
>>
> 
> For reference, here below is the discussion we had about it in revision 
> 8, patch 4.
> 
>>>>> --- a/gas/doc/c-aarch64.texi
>>>>> +++ b/gas/doc/c-aarch64.texi
>>>>> @@ -479,6 +479,33 @@ The AArch64 architecture uses @sc{ieee} floating-point numbers.
>>>>>   
>>>>>   @c AAAAAAAAAAAAAAAAAAAAAAAAA
>>>>>   
>>>>> +@cindex @code{.aeabi_subsection} directive, AArch64
>>>>> +@item .aeabi_subsection @var{name}, @var{comprehension}, @var{encoding}
>>>>> +Create or switch the current object attributes subsection to @var{name}.  Valid
>>>>> +values for @var{name} are following the pattern @code{[a-zA-Z0-9_-]+}.
>>>>> +
>>>>> +The subsection property @var{comprehension} determines how a program processing
>>>>> +the attributes handles attributes that it does not recognize (perhaps because
>>>>> +the object file was generated by a different version of the toolchain).  A
>>>>> +subsection that is marked @code{optional} can be skipped if it is not
>>>>> +understood.  A subsection marked @code{required} implies that information
>>>>> +conveyed by the attribute is required for correct processing of the object file;
>>>>> +a fatal diagnostic must be generated if a tool does not recognize either the tag
>>>>> +or the value associated with it.
>>>>> +
>>>>> +@var{encoding} specifies the expected encoding of the attributes recorded in the
>>>>> +subsection.  Currently supported values are @code{ULEB128} and @code{NTBS}
>>>>> +(null-terminated byte string).
>>>>> +
>>>>> +@cindex @code{.aeabi_attribute} @var{tag}, @var{value}
>>>>> +@item .aeabi_attribute @var{tag}, @var{value}
>>>>> +Create an attribute with the pair @var{tag}, @var{value} in the current
>>>>> +subsection.  @var{tag} can either be an integer value, or a known named key.
>>>>> +@var{value} can either be an integer or a string.
>>>>> +
>>>>> +The complete list of subsections and tags supported on AArch64 is documented
>>>>> +in @cite{Build Attributes for the Arm 64-bit Architecture (AArch64)}.
>>>>
>>>> Throughout you additions here I think it would be helpful if it was clarified
>>>> which of the items are case-sensitive, and which ones are not. For example
>>>> you spell "optional" and "required" all lower-case, but "ULEB128" and "NTBS"
>>>> all upper-case. Without disambiguation it doesn't become clear whether that
>>>> has any particular significance.
>>>>
>>>
>>> I personally don't like this choice between capitalized and 
>>> non-capitalized version for the encoding. The reason why capitalized 
>>> version exists for encoding is because ULEB128 and NTBS are acronyms, 
>>> and it is more natural in English to use the capitalized version for 
>>> those. I would have preferred only the capitalized version to exist, but 
>>> the spec proposed it so Gas supports it.
>>>
>>> I wrote this documentation in a prescriptive way and expect people to 
>>> use the version from the documentation. I omitted on purpose the 
>>> possibility of using lower case for the encoding because having several 
>>> options opens the door to inconsistencies across a code base.
>>>
>>> If you disagree, I can add it.
>>
>> Imo you want to state precisely what's accepted. If case doesn't matter,
>> people should be free to use spellings of their choice. No matter what
>> you or I like or dislike.
>>
> 
> In the next revision, the identifiers are changed to lower case before 
> the comparison.

But this doesn't answer my question: Why can the comparisons not simply be
case-insensitive, to allow people to spell things as they like them in their
sources?

>>> +/* Look up the symbol table of this compilation unit, and try to resolve the
>>> +   given identifier.  */
>>> +static bool
>>> +lookup_symbol_table (const char *identifier,
>>> +		     const arg_token_t expected_ttype,
>>> +		     arg_t *val_out)
>>> +{
>>> +  if (identifier == NULL)
>>> +    return false;
>>> +
>>> +  /* Note: signed integer are unsupported for now.  */
>>> +  gas_assert (expected_ttype & UNSIGNED_INTEGER);
>>> +  /* The identifier should match the value in val_out.  */
>>> +  gas_assert (val_out->val.string == identifier);
>>
>> Hmm, maybe herein lies part of the answer to an earlier question. Yet then why
>> would those functions take redundant parameters?
>>
> 
> The wanted interface for such a function is:
> - identifier: IN
> - expected_type: IN
> - val_out: OUT
> but I acknowledge that the free() in those functions is misleading.
> 
> What about this ?

That still doesn't look to address the apparent redundancy of what needs
passing into this function.

Jan


More information about the Binutils mailing list