Spring Cleaning by taspelund · Pull Request #648 · oxidecomputer/maghemite

added 24 commits

February 19, 2026 12:01
Three tests failed under cargo test (thread-per-test) but passed
under cargo nextest (process-per-test) due to shared global state
in the channel-based test network simulation:

1. unnumbered_peering_helper() leaked dispatcher threads by not
   returning dispatcher handles to callers. Leaked threads would
   periodically re-bind to addresses in the global NET endpoint
   map, stealing entries from concurrently running tests.

2. The simulated Network had no unbind mechanism, so endpoint
   entries persisted after a Listener was dropped, leaving stale
   state for subsequent tests.

3. Several tests used hardcoded scope_ids (2, 3) for link-local
   addresses, causing collisions in the shared NET endpoint map
   when tests ran as concurrent threads in the same process.

Fix by:
- Adding Network::unbind() and Listener::Drop to clean up NET
  endpoint entries when a listener goes out of scope
- Returning dispatchers from unnumbered_peering_helper() so all
  callers can shut them down during test cleanup
- Switching hardcoded scope_ids to next_scope_id() for unique
  per-test addresses, matching the newer test infrastructure
This adds logic to send_update() to be able to split a large amount of
prefixes that would cause the Update to exceed the max BGP packet size.
Now, if we have more originated prefixes than can fit into a single
Update, we will split the prefixes into chunks that fit the remaining
available space (max size - header - fixed length fields - attributes).

This also updates the basic_update* tests to exchange and validate 4k
routes instead of just 1. This not only exercises the Update chunking,
but also recreates the conditions where #634 was observed.

Fixes: #614
Allow static routes to use cross-address-family next-hops
(e.g. IPv4 prefix via IPv6 next-hop and vice versa).

Introduce API version 7 (EXTENDED_NH_STATIC) with StaticRoute4 and
StaticRoute6 types that accept IpAddr for the nexthop field
instead of the AF-specific Ipv4Addr/Ipv6Addr. The previous
types are retired as V1 variants with version-gated endpoints.

The internal layers (StaticRouteKey, Path, RIB) already handle
cross-AF next-hops; this closes the gap at the API boundary.

Closes: #581
When stdout is piped into a program that closes early (e.g.
`mgadm bgp status neighbors 47 | head`), Rust's println!
macro panics on the resulting EPIPE, causing a SIGABRT.

Add println_nopipe!/print_nopipe! macros to mg-common that
silently exit on BrokenPipe instead of panicking. Use them
throughout mgadm's non-test print calls. Also catch EPIPE
errors propagated via ? from TabWriter writeln/flush paths
at both mgadm and ddmadm main() boundaries.

Fixes #567
The Ord impl we have for Path was only checking nexthop_interface and
vlan_id for an identity check which returns Ordering::Equal if bgp is
None and the (nexthop, nexthop_interface, vlan_id) are the same.
However, when all these fields are not the same that impl moved onto a
tie-breaker which compared the nexthop before moving onto the bgp cmp,
without ever doing a cmp for nexthop_interface or vlan_id. This meant
that any two Paths with the same nexthop but different nexthop_interface
and/or vlan_id would be treated as identical paths.

BTreeSet relies on this Ord impl to determine whether an insertion
should replace or add an element. The above issue in the Ord impl would
result in elements being replaced instead of added, which was observed
in #528 when we static routes are pulled out of the persistent sled DB
and into a BTreeSet before being returned to the API client. The impact
of this is not limited to just `mgadm get-v{4,6}-routes` returning
misleading information, but also missing static routes if mgd were to
restart.

However, this bug does not affect ECMP paths with separate next-hops
since the nexthop field was compared as part of the tie-breaker when
the composite identity check did not match.

Fixes: #528
Add handling for changes to enforce-first-as that doesn't involve a hard
reset of the session. When enabled, walk the RIB and remove paths from
the peer that fail the first-as test. When disabled, send a
route-refresh to the peer so we can re-learn any routes we may have
filtered out.

Fixes: #620
Consolidate paired address-family-specific methods into generic
or AF-parameterized equivalents throughout the RIB database layer.
Origin methods now use the Prefix enum for consistency with existing
static routing methods. Rename BGP removal methods for clarity and fix
over-approximate PCN notifications.
There's a lot of code duplication across {create,get,set}_origin{4,6}
methods, which can be simplified to just use Prefix + AddressFamily.
This does exactly that -- unifies each method to just a single version
that accepts Prefix (AF-agnostic) and uses AddressFamily to distinguish
what should be installed where.
When an Update was marked treat-as-withdraw as a result of an error
(either by the parser or by checks done by the FSM), it was never read
or handled by the FSM. So no treat-as-withdraw error would ever actually
result in routes being withdrawn... even worse, routes that should have
been withdrawn due to errors caught by the parser would get installed if
there wasn't a duplicated check in the FSM.

There were also a couple other bugs that needed fixing too:
1) EnforceFirstAs was unconditionally applied to all peers, but should
   only be applied to eBGP peers.
2) FSM layer failures in check_update() would result in an early return
   rather than a withdrawal, which could lead to routes staying
   installed indefinitely with outdated attributes.

For example, we would see issues as a result of (2) in this scenario:
  A) We receive an UPDATE for Route X with valid attrs
  B) We receive an UPDATE for Route X with invalid attrs

If we return early after (B) then the second UPDATE is ignored, leaving
us with Route X installed with the earlier (now outdated) valid attrs.
The correct response to (B) is to withdraw Route X. Let's not stick our
fingers in our ears and pretend the second UPDATE never happened.

EnforceFirstAs was also explicitly checking for a missing AS_PATH path
attribute, however this was already checked by the parser as a protocol
level error (mandatory attribute missing == treat-as-withdraw). So this
check has been removed, since we now catch treat-as-withdraw flagged by
the parser.
There were multiple places where we kicked off a RIB walker as a result
of an event pertaining to a BGP peer which unconditionally locked and
walked the RIBs for both IPv4 and IPv6. Update these spots to only walk
the RIBs for address-families that have been negotiated with the peer.
Add logic to send a NOTIFICATION if we can't find common ground with our
BGP peer. If the peer sends us MP capabilities, then we treat the AFIs
they send us as an exhaustive list of what they're willing to exchange.
If the peer doesn't send us MP capabilities, then we treat them as a
legacy peer that supports only IPv4 Unicast and only with traditional
encoding. If we haven't enabled IPv4 Unicast and our peer doesn't
support MP-BGP, then we do not have any AFIs in common and we'll reset
the session.

Also adds a basic integration test to ensure that two peers with non
overlapping AFI/SAFIs never move into Established.
Adds a missing call to handle_open() in fsm_established(). This covers
the case where we've enabled deterministic_collision_resolution and we
get an OPEN while we're in Established. The OPEN still needs to be
validated here too.
Updates is_major_event() to categorize more FSM events as major so they
don't live just in the "all" buffer. Now the only events that are not
major are Keepalives, ConnectRetryTimerExpires, IdleHoldTimerExpires,
and KeepaliveTimerExpires. Every other message or event is significant
since they're either low frequency or high importance (triggering an
FSM state change, changing a connection state, updating the RIB, etc.)
Plumb major/all buffers through the BGP Message History, mirroring the
major/all buffer split of the BGP FSM History. In this case, the only
message type that doesn't live in the "major" buffer is keepalives.
Expands the advertised/imported prefix counters to be per AFI.
Adds logic to track the last reset timestamp/reason.
Adds logic to track the last sent/received NOTIFICATION and timestamp.
Adds unit tests to validate the prefix counters and the last reset and
last NOTIFICATION.
Adds clippy lints to deny wildcard match arms within the BGP FSM. This
ensures that any future work will always perform an exhaustive match so
there's no chance of an FSM event handler getting missed. The existing
match arms were already exhaustive in almost all cases, but this adds an
extra level of confidence that it won't change in the future.
Adds a configurable DSCP value for IP packets carrying BGP sessions.

Fixes: #565
Adds logic to update the DSCP and TTL values on TCP sockets in-flight
for all active connections of a SessionRunner. This means that we update
the sockets and allow BGP to manage the aftermath.
e.g.
We won't explicitly close the connection when setting the TTL to 1, but
that doesn't mean the connection isn't doomed. The connection may drop
due to the practical implications of changing the TTL on the socket.

Manual testing confirms (via snoop) that the QoS marking is properly
updated (added/removed/changed) along with the BGP neighbor config,
without causing a reset of the session.
The outgoing hop-limit and incoming min-hop filter for IPv6 BGP
peers were broken in two ways:

1. apply_min_ttl() called TcpStream::set_ttl() unconditionally,
   which sets IP_TTL — an IPv4-only option. For IPv6 sockets this
   is silently ignored. Branch on address family and use
   setsockopt(IPV6_UNICAST_HOPS) for IPv6 peers.

2. set_ip_minttl() used IPV6_HOPLIMIT as the IPv6 equivalent of
   IP_MINTTL. IPV6_HOPLIMIT is a recvmsg ancillary-data option
   that delivers the received hop-limit as a cmsg — it does not
   filter on it. Replace with IPV6_MINHOPCOUNT, which is the
   actual IPv6 counterpart to IP_MINTTL.

Define IPV6_MINHOPCOUNT as a local constant (value 73 for Linux, 47 for
Illumos) since it is not exposed by the libc crate for either platform.

Fixes: #647
Instead of asking the kernel to reset the TTL back to its own default,
let's just manage this explicitly ourselves. This introduces a const
BGP_DEFAULT_TTL which is set to 255 -- we are not a 40 year old routing
platform that needs to have 100 different nerd knobs to make TTL work
in a sane fashion, so let's avoid all the hubbub with making things
different for eBGP/iBGP and just set our TTL to 255 from the get-go.

@taspelund

socket2 has SockRef methods for setting DSCP/TCLASS values for IPv4/IPv6
headers, but not for illumos. Revert back to using the unsafe libc
*sockopt functions for setting and testing those functions.

@taspelund

Adds support for BGP Extended Message Size Capability (RFC 8654).
Updates PeerConnection to store a new PeerCabilities struct, which has
the negotiated state of each supported capability. This is in addition
to the BTreeSets of sent/received capabilities, which are still exposed
via the API so operators can see exactly what caps were exchanged rather
than just the resulting usable state (e.g. if we wanted to figure out
what capability was sent to us + what it contained).
Actually store ENHE in caps_tx so we can track the negotiation state of
that capability for a given peer.
When IPv6 NLRI are exchanged using IPv4 next-hops, they use IPv4-Mapped
IPv6 addresses for the next-hop rather than relying on ENHE capabilities
(Extended Next-Hop Encoding) describing IPv4 NH-AFI for IPv6 Unicast.

This updates our nexthop selection to return an IPv4-Mapped IPv6 address
on the sending side (for cases where IPv6 Unicast has been configured
with an explicit nexthop override where the address is IPv4), and our
UPDATE processing to normalize the nexthop prior to RIB installation on
the receiving side.
e.g.
IpAddr::V6(::ffff:10.0.0.1) becomes IpAddr::V4(10.0.0.1) prior to RIB
insertion so mg-lower/ddm/dendrite never see a mapped address.
Last bits of logic to ensure extended nexthops work for numbered peers.
A nexthop override configuration already takes an IpAddr, allowing for
either v4 or v6 next-hops to be configured. This ensures that we use
proper encoding anytime nexthop-afi != nlri-afi (ENHE for v4-over-v6,
v4-mapped v6 addresses for v6-over-v4). Unit and integration tests are
added to cover these cases. Also adds in missing type conversions for
ENHE and BGP Extended Message capabilities, ensuring they both appear as
the correct enum variant rather than the catch-all unknown.
Add unit tests for negotiate_capabilities() covering AF
negotiation, legacy implicit IPv4, extended messages, route
refresh, and ENHE element matching. Add select_nexthop tests
for configured cross-AF nexthop with and without ENHE.

Add ExtendedNexthopElement::v4_over_v6() constructor; gate
v6_over_v4 helpers behind cfg(test).
Adds logic into update_session_info() to reset the BGP session if a
nexthop configuration change results in ENHE capability being toggled.
i.e.
If the NH-AFI changes for IPv4 Unicast, ENHE must be re-negotiated which
requires a full session reset.
Bump TEST_ROUTE_COUNT to 32k to ensure we always originate enough routes
to trigger UpdateMessage chunking, even when BGP Extended Message Size
has been negotiated.

@taspelund

Use mg-common's ManagedThread so we retain JoinHandles in a structured
fashion in case we want to make explicit use of these later.
Also introduce AddPeerRequest to group arguments defining a peer config.
This avoids clippy's too many arguments warning and provides some rough
structure for the peer add flow.

@taspelund

After converting to using a thread builder instead of thread::spawn()
(builder::spawn returns result, thread::spawn panics), the Result should
be propagated back out to the callers so they can do proper error
handling.
ensure() was returning a Result<UdpSocket> but the caller just dropped
it, so now we just return Result<()>.
Removes redundant presence check for peer during add_peer callpath.
Updates egress() to use recv_timeout() to avoid possible deadlocks where
recv() hangs on a dead channel and we miss the killswitch (AtomicBool).
The BGP SessionClock thread was not being stopped when the FSM was put
into shutdown, although all its timers were. After migrating to thread
Builder to have thread names, the thread leak test began failing due to
the SessionClock threads not being stopped/joined. This adds a stop()
method to ManagedThread, allowing the thread itself to be joined so it
doesn't linger indefinitely.

@taspelund

@taspelund

Fix PolicyAction::from_str() to convert "deny" into Self::Deny instead
of Self::Allow.

@taspelund

@taspelund

@taspelund

We skipped both 127.0.0.1 and ::1 during install, but were not skipping
::1 during uninstall. Let's make that symmetric.
Move from using an installed bool to using a u32 to track the number of
intra-process users of an IP, which prevents early uninstalls.
Move loopback-skipping logic (127.0.0.1 / ::1) into a dedicated function
and make a few methods private that don't need to be public.
If an installation fails for one of the requested IPs, rollback the IPs.
Removes an unused uninstall() method.

@taspelund

@taspelund

zeeshanlakhani