Spring Cleaning by taspelund · Pull Request #648 · oxidecomputer/maghemite
added 24 commits
February 19, 2026 12:01Three tests failed under cargo test (thread-per-test) but passed under cargo nextest (process-per-test) due to shared global state in the channel-based test network simulation: 1. unnumbered_peering_helper() leaked dispatcher threads by not returning dispatcher handles to callers. Leaked threads would periodically re-bind to addresses in the global NET endpoint map, stealing entries from concurrently running tests. 2. The simulated Network had no unbind mechanism, so endpoint entries persisted after a Listener was dropped, leaving stale state for subsequent tests. 3. Several tests used hardcoded scope_ids (2, 3) for link-local addresses, causing collisions in the shared NET endpoint map when tests ran as concurrent threads in the same process. Fix by: - Adding Network::unbind() and Listener::Drop to clean up NET endpoint entries when a listener goes out of scope - Returning dispatchers from unnumbered_peering_helper() so all callers can shut them down during test cleanup - Switching hardcoded scope_ids to next_scope_id() for unique per-test addresses, matching the newer test infrastructure
This adds logic to send_update() to be able to split a large amount of prefixes that would cause the Update to exceed the max BGP packet size. Now, if we have more originated prefixes than can fit into a single Update, we will split the prefixes into chunks that fit the remaining available space (max size - header - fixed length fields - attributes). This also updates the basic_update* tests to exchange and validate 4k routes instead of just 1. This not only exercises the Update chunking, but also recreates the conditions where #634 was observed. Fixes: #614
Allow static routes to use cross-address-family next-hops (e.g. IPv4 prefix via IPv6 next-hop and vice versa). Introduce API version 7 (EXTENDED_NH_STATIC) with StaticRoute4 and StaticRoute6 types that accept IpAddr for the nexthop field instead of the AF-specific Ipv4Addr/Ipv6Addr. The previous types are retired as V1 variants with version-gated endpoints. The internal layers (StaticRouteKey, Path, RIB) already handle cross-AF next-hops; this closes the gap at the API boundary. Closes: #581
When stdout is piped into a program that closes early (e.g. `mgadm bgp status neighbors 47 | head`), Rust's println! macro panics on the resulting EPIPE, causing a SIGABRT. Add println_nopipe!/print_nopipe! macros to mg-common that silently exit on BrokenPipe instead of panicking. Use them throughout mgadm's non-test print calls. Also catch EPIPE errors propagated via ? from TabWriter writeln/flush paths at both mgadm and ddmadm main() boundaries. Fixes #567
The Ord impl we have for Path was only checking nexthop_interface and vlan_id for an identity check which returns Ordering::Equal if bgp is None and the (nexthop, nexthop_interface, vlan_id) are the same. However, when all these fields are not the same that impl moved onto a tie-breaker which compared the nexthop before moving onto the bgp cmp, without ever doing a cmp for nexthop_interface or vlan_id. This meant that any two Paths with the same nexthop but different nexthop_interface and/or vlan_id would be treated as identical paths. BTreeSet relies on this Ord impl to determine whether an insertion should replace or add an element. The above issue in the Ord impl would result in elements being replaced instead of added, which was observed in #528 when we static routes are pulled out of the persistent sled DB and into a BTreeSet before being returned to the API client. The impact of this is not limited to just `mgadm get-v{4,6}-routes` returning misleading information, but also missing static routes if mgd were to restart. However, this bug does not affect ECMP paths with separate next-hops since the nexthop field was compared as part of the tie-breaker when the composite identity check did not match. Fixes: #528
Add handling for changes to enforce-first-as that doesn't involve a hard reset of the session. When enabled, walk the RIB and remove paths from the peer that fail the first-as test. When disabled, send a route-refresh to the peer so we can re-learn any routes we may have filtered out. Fixes: #620
Consolidate paired address-family-specific methods into generic or AF-parameterized equivalents throughout the RIB database layer. Origin methods now use the Prefix enum for consistency with existing static routing methods. Rename BGP removal methods for clarity and fix over-approximate PCN notifications.
There's a lot of code duplication across {create,get,set}_origin{4,6}
methods, which can be simplified to just use Prefix + AddressFamily.
This does exactly that -- unifies each method to just a single version
that accepts Prefix (AF-agnostic) and uses AddressFamily to distinguish
what should be installed where.
When an Update was marked treat-as-withdraw as a result of an error (either by the parser or by checks done by the FSM), it was never read or handled by the FSM. So no treat-as-withdraw error would ever actually result in routes being withdrawn... even worse, routes that should have been withdrawn due to errors caught by the parser would get installed if there wasn't a duplicated check in the FSM. There were also a couple other bugs that needed fixing too: 1) EnforceFirstAs was unconditionally applied to all peers, but should only be applied to eBGP peers. 2) FSM layer failures in check_update() would result in an early return rather than a withdrawal, which could lead to routes staying installed indefinitely with outdated attributes. For example, we would see issues as a result of (2) in this scenario: A) We receive an UPDATE for Route X with valid attrs B) We receive an UPDATE for Route X with invalid attrs If we return early after (B) then the second UPDATE is ignored, leaving us with Route X installed with the earlier (now outdated) valid attrs. The correct response to (B) is to withdraw Route X. Let's not stick our fingers in our ears and pretend the second UPDATE never happened. EnforceFirstAs was also explicitly checking for a missing AS_PATH path attribute, however this was already checked by the parser as a protocol level error (mandatory attribute missing == treat-as-withdraw). So this check has been removed, since we now catch treat-as-withdraw flagged by the parser.
There were multiple places where we kicked off a RIB walker as a result of an event pertaining to a BGP peer which unconditionally locked and walked the RIBs for both IPv4 and IPv6. Update these spots to only walk the RIBs for address-families that have been negotiated with the peer.
Add logic to send a NOTIFICATION if we can't find common ground with our BGP peer. If the peer sends us MP capabilities, then we treat the AFIs they send us as an exhaustive list of what they're willing to exchange. If the peer doesn't send us MP capabilities, then we treat them as a legacy peer that supports only IPv4 Unicast and only with traditional encoding. If we haven't enabled IPv4 Unicast and our peer doesn't support MP-BGP, then we do not have any AFIs in common and we'll reset the session. Also adds a basic integration test to ensure that two peers with non overlapping AFI/SAFIs never move into Established.
Adds a missing call to handle_open() in fsm_established(). This covers the case where we've enabled deterministic_collision_resolution and we get an OPEN while we're in Established. The OPEN still needs to be validated here too.
Updates is_major_event() to categorize more FSM events as major so they don't live just in the "all" buffer. Now the only events that are not major are Keepalives, ConnectRetryTimerExpires, IdleHoldTimerExpires, and KeepaliveTimerExpires. Every other message or event is significant since they're either low frequency or high importance (triggering an FSM state change, changing a connection state, updating the RIB, etc.)
Plumb major/all buffers through the BGP Message History, mirroring the major/all buffer split of the BGP FSM History. In this case, the only message type that doesn't live in the "major" buffer is keepalives.
Expands the advertised/imported prefix counters to be per AFI. Adds logic to track the last reset timestamp/reason. Adds logic to track the last sent/received NOTIFICATION and timestamp. Adds unit tests to validate the prefix counters and the last reset and last NOTIFICATION.
Adds clippy lints to deny wildcard match arms within the BGP FSM. This ensures that any future work will always perform an exhaustive match so there's no chance of an FSM event handler getting missed. The existing match arms were already exhaustive in almost all cases, but this adds an extra level of confidence that it won't change in the future.
Adds a configurable DSCP value for IP packets carrying BGP sessions. Fixes: #565
Adds logic to update the DSCP and TTL values on TCP sockets in-flight for all active connections of a SessionRunner. This means that we update the sockets and allow BGP to manage the aftermath. e.g. We won't explicitly close the connection when setting the TTL to 1, but that doesn't mean the connection isn't doomed. The connection may drop due to the practical implications of changing the TTL on the socket. Manual testing confirms (via snoop) that the QoS marking is properly updated (added/removed/changed) along with the BGP neighbor config, without causing a reset of the session.
The outgoing hop-limit and incoming min-hop filter for IPv6 BGP peers were broken in two ways: 1. apply_min_ttl() called TcpStream::set_ttl() unconditionally, which sets IP_TTL — an IPv4-only option. For IPv6 sockets this is silently ignored. Branch on address family and use setsockopt(IPV6_UNICAST_HOPS) for IPv6 peers. 2. set_ip_minttl() used IPV6_HOPLIMIT as the IPv6 equivalent of IP_MINTTL. IPV6_HOPLIMIT is a recvmsg ancillary-data option that delivers the received hop-limit as a cmsg — it does not filter on it. Replace with IPV6_MINHOPCOUNT, which is the actual IPv6 counterpart to IP_MINTTL. Define IPV6_MINHOPCOUNT as a local constant (value 73 for Linux, 47 for Illumos) since it is not exposed by the libc crate for either platform. Fixes: #647
Instead of asking the kernel to reset the TTL back to its own default, let's just manage this explicitly ourselves. This introduces a const BGP_DEFAULT_TTL which is set to 255 -- we are not a 40 year old routing platform that needs to have 100 different nerd knobs to make TTL work in a sane fashion, so let's avoid all the hubbub with making things different for eBGP/iBGP and just set our TTL to 255 from the get-go.
socket2 has SockRef methods for setting DSCP/TCLASS values for IPv4/IPv6 headers, but not for illumos. Revert back to using the unsafe libc *sockopt functions for setting and testing those functions.
Updates PeerConnection to store a new PeerCabilities struct, which has the negotiated state of each supported capability. This is in addition to the BTreeSets of sent/received capabilities, which are still exposed via the API so operators can see exactly what caps were exchanged rather than just the resulting usable state (e.g. if we wanted to figure out what capability was sent to us + what it contained).
Actually store ENHE in caps_tx so we can track the negotiation state of that capability for a given peer.
When IPv6 NLRI are exchanged using IPv4 next-hops, they use IPv4-Mapped IPv6 addresses for the next-hop rather than relying on ENHE capabilities (Extended Next-Hop Encoding) describing IPv4 NH-AFI for IPv6 Unicast. This updates our nexthop selection to return an IPv4-Mapped IPv6 address on the sending side (for cases where IPv6 Unicast has been configured with an explicit nexthop override where the address is IPv4), and our UPDATE processing to normalize the nexthop prior to RIB installation on the receiving side. e.g. IpAddr::V6(::ffff:10.0.0.1) becomes IpAddr::V4(10.0.0.1) prior to RIB insertion so mg-lower/ddm/dendrite never see a mapped address.
Last bits of logic to ensure extended nexthops work for numbered peers. A nexthop override configuration already takes an IpAddr, allowing for either v4 or v6 next-hops to be configured. This ensures that we use proper encoding anytime nexthop-afi != nlri-afi (ENHE for v4-over-v6, v4-mapped v6 addresses for v6-over-v4). Unit and integration tests are added to cover these cases. Also adds in missing type conversions for ENHE and BGP Extended Message capabilities, ensuring they both appear as the correct enum variant rather than the catch-all unknown.
Add unit tests for negotiate_capabilities() covering AF negotiation, legacy implicit IPv4, extended messages, route refresh, and ENHE element matching. Add select_nexthop tests for configured cross-AF nexthop with and without ENHE. Add ExtendedNexthopElement::v4_over_v6() constructor; gate v6_over_v4 helpers behind cfg(test).
Adds logic into update_session_info() to reset the BGP session if a nexthop configuration change results in ENHE capability being toggled. i.e. If the NH-AFI changes for IPv4 Unicast, ENHE must be re-negotiated which requires a full session reset.
Bump TEST_ROUTE_COUNT to 32k to ensure we always originate enough routes to trigger UpdateMessage chunking, even when BGP Extended Message Size has been negotiated.
Use mg-common's ManagedThread so we retain JoinHandles in a structured fashion in case we want to make explicit use of these later. Also introduce AddPeerRequest to group arguments defining a peer config. This avoids clippy's too many arguments warning and provides some rough structure for the peer add flow.
After converting to using a thread builder instead of thread::spawn() (builder::spawn returns result, thread::spawn panics), the Result should be propagated back out to the callers so they can do proper error handling.
ensure() was returning a Result<UdpSocket> but the caller just dropped it, so now we just return Result<()>. Removes redundant presence check for peer during add_peer callpath. Updates egress() to use recv_timeout() to avoid possible deadlocks where recv() hangs on a dead channel and we miss the killswitch (AtomicBool).
The BGP SessionClock thread was not being stopped when the FSM was put into shutdown, although all its timers were. After migrating to thread Builder to have thread names, the thread leak test began failing due to the SessionClock threads not being stopped/joined. This adds a stop() method to ManagedThread, allowing the thread itself to be joined so it doesn't linger indefinitely.
We skipped both 127.0.0.1 and ::1 during install, but were not skipping ::1 during uninstall. Let's make that symmetric.
Move from using an installed bool to using a u32 to track the number of intra-process users of an IP, which prevents early uninstalls. Move loopback-skipping logic (127.0.0.1 / ::1) into a dedicated function and make a few methods private that don't need to be public.
If an installation fails for one of the requested IPs, rollback the IPs. Removes an unused uninstall() method.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters