Skip to content

Conversation

@lihua-cls
Copy link
Contributor

Pre-submission checklist

  • I've ran the linters locally and fixed lint errors related to the files I modified in this PR. You can install the linters by running pip install -r requirements-dev.txt && pre-commit install
  • pre-commit run
[INFO] Stashing unstaged files to /root/.cache/pre-commit/patch1764668592-1594938.
clang-format.............................................................Passed
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
check for merge conflicts................................................Passed

Summary

When running the AgentMirroringTest related cases on the santa barbara (tahansb800bc), observed a consistent test failures like this:

AgentMirroringTests.cpp:133: Failure
Expected equality of these values:
  newOutPkt - oldOutPkts
    Which is: 0
  count
    Which is: 1

Root Cause Analysis

When the agent test is configured for 400G mode, the test framework automatically creates over 200 L3 interfaces and assigns IP addresses to them (e.g., 1.0.0.1/24, 2.0.0.1/24, ..., 201.0.0.1/24,...):

(unidiag)[root@localhost fboss]# ./bin/fboss2 show interface

...
+-----------+--------+-------+------+------+--------------------+-------------+
| eth1/51/7 | up     | 400G  | 2200 | 9000 | 201.0.0.1/24       |             |
|           |        |       |      |      | 201::/64           |             |
|           |        |       |      |      | fe80::ff:fe00:1/64 |             |
+-----------+--------+-------+------+------+--------------------+-------------+
| eth1/52/1 | up     | 400G  | 2201 | 9000 | 202.0.0.1/24       |             |
|           |        |       |      |      | 202::/64           |             |
|           |        |       |      |      | fe80::ff:fe00:1/64 |             |
+-----------+--------+-------+------+------+--------------------+-------------+
| eth1/52/3 | up     | 400G  | 2202 | 9000 | 203.0.0.1/24       |             |
|           |        |       |      |      | 203::/64           |             |
|           |        |       |      |      | fe80::ff:fe00:1/64 |             |
+-----------+--------+-------+------+------+--------------------+-------------+
...

The AgentMirroringTest sends test packets to a destination IP of 201.0.0.x, assuming this traffic will be forwarded via the default route (0.0.0.0/0). However, due to the interface setup described above, a more specific route (201.0.0.0/24) exists in the routing table:

(unidiag)[root@localhost fboss]# ./bin/fboss2 show route

...
Network Address: 0.0.0.0/0
        via 1.0.0.2 dev fboss2000 weight 1
...
Network Address: 201.0.0.0/24
        via 201.0.0.1 dev fboss2200 weight 1
...	
	(unidiag)[root@localhost fboss]#

This output clearly shows that traffic destined for the 201.0.0.0/24 network will be routed via 201.0.0.1, not the default route's next-hop 1.0.0.2.

Also, the test code only creates the neighbor entries for the next-hops it expects to use, which are the default route's next-hop (1.0.0.2) and the mirror destination's next-hop (2.0.0.3). It does not create the neighbor for the actual next-hop (201.0.0.1) that the packet is routed to. The logs confirm that only the expected, but incorrect, neighbors are added:

V0702 06:20:51.088154 773516 SaiNeighborManager.cpp:95] addNeighbor 1.0.0.2
V0702 06:20:51.088200 773516 SaiNeighborManager.cpp:145] Add Neighbor: create neighborNeighborEntry:: MAC: 06:00:00:00:00:01 IP: 1.0.0.2 classID: None  Encap index: None isLocal: Y Port: PhysicalPort-1 NeighborState: Reachable type: DYNAMIC_ENTRY noHostRoute: --

V0702 06:20:51.096617 773516 SaiNeighborManager.cpp:95] addNeighbor 2.0.0.3
V0702 06:20:51.096642 773516 SaiNeighborManager.cpp:145] Add Neighbor: create neighborNeighborEntry:: MAC: 06:00:00:00:00:02 IP: 2.0.0.3 classID: None  Encap index: None isLocal: Y Port: PhysicalPort-3 NeighborState: Reachable type: DYNAMIC_ENTRY noHostRoute: --

(unidiag)[root@localhost fboss]# ./bin/fboss2 show mac details
MAC Address             Port/Trunk         VLAN          TYPE               CLASSID
06:00:00:00:00:02       eth1/1/3           2001          Validated          -
06:00:00:00:00:01       eth1/1/1           2000          Validated          -

(unidiag)[root@localhost fboss]#

When the hardware ASIC receives the packet for forwarding, it performs the L3 lookup and correctly determines that the packet should be sent to the next-hop 201.0.0.1. However, when it performs the L2 lookup to find the destination MAC address for this next-hop, it fails because the neighbor was never resolved. So the hardware cannot forward the packet. As a result, it traps the packet to the CPU as an exception. This is confirmed by the CPU port counters in drivshell:

drivshell>show c
show c
RPKT.cpu0(0)                          :                     1                  +1
TPKT.cpu0(0)                          :                     1                  +1
PERQ_PKT_MC(0).cpu0(0)                :                     1                  +1
PERQ_BYTE_MC(0).cpu0(0)               :                   550                +550
drivshell>

Solution

Change the test IP address to another one, like 201.201.0.10, etc.
Also fixed the IPv6 address since it has similar issue.
Also fixed the AgentL4PortBlackholingTests and AgentRouteTests cases.

Test Plan

[       OK ] warm_boot.AgentMirroringScaleTest/5.MaxMirroringTest (15762 ms)
[       OK ] warm_boot.AgentMirroringScaleTest/6.MaxMirroringTest (15815 ms)
[       OK ] warm_boot.AgentMirroringScaleTest/7.MaxMirroringTest (15582 ms)
Summary:
   OK : 36
   FAILED : 0
   SKIPPED : 0
   TIMEOUT : 0

[       OK ] warm_boot.AgentL4PortBlackHolingTest.v6UDP (16944 ms)
[       OK ] warm_boot.AgentL4PortBlackHolingTest.v4UDP (16909 ms)
Summary:
   OK : 2
   FAILED : 0
   SKIPPED : 0
   TIMEOUT : 0

[       OK ] warm_boot.AgentRouteTest/1.verifyHostRouteChange (28598 ms)
[       OK ] warm_boot.AgentRouteTest/1.verifyCpuRouteChange (48247 ms)
[       OK ] warm_boot.AgentRouteTest/1.VerifyDefaultRoute (26469 ms)
Summary:
   OK : 20
   FAILED : 0
   SKIPPED : 0
   TIMEOUT : 0

Full log put in Gdrive
config file running agent hw test in Gdrive

@meta-cla meta-cla bot added the CLA Signed label Dec 2, 2025
@meta-codesync
Copy link

meta-codesync bot commented Dec 4, 2025

@mikechoifb has imported this pull request. If you are a Meta employee, you can view this in D88407761.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant