OPNsense Failover Kills States When No Viable Failover Target Exists
This setup ran flawlessly for months. OPNsense firewall, Unbound DNS-over-TLS to Cloudflare, a single WAN connection with a disabled USB 5G backup interface. Then in early February, intermittent DNS hangs started hitting every device on the network—phones, laptops, Claude Code sessions randomly stalling. Switching to cellular bypassed it every time, pointing straight at the firewall.
The Actual Bug
The secondary gateway (USB5GPHONE) and its interface were both disabled. There was nothing to fail over to. The system knew this—it logged ROUTING: ignoring down gateways: WAN_DHCP, USB5GPHONE_DHCP. The routing logic correctly determined there was no failover target and didn't switch. But the state-killing logic ran anyway, destroying working connections for zero benefit.
Two subsystems in the same failover process reached different conclusions. The routing logic was smart enough to skip failover when there's no target. The state-killing logic wasn't. The firewall was causing the exact outage it was supposed to protect against.
The Debugging Journey
The symptoms pointed at DNS, so that's where debugging started. Unbound's logs showed TCP connection reuse errors—outtcp got tcp error -1—with TLS connections to Cloudflare being decommissioned and rebuilt repeatedly. Stats showed a 7% cache hit rate with zero prefetches. Almost every query was going upstream.
Point-in-time DNS tests looked fine: 6ms cached, 29ms uncached. The problem was intermittent, so spot checks always passed. The breakthrough came from watching the system logs in real time:
ROUTING: killing states for unreachable gateway WAN_DHCP
Repeating every 1–3 minutes. The gateway logs told the rest of the story—dpinger was declaring the WAN gateway DOWN based on latency alone, with 0% packet loss.
What Changed
The ISP gateway started responding slowly to ICMP around February 5th. OPNsense's Reporting → Health → Quality graph confirmed it: rock solid performance from December through January, then latency and loss spikes starting in early February.
dpinger saw the elevated latency and escalated:
Alarm: none -> delay RTT: 291.2 ms Loss: 0.0 %
Alarm: delay -> down RTT: 502.7 ms Loss: 0.0 %
When dpinger declared WAN_DHCP down, OPNsense killed all TCP states for that gateway—including Unbound's TLS connections to Cloudflare on port 853. DNS hung until Unbound rebuilt those connections. Then dpinger declared the gateway back up, things stabilized for a minute or two, and the cycle repeated.
The Fix
There are a few ways to fix the symptom:
- Disable gateway monitoring on WAN_DHCP entirely (what I did)
- Change the monitoring IP to something more reliable than your ISP gateway (e.g., 1.1.1.1 or 8.8.8.8)
- Adjust the latency/loss thresholds so dpinger doesn't declare the gateway down over slow ICMP
- Disable "Failover States" in the gateway group settings to stop the state killing
I went with disabling gateway monitoring because I didn't want to disable Failover States—I actually want state killing to work when I plug in my USB 5G modem or Starlink as a backup and there's a real failover target available. The feature is useful; it just shouldn't fire when there's nothing to fail over to.
To replace the alerting I lost, I set up Monit with an independent ping check to 1.1.1.1 for email notifications—monitoring without the destructive side effects.
The Bug Report
Filed as opnsense/core #9789. The maintainer's response: "you're referring to default gateway switching features… sometimes it's hard to admit you haven't found a bug but a configuration issue." Closed without addressing why the system kills states when it already knows there's no failover target.
The suggested fixes—increase the latency threshold, disable "Failover States"—are workarounds for the symptom. The core issue remains: the routing logic and state-killing logic are not in agreement. One checks whether failover is possible before acting. The other doesn't.
Takeaways
- Symptoms can be far removed from the root cause. DNS-over-TLS hangs traced back to gateway monitoring killing firewall states.
- Point-in-time tests miss intermittent issues. Spot-checking DNS resolution always passed. You need to watch logs in real time or use continuous monitoring.
- OPNsense Health graphs are your friend. Reporting → Health → Quality provides passive gateway monitoring history without the destructive dpinger/failover behavior.
- If your only failover target is disabled, be aware that state killing still runs. OPNsense will destroy working connections even when it knows there's nowhere to fail over to.
- ISP gateways responding slowly to ICMP does not mean your internet is down. dpinger monitoring an ISP hop you don't control is a fragile foundation for aggressive failover actions.
The best failover systems check whether failover is actually possible before taking destructive action. Killing states to "prepare" for a failover that can't happen isn't defensive—it's a self-inflicted outage.