Skip to content

Configure Failover Behavior

When an SD-WAN member fails (SLA out of bounds), what happens depends on your rules and failover settings. Tune the response speed and recovery behavior to match your tolerance.

Key Settings

Failure detection time

config system sdwan
    config health-check
        edit "Google-DNS-Health"
            set interval 500        # ms between checks
            set failtime 5          # consecutive failures to mark down
            set recoverytime 5      # consecutive successes to mark up
        next
    end
end

Default: 500 ms * 5 = 2.5 seconds to detect failure. Aggressive: 200 ms * 3 = 0.6 seconds (more sensitive, more flapping). Lazy: 1000 ms * 10 = 10 seconds (less flapping, slower failover).

Hold timer (prevent flapping)

config system sdwan
    config service
        edit 1
            set hold-down-time 30    # seconds before bringing member back after recovery
        next
    end
end

Prevents oscillation on a borderline link.

Per-rule behavior on SLA failure

In an SD-WAN rule:

  • Auto — automatically falls to next member in priority list.
  • Manual — sticks with the failed member regardless (use cautiously).

Visual Failover Test

  1. Network → SD-WAN → SD-WAN Members — note current member used.
  2. Physically unplug wan1.
  3. Within ~2-5 seconds: members reshuffles, traffic moves to wan2.
  4. Plug wan1 back in. After recoverytime (and any hold-down), traffic returns.

Logging Failover Events

By default, SD-WAN logs major state changes:

# Recent failover events:
execute log filter category 1     # System
execute log display

Or in GUI: Log & Report → System Events, filter by subtype = sdwan.

CLI Equivalent

config system sdwan
    set status enable
    set neighbor-hold-down enable
    set neighbor-hold-down-time 30
end

Common Issues

  • Failover takes 30+ seconds. Health-check too slow. Tighten interval and failtime.
  • Failover is instant but app sessions still drop. Stateful apps can't survive WAN IP change. Use SD-WAN with same external IP (if BGP-advertised), or accept the reset for non-persistent apps.
  • Constant flapping. Health check too aggressive, or actual link quality genuinely on the edge. Increase failtime + hold-down.
  • Wan1 came back but traffic stays on wan2. No "preferred" member configured. Set priority in the rule.