Opened 9 months ago

Last modified 9 months ago

#2455 new defect

RouterWatchdog barking when too few UDP errors

Reported by: Zlatin Balevsky Owned by:
Priority: maintenance Milestone: undecided
Component: router/general Version: 0.9.38
Keywords: Cc:
Parent Tickets: Sensitive: no

Description

I cannot understand the logic of the following statement in RouterWatchdog.java

ok = ok && (verifyClientLiveliness() || netErrors >= 5);

(from 2005, jrand0m, 56031a55b74eae45b7def88baab96a876d834a82)

The "ok" flag is initialized with the status of the JobQueue. "netErrors" is initialized from the "udp.sendException" metric. Client liveliness checks if there is even one client with a leaseset older than 10 minutes.

Let's assume the job queue is fine but there is a client with an old leaseset. Why does having more than 5 exceptions when writing to the datagram socket make the router ok? Or is the reasoning "the client leaseset is old but we're having network problems anyway, so don't bark"? If that's the case why is only UDP checked?

Subtickets

Change History (1)

comment:1 Changed 9 months ago by zzz

Yeah, the point is to not bark if the problem is basic network connectivity. We have other ways of notification (console summary bar notice, log entries) if we're completely disconnected. Dumping threads to wrapper log is not helpful if it's a network problem.

Having said that, the network connectivity check isn't perfect, but the netErrors ≥ 5 is a hack that predates a lot of improvements and state machines that try to have accurate reporting on the state of the connectivity.

Also, the whole watchdog thing, which is a jrandom thing that dates back 15 years or more, might need a review. Is it still useful, or is it too confusing for users, or what.

So there could be work to do here, but it's not a logic error.

Note: See TracTickets for help on using tickets.