Opened 4 years ago

Closed 4 years ago

#1700 closed defect (not our bug)

Threads hang in InetAddress

Reported by: zzz Owned by: zzz
Priority: minor Milestone: undecided
Component: api/utils Version: 0.9.22
Keywords: Cc:
Parent Tickets: Sensitive: no

Description

Symptoms identical to https://wiki.zimbra.com/wiki/Configuring_for_IPv4

Not our bug, recording for future reference, see 1st comment for details

Subtickets

Change History (2)

comment:1 Changed 4 years ago by zzz

Owner: set to zzz
Status: newaccepted

<kytv> zzz: another weird problem, this time "Dropping tunnel requests: High load".
<kytv> Current system load per uptime: 1 user, load average: 0.12, 0.15, 0.14
<kytv> Job lag:0
<kytv> Message delay:0
<kytv> Backlog:3
<kytv> Exploratory: 2
<kytv> Client: 3
<kytv> Participating: 0
<kytv> Share ratio: 0.00
<kytv> 3 sec: 0.09 / 0.00 KBps
<kytv> 5 min: 0.07 / 0.02 KBps
<kytv> Total: 2.54 / 2.45 MBps
<kytv> Used: 1.78 TB / 1.71 TB
<kytv> Active: 12 / 12
<kytv> Fast: 0
<kytv> High capacity: 0
<kytv> Integrated: 9
<kytv> Known: 685
<kytv> is there anything I should get before restarting the router?

<zzz> don't panic
<zzz> usually transient
<kytv> no panic, but it's been hosed for some time now
<kytv> (fetching logs)
<zzz> odd
<zzz> never seen that associated with 'hosed'
<kytv> hmm…it's been hosed for a while now. Seems to hae started on Saturday. Everything dropped to zero at that point.
<zzz> worth investigating before restarting
<zzz> hopefully something in logs
<kytv> The start:
<kytv> ERROR [uterWatchdog] client.ClientManagerFacadeImpl?: Client xxx has a leaseSet that expired 10m
<kytv> ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Ready and waiting jobs: 0
<kytv> ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Job lag: 0
<kytv> ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Participating tunnel count: 82
<kytv> ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: 1 minute send processing time: 349ms
<kytv> ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Outbound send rate: 1.81KBps
<kytv> ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Memory: 371.41MB / 3.56GB
<kytv> CRIT [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Router appears hung, or there is severe network congestion. Watchdog starts barking!
<zzz> sadly that's just a symptom
<kytv> indeed
<kytv> but that's when it seems to have died
<kytv> more of the symptoms:
<kytv> (now)
<kytv> ERROR [uterWatchdog] client.ClientManagerFacadeImpl?: Client xxx has a leaseSet that expired 29h
<kytv> PM ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Ready and waiting jobs: 26790
<kytv> ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Job lag: 106395967
<kytv> ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Participating tunnel count: 0
<kytv> ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: 1 minute send processing time: 0ms
<kytv> ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Outbound send rate: 0Bps
<kytv> ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Memory: 589.37MB / 3.56GB
<kytv> (still looking in the logs)
<zzz> well, it's now completely hosed due to piled up jobs (26790!)
<zzz> but that wasn't the case at the beginning
<zzz> so it hasn't built client tunnels in 29 hours
<zzz> does it have any connected peers? seems like it's lost all connectivity
<zzz> please look on /jobs and see what all the jobs are
<kytv> 9/9 active. Checking elsewhere

<zzz> ooh, so the job queue is stuck
<zzz> thread dump time

<zzz> maybe, but doubt it
<zzz> this feels like a deadlock

<kytv> this is with -19

<zzz> no deadlock…
<zab> this one looks wrong:
<zab> | "JobQueue? 4/5" prio=10 tid=0x00007fb2c8015000 nid=0x55ab runnable [0x00007fb3738f7000]
<zab> | java.lang.Thread.State: RUNNABLE

<zab> if the stack dump caught it in RUNNABLE state mans it's stuck

<zzz> in today's dump, all 5 JobQueue? threads are stuck in java.net InetAddress?
<zzz> like DNS is hung up
<zzz> in checkLookupTable()
<kytv> hm
<kytv> fwiw I can resolve domains via DNS
<zzz> it's like the JVM has gone insane and everything is stuck in the java cache

<kytv> unfortunately there wasn't a syslog daemon running so there are no system logs to look at
<zzz> here's a 1999 bug fixed in java 1.3 :(
<zzz> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4291325

<zzz> java version?
<zzz> https://glassonionblog.wordpress.com/2013/04/27/jvm-threads-getting-locked-when-using-ipv6-methods-with-ipv4-address/
<zzz> https://wiki.zimbra.com/wiki/Configuring_for_IPv4

<zzz> this looks identical to the issue in the last link above
<kytv> Java version: Oracle Corporation 1.7.0_79 (OpenJDK Runtime Environment 1.7.0_79-b14)
<kytv> The host doesn't have IPv6 support. (AFAIK it's supposed to but the routing has been broken for as long as I've had access)
<zzz> pretty convinced it's a JVM bug. not clear why I can't find an open openjdk/oracle ticket about it even though the zimbra link above is 4 years old and was updated recently
<zzz> I see some interesting stuff in your dump to look at to possibly reduce our lookups, but
<zzz> ultimately it's not our bug
<zzz> feel free to restart… set the ipv4 setting if you like, or not
<kytv> IPv6 is already disabled. Will restart.
<zzz> but above link claims: With java.net.preferIPv4Stack set to true, Java will not execute this code and the problem should be avoided.
<kytv> I mean in the router
<kytv> I didn't modify wrapper.config but IPv6 is disabled at /confignet
<zzz> seems like you need the system property to fix it
<kytv> interesting that for the last year it's seemingly been fine. Since there's a newer update than -19 I'll update the wrapper.config to restrict it to IPv4 only, shutdown, then start on -21 or w/e the latest is
<zzz> we do have a lookup cache of our own in Addresses.java, which will minimize the use of the problematic java code
<zzz> but we may not use it everywhere we should, and the cache may not be big enough… investigating…

<zzz> it looks like we're using our version in all the important places
<zzz> I'll double the cache size
<zzz> I'm going to open and close a ticket on this so we don't lose it

comment:2 Changed 4 years ago by zzz

Resolution: not our bug
Status: acceptedclosed

Doubled cache size in 7db2f97711361f598cb14aa579cb008ac8438577 to be 0.9.22-22

JVM bug

See workaround in link in OP

Note: See TracTickets for help on using tickets.