Opened 7 years ago

Closed 7 years ago

#825 closed defect (not a bug)

Watchdog barking (was: strange error message)

Reported by: DISABLED Owned by:
Priority: minor Milestone: 0.9.5
Component: api/i2cp Version: 0.9.4
Keywords: Cc: killyourtv
Parent Tickets: Sensitive: no

Description

Hi there,
I recently get this error message which I do not understand:

30.12.12 10:14:08 FEHLER [uterWatchdog] client.ClientManagerFacadeImpl?: Client uwHNaI has a leaseSet that expired 47m
30.12.12 10:14:08 FEHLER [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Ready and waiting jobs: 0
30.12.12 10:14:08 FEHLER [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Job lag: 0
30.12.12 10:14:08 FEHLER [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Participating tunnel count: 241
30.12.12 10:14:08 FEHLER [uterWatchdog] 2p.router.tasks.RouterWatchdog?: 1minute send processing time: 25.18957394300789
30.12.12 10:14:08 FEHLER [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Outbound send rate: 708802.1598920054 Bps
30.12.12 10:14:08 FEHLER [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Memory: 130,38M/1,04G

30.12.12 10:15:08 FEHLER [uterWatchdog] client.ClientManagerFacadeImpl?: Client uwHNaI has a leaseSet that expired 48m
30.12.12 10:15:08 FEHLER [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Ready and waiting jobs: 0
30.12.12 10:15:08 FEHLER [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Job lag: 0
30.12.12 10:15:08 FEHLER [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Participating tunnel count: 242
30.12.12 10:15:08 FEHLER [uterWatchdog] 2p.router.tasks.RouterWatchdog?: 1minute send processing time: 25.18957394300789
30.12.12 10:15:08 FEHLER [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Outbound send rate: 708802.1598920054 Bps
30.12.12 10:15:08 FEHLER [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Memory: 146,61M/1,04G

I get this once a minute. Also the watchdog is barked sometimes recently, but there is no restart or such thing:
2012/12/30 09:37:13 | KRITISCH [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Router appears hung, or there is severe network congestion. Watchdog starts barking!
2012/12/30 09:37:13 | KRITISCH [uterWatchdog] et.i2p.router.tasks.ThreadDump?: Threads dumped to wrapper log

Wrapper settings should be relaxed enough to not make it bark:
wrapper.java.maxmemory=1200
wrapper.ping.interval=300
wrapper.ping.timeout=360

Restarting didn't help. Still all things are running, tho I "feel" with a little less overall performance.

I2P version: 0.9.4-0
Java version: Oracle Corporation 1.7.0_09 (OpenJDK Runtime Environment 1.7.0_09-b30)
Wrapper version: 3.1.1
Server version: 6.1.26
Servlet version: Jasper JSP 2.1 Engine
Platform: Linux amd64 3.5.0-21-generic
Processor: Phenom II / Opteron Gen 3 (Shanghai/Deneb/Heka/Callisto?, 45 nm) (athlon64)
Jbigi: Locally optimized native BigInteger? library loaded from file
Encoding: UTF-8
Charset: UTF-8

Router is running 24/7 with "1.200 kBytes/s shared" and high load on i2psnark "Gesamt: 136 Torrents, 325,46 GB, 393 verbundene Teilnehmer, 464 DHT-Gegenstellen".

Subtickets

Change History (10)

comment:1 Changed 7 years ago by zzz

Summary: strange error messageWatchdog barking (was: strange error message)

You said "restarting didn't help". You got the watchdog messages again after restarting?

It may be a problem with a particular client.

The client with the problem (uwHNaI) can be identified by looking under "local destinations" in the summary bar of your console. Hover over each destination with your mouse, you will see the first 4 letters/numbers of each one. In general, only servers keep the same ID after restart, so it may or may not be there now.

If it isn't there, go to /tunnels, look for client tunnel sections where there are no tunnels, or it says it is dead.

comment:2 Changed 7 years ago by DISABLED

Resolution: not a bug
Status: newclosed

Well, the messages stoped without any reason shortly after my post. Maybe I have some other problem, my router just restarted due to hung JVM which is unusual. I'll close this ticket.

comment:3 Changed 7 years ago by zzz

Cc: killyourtv added
Component: unspecifiedapi/i2cp

Maybe, maybe not. Killyourtv just saw this too.

There are several changes in 0.9.4-0 re: leaseset request/grant between the router and the client, which try to recover if a request or grant is dropped. It's possible that the changes didn't fix it, or made things worse or introduced a new bug.

Leaving closed for now but cc'ing kytv and will keep an eye on it. Will also go back thru the recent changes and see if I see anything wrong.

comment:4 Changed 7 years ago by zzz

Were you using iMule? If so see #819

comment:5 Changed 7 years ago by elgo

Hi,

Since 0.9.4 and (surely not related) kernel upgraded to 3.7.0 (from 3.5.4), I'm seeing lots of this watchdog barking, and really "soon" (~800 tunnels).
Really strange, because previous i2p version could process 16k tunnels without sweating with 1260 MB of RAM allocated to wrapper (even if dealing with so much tunnels may have been related to the %rate tunnel build bug fixed since, in my understanding).

I'm using bare I2P router + i2psnark (~20 torrents), i2p http proxy is almost never used, everything is running 24/7.

JVM/wrapper memory set to 1024MB.
CPU and RAM and I/O are fine.
Runs on Gentoo Linux.

I2P version: 0.9.4-0
Java version: Sun Microsystems Inc. 1.6.0_24 (OpenJDK Runtime Environment 1.6.0_24-b24)
Wrapper version: 3.5.16
Server version: 6.1.26
Servlet version: Jasper JSP 2.1 Engine
Platform: Linux amd64 3.7.0-hardened
Processor: Mobile Athlon II/Turion II/Phenom II/Sempron/V-series (Regor/Caspian/Champlain?, 45 nm) (athlon64)
Jbigi: Native BigInteger? library libjbigi-linux-athlon64_64.so loaded from resource
Encoding: UTF-8
Charset: UTF-8


15/01/13 23:40:30 ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Memory: 59,04M/1,09G
15/01/13 23:40:30 ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Outbound send rate: 214685.55943321701 Bps
15/01/13 23:40:30 ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: 1minute send processing time: 135.4322931459846
15/01/13 23:40:30 ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Participating tunnel count: 546
15/01/13 23:40:30 ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Job lag: 0
15/01/13 23:40:30 ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Ready and waiting jobs: 0
15/01/13 23:40:30 ERROR [uterWatchdog] client.ClientManagerFacadeImpl?: Client Io52yA has a leaseSet that expired 12m


2013/01/15 23:39:16 | Heap
2013/01/15 23:39:16 | PSYoungGen total 4928K, used 4432K [0x000002f2ead60000, 0x000002f2eb350000, 0x000002f305000000)
2013/01/15 23:39:16 | eden space 3904K, 98% used [0x000002f2ead60000,0x000002f2eb1201b0,0x000002f2eb130000)
2013/01/15 23:39:16 | from space 1024K, 57% used [0x000002f2eb250000,0x000002f2eb2e4040,0x000002f2eb350000)
2013/01/15 23:39:16 | to space 1088K, 0% used [0x000002f2eb130000,0x000002f2eb130000,0x000002f2eb240000)
2013/01/15 23:39:16 | PSOldGen total 66560K, used 65600K [0x000002f2b6800000, 0x000002f2ba900000, 0x000002f2ead60000)
2013/01/15 23:39:16 | object space 66560K, 98% used [0x000002f2b6800000,0x000002f2ba810370,0x000002f2ba900000)
2013/01/15 23:39:16 | PSPermGen total 25152K, used 24939K [0x000002f2ac200000, 0x000002f2ada90000, 0x000002f2b6800000)
2013/01/15 23:39:16 | object space 25152K, 99% used [0x000002f2ac200000,0x000002f2ada5ace0,0x000002f2ada90000)


Is there some more usefull data I can provide with?

Regards.

comment:6 Changed 7 years ago by elgo

Resolution: not a bug
Status: closedreopened

comment:7 Changed 7 years ago by elgo

I got I2P crashing with the same error for the second time, as it ran about 900 tunnels:


2013/01/16 18:00:52 | JVM received a signal SIGSEGV (11).
2013/01/16 18:00:52 | JVM process is gone.
2013/01/16 18:00:52 | JVM exited unexpectedly.
2013/01/16 18:01:06 | Launching a JVM…
2013/01/16 18:01:07 | WrapperManager?: Initializing…
2013/01/16 18:01:08 | Starting I2P 0.9.4-0
2013/01/16 18:01:08 | ERROR: There appears to be another router already running!
2013/01/16 18:01:08 | Please make sure to shut down old instances before starting up
2013/01/16 18:01:08 | a new one. If you are positive that no other instance is running,
2013/01/16 18:01:08 | please delete the file /opt/i2p/home/.i2p/router.ping
2013/01/16 18:01:10 | ←- Wrapper Stopped


Kernel side:

[432670.177274] skbuff: skb_over_panic: text:ffffffff81573639 len:1568 put:640 head:ffff880009a41800 data:ffff880009a41920 tail:0x740 end:0x6c0 dev:<NULL>
[432670.179451] invalid opcode: 0000 #1 SMP
[432670.180552] CPU 1
[432670.180575] Pid: 24403, comm: java Not tainted 3.7.0-hardened #8 HP ProLiant? MicroServer?
[432670.182821] RIP: 0010:[<ffffffff8151d357>] [<ffffffff8151d357>] skb_put+0x97/0xa0
[432670.184016] RSP: 0018:ffff88007b173ac8 EFLAGS: 00010282
[432670.185195] RAX: 000000000000008a RBX: ffff8800099741c0 RCX: 0000000000000031
[432670.186406] RDX: 0000000000000027 RSI: 0000000000000046 RDI: ffffffff81cb8dd4
[432670.187636] RBP: ffff88006377a700 R08: 0000000000000000 R09: 0000000000000000
[432670.188885] R10: 8000000000000000 R11: 00000000000002fc R12: 0000000000000280
[432670.190149] R13: 0000000000000420 R14: 00000000000052c0 R15: 0000000000000280
[432670.191426] FS: 000002f273bfb700(0000) GS:ffff88007dd00000(0000) knlGS:0000000000000000
[432670.192742] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[432670.194087] CR2: 000002f30f5c7b00 CR3: 000000000164b000 CR4: 00000000000007f0
[432670.195463] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[432670.196841] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[432670.198232] Process java (pid: 24403, threadinfo ffff88007b128ac8, task ffff88007b1286d0)
[432670.199655] Stack:
[432670.201099] ffff880009a41920 0000000000000740 00000000000006c0 ffffffff81807b6f
[432670.202618] 0000000000000420 ffffffff81573639 0000000000000001 ffff88007ad00660
[432670.204148] ffff88007b137220 ffff88007b173db8 0000000000000000 ffff88007a891c00
[432670.205703] Call Trace:
[432670.207252] [<ffffffff81573639>] ? tcp_sendmsg+0x6e9/0x1550
[432670.208801] [<ffffffff81512089>] ? sock_aio_write+0x109/0x140
[432670.210352] [<ffffffff810d135a>] ? do_sync_write+0x9a/0xe0
[432670.211894] [<ffffffff810d1551>] ? vfs_write+0x1b1/0x200
[432670.213436] [<ffffffff810d1690>] ? sys_write+0x50/0xa0
[432670.214977] [<ffffffff8163cc98>] ? system_call_fastpath+0x18/0x1d
[432670.216529] [<ffffffff81113d39>] ? sys_epoll_wait+0xd9/0x1a0
[432670.218085] [<ffffffff8163ccbe>] ? sysret_check+0x1c/0x58
[432670.219656] Code: 89 54 24 10 8b 97 b8 00 00 00 48 89 54 24 08 48 8b 97 c8 00 00 00 48 c7 c7 a0 f3 80 81 48 89 14 24 48 8b 54 24 28 e8 ed 5d 11 00 <0f> 0b 0f 1f 80 00 00 00 00 41 55 89 f0 41 54 41 89 d4 55 89 f5
[432670.223499] RIP [<ffffffff8151d357>] skb_put+0x97/0xa0
[432670.225279] RSP <ffff88007b173ac8>
[432670.232714] —-[ end trace 9917b14e206d9580 ]—-


So right know, I don't know if the watchdog "barking" and final crash are related.

comment:8 Changed 7 years ago by zzz

elgo, JVM crashes (SISSEGV) are almost never are fault. Generally either a JVM bug or a hardware problem (bad RAM).

I suggest you run a RAM test (memtest86)

comment:9 in reply to:  8 Changed 7 years ago by elgo

Replying to zzz:

elgo, JVM crashes (SISSEGV) are almost never are fault. Generally either a JVM bug or a hardware problem (bad RAM).

I suggest you run a RAM test (memtest86)

Ok, "but":
*tested RAM 4 months ago (memtest86++ for 12 hours) when got this new hardware: OK
*RAM is ECC, and no RAM error reported by system (edac-utils)

As I said, maybe it's due to kernel upgrade then, but I hardly see how. In this case, have you suggestion how I can go further troubleshooting this (except downgrading kernel and wait for 48h to see if I2P still crashes, of course ;))?

I'll try to go from JVM icedtea v6.1.11.5 to v7.2.3.3 to see if it changes anything (as you pointed a possible JVM bug).

Thanks for your help.

comment:10 Changed 7 years ago by elgo

Resolution: not a bug
Status: reopenedclosed
Note: See TracTickets for help on using tickets.