Opened 4 years ago

Closed 23 months ago

#1411 closed defect (fixed)

"NTCP Pumper" thread is eating all available CPU power.

Reported by: djjeshk Owned by:
Priority: critical Milestone: 0.9.24
Component: router/transport Version: 0.9.17
Keywords: NTCP Pumper thread failure Cc:
Parent Tickets:

Description

After two day usage java started to eat all available CPU power.
Opened it with Process Explorer and saw that one thread is eating CPU. Suspended it. Looks like one tunnel failed unexpectedly and i2p does not care about it. After failed thread suspension CPU load is back to normal. I2P works as expected before and after thread suspension.

After some minutes I2P hung, resumed that thread and I2P restarted.
Log:
wrapper | Pinging the JVM took 727 seconds to respond.
jvm 4 | CRIT [mer 6WmH 4/4] net.i2p.util.Clock : Large clock shif
t forward by 17m
jvm 4 | CRIT [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Router appears h
ung, or there is severe network congestion. Watchdog starts barking!

Subtickets

Attachments (1)

wrapper.log (405.3 KB) - added by djjeshk 4 years ago.
new wrapper.log

Download all attachments as: .zip

Change History (26)

comment:1 Changed 4 years ago by zzz

  • Status changed from new to infoneeded_new

Need version info from top of /logs please

comment:2 Changed 4 years ago by zzz

still need version info as requested

comment:3 Changed 4 years ago by djjeshk

  • Resolution set to no response
  • Status changed from infoneeded_new to closed

comment:4 Changed 4 years ago by zzz

Unclear why OP closed own ticket 'no response' with no comment. Leaving closed.

comment:5 Changed 4 years ago by djjeshk

  • Resolution no response deleted
  • Status changed from closed to reopened

comment:6 Changed 4 years ago by djjeshk

  • Version changed from 0.9.16 to 0.9.17

After some days I2P started to eat all CPU power again. Did thread dump. Assuming entry nid is thread id failing component is "NTCP Pumper".

2014/12/04 23:58:08 | "NTCP Pumper" daemon prio=6 tid=0x030c5800 nid=0xcb4 runnable [0x0485f000]
2014/12/04 23:58:08 |    java.lang.Thread.State: RUNNABLE
2014/12/04 23:58:08 | 	at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method)
2014/12/04 23:58:08 | 	at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(Unknown Source)
2014/12/04 23:58:08 | 	at sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(Unknown Source)
2014/12/04 23:58:08 | 	at sun.nio.ch.WindowsSelectorImpl.doSelect(Unknown Source)
2014/12/04 23:58:08 | 	at sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source)
2014/12/04 23:58:08 | 	- locked <0x1e416340> (a sun.nio.ch.Util$2)
2014/12/04 23:58:08 | 	- locked <0x1e416350> (a java.util.Collections$UnmodifiableSet)
2014/12/04 23:58:08 | 	- locked <0x1e4162c8> (a sun.nio.ch.WindowsSelectorImpl)
2014/12/04 23:58:08 | 	at sun.nio.ch.SelectorImpl.select(Unknown Source)
2014/12/04 23:58:08 | 	at net.i2p.router.transport.ntcp.EventPumper.run(EventPumper.java:176)
2014/12/04 23:58:08 | 	at java.lang.Thread.run(Unknown Source)
2014/12/04 23:58:08 | 	at net.i2p.util.I2PThread.run(I2PThread.java:84)

I will attach full wrapper.log file.

Changed 4 years ago by djjeshk

new wrapper.log

comment:7 Changed 4 years ago by djjeshk

This came up when i used another CPU intensive program (played a fullscreen game bit). They were large message delays due to unavailable CPU power. Developer might check what exactly I2P does when high message delay is detected. Message delays now are 10 times larger than usual, but it is still functionating well. Maybe after router lockdown it tries to do all jobs without sleeping at all?

comment:8 Changed 4 years ago by zzz

  • Status changed from reopened to infoneeded

Thanks for the update. However, this is the 3rd time I've asked for full version info. See comments 1 and 2 above for the other times. We cannot help you unless you provide the information requested. If not provided I will close the ticket again.

comment:9 Changed 4 years ago by djjeshk

  • Status changed from infoneeded to assigned

I2P version: 0.9.17-0
Java version: Oracle Corporation 1.7.0_71 (Java(TM) SE Runtime Environment 1.7.0_71-b14)
Wrapper version: 3.5.25
Server version: 8.1.16.v20140903
Servlet version: Jasper JSP 2.1 Engine
Platform: Windows XP x86 5.1
Processor: Core 2 (45nm) (core2)
Jbigi: Locally optimized native BigInteger? library loaded from file
Encoding: Cp1257
Charset: windows-1257

comment:10 Changed 4 years ago by djjeshk

  • Keywords NTCP Pumper added; native java runtime removed
  • Summary changed from One of i2p msvcr100.dll!_endthreadex+0x6a threads has turned into unconsistent state and eating all available CPU power. to "NTCP Pumper" thread is eating all available CPU power.

I would want to edit ticket text since it is unclear but it is not possible.

comment:11 Changed 4 years ago by djjeshk

  • Component changed from unspecified to router/transport
  • Priority changed from major to critical

This happen again...

comment:12 Changed 4 years ago by djjeshk

I completely disabled TCP, this may be working workaround to avoid hitting this bug

comment:13 Changed 4 years ago by zzz

Another report with possible clues: http://zzz.i2p/topics/1613?page=6#p9056

comment:14 Changed 4 years ago by djjeshk

UDP was too unreliable for me, now i run router under control of BES – Battle Encoder Shirase (available on clearnet http://mion.faireal.net/BES/ ) and set Target Sleep/Awake? cycle lower than 40ms and 5%. Now if router starts to fail, it detects that and restarts. Interesting, but UDP transport has not that type of failure. If NTCP would have same data queuing scheme, there would not be such type of failure. Yes, of course, UDP is not TCP, but in this case UDP tries to be reliable like TCP. If there is two versions of TCP, failing version could be fixed.

comment:15 Changed 4 years ago by zzz

  • Milestone set to 0.9.18
  • Status changed from assigned to testing

Simplistic workaround in 34c818def2d17b313a6e39c8bec90833cba3b94a 0.9.17-16
May need tuning. Root cause unknown.
See also #551 #1075

comment:16 Changed 4 years ago by zzz

Several more changes in -17...-23. We're seeing high numbers of inbound requests that are being dropped after handshake. This causes the DH pool to run dry, so DH must be generated in the pumper thread, and it gets behind, and it all goes bad.

Changes include:

  • Blocking subsequent inbound connections from an IP that did this, for a few minutes
  • Increasing DH pool size and decreasing check and regenerate time

At this time it does not appear to be an NIO bug per se, we are not looping too fast.

Not declaring victory yet, more research to do.

comment:17 Changed 4 years ago by djjeshk

Currently running CPU stress test with NTCP enabled and this issue does not appear so far. Looks good news for NTCP enabled routers.

comment:18 Changed 4 years ago by djjeshk

Loaded 2000x1000 stat graph while running stress test and router continues working without rejecting tunnels and router CPU load is normal. Stress test time - 12 minutes.

comment:19 Changed 4 years ago by djjeshk

This issue seems fixed only partially, I am again getting high CPU usage by this, but message delay is not increased significantly so at 0.9.18-11-rc this issue is not critical. 55% of CPU used by NTCP Pumper thread, 74% of CPU total by router, 100% total. Set java priority to Above normal for a while and it eated 99-100% CPU.

Five thread dumps of NTCP Pumper thread:
One of this
2015/03/26 20:53:06 | "NTCP Pumper" #74 daemon prio=5 os_prio=0 tid=0x438eac00 nid=0xc5c waiting on condition [0x46a2f000]
2015/03/26 20:53:06 | java.lang.Thread.State: TIMED_WAITING (sleeping)
2015/03/26 20:53:06 | at java.lang.Thread.sleep(Native Method)
2015/03/26 20:53:06 | at net.i2p.router.transport.ntcp.EventPumper?.run(EventPumper?.java:299)
2015/03/26 20:53:06 | at java.lang.Thread.run(Unknown Source)
2015/03/26 20:53:06 | at net.i2p.util.I2PThread.run(I2PThread.java:84)

And four of these
2015/03/26 20:55:47 | "NTCP Pumper" #74 daemon prio=5 os_prio=0 tid=0x438eac00 nid=0xc5c runnable [0x46a2f000]
2015/03/26 20:55:47 | java.lang.Thread.State: RUNNABLE
2015/03/26 20:55:47 | at sun.nio.ch.WindowsSelectorImpl?$SubSelector?.poll0(Native Method)
2015/03/26 20:55:47 | at sun.nio.ch.WindowsSelectorImpl?$SubSelector?.poll(Unknown Source)
2015/03/26 20:55:47 | at sun.nio.ch.WindowsSelectorImpl?$SubSelector?.access$400(Unknown Source)
2015/03/26 20:55:47 | at sun.nio.ch.WindowsSelectorImpl?.doSelect(Unknown Source)
2015/03/26 20:55:47 | at sun.nio.ch.SelectorImpl?.lockAndDoSelect(Unknown Source)
2015/03/26 20:55:47 | - locked <0x1948e498> (a sun.nio.ch.Util$2)
2015/03/26 20:55:47 | - locked <0x1948e4a8> (a java.util.Collections$UnmodifiableSet?)
2015/03/26 20:55:47 | - locked <0x1948e420> (a sun.nio.ch.WindowsSelectorImpl?)
2015/03/26 20:55:47 | at sun.nio.ch.SelectorImpl?.select(Unknown Source)
2015/03/26 20:55:47 | at net.i2p.router.transport.ntcp.EventPumper?.run(EventPumper?.java:184)
2015/03/26 20:55:47 | at java.lang.Thread.run(Unknown Source)
2015/03/26 20:55:47 | at net.i2p.util.I2PThread.run(I2PThread.java:84)

PS: router is using over 400KBps each direction, floodfill disabled

Last edited 4 years ago by djjeshk (previous) (diff)

comment:20 Changed 4 years ago by djjeshk

Can we programmatically detect that router sits too much on sun.nio.ch.WindowsSelectorImpl??$SubSelector??.poll0(Native Method) that causes 100% usage? If yes, router might wait 11 minutes for NTCP affected tunnels expiration and restart only NTCP related stuff without restarting whole router.

PS. If these network transport only restarts are possible, then we can add two new restart buttons [Restart NTCP] and [Restart UDP] that does things gracefully like [Restart]

comment:21 follow-up: Changed 3 years ago by djjeshk

This bug has minimal impact if TCP is set to be firewalled.

comment:22 in reply to: ↑ 21 Changed 3 years ago by dg

Replying to djjeshk:

This bug has minimal impact if TCP is set to be firewalled.

With minimal config (preferably vanilla/out-of-the-box with only bandwidth settings set and a few tunnels created), can you please tell us if the problem in the OP* occurs in 0.9.20?

  • The problem specifically relating to the NTCP Pumper thread.

comment:23 Changed 3 years ago by djjeshk

Minimal config I2P installation (unused shared tunnels only) does not integrates into network at all, stars are yellow for some hours, reseed helps only for some minutes, so I am unable to reproduce it due to network issues. Fully loaded I2P installation with eepsites and i2psnark hits NTCP Pumper thread bug after some hours.

In short words: It occurs when I2P is fully used and it does not occurs when minimal config is used.

comment:24 Changed 3 years ago by djjeshk

It occured in freshly installed I2P version 0.9.20-0 without client tunnels (only participating tunnels) so problem is still here.

comment:25 Changed 23 months ago by zzz

  • Milestone changed from 0.9.18 to 0.9.24
  • Resolution set to fixed
  • Status changed from testing to closed

I believe the root cause of this was the DH precalc thread running out of keypairs. This caused keypairs to be created in the EventPumper? thread. Once this happened, the EventPumper? thread could get stuck in high-CPU and the DH thread couldn't catch up, and things generally stopped working well as it added a lot of latency to connection setup. Changes to the DH thread to increase priority and capacity, and to interrupt the thread sleep when empty, have dramatically reduced the amount of DH-empty instances. The DH thread priority issue in particular seemed to have a big effect on Windows. These changes were in 0.9.21 and 0.9.24.

Note: See TracTickets for help on using tickets.