Opened 6 years ago

Closed 3 years ago

#1075 closed defect (fixed)

NTCP pumper thread uses too much cpu

Reported by: Zlatin Balevsky Owned by: zzz
Priority: minor Milestone: 0.9.24
Component: router/transport Version: 0.9.8.1
Keywords: ntcp spin jstack Cc: zab@…, zzz@…, killyourtv@…
Parent Tickets: Sensitive: no

Description

"greyman" on irc2p reported that i2p is using 100% cpu. Further investigation points to the NTCP event pumper thread. See the attached files for more information.

From j/top_H.txt:

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND           
 1149 abcdef    20   0 2942m 290m  13m R  98.7  3.8   1850:18 java  

1149 == 0x47d

Then, grep 47d j/jstack*

jstack0.txt:"NTCP Pumper" daemon prio=10 tid=0x00007ffe480e2800 nid=0x47d runnable [0x00007ffea1f23000]
jstack1.txt:"NTCP Pumper" daemon prio=10 tid=0x00007ffe480e2800 nid=0x47d runnable [0x00007ffea1f23000]
jstack2.txt:"NTCP Pumper" daemon prio=10 tid=0x00007ffe480e2800 nid=0x47d runnable [0x00007ffea1f23000]
jstack3.txt:"NTCP Pumper" daemon prio=10 tid=0x00007ffe480e2800 nid=0x47d runnable [0x00007ffea1f23000]
jstack4.txt:"NTCP Pumper" daemon prio=10 tid=0x00007ffe480e2800 nid=0x47d runnable [0x00007ffea1f23000]
jstack5.txt:"NTCP Pumper" daemon prio=10 tid=0x00007ffe480e2800 nid=0x47d runnable [0x00007ffea1f23000]
jstack6.txt:"NTCP Pumper" daemon prio=10 tid=0x00007ffe480e2800 nid=0x47d runnable [0x00007ffea1f23000]
jstack7.txt:"NTCP Pumper" daemon prio=10 tid=0x00007ffe480e2800 nid=0x47d runnable [0x00007ffea1f23000]
jstack8.txt:"NTCP Pumper" daemon prio=10 tid=0x00007ffe480e2800 nid=0x47d runnable [0x00007ffea1f23000]
jstack9.txt:"NTCP Pumper" daemon prio=10 tid=0x00007ffe480e2800 nid=0x47d runnable [0x00007ffea1f23000]

Subtickets

Attachments (1)

j.tar.bz2 (26.6 KB) - added by Zlatin Balevsky 6 years ago.
stack traces with 'jstack' and 'top -H'

Download all attachments as: .zip

Change History (7)

Changed 6 years ago by Zlatin Balevsky

Attachment: j.tar.bz2 added

stack traces with 'jstack' and 'top -H'

comment:1 Changed 6 years ago by zzz

See also (possible duplicate) #551 and from way way back, #46

<topiltzin> greyman: there could be many reasons why. This patch takes care of one possibility. Please apply and see if the problem reoccurs http://pastethis.i2p/show/5761/
<zzz> topz_afk, greyman, there's a couple of pumper stats that may be interesting if you enable full stats
<zzz> e.g. ntcp.pumperLoopsPerSecond
<greyman> k, got it. will report back. don't know if you got this but I wrote: this happened with Oracle 7 although it took longer to manifest. The server has been running for 7 days but the thread reports running full out for only 52 hours

Most of our EventPumper? problems went away with Java 7. What's left could well be a bug of ours, possibly with the interest ops. The whole "fail safe" part of the code papers over a lot of other issues.

I like reading whatever Jetty says about NIO because they've had similar problems over the years and have completely rewritten their NIO code several times.

See also:
http://webtide.intalio.com/2012/10/why-detecting-concurrent-issues-can-be-difficult/
http://wiki.eclipse.org/Jetty/Feature/JVM_NIO_Bug
http://webtide.intalio.com/2004/02/nio-and-the-servlet-api/ (very old)

The "difficult" blog post above about fixing a Jetty NIO bug with interest ops is good. At the time it came out I looked at it pretty carefully and decided we didn't have the same problem in EventPumper? but who knows.

I don't think anything can be fixed by changing buffer sizes, I think there's a bug in there somewhere. Odd that the OP hits it so often.

comment:2 Changed 6 years ago by Zlatin Balevsky

The reason I'm looking at buffer sizes and the read path is because stack traces 1,2,3,4,7,8,9 are processing a read event. Only stack traces 0,5,6 are not.

(grep -A 5 0x47d jstack*)

comment:3 Changed 6 years ago by killyourtv

Cc: killyourtv@… added

comment:4 Changed 5 years ago by str4d

Milestone: 0.9.9

comment:5 Changed 5 years ago by zzz

Milestone: 0.9.18
Status: newtesting

Simplistic workaround in 34c818def2d17b313a6e39c8bec90833cba3b94a 0.9.17-16
May need tuning. Root cause unknown.
See also #1411 #551

comment:6 Changed 3 years ago by zzz

Milestone: 0.9.180.9.24
Resolution: fixed
Status: testingclosed

I believe the root cause of this was the DH precalc thread running out of keypairs. This caused keypairs to be created in the EventPumper? thread. Once this happened, the EventPumper? thread could get stuck in high-CPU and the DH thread couldn't catch up, and things generally stopped working well as it added a lot of latency to connection setup. Changes to the DH thread to increase priority and capacity, and to interrupt the thread sleep when empty, have dramatically reduced the amount of DH-empty instances. The DH thread priority issue in particular seemed to have a big effect on Windows. These changes were in 0.9.21 and 0.9.24.

Note: See TracTickets for help on using tickets.