Opened 10 years ago

Closed 5 years ago

#551 closed defect (fixed)

High CPU usage by NTCP Pumper thread

Reported by: slow Owned by: zzz
Priority: minor Milestone: 0.9.24
Component: router/transport Version: 0.8.11
Keywords: NTCP Cc:
Parent Tickets: Sensitive: no


I've checked 2 routers both running 0.8.11-0 for CPU usage by java threads. Here is CPU usage of highest 7 threads, reverse sorted.

$ ps -C java H -o lwp,pcpu k pcpu |tail -n 7
18118  2.1
18128  2.1
18077  4.1
18109 13.1
18110 13.1
18108 13.1
18163 69.6

$ ps -C java H -o lwp,pcpu k pcpu |tail -n 7
26362  4.6
26360  4.6
26359  4.7
26318  8.8
26319  8.8
26320  8.8
26358 61.5

Both top threads are NTCP Pumper threads, next 3 are BuildHandlers?.

First router have 3k participating tunnels and 1k ntcp connections, and second have 17k and 4.5k ntcp conns.


Change History (15)

comment:1 Changed 10 years ago by zzz

Component: unspecifiedrouter/transport
Owner: set to zzz
Status: newaccepted

I don't see any obvious bugs but I think there's opportunity to either optimize things or split it into 2 or more threads. I don't have a complete plan yet but I'm thinking about it.

comment:2 Changed 10 years ago by zzz

OK, I spent quite some time going through the thread and tweaking things. Mostly it's a bunch of "micro-optimizations" that hopefully will make a difference as a whole. I didn't find any big problems.

The changes are in 0.8.11-5.

comment:3 Changed 10 years ago by slow

Well, at the first router NTCP Pumper now consumes the same amount of CPU as BuildHandlers?.

At the second router with 13k part tunnels NTCP Pumper still top java thread - 70% CPU, BuildHandlers? - 9%. This is after ~8 hours uptime. What interesting, that I've watched on CPU usage first 1.5 hours, and it was fine like at the 1st machine.

comment:4 Changed 10 years ago by slow

And after some time NTCP Pumper at the first router uses 70% CPU same as at 2nd.

comment:5 Changed 10 years ago by zzz

Lots more optimizations (that probably won't help much) and debugging in 0.8.11-6. Again, we still don't know if there are bugs or if this is just the way it is.

New stats: (full stats must be enabled)

Should be about the same as your average number of NTCP connections. If it's much bigger there's a problem or leak somewhere.

Should be about 1 or maybe 2-5 on a busy router? Much higher and it's getting really backlogged.

5-50 on a slow router. Just guessing, maybe a few hundred on a fast router? If it's some huge number then the pumper may be infinite looping

comment:6 Changed 10 years ago by zzz

There were more changes in -7, and an NPE fix in -8. 'slow' reported on IRC that he hasn't seen a problem since -6. Leaving open for now as we test further.

comment:7 Changed 10 years ago by zzz

Resolution: fixed
Status: acceptedclosed

Closing, optimistically.

comment:8 Changed 10 years ago by slow

Resolution: fixed
Status: closedreopened

Happened again on another router, I2P 0.8.11-23-rc, 3 days uptime. Unfortunately stats wasn't enabled here.

comment:9 Changed 10 years ago by str4d


comment:10 Changed 10 years ago by zzz


The Jetty 6 source file modules/jetty/src/main/java/org/mortbay/jetty/nio/ has a number of workarounds and comments for Sun NIO bugs.

There's some comments, counters, and delay injection. We should review the Jetty source and see which workarounds we already do and which we should add.

@slow and others with this issue, please add your JVM version to the comments.

comment:11 Changed 9 years ago by slow

Reproduced on
I2P version: 0.9-19
Java version: Sun Microsystems Inc. 1.6.0_17 (Java™ SE Runtime Environment 1.6.0_17-b04)

comment:12 Changed 9 years ago by zzz

Still need to review the workarounds in Jetty 6 SelectorManager? and port them over if appropriate.

In the meantime, my advice is to use the latest Java 6 version available. While Java 5 had a lot of NIO issues, they aren't all gone in Java 6.

comment:13 Changed 9 years ago by str4d

Keywords: NTCP added
Milestone: 0.9.3

comment:14 Changed 7 years ago by zzz

Milestone: 0.9.18
Status: reopenedtesting

Simplistic workaround in 34c818def2d17b313a6e39c8bec90833cba3b94a 0.9.17-16
May need tuning. Root cause unknown.
See also #1411 #1075

comment:15 Changed 5 years ago by zzz

Resolution: fixed
Status: testingclosed

I believe the root cause of this was the DH precalc thread running out of keypairs. This caused keypairs to be created in the EventPumper? thread. Once this happened, the EventPumper? thread could get stuck in high-CPU and the DH thread couldn't catch up, and things generally stopped working well as it added a lot of latency to connection setup. Changes to the DH thread to increase priority and capacity, and to interrupt the thread sleep when empty, have dramatically reduced the amount of DH-empty instances. The DH thread priority issue in particular seemed to have a big effect on Windows. These changes were in 0.9.21 and 0.9.24.

Note: See TracTickets for help on using tickets.