Opened 8 years ago

Closed 2 years ago

#551 closed defect (fixed)

High CPU usage by NTCP Pumper thread

Reported by: slow Owned by: zzz
Priority: minor Milestone: 0.9.24
Component: router/transport Version: 0.8.11
Keywords: NTCP Cc:
Parent Tickets:

Description

I've checked 2 routers both running 0.8.11-0 for CPU usage by java threads. Here is CPU usage of highest 7 threads, reverse sorted.

$ ps -C java H -o lwp,pcpu k pcpu |tail -n 7
18118  2.1
18128  2.1
18077  4.1
18109 13.1
18110 13.1
18108 13.1
18163 69.6

$ ps -C java H -o lwp,pcpu k pcpu |tail -n 7
26362  4.6
26360  4.6
26359  4.7
26318  8.8
26319  8.8
26320  8.8
26358 61.5

Both top threads are NTCP Pumper threads, next 3 are BuildHandlers?.

First router have 3k participating tunnels and 1k ntcp connections, and second have 17k and 4.5k ntcp conns.

Subtickets

Change History (15)

comment:1 Changed 8 years ago by zzz

  • Component changed from unspecified to router/transport
  • Owner set to zzz
  • Status changed from new to accepted

I don't see any obvious bugs but I think there's opportunity to either optimize things or split it into 2 or more threads. I don't have a complete plan yet but I'm thinking about it.

comment:2 Changed 8 years ago by zzz

OK, I spent quite some time going through the thread and tweaking things. Mostly it's a bunch of "micro-optimizations" that hopefully will make a difference as a whole. I didn't find any big problems.

The changes are in 0.8.11-5.

comment:3 Changed 8 years ago by slow

Well, at the first router NTCP Pumper now consumes the same amount of CPU as BuildHandlers?.

At the second router with 13k part tunnels NTCP Pumper still top java thread - 70% CPU, BuildHandlers? - 9%. This is after ~8 hours uptime. What interesting, that I've watched on CPU usage first 1.5 hours, and it was fine like at the 1st machine.

comment:4 Changed 8 years ago by slow

And after some time NTCP Pumper at the first router uses 70% CPU same as at 2nd.

comment:5 Changed 8 years ago by zzz

Lots more optimizations (that probably won't help much) and debugging in 0.8.11-6. Again, we still don't know if there are bugs or if this is just the way it is.

New stats: (full stats must be enabled)

ntcp.pumperKeySetSize
Should be about the same as your average number of NTCP connections. If it's much bigger there's a problem or leak somewhere.

ntcp.pumperKeysPerLoop
Should be about 1 or maybe 2-5 on a busy router? Much higher and it's getting really backlogged.

ntcp.pumperLoopsPerSecond
5-50 on a slow router. Just guessing, maybe a few hundred on a fast router? If it's some huge number then the pumper may be infinite looping

comment:6 Changed 7 years ago by zzz

There were more changes in -7, and an NPE fix in -8. 'slow' reported on IRC that he hasn't seen a problem since -6. Leaving open for now as we test further.

comment:7 Changed 7 years ago by zzz

  • Resolution set to fixed
  • Status changed from accepted to closed

Closing, optimistically.

comment:8 Changed 7 years ago by slow

  • Resolution fixed deleted
  • Status changed from closed to reopened

Happened again on another router, I2P 0.8.11-23-rc, 3 days uptime. Unfortunately stats wasn't enabled here.

comment:9 Changed 7 years ago by str4d

  • Milestone changed from 0.8.12 to 0.8.14

comment:10 Changed 7 years ago by zzz

  • Milestone changed from 0.8.14 to 0.9.3

The Jetty 6 source file modules/jetty/src/main/java/org/mortbay/jetty/nio/SelectorManager.java has a number of workarounds and comments for Sun NIO bugs.

There's some comments, counters, and delay injection. We should review the Jetty source and see which workarounds we already do and which we should add.

@slow and others with this issue, please add your JVM version to the comments.

comment:11 Changed 7 years ago by slow

Reproduced on
I2P version: 0.9-19
Java version: Sun Microsystems Inc. 1.6.0_17 (Java(TM) SE Runtime Environment 1.6.0_17-b04)

comment:12 Changed 7 years ago by zzz

Still need to review the workarounds in Jetty 6 SelectorManager? and port them over if appropriate.

In the meantime, my advice is to use the latest Java 6 version available. While Java 5 had a lot of NIO issues, they aren't all gone in Java 6.

comment:13 Changed 6 years ago by str4d

  • Keywords NTCP added
  • Milestone 0.9.3 deleted

comment:14 Changed 4 years ago by zzz

  • Milestone set to 0.9.18
  • Status changed from reopened to testing

Simplistic workaround in 34c818def2d17b313a6e39c8bec90833cba3b94a 0.9.17-16
May need tuning. Root cause unknown.
See also #1411 #1075

comment:15 Changed 2 years ago by zzz

  • Milestone changed from 0.9.18 to 0.9.24
  • Resolution set to fixed
  • Status changed from testing to closed

I believe the root cause of this was the DH precalc thread running out of keypairs. This caused keypairs to be created in the EventPumper? thread. Once this happened, the EventPumper? thread could get stuck in high-CPU and the DH thread couldn't catch up, and things generally stopped working well as it added a lot of latency to connection setup. Changes to the DH thread to increase priority and capacity, and to interrupt the thread sleep when empty, have dramatically reduced the amount of DH-empty instances. The DH thread priority issue in particular seemed to have a big effect on Windows. These changes were in 0.9.21 and 0.9.24.

Note: See TracTickets for help on using tickets.