Opened 9 months ago

Last modified 6 months ago

#2355 new defect

Single thread goes wild on ARM

Reported by: jogger Owned by:
Priority: major Milestone: undecided
Component: router/general Version: 0.9.37
Keywords: Cc:
Parent Tickets: Sensitive: no

Description (last modified by zzz)

This identifies the basic problem touched in #1411, #1943 and #2314.

General effect is reduced throughput, one thread maxing out a single core, several other packet processing threads running at CPU levels too high relative to the reduced traffic.

Seen this in the last years on RPi2 (quad ARM32), Odroid C2 (quad ARM 64) and Odroid HC1 (~XU4, octa ARM32), Java 8 and 9. On the quad-core machine total CPU load goes > 90%, on the octa-core total CPU stays well under 80% while the thread going wild clearly stands out.

This is not a high usage bug, in the last event total CPU was at around 20% for days, when the bug stroke.

I could identify NTCP Pumper twice and I2CP reader once. One has to restart to get back to old traffic levels. Thread dump and top output attached. The top output shows that the threads consuming most CPU at the time of the problem had considerably less total CPU run time before, it was a clear sudden change.

PID USER PR NI S %CPU %MEM P TIME COMMAND

21473 e 3 -17 R 97,1 28,8 4 529:00 I2CP Reader 191
12569 e 14 -6 R 67,1 28,8 7 135:51 I2CP Internal R
12528 e 7 -13 S 54,7 28,8 5 954:38 Tunnel GW pumpe
12527 e 7 -13 S 51,5 28,8 2 955:02 Tunnel GW pumpe
12542 e 14 -6 R 31,3 28,8 5 610:10 NTCP writer 1/1
12551 e 14 -6 R 31,3 28,8 6 16,48 UDP message rec
12540 e 11 -9 S 26,4 28,8 1 16,44 NTCP reader 2/3
12539 e 11 -9 S 25,4 28,8 6 16,46 NTCP reader 1/3
12541 e 11 -9 S 25,4 28,8 0 16,44 NTCP reader 3/3
12538 e 14 -6 S 16,0 28,8 6 18,05 NTCP Pumper
12552 e 17 -3 S 12,7 28,8 0 23,41 UDP packet push

edit zzz: dump moved to attachment

Subtickets

#2381: Make connection limits soft goalsacceptedzzz
#2382: Fix isUnreachable() and wasUnreachable() in TransportImpl.javaclosedzzz

Attachments (1)

2355-dump.txt (83.1 KB) - added by zzz 9 months ago.
dump moved from OP

Download all attachments as: .zip

Change History (35)

comment:1 Changed 9 months ago by jogger

Description: modified (diff)

comment:2 Changed 9 months ago by zzz

Status: newinfoneeded_new

which thread is the problem?

comment:3 Changed 9 months ago by jogger

Status: infoneeded_newnew

Finally nailed it down. The thread going wild varies. Have seen it with NTCP Pumper several times, in the documentation below it is NTCP Writer running far above its long time average and Tunnel GW Pumper running high during the entire shutdown. The problem is that i2p (or the JVM) starts process thrashing and traffic drops. Basically I believe it is the combination of ARM being not the fastest and Linux CFS handing out relatively short time slices of 6 ms max, that are getting very short if many processes (threads) want to run. So the packet processing threads preempt each other multiple times before they get their work done. Solution might be to cut down on the number of threads. I have done this in the source, with more threads the bug occurred more frequently and with sharper traffic drops. Proof for this (happened after 6 days uptime).

Total system was at 80 to > 90% CPU with traffic levels (60 min average) far below the max.

Top output:

PID USER PR NI S %CPU %MEM P TIME COMMAND

30981 e 9 -11 R 83,0 31,6 4 69,28 UDP packet push
30980 e 10 -10 S 74,3 31,6 4 48,00 UDP message rec
30970 e 12 -8 R 33,4 31,6 2 31,00 NTCP Pumper
29374 e 10 -10 R 27,2 31,6 7 30,47 Tunnel GW pumpe
29375 e 10 -10 R 31,6 31,6 2 30,46 Tunnel GW pumpe
30971 e 8 -12 R 57,3 31,6 0 30,25 NTCP reader 1/3
30972 e 8 -12 R 60,1 31,6 6 30,25 NTCP reader 2/3
30973 e 8 -12 R 55,7 31,6 3 30,25 NTCP reader 3/3
30974 e 10 -10 R 58,5 31,6 5 19,43 NTCP writer 1/1
30978 e 15 -5 S 19,5 31,6 1 931:20 UDP Packet hand
14559 e 20 0 S 9,6 31,6 6 832:04 I2CP Reader 45
29367 e 20 0 S 13,0 31,6 2 558:37 SimpleTimer2 1/
29368 e 20 0 R 11,8 31,6 2 557:48 SimpleTimer2 2/
30976 e 20 0 S 8,0 31,6 4 434:24 UDPSender 2
30979 e 20 0 R 9,3 31,6 3 433:29 UDP ACK sender

Averages from stats page:
bw.receiveBps: Message receive rate (bytes/sec)
60 sec rate: Average: 996.137,75; Highest average: 1.762.161,75. There was 1 event in this period. The period ended 86 sec ago. (Average event count: 1,199; Events in peak period: 1)
5 min rate: Average: 844.344,5; Highest average: 1.356.251,0. There were 6 events in this period. The period ended 4 min ago. (Average event count: 5,999; Events in peak period: 6)
60 min rate: Average: 650.497,062; Highest average: 995.977,312. There were 72 events in this period. The period ended 29 min ago. (Average event count: 72,238; Events in peak period: 72)
Lifetime average value: 516.707,844 (10,619 events)

bw.recvRate: Low-level receive rate (bytes/sec)
60 sec rate: Average: 1.973.561,875; Highest average: 3.487.543,25. There was 1 event in this period. The period ended 86 sec ago. (Average event count: 1,199; Events in peak period: 1)
5 min rate: Average: 1.751.331,125; Highest average: 2.812.168,25. There were 6 events in this period. The period ended 4 min ago. (Average event count: 5,999; Events in peak period: 6)
60 min rate: Average: 1.391.899,375; Highest average: 2.341.244,25. There were 72 events in this period. The period ended 29 min ago. (Average event count: 72,238; Events in peak period: 72)
Lifetime average value: 1.013.717,062 (10,619 events)

bw.sendBps: Message send rate (bytes/sec)
60 sec rate: Average: 1.635.342,625; Highest average: 2.882.780,0. There was 1 event in this period. The period ended 86 sec ago. (Average event count: 1,199; Events in peak period: 1)
Graph Data - Graph Event Count
5 min rate: Average: 1.446.055,0; Highest average: 2.283.735,0. There were 6 events in this period. The period ended 4 min ago. (Average event count: 5,999; Events in peak period: 6)
60 min rate: Average: 1.174.283,75; Highest average: 1.907.248,5. There were 72 events in this period. The period ended 29 min ago. (Average event count: 72,238; Events in peak period: 72)
Lifetime average value: 1.100.247,25 (10,619 events)

bw.sendRate: Low-level send rate (bytes/sec)
60 sec rate: Average: 2.204.005,75; Highest average: 3.739.685,75. There was 1 event in this period. The period ended 86 sec ago. (Average event count: 1,199; Events in peak period: 1)
5 min rate: Average: 1.974.525,375; Highest average: 3.011.623,75. There were 6 events in this period. The period ended 4 min ago. (Average event count: 5,999; Events in peak period: 6)
60 min rate: Average: 1.597.892,5; Highest average: 2.565.552,0. There were 72 events in this period. The period ended 29 min ago. (Average event count: 72,238; Events in peak period: 72)
Lifetime average value: 1.325.021,125 (10,619 events)

I analysed further: The additional CPU was spent in user space, kernel functions at around 13%, less than at full traffic speed.

Came up with a script detailing 10 sec intervals of context switches:

Thread nonvol vol
SimpleTim? 2634 2735
Tunnel_GW 1640 3147
I2CP_Read 1235 284
NTCP_read 1195 3272
UDP_messa 1141 964
UDP_Packe 967 2894
NTCP_Pump 936 5095
NTCP_writ 914 1686
UDP_ACK_s 830 136
UDPSender 818 2976
UDP_packe 785 688
BuildHand? 660 56
UDPReceiv 450 4437
G1_Refine 398 138
GC_Thread 382 278
PRNG 326 26
BuildExec? 251 56
UDP_Estab 223 88
DH_Precal 220 95
JobQueue_ 182 484
I2CP_Inte 94 323
NTCPSendF 92 5425

100 nonvol switches per sec for many threads with the associated waiting time is just too high and slows down the router.

comment:4 Changed 9 months ago by zzz

good research! But I still don't have the info I need…
Which threads did you reduce the count for?

Changed 9 months ago by zzz

Attachment: 2355-dump.txt added

dump moved from OP

comment:5 Changed 9 months ago by zzz

Description: modified (diff)

moved dump from OP to attachment

comment:6 Changed 9 months ago by jogger

I am running with 3 NTCP readers und build handlers, 2 Simpletimers, Job Queue runners and Tunnel GW pumpers, 1 for everything else. As said before: more threads, more frequent lockups.

Same happened again today after 4 days uptime, more extreme case, 98% total CPU, traffic drop > 50%. I2CP Reader going wild, > 80% through the entire shutdown phase, Tunnel GW Pumpers and NTCP Writer also very high. top and context switches below. Additional info: garbage collection frequency down, expected as less packets processed. Nothing unusual in jconsole, no deadlocks. Time spent in kernel calls also down 50% (see top below), also expected with less traffic.

Interesting: I created a heap dump using jmap two times. jmap caused a full GC cycle and then CPU levels dropped for some seconds. Is it possible that the JVM is shuffling memory around without resulting in GC? 90% of heap memory is G1 old gen according to jconsole but there are no G1 old gen collections, only G1 new gen. Nevertheless G1 old gen varies between 150 and > 300 MB. Where to post the 56 MB gzipped heap dump, if anybody is interested?

top - 11:10:59 up 25 days, 14:34, 0 users, load average: 16,17, 17,93, 18,52
Threads: 541 total, 16 running, 459 sleeping, 0 stopped, 1 zombie
%Cpu(s): 89,4 us, 6,9 sy, 0,0 ni, 2,0 id, 0,0 wa, 0,0 hi, 1,6 si, 0,0 st
KiB Mem : 2044844 total, 143884 free, 1526864 used, 374096 buff/cache
KiB Swap: 2096124 total, 2094076 free, 2048 used. 429216 avail Mem

PID USER PR NI S %CPU %MEM P TIME COMMAND

12107 e 13 -7 S 35,3 29,4 0 71,57 UDP packet push
12091 e 13 -7 R 29,6 29,4 7 49,21 NTCP Pumper
12106 e 11 -9 R 53,7 29,4 7 49,19 UDP message rec
12093 e 9 -11 R 41,4 29,4 6 40,45 NTCP reader 2/3
12092 e 9 -11 S 39,7 29,4 6 40,45 NTCP reader 1/3
12094 e 9 -11 R 42,7 29,4 4 40,45 NTCP reader 3/3
12080 e 9 -11 R 91,8 29,4 6 33,27 Tunnel GW pumpe
12081 e 9 -11 R 89,3 29,4 2 33,25 Tunnel GW pumpe
12095 e 9 -11 R 85,2 29,4 7 25,39 NTCP writer 1/1
12104 e 20 0 R 9,6 29,4 7 17,16 UDP Packet hand
12073 e 20 0 R 13,2 29,4 3 632:16 SimpleTimer2 1/
12074 e 20 0 R 12,1 29,4 0 632:02 SimpleTimer2 2/
12644 e 10 -10 R 80,8 29,4 5 628:01 I2CP Reader 64

Thread nonvol vol
NTCP_read 4607 1784
GC_Thread 4317 340
I2CP_Read 3149 234
SimpleTim? 3053 667
NTCP_Pump 2515 3265
UDP_packe 2379 736
Tunnel_GW 2148 697
NTCP_writ 2083 426
UDP_messa 1779 640
UDP_Packe 1495 520
BuildHand? 1094 107
UDP_ACK_s 855 213
I2CP_Inte 705 112
UDP_Estab 451 118
JobQueue_ 376 699
G1_Refine 349 154
DH_Precal 344 176

comment:7 Changed 9 months ago by zzz

I see from the dump that you're using the OpenJDK JRE. Our standing advice for many years is to use Oracle JRE for ARM - the OpenJDK was much slower and buggy, almost unusable. This was in the early days of Java 8. We still say that on our download page as we haven't ever had anybody say 'I use OpenJDK on RPi and it works great'.

Our advice may or may not still be good - but perhaps you can try the Oracle JRE and see how it works?

comment:8 Changed 9 months ago by Zlatin Balevsky

Hi Jogger,

In the thread dump you provided NTCPPumper is stuck in a very unexpected place. That rules out any 100% cpu bugs in select() calls and supports your theory of frequent context switches.

A single thread dump can only tell us so much. The way to use thread-dumps is to take, say 10 thread dumps 10 ms seconds apart (via a script ofc) then to analyze statistically where each thread is spending it's time. Can you do that for us and attach to this ticket?

Regarding reducing the number of threads, please take a look at ticket #2252. You can go down to 1 thread for every stage of the pipeline and not lose any throughput (with some caveats).

Based on your recent forum posts I'm guessing you're doing heavy seeding with BiglyBT. Can you also try spinning up an "empty" router on your ARM setup that does nothing but route traffic? I'm curious how the profile will change and if an investigation into the I2CP code is warranted.

comment:9 Changed 8 months ago by jogger

I will try to come up with a number of dumps when then lockup hits again which will be more than two days away.

Yes, I ran Oracle JDK 8 on ARM 64 a year ago and this bug hit more frequently. I tested Oracle JDK 8 and all OpenJDK 8 through 11 on ARM 32. OpenJDK 9 is definitely a big improvement over Oracle JDK 8, especially much faster. Currently preferring it because 10 and especially 11 use more memory. The warning in the logs is definitely misleading and wrong.

You can not go down to 1 thread each on an 8 core system. I will comment #2252.

The "empty" router will not work, I have only 50 Mbps outbound which allows just one router at full speed. The bug hits with Snark internal as well though.

comment:10 Changed 8 months ago by jogger

The bug hit again. Took 10 thread dumps 10 sec apart, machine can not go faster. Also heap dump histograms before and after. During the shut down phase when most threads came down both I2CP Readers went near 100% CPU. However this is not a straight I2CP bug, it has been seen several times before when port 7654 was not in use with NTCP Pumper or Writer going wild.

I can not currently attach the files with Firefox and Safari tried: IndexError?: pop from empty list. Where to put them?

comment:11 Changed 8 months ago by zzz

zip them up and email to zzz@… and zab@…
thanks

comment:12 Changed 8 months ago by jogger

Finally found proof of a bug that is most likely the root cause of the problem. NTCP Pumper has a CPU leak, needs more and more cycles over time to do its work. To show this, I came up with a slow moving metric that shows in a linear fashion that things go wrong (Very slowly, will probably go unnoticed between i2p releases if you have little traffic or fast CPU or just one or two cores, just causing some performance degradation).

I divided NTCP Pumpers total cpu time by the total cpu time of all other threads, measured in per-Millis, separated for user and system cycles. If the bug is not ARM specific - which I suppose - then running the following bashism against any busy Linux router (for some hours after some hours uptime) will more than clearly exhibit the bug. Some code profiling which I am unable to do should find the problematic code.

#!/bin/bash
javapid=$(ps -C java -o 'pid args' | grep i2ptunnel | sed 's/^ *//' | cut -f 1 -d " ")
read pumpertid dummy < <(ps hH -p $javapid -o 'tid comm' | grep "NTCP P")
while true
do
        stat=($(</proc/$javapid/task/$pumpertid/stat))
        pumperusercpu=${stat[14]}
        pumpersyscpu=${stat[15]}
        stat=($(</proc/$javapid/stat))
        otherusercpu=$((stat[13] - pumperusercpu))
        othersyscpu=$((stat[14] - pumpersyscpu))
        echo "PumperUserCPU 0/00:" $(((pumperusercpu * 1000 + otherusercpu / 2) / otherusercpu)) "PumperSysCPU 0/00:" $(((pumpersyscpu * 1000 + othersyscpu / 2) / othersyscpu)) "Raw:" $pumperusercpu $otherusercpu $pumpersyscpu $othersyscpu
        sleep 999
done

edited post for formatting - sl

Last edited 7 months ago by slumlord (previous) (diff)

comment:13 Changed 8 months ago by Zlatin Balevsky

Hi,

The thing that comes closest to smoking gun is that it takes a long time to count the active NTCP connections. Have you changed the default number of NTCP connections? For unlimited bandwidth that number is 700.

To get to this conclusion notice that in the stack dump "NTCP Pumper" is frequently "BLOCKED". Then when you search the hexadecimal value that it is "waiting to lock" you'll see it's locked by an "NTCP Reader" who is "RUNNABLE" and in the NTCPTransport::countActivePeers method.

What is strange is that all that method does is iterate through the list of NTCP connections and calls some getters. Unless you have a huge number of NTCP connections this method should return very quickly. We may be able to do some micro-optimizations to that code but without major redesign I would not expect miracles.

Can you try reducing the number of NTCP connections?

comment:14 Changed 8 months ago by Zlatin Balevsky

Here is an example micro-optimization which takes the call to get current system time out of the tight loop. Who knows, maybe that call is slow on ARM32. Give it a try if you get a chance, the patch is against latest monotone head.

http://paste.crypthost.i2p/?00c2548094be2e4a#AwWZ/QinTVWpHZ+0YAs15GU+Q0NfPeS8tlMAPnjI0Kk=

comment:15 Changed 8 months ago by Zlatin Balevsky

@jogger: I just committed the patch in the previous comment, so just build latest from monotone.

comment:16 Changed 8 months ago by jogger

Hi, bad news for someone who bought and mastered a professional grade router to support 10.000+ TCP connections for i2p. The top line consumer product I purchased before maxed out at 2.000 conns, while i2p was fine with that number.

However applied your patch, left everything else unchanged. Definitely an improvement. From a first glance no noticeable difference - there must be other kernel calls heavily used by NTCP Pumper. But I have to wait some days now for any final word. Meanwhile I will be looking at the number of TCP connections (1.111 after 3 hours uptime, but my routers need 12-24 hours to stabilize).

When I was young I organized lists too expensive to traverse or to sort as heaps (as known from the classic heapsort). Predictable times for insertion/deletion/selection. Remember BASIC on the Apple II (1 MHz, 48 kB RAM)?

comment:17 Changed 8 months ago by Zlatin Balevsky

10,000+ TCP connections? If you don't mind us asking, how many routers is that split between? Or if you can please give us an estimate of how many connections end up per router.

NTCP Pumper does an EPoll system call which in turn makes the kernel iterate over the file descriptors to check each one for readiness. So there is no getting around a linear scan of a huge list. The best we can do short of massive re-engineering is micro-optimizations here and there.

I'm going to guess that after a while your routers will reach the degenerate state again, only this time the bottleneck will be a linear scan elsewhere. Then we'll play whack-a-mole for a while with ever diminishing returns.

comment:18 Changed 8 months ago by jogger

First BIG thumbs up!!! Your patch resulted in HUGE CPU savings. Overall system call cycles down, NTCP Pumper, Reader, Writer, Tunnel GW Pumper, SimpleTimer? all down noticeably. First time seeing I/O-Wait at all. Now getting > 2 MBps participating traffic at just 50% CPU on ARM32. You might verify yourself on a busy router. Any such "micro-optimizations" are highly welcome. So now it´s probably wait-and-see for a long time.

Regarding the limits of TCP conns I was misled by a remark from the devs, probably zzz, years ago that the 700 limit was chosen with respect to consumer type routers. In fact with 800+ conns my entire internet access slowed to a crawl. Bought a bigger box, same game at 2.000 conns. Then bought this EdgeRouter?, advertised at 10.000+ and got 2.500+ from i2p at 5% router CPU. Can´t reproduce currently because throughout this winter I will be sitting on a line with IP changing daily. So currently I am getting 1.300+ max.

comment:19 Changed 8 months ago by zzz

Yeah if you're trying to put huge numbers of connections into an ARM machine that's good to know, and maybe lead with that if you're doing any other tickets. It's easy for us to be confused and start looking for bugs when we're simply trying to get rid of performance bottlenecks. Also, just so we can stop guessing, please put in the full version information from the top of /logs, that's especially important when investigating performance issues.

One thing I do see in the stack traces is a LOT of threads stuck in the middle of SHA256. I wonder if the JVM's implementation is particularly slow or has some spot where it's naturally getting preempted.

comment:20 Changed 8 months ago by jogger

Have not seen this SHA256 thing in stack dumps again after the patch. The guys at wolfcrypt offer some native JNI and JCE solution for replacing the standard Sun security provider, but probably not worth the hassle.

After the patch NTCP pumper runs low on CPU, little time spent for the EPoll at 1.400 TCP conns.

I think we are still digging at the same topic, call it bug or bottleneck. zab´s patch is a complete game changer. Lower garbage collection, charts in jconsole look noticeably different and most important context switches are up drastically.

Most major threads (NTCP*, UDP*) now seem to run one packet at a time. At 50% CPU with the .5 ms delay for transfer to the next CPU, throughput is stuck at around 1.000 packets each for TCP and UDP. Looked at the CPU usage of the threads and found most invocations to run only .1 ms or less. This clearly defeats part of the scheduler which is designed to deal with time slices of .75ms upward.

Played a bit with taskset and found a limit of 10.000+ ctx/sec. Used taskset to assign all NTCP*, UDP*, Tunn* (compiled with 1 thread each) to dedicated cores at higher priority to avoid the transfer penalty. This raised top speeds from 2 MBps at 50% CPU to 3MBps at 70% CPU driving context switches further up in a linear fashion. Example: NTCP Pumper going from 1.000 voluntary switches to 1.500/sec. See below the number of non-vol and vol. ctx switches for a 10 sec period at 3 MBps.

NTCP_Pump 75 15903
NTCPSendF 1 14033
NTCP_writ 42 12318
UDPReceiv 1 5121
UDPSender 29 4704
UDP_Packe 2739 3900
SimpleTim? 1814 3014
NTCP_read 13 2522
UDP_messa 4994 2106
Tunnel_GW 192 1959
UDP_packe 74 1498

comment:21 Changed 7 months ago by jogger

Long story short: My tests indicate that implementing #2381 and #2382 will solve the problem. I ran equivalent patches for days and never saw the CPU hikes and widely varying CPU usage in NTCP Pumper, UDP message receiver and Tunnel GW Pumper I had for three weeks before when the bug in the OP hit 3 times (was logging 1000 second samples of time spent in kernel and user space for individual threads). The patch above alone was an improvement but no final solution.

Example: variation over time for Tunnel GW Pumper relative CPU usage in user space down from >350% to <20%.

comment:22 Changed 7 months ago by slumlord

I have set up and started an I2P Router on an RPi3 running pi64 - https://github.com/bamarni/pi64

The router is running stock 0.9.38, no changes to the router other than configuring networking to ensure reachability and setting bandwidth limits.

It initially ran for around 24 Hours before my consumer-grade router running dd-wrt started rebooting frequently (unsure if this is related to the I2P router). Before the networking issues, the I2P router had a bandwidth utilization of about 800 KBps both rx & tx. There were no issues on the RPi3 and no issues with high usage for a single thread as described in this ticket. I will continue to run the router and check if I can reproduce the issue.

I spent some time updating my router's firmware (which had been very stable up till now) and improving the configuration. The I2P router is back online and currently routing about 300 KBps of traffic both rx & tx, around 400-500 participating tunnels, load average is around 1 ± 0.5, memory usage around 100 MB.

jogger, if you have made any other changes to your router or OS please share so that I may apply to my router & OS as well. thanks.

comment:23 Changed 6 months ago by zzz

Add a subticket #2382.

comment:24 Changed 6 months ago by zzz

Add a subticket #2381.

comment:25 Changed 6 months ago by jogger

I am currently running with all the WIP patches for my open tickets implemented, so thats more than hard to share.

Since running patches equivalent to the above subtickets, problems have not reoccurred for me. I must admit that I do frequent restarts to check my changes.

What I think still exists but have not traced down is a clear 1% or larger hike after 24 - 48h uptime for the ratio (NTCP pumper CPU spent in user space) / (total i2p CPU). Maybe this can be accelerated by using taskset to assign less CPU than needed for i2p.

comment:26 Changed 6 months ago by jogger

Here I have proof of the effect I am talking about. It is 100% reproducible and irreversible. The following log tracks relative CPU usage of NTCP Pumper and Tunnel GW Pumper separated for user space and kernel calls. There is a 10% relative hike for NTCP pumper CPU spent in user space compared to the kernel calls, beginning from 6.6.

The hike occurred after 18h uptime, it may or may not coincide with an IP change. I would think that is not related to each other as the hike occurs before CPU drops due to the disconnect.

 user.NCTP.CPU%: 6.1 kernel.NTCP.CPU%: 6.2 user.T.GW.CPU%: 9.2 kernel.T.GW.CPU%: .96
 user.NCTP.CPU%: 5.7 kernel.NTCP.CPU%: 5.8 user.T.GW.CPU%: 11.1 kernel.T.GW.CPU%: 1.07
 user.NCTP.CPU%: 5.8 kernel.NTCP.CPU%: 6.0 user.T.GW.CPU%: 10.5 kernel.T.GW.CPU%: 1.05
 user.NCTP.CPU%: 6.1 kernel.NTCP.CPU%: 6.2 user.T.GW.CPU%: 9.6 kernel.T.GW.CPU%: 1.03
 user.NCTP.CPU%: 6.1 kernel.NTCP.CPU%: 6.0 user.T.GW.CPU%: 8.9 kernel.T.GW.CPU%: .94
 user.NCTP.CPU%: 6.6 kernel.NTCP.CPU%: 6.2 user.T.GW.CPU%: 9.6 kernel.T.GW.CPU%: .99
 user.NCTP.CPU%: 4.4 kernel.NTCP.CPU%: 4.0 user.T.GW.CPU%: 8.6 kernel.T.GW.CPU%: .91
 user.NCTP.CPU%: 4.7 kernel.NTCP.CPU%: 4.2 user.T.GW.CPU%: 10.2 kernel.T.GW.CPU%: .99
 user.NCTP.CPU%: 5.0 kernel.NTCP.CPU%: 4.4 user.T.GW.CPU%: 10.7 kernel.T.GW.CPU%: 1.01
 user.NCTP.CPU%: 4.9 kernel.NTCP.CPU%: 4.4 user.T.GW.CPU%: 9.1 kernel.T.GW.CPU%: .91
Last edited 6 months ago by jogger (previous) (diff)

comment:27 Changed 6 months ago by jogger

Finally managed to force NTCP Pumper into using all available CPU and stick in this state:
Running ARM 32 with 2000 tunnels and 1500 TCP conns, unrestricted settings.

After seeing the above effect I used taskset to excercise CPU pressure (allocating 200% down from 800%). NTCP Pumper CPU went immediately up and garbage collection started fighting harder (according to top and the jconsole charts). There was now another 50% relative hike for NTCP pumper CPU spent in user space compared to the kernel calls. Garbage collection consumed 50-100% CPU. This went some hours.

Then I lifted the CPU restrictions. GC immediately down from 50-100% to 10%. NTCP Pumper jumped up to > 80% CPU single core. But no lockup anymore, other threads unaffected. Router runs 2 MBps now, just NTCP Pumper is high on CPU.

Log from above continued with comments inserted - intervals are 1000 seconds:

 user.NCTP.CPU%: 4.9 kernel.NTCP.CPU%: 4.4 user.T.GW.CPU%: 9.1 kernel.T.GW.CPU%: .91
 user.NCTP.CPU%: 4.8 kernel.NTCP.CPU%: 4.3 user.T.GW.CPU%: 9.8 kernel.T.GW.CPU%: .94
 user.NCTP.CPU%: 5.3 kernel.NTCP.CPU%: 4.8 user.T.GW.CPU%: 10.3 kernel.T.GW.CPU%: .99
 user.NCTP.CPU%: 5.5 kernel.NTCP.CPU%: 5.0 user.T.GW.CPU%: 9.5 kernel.T.GW.CPU%: .97
>>> CPU limits enforced during the following period :
 user.NCTP.CPU%: 6.9 kernel.NTCP.CPU%: 5.5 user.T.GW.CPU%: 7.8 kernel.T.GW.CPU%: .95
 user.NCTP.CPU%: 7.4 kernel.NTCP.CPU%: 5.7 user.T.GW.CPU%: 6.8 kernel.T.GW.CPU%: .94
 user.NCTP.CPU%: 9.6 kernel.NTCP.CPU%: 7.2 user.T.GW.CPU%: 6.5 kernel.T.GW.CPU%: .94
 user.NCTP.CPU%: 8.8 kernel.NTCP.CPU%: 6.8 user.T.GW.CPU%: 7.2 kernel.T.GW.CPU%: 1.02
 user.NCTP.CPU%: 8.0 kernel.NTCP.CPU%: 6.1 user.T.GW.CPU%: 6.0 kernel.T.GW.CPU%: .93
 user.NCTP.CPU%: 9.6 kernel.NTCP.CPU%: 7.2 user.T.GW.CPU%: 6.8 kernel.T.GW.CPU%: .98
 user.NCTP.CPU%: 7.8 kernel.NTCP.CPU%: 6.1 user.T.GW.CPU%: 7.2 kernel.T.GW.CPU%: 1.02
 user.NCTP.CPU%: 8.5 kernel.NTCP.CPU%: 5.7 user.T.GW.CPU%: 6.7 kernel.T.GW.CPU%: 1.02
 user.NCTP.CPU%: 9.0 kernel.NTCP.CPU%: 5.8 user.T.GW.CPU%: 6.7 kernel.T.GW.CPU%: 1.00
 user.NCTP.CPU%: 9.1 kernel.NTCP.CPU%: 5.9 user.T.GW.CPU%: 7.3 kernel.T.GW.CPU%: 1.03
 user.NCTP.CPU%: 7.7 kernel.NTCP.CPU%: 5.0 user.T.GW.CPU%: 6.9 kernel.T.GW.CPU%: .98
 user.NCTP.CPU%: 8.0 kernel.NTCP.CPU%: 5.2 user.T.GW.CPU%: 6.5 kernel.T.GW.CPU%: .94
 user.NCTP.CPU%: 9.3 kernel.NTCP.CPU%: 5.7 user.T.GW.CPU%: 6.9 kernel.T.GW.CPU%: .99
 user.NCTP.CPU%: 8.7 kernel.NTCP.CPU%: 5.5 user.T.GW.CPU%: 6.7 kernel.T.GW.CPU%: .99
 user.NCTP.CPU%: 10.1 kernel.NTCP.CPU%: 6.2 user.T.GW.CPU%: 7.0 kernel.T.GW.CPU%: 1.00
 user.NCTP.CPU%: 8.5 kernel.NTCP.CPU%: 5.4 user.T.GW.CPU%: 6.1 kernel.T.GW.CPU%: .94
>>> CPU limits removed during the following period :
 user.NCTP.CPU%: 14.7 kernel.NTCP.CPU%: 9.7 user.T.GW.CPU%: 7.9 kernel.T.GW.CPU%: .91
 user.NCTP.CPU%: 16.4 kernel.NTCP.CPU%: 10.9 user.T.GW.CPU%: 9.0 kernel.T.GW.CPU%: .93
 user.NCTP.CPU%: 17.3 kernel.NTCP.CPU%: 11.3 user.T.GW.CPU%: 8.3 kernel.T.GW.CPU%: .88

slumlord should be able to reproduce by executing "taskset -a -p 1 <javapid>" (or -p 3) after the router is up to speed and "taskset -a -p f <javapid>" a few hours later. To watch use "top -H", jconsole and script from comment 12.

comment:28 Changed 6 months ago by zzz

sure, EventPumper? thread relies on DH precalc threads to create keypairs. If you artificially starve the precalc threads, pumper has to generate keypairs in its thread and it falls behind. We'll wakeup the DH thread when this happens to generate more, but if you don't let it catch up, we'll be stuck.

Perhaps for massive numbers of connections we should let the DH threads get further ahead, but you'd have to look at the stats to help us tune it. Not clear from comment 26 if this is a real problem or just a demo.

comment:29 Changed 6 months ago by jogger

I will give any help if you advise me what to analyse or what to report.

The above is not a demo, but a real problem. I am logging NTCP Pumper CPU usage constantly for 6 month now and know how it behaves.

a) The persistent 10% relative hike NTCP Pumper user space CPU within the first 48h uptime comes out of the blue, after the last patches my systems run below 50% total CPU all the time and it is still there.

b) The CPU pressure thing showed up, when I tested my patch for #2412 under CPU pressure. You gave a good reason for the intensive GC going on, possibly caused by the pumper generating keys, but that has to be verified. But CPU stays high afterwards and does not go down. There is definitely something going wrong. See also https://trac.i2p2.de/ticket/1943#comment:13

To put it another way: if we fix this thing, then anybody will be able to run i2p with limited CPU.

comment:30 Changed 6 months ago by Zlatin Balevsky

So this is where it gets very difficult without a proper profiler to tell what is going on. If you absolutely don't want to try YourKit there is Oracle's "Mission Control" which is free but only works on Oracle JVMs, not OpenJDK. Lastly, there's always the jstack-in-a-loop approach but that doesn't generate any useful info unless you're at 100% cpu.

comment:31 Changed 6 months ago by jogger

Now repeated the CPU pressure exercise with #2434 implemented. Garbage collection progressed unchanged, NTCP Pumper CPU was down vs. up in the previous exercise. Was now able to get 2.5 MBps top speed out of just 2 ARM32 cores.

After lifting the restrictions everything was back to normal. Except the 10% hike which needs multi-day logging to rule out I consider this ticket done.

comment:32 Changed 6 months ago by jogger

The CPU hike reappeared much faster and more significant with #2434 implemented.

@zab Nice you recommend your advanced tools, but I can´t profile what I don´t understand before. So good old code analysis did it. Turns out that there is a second lockup hidden in the failsafe stuff.

When the keyset size raises along with the CPU load a point will be reached where the else clause can not fire any more. No more sleep() and even my ARM32 drives thousands of loops per second through the EPoll. Proof, that the EPoll is darn fast even on low end systems.

As there is a TO-DO regarding the sleep() anyway I have a patch in testing that reduces the failsafe code, provides more sleep() and less looping, thus driving CPU down on all platforms.

Once I have that patch load tested, I will add it to #2434. Then we are finally done with this never-ending story here.

comment:33 Changed 6 months ago by zzz

See my thoughts over on #2434, but the goal is to make incremental improvements. We've been working on the pumper for years. We may never be "done".

comment:34 Changed 6 months ago by slumlord

My (home) router is running stable now, after the repeated reboots that were happening.

My I2P router was running in a stable manner as well though it seems like my RPi3 had an unexpected reboot recently. Not sure what the cause was, perhaps it overheated, perhaps it ran out of memory & swap. I thought I had some heatsinks for the RPi3 lying around but have not been able to find them.

My I2P router is running version 0.9.38-6, peak participating traffic has been ~1.5 MBps with around 1.5K participating tunnels. I have so far not been able to observe the behavior without applying any resource "starvation" to the I2P router - i.e. 100% usage of a single thread in top -H. I will continue monitoring my I2P router & RPi3 OS.

Note: See TracTickets for help on using tickets.