Opened 5 years ago

Last modified 3 years ago

#1943 needs_work defect

Arm: High CPU usage and participating bandwidth drop

Reported by: subatomic Owned by:
Priority: minor Milestone: undecided
Component: router/general Version: 0.9.28
Keywords: arm reliability participation Cc:
Parent Tickets: Sensitive: no


After about 48 uptime hours CPU usage becomes high and spiky.
Participating bandwidth drops from ~8 Mbit to ~2 Mbit.
Client tunnels work fine.
Restarting helps for another 48 hours.
There are no suspicious messages in wrapper.log
The log-router-0.txt contains many similar messages:

ERROR [te #18593]: ] Invalid datagram received. Error loading datagram
     Caused by: EOF after reading 0 bytes of 256 byte value
ERROR [te #18593]: ] .datagram.I2PDatagramDissector: Error loading datagram EOF after reading 0 bytes of 256 byte value
ERROR [te #18593]: ] Can't get new message from I2PSession.
     net.i2p.client.I2PSessionException: Error decompressing message
     Caused by: invalid distance too far back

Munin graphs:
CPU usage - http://ah4r4vzunzfa67atljlbrdgtg3zak5esh7ablpm6xno6fhqij35q.b32.i2p/sha1/b631c7f989c0220d99137df380830a77c5f6539a/cpu-week.png
Bandwidth - http://ah4r4vzunzfa67atljlbrdgtg3zak5esh7ablpm6xno6fhqij35q.b32.i2p/sha1/5e86505b8a502b6cfb0a5aa0f79d7229639ce930/if_eth0-week.png

wrapper.config excerpts:
root@host:~# free -ho
             total       used       free     shared    buffers     cached
Mem:          2.0G       783M       1.2G        15M        19M       158M
Swap:         127M         0B       127M
root@host:~# cat /etc/issue
Debian GNU/Linux 8 \n \l
I2P version: 0.9.28-0-1ubuntu1
Java version: Oracle Corporation 1.8.0_121 (Java(TM) SE Runtime Environment 1.8.0_121-b13)
Wrapper version: 3.5.25
Server version: 8.1.16.v20140903
Servlet version: Jasper JSP 2.1 Engine
JSTL version: standard-taglib 1.2.0
Platform: Linux arm 3.4.113-sun8i
Processor: (armcortexa7)
Jbigi: Locally optimized native BigInteger library loaded from file
Jbigi version: 2
GMP version: unknown
Encoding: UTF-8
Charset: UTF-8


Change History (13)

comment:1 Changed 5 years ago by zzz

Component: unspecifiedrouter/general
Status: newopen

Could be related to the years-old NTCP Event Pumper issue. Coincidentally I just closed a few tickets about those (#1411 #1075 #551) as finally fixed, after no reports in a while. Reviewing those tickets may be helpful.

The router log errors above are unrelated (see #1915).

To see which thread is eating all the CPU, run top and use the option to show all threads separately. Get the PID of that thread, convert to hex, take a thread dump in I2P, and search for that "nid" Copy the trace for that thread here.

Is your participating bandwidth 8 Mbit as stated in the OP, or 8 MByte (all stats in our console are in bytes)?

Workaround that will probably work is to limit your bandwidth on the /config page in the console. After experimenting you will probably find a value that keeps the router from tipping into this overload situation.

There may also be issues specific to ARM here. What hardware platform is this?

comment:2 Changed 5 years ago by subatomic

I have narrowed down the cause to option in wrapper.config
Having this option commented out solves the issue for me. The reason I have uncommented it in the first place is because there is an explanation next to it that promised higher performance.

My bandwidth settings are 9000/9000 KBps in/out (~70 Mbits) and 100% share. I really do have that much bandwidth. Normally my participating bandwidth is around 700 KBytes (~8 Mbits)

The ARM hardware platform is Orange Pi Plus 2E based on Allwinner H3 SoC.

I will reproduce the issue and post a thread dump as asked.

comment:3 Changed 5 years ago by subatomic

Status: opentesting

I have reproduced the problem and checked thread CPU usage with htop. No single thread showed abnormal CPU usage. Let me know if I can provide any other info.

comment:4 Changed 4 years ago by jogger

Reproduced the problem for up to JDK 8 ea 162, running 0.9.32 on Odroid HC1 (32bit 8 core big.LITTLE) with Armbian 5.36 Jessie. Identical results. According to system monitor there are no traffic or cpu peaks immediately preceding the lockup. When the lockup occurs nothing unusual can be seen in jconsole or the log. Also not caused by overheating when the cpu is throttled down.

client mode produces frequent random jvm crashes for me - this is load dependent.

I will implement a cron job just restarting i2p when cpu averages more than 90% for a minute.

comment:5 Changed 4 years ago by zzz

Status: testingneeds_work
Summary: High CPU usage and participating bandwidth dropArm: High CPU usage and participating bandwidth drop

comment:6 Changed 4 years ago by zzz

jogger - openjdk or oracle? Openjdk is terrible on arm, you need to use the Oracle jvm.

comment:7 Changed 4 years ago by jogger

I tested Oracle JDK 8 v 152 and ea 162 with identical results.

Oracle compiles those with an ancient prerelease of gcc 4.7.2 - absolutely unprofessional.

Would like to know whether it would make sense to try to build OpenJDK 9 myself on this 32 bit platform. Openjdk 8 is better than Oracle on 64 bit ARM, but much worse on 32 bit.

comment:8 Changed 4 years ago by jogger

I think we are seeing build issues here. Any research on Oracle Java 8 is wasted time given their crappy toolchain. Compiled OpenJDK9 - crash after 9 hours. Compiled i2p against it - runs solid for more than 2 days now at record throughput levels. Will not go back to Java 8. This speed warrants frequent crashes.

Next steps on ARM32: Will try archlinux on RPi2 which comes with JDK9 standard. Will build jbigi, wrapper and i2p there. If it works out, I will migrate one of my HC1 boxes to archlinux, build there and test again.

comment:9 Changed 4 years ago by jogger

Test results for ARM32: i2p (compiled against java 9 on machine) with java 9 crashes consistently on RPi2/archlinux after about 12 hours of moderate traffic. Tried the cross-compiled version from the archlinux repos as well as build by myself on the machine from openjdk 9 mercurial.

On Odroid HC1 / Armbian 5.3.6 I get more than 4-5 days of high class X throughput before the crash for java 9 compiled from openjdk 9 mercurial. So restarting i2p every 4 days currently resolves the problems.

I welcome any advice what else i could test. Also latest commit to openjdk 9 mercurial is from august. Maybe newer versions resolve current problems? Where can I get those? googled a lot, found nothing.

comment:10 Changed 4 years ago by jogger

Now tested with 0.9.33 compiled for java9 with Java 9.0.4 compiled on machine:

frequent crashes on RPi2/archlinux - giving up.

I upgraded my Odroid HC1 machines to Armbian stretch, compiled Java 9.0.4: still always crashes after more than 4 but less than 6 days uptime.

comment:11 Changed 4 years ago by jogger

Since Armbian crashes seem to occur at a certain frequency I suspect some table or counter running over.

Recently I upped my TCP connections by more than a factor of two and the crashes came more than twice as fast. No other major change in settings in this period. So this could be worth a look.

2018/03/04 18:26:02 | Launching a JVM…
2018/03/04 18:26:02 | OpenJDK Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
2018/03/04 18:26:03 | WrapperManager?: Initializing…
2018/03/04 18:26:05 | Starting I2P 0.9.33-0
2018/03/04 18:26:05 | INFO: Locally optimized library loaded from file
2018/03/04 18:26:07 | Warning: Java version: 9.0.4-internal OS: Linux arm 4.9.61-odroidxu4
2018/03/04 18:26:07 | Warning: OpenJDK is not recommended for ARM. Use Oracle Java 8
2018/03/04 18:26:07 | Warning: Java 9 support is beta, and not recommended for general use
2018/03/06 11:25:02 | TERM trapped. Shutting down.
2018/03/06 11:25:02 | CRIT [r 1 shutdown] net.i2p.router.Router : Shutting down the router…
2018/03/06 11:25:02 | CRIT [r 1 shutdown] net.i2p.router.Router : Starting final shutdown(3)
2018/03/06 11:25:56 | CRIT [r 1 shutdown] net.i2p.router.Router : Shutdown(3) complete
2018/03/06 11:25:57 | ←- Wrapper Stopped

comment:12 Changed 3 years ago by jogger

I have to correct myself. The problem in the OP still exists in jdk904+12. Took months to surface. Since openjdk 9 performs so much better, with my old network it probably just did not get beyond the tipping point which I estimate at above 4000 connections, 2500 tunnels and 90% CPU. Now trying jdk 10.

comment:13 Changed 3 years ago by jogger

I finally got the culprit. It is NTCP Pumper. I played around with taskset and schedtool thereby overloading the machine. At some point NTCP Pumper went over 90% CPU. Symptoms exactly as described above. Did a graceful restart and and its CPU usage declined very slowly. When all other threads had terminated it still showed 65% in top -H. So must be kind of a dead loop.

Note: See TracTickets for help on using tickets.