Opened 8 years ago

Closed 7 years ago

Last modified 7 years ago

#519 closed defect (not a bug)

I2P hogs CPU when unable to connect to peers

Reported by: mk Owned by: zzz
Priority: minor Milestone: 0.9
Component: router/transport Version: 0.8.8
Keywords: GNU Classpath sucks Cc:
Parent Tickets: Sensitive: no

Description

When using JamVM (Gentoo: jamvm-1.5.4, gnu-classpath-0.98+gmp), I2P stalls the CPU at 100% when unable to connect to peers, either due to lack of network connectivity, or due to clock difference that was not fixed. Apparently, the router enters a tight loop in this case.

This is very problematic when starting I2P as a service automatically from network manager, and without console or other apps (Liberté Linux).

In addition, it would be very desirable if I2P could recover from its 100% CPU state when finally able to connect to peers (or when clock is set to correct time, but that's a different issue).

According to zzz @ #i2p-dev, this is easy to reproduce with i2p.vmCommSystem=false.

Configuration is available at /opt/i2p (wrapper) and /var/lib/i2p (router). Nothing appears in the logs at the configured levels during the tight loop.

Subtickets

Attachments (5)

wrapper-log (5.4 KB) - added by mk 8 years ago.
wrapper: NTCP Pumper exceptions
router-log (2.4 KB) - added by mk 8 years ago.
router: NTCP Pumper exceptions
wrapper-log-2 (1.3 KB) - added by mk 8 years ago.
wrapper: NonReadableChannel? exceptions
router-log-2 (6.7 KB) - added by mk 8 years ago.
router: NotYetConnected and other exceptions
openJDK7-x86_64-wrapper.tar.gz (139.1 KB) - added by Astral2012 7 years ago.
60-75% CPU usage when disconnects are not detected. Modern GNU/Linux environment.

Download all attachments as: .zip

Change History (45)

comment:1 Changed 8 years ago by mk

Summary: I2P hogs CPU when uunable to connect to peersI2P hogs CPU when unable to connect to peers

comment:2 Changed 8 years ago by zzz

Component: router/generalrouter/transport
Milestone: 0.8.90.9
Owner: set to zzz
Priority: majorminor
Status: newaccepted

Correction, to reproduce use i2p.vmCommSystem=true

This has been the behavior for approximately forever. Pretty sure it's the tunnel-building loop. Would be nice to fix it.

comment:3 Changed 8 years ago by mk

It's probably easy to pinpoint the busy loop location with VisualVM. I think it's a major problem, since for instance, on laptops the battery will be spent for no reason whenever there is no network access (when say, the interface is up and I2P has been started).

comment:4 Changed 8 years ago by zzz

Whether we call it minor or major won't really change how soon it gets fixed.

It's almost certainly the tunnel BuildExecutor? loop.

On my Atom netbook the usage ranges from 15-40% (of one CPU) with i2p.vmCommSystem=true. Probably gnu/jamvm is part of why you're at 100%.

As we discussed on IRC, one possibility is to hook into dbus on linux, like firefox does, to monitor network connection state. That's a big job though.

comment:5 Changed 8 years ago by mk

I think that monitoring the network connection state is not I2P's responsibility - I2P can be started and stopped by the network manager, if necessary. But on the other hand, 15% of CPU usage when idle is just as unacceptable as 100%, in my opinion. Whether one JVM handles a busy loop better than the other is unimportant - the core problem is the busy loop. If I2P is to be used in unattended setups, such issues are show-stoppers.

comment:6 Changed 8 years ago by zzz

There is a delay in the build loop.

If the router doesn't know (because it isn't "I2P's responsibility" to know) whether the network is connected, how will it know whether to aggressively keep trying to build tunnels, or wait longer than usual in the loop?

I'm looking at it but it isn't clear how to do what you want cleanly.

comment:7 Changed 8 years ago by zzz

I've made two changes in 0.8.8-16, they are partial solutions, as explained above we don't know for sure when the network is connected.

1) Limit number of parallel builds on slow machines
2) Don't retry immediately if a build fails quickly, as it is a symptom of a connection problem.

Please test.

comment:8 Changed 8 years ago by mk

Will test, but I suspect that there is another busy loop, perhaps in the time synchronization part. After disabling the patches in ticket 522, I observed full CPU load even when I2P has built tunnels and works correctly (i.e., eepSites are accessible). The difference from the CPU load mentioned here is that I neglected to mention here that only one of two CPUs was loaded, whereas the CPU load I observed recently (after removing the patches and starting I2P when the clock is already correct) was on both CPUs.

Or perhaps it's the parallel builds that you mention, that hog the CPU even when the time is correct (especially since the tunnels are set up via Tor). So I will test after reapplying the patches from ticket 522, and then test your changes.

comment:9 Changed 8 years ago by mk

I tested with the patches from ticket 522 reapplied, still constantly high CPU load with intermittent low load sometimes (even after several hours), both when going via Tor and when using direct connections, with TCP or UDP, more memory in -Xmx, less entries in netDb, etc. Most of the time I2P connections didn't work, too, so I removed the patches again.

So in summary, I observe high CPU load (both cores) most of the time - although I2P is started when the clock is already correct, the network is accessible, and no code is changed. I can't tell what changed compared to the situation previously when high CPU load mostly stopped after I2P has established tunnels. Perhaps it's the transition of most NTCP OR/OfR clients (which compose the netDb entries in my setup) to 0.8.8, but overall I have no idea.

I will next try to test 0.8.8-16 as you suggested.

I am also attaching some logs with exceptions observed while running I2P.

Changed 8 years ago by mk

Attachment: wrapper-log added

wrapper: NTCP Pumper exceptions

Changed 8 years ago by mk

Attachment: router-log added

router: NTCP Pumper exceptions

Changed 8 years ago by mk

Attachment: wrapper-log-2 added

wrapper: NonReadableChannel? exceptions

Changed 8 years ago by mk

Attachment: router-log-2 added

router: NotYetConnected and other exceptions

comment:10 Changed 8 years ago by zzz

Thanks for the logs, I will take a look at them soon.

I'm struggling to keep track of what is going on with the high CPU, but apparently it's happening now even with peers connected and a good clock.

One thing for sure though, every time I've tried JamVM + gnu libs with I2P, the performance has been so bad it's essentially unusable. Maybe there's something we can do on our side to make it better, maybe not. Sure, we should try to fix the bugs, but in the meantime we strongly recommend Sun/Oracle? or OpenJDK JVMs.

comment:11 Changed 8 years ago by mk

You can also test JamVM + I2P in my exact environment if you want: just run the 200 MiB ISO in a VM like VirtualBox. I2P is enabled by adding gentoo=i2p to the kernel command line by pressing Tab on a menu entry - I recommend the third entry for booting into console. The root user is accessible on the second terminal (Alt-Right). The firewall setup is in /usr/local/sbin/fw-reload, which can be just executed after changing (e.g., to change transparent I2P-to-Tor forwarding to a more sane rule that permits I2P traffic).

comment:12 Changed 8 years ago by zzz

Unchecked exceptions in NTCP EventPumper? and I2PTunnelRunner now caught in 0.8.8-20.

I'll try running JamVM here when I get a chance. Somewhat busy fixing my own bugs atm, been creating a lot of them recently.

comment:13 Changed 8 years ago by mk

What's the easiest way to get 0.8.8-xx - my checking out from monotone? No snapshots similar to stable releases?

comment:14 Changed 8 years ago by zzz

Either checkout and build from monotone, or use echelon's builds by enabling unsigned development builds on the config update page with the URL http://echelon.i2p/update/i2pupdate.zip

comment:15 Changed 7 years ago by zzz

Resolution: not a bug
Status: acceptedclosed

As best as I can tell reviewing the above, the behavior is not 'high CPU usage when no peers' but 'high CPU usage when using JamVM'. That matches my test results here. I do not have high CPU usage when the network is disconnected. I do have high CPU usage when I use JamVM. JamVM is absolutely and hopelessly slow.

Yes, the tunnel build loop does use a fair amount of CPU, that's due to the encryption required for tunnel build messages. Nothing we can do about that.

Closing this ticket as not-a-bug. If you have a specific (and fixable) place to point to where we are busy looping or other bad behavior (not when using JamVM!), please open a new ticket.

comment:16 Changed 7 years ago by mk

Keywords: RAM is cheap added

JamVM is absolutely and hopelessly slow.

That's not true — it works fine once I2P has established tunnels, for instance. Maybe you don't have gmp enabled in gnu-classpath. Maybe you have a busy loop where Oracle JRE introduces a delay, and JamVM doesn't.

Yes, the tunnel build loop does use a fair amount of CPU, that's due to the encryption required for tunnel build messages. Nothing we can do about that.

That doesn't make sense if the CPU load happens when network connectivity is lost. If you use encryption that's then discarded, in a loop, it's a bug.

Closing this ticket as not-a-bug.

I will also add a “RAM is cheap” tag. You know, the Java way.

comment:17 Changed 7 years ago by DISABLED

Keywords: GNU Classpath sucks added; RAM is cheap removed

Maybe you have a busy loop where Oracle JRE introduces a delay, and JamVM doesn't.

Let me enlighten you : if that statement is true, then JamVM is broken.

I will also add a “RAM is cheap” tag. You know, the Java way.

I changed the keywords to "GNU Classpath sucks". You know, the Reality way.

comment:18 in reply to:  17 Changed 7 years ago by Zlatin Balevsky

Signing in to take credit of the comment above. Arrogant trolls who have no idea what they're talking about sometime take the better of me.

comment:19 Changed 7 years ago by mk

Let me enlighten you : if that statement is true, then JamVM is broken.

Not really, it would mean that you are relying on some undocumented or unintended feature of Oracle JRE, for instance. But it is far more likely that you are doing something trivially wrong, like encrypting messages to peers in a busy loop before making sure that a peer is actually reachable. Then, you run the code on a fast computer with highly optimized JIT JRE, do not see the slowdown, and expect me to find the problematic code portion for you in your code.

Arrogant trolls who have no idea what they're talking about sometime take the better of me.

Perhaps you should reread the comment by zzz where he experienced 15-40% load on non-JamVM JRE. Then the load was lowered via some hack, and suddenly it's not a bug anymore. The wonders of Java world! But hey, I am just an arrogant troll, right?

comment:20 Changed 7 years ago by zzz

I wasn't waving my hand to make the problem go away - there was a real problem, reproduced by me, and it was fixed in 0.8.8-16, 13 months ago (see comment 7 above).

What remained was, for me, that JamVM (yes, more correctly, GNU classpath) is terrible - for me. If it isn't for you, great. But if JamVM is not poor for you, I don't understand what the ticket is about anymore. As I said above when I closed the ticket, I'm having a little trouble understanding what the issue even is anymore after 14 months.

I claim that I fixed the high CPU when the network is disconnected (the original issue) in 0.8.8-16, 13 months ago. If the only JVM it still happens on is JamVM, then I claim not-a-bug, and I explained why above. If I'm wrong, please let me know.

comment:21 in reply to:  19 ; Changed 7 years ago by zzz

Replying to mk:

But it is far more likely that you are doing something trivially wrong, like encrypting messages to peers in a busy loop before making sure that a peer is actually reachable.

Maybe, maybe not. The tunnel build message is fully generated and encrypted before it's sent to the first peer. The tunnel building is several layers above the transport code so it's architecturally a little messy. The real problem is that the only way to "make sure a peer is actually reachable" is to open a socket to it (for NTCP) or establish an SSU session for UDP. In other words, pre-establish the connection on one of our two transports. That also requires resources, including some crypto precomputation for DH exchange.

It's doable but not straightforward.

It sounds like your underlying theory is that finding out if the network is connected, or a peer is reachable, is "trivial" and won't use much resources. It's relatively straightforward to determine that a computer has no active network interfaces at all - only loopback addresses - but if it has a network address from the firewall, but the modem on the other side, providing the connection to the ISP, is down, that's problematic. Even in the first case though, we can probably do better to recognize it.

Don't get frustrated because you think the bug is "trivial". Maybe it is to you, and I'm not understanding, in that case I need more help. Or maybe it's not easy at all.

…and expect me to find the problematic code portion for you in your code.

No, we don't "expect" that. But help is always appreciated, if you are willing and able.

comment:22 in reply to:  19 Changed 7 years ago by Zlatin Balevsky

Replying to mk:

Let me enlighten you : if that statement is true, then JamVM is broken.

Not really, it would mean that you are relying on some undocumented or unintended feature of Oracle JRE, for instance.

For instance, I happen to have an intimate knowledge of the Oracle JDK as I have modified and deployed custom JVMs in a an environment. There is no "undocumented feature" like the on you're referring to.

In fact, adding sleeps inside busy-loops is completely against the standard JVM specifications. Anything that claims to be a JVM needs to comply to those and if it doesn't, then it's not a JVM.

comment:23 Changed 7 years ago by Zlatin Balevsky

But it is far more likely that you are doing something trivially wrong, like encrypting messages to peers in a busy loop before making sure that a peer is actually reachable.

If you cannot reproduce this on an {Oracle,OpenJDK,Apple,IBM,JRockit,Android} jvm then it's far more likely that the inferior GNU Classpath "jvm" is doing something wrong.

comment:24 Changed 7 years ago by Zlatin Balevsky

@mk: you like to make accusations without having any knowledge of the facts whatsoever. Since you are clearly not part of the "java world" you are clearly not qualified to judge which jvms are good and which are bad.

Hence, you an arrogant troll. Learn WTF you're talking about OR tone down your attitude OR be ignored.

@zzz: my recommendation (as you may have guessed already) is to explicitly drop support for GNU Classpath as well as any derivatives. It's a waste of time. Just like any further discussion on this ticket.

comment:25 in reply to:  21 Changed 7 years ago by mk

Replying to zzz:

Maybe, maybe not. The tunnel build message is fully generated and encrypted before it's sent to the first peer. The tunnel building is several layers above the transport code so it's architecturally a little messy.

So the architectural problem is clear, isn't it? Messages are encrypted to peers without making sure that the peers are reachable, which results in waste of resources in a busy loop in case of network going down.

The real problem is that the only way to "make sure a peer is actually reachable" is to open a socket to it (for NTCP) or establish an SSU session for UDP. In other words, pre-establish the connection on one of our two transports. That also requires resources, including some crypto precomputation for DH exchange.

In case of TCP at least, you can move the socket opening part (a TCP handshake without DH exchange) to the phase when a message is built, and then hand-off the connection to the lower layer.

It sounds like your underlying theory is that finding out if the network is connected, or a peer is reachable, is "trivial" and won't use much resources.

I didn't intend to imply that a solution is trivial, only that the cause is probably simple, which it seems to be.

No, we don't "expect" that. But help is always appreciated, if you are willing and able.

I spent a lot of time going over the code in some other issues (with time sync, I think), but here it seems to be an architectural issue, and it's not possible to do anything without being thoroughly familiar with the design and the code.

comment:26 in reply to:  23 Changed 7 years ago by mk

Replying to zab:

If you cannot reproduce this on an {Oracle,OpenJDK,Apple,IBM,JRockit,Android} jvm then it's far more likely that the inferior GNU Classpath "jvm" is doing something wrong.

Android's Dalvik is not a JVM. GNU Classpath is not a JVM either — it is a library, that can use GMP for numeric computations, which is not compiled with support for advanced ISAs in my case. It's a simple speed difference. JamVM also has JIT turned off, because it is executed in a hardened environment. You know about those, right? Some people who run I2P like to use secure OSes.

you like to make accusations without having any knowledge of the facts whatsoever.

Sounds like I guessed the root cause correctly, though. You, on the other hand, went to venture about some completely arbitrary meaning of undocumented JVM behavior.

Since you are clearly not part of the "java world" you are clearly not qualified to judge which jvms are good and which are bad.

Does being a part of the “Java world” include knowing what a JVM is?

Seriously, I make one little snide remark about the Java ecosystem, and you accuse me of a thousand sins. I assure you I know more than you about inner workings of JVMs, but I didn't come here for a pissing contest. Did you?

comment:27 Changed 7 years ago by zzz

fyi we have a new ticket #743 that Jetty 6 (in I2P as of 0.9) does not work with gij, that may also affect JamVM. Will try to reproduce here soon.

comment:28 Changed 7 years ago by mk

FWIW, I ship I2P without Jetty in Liberté. I.e., the router console is disabled. Here are the shipped libraries for 0.9:

commons-logging.jar
i2p.jar
i2ptunnel.jar
mstreaming.jar
router.jar
streaming.jar

comment:29 Changed 7 years ago by zzz

JamVM worked fine for me with Jetty. JamVM 1.5.3 with I-don-tknow-what version of classpath. With gij 4.6.3, Jetty was definite fail. Fixing in 0.9.2-15.

For that and other reasons (Jetty NIO really is flaky on Java 5), the fix is to force the console Jetty to use non-NIO on Java 5, or on JamVM or gij.

You are the packager for I2P in Liberte? Didn't know that… I2P without a console can't be very easy for people… how do they manage it?

comment:30 Changed 7 years ago by zzz

And @zab - yes you have some good points - however you asked the other day on IRC for people to tell you when you were being a *. Chill out a little :)

comment:31 Changed 7 years ago by DISABLED

@mk "most likely you have a bug which is hidden by some undocumented feature that you don't know about"

How exactly do you think this is perceived? I know the non-java world is rife with undocumented features but don't extrapolate your personal experiences into areas you are simply not familiar with. Thank you for your contributions to i2p and FYI disabling jit usually destroys performance. I admit I was a * and I'm looking forward to working with you in the future.

@zzz thank you, I'm still human. Please don't ever hesitate to correct me.

—zab from the Swiss alps

comment:32 in reply to:  29 Changed 7 years ago by mk

Replying to zzz:

JamVM worked fine for me with Jetty.

Yes, it works fine, the console is just excluded to save memory and image space.

I2P without a console can't be very easy for people… how do they manage it?

Similarly to Tor, I guess. :) It runs as a daemon, and provides connectivity to .i2p network via Privoxy.

comment:33 in reply to:  31 ; Changed 7 years ago by mk

Replying to guest:

How exactly do you think this is perceived? I know the non-java world is rife with undocumented features but don't extrapolate your personal experiences into areas you are simply not familiar with.

Seriously… Different JVMs can wrap the same method around different OS calls, one resulting in a context switch and one not. So in a busy loop, you will have big CPU load difference. Just one example. Call it unspecified behavior if you want. Or, I don't know, JIT can optimize away a long loop doing nothing in one JVM, but not in another.

FYI disabling jit usually destroys performance.

But it is a good way to verify that a program works well on slow machines.

comment:34 in reply to:  33 Changed 7 years ago by Zlatin Balevsky

Seriously… Different JVMs can wrap the same method around different OS calls, one resulting in a context switch and one not. So in a busy loop, you will have big CPU load difference. Just one example. Call it unspecified behavior if you want. Or, I don't know, JIT can optimize away a long loop doing nothing in one JVM, but not in another.

Fair point, however a context switch will not cause a drop in cpu usage; it will just move it to the "system" type of cpu usage. Now if the system call itself is adding the sleep()'s then we need to be looking in a completely different direction but there are tools that can give that answer (strace, whatever).

FYI disabling jit usually destroys performance.

But it is a good way to verify that a program works well on slow machines.

Since I don't know anything about hardened systems (and 30 minutes on wikipedia are not going to make me an expert) I'm not going to make any comments on whether disabling jit is necessary there or not. If it is then it is, just be aware that there usually is a significant degradation in performance.

comment:35 Changed 7 years ago by zzz

Of course we should be JVM-agnostic, and work on slow machines. Our Android port has been incredibly helpful in finding issues.

Unfortunately the performance-improvement task is never done. For issues on lesser-used JVMs, it's a matter of priority. As you can see in #743, the console broke with gij two releases ago, and nobody noticed until now.

WRT to the connect-before-encrypting-the-tunnel-message idea, it's very interesting. It's definitely difficult. Worth it? I don't know at this point.

Changed 7 years ago by Astral2012

60-75% CPU usage when disconnects are not detected. Modern GNU/Linux environment.

comment:36 Changed 7 years ago by zzz

@Astral2012 Router version? What kind of disconnect? At PC (ethernet cable) or farther away? How were they "not detected"?

comment:37 in reply to:  36 Changed 7 years ago by Astral2012

Replying to zzz:

@Astral2012 Router version?

1st time with 0.9.2-13.
2nd time with 0.9.2-15.

What kind of disconnect? At PC (ethernet cable) or farther away?

802.11n/g + DHCP, 100% packet loss but the interface was still up, IP address still attached.

How were they "not detected"?

Network reports that it's Ok. Tunnels still open.

comment:38 Changed 7 years ago by zzz

<Meeh> I got a improvment suggestion. If a i2p computer lost internet connection for a while, RIs will be removed from netdb since they are unavailable. would it be a idea to check if i2p is online once in a while to stop the rapid delete of RIs? or something like that.
<Meeh> if a i2p installation got 4k< of RIs, and for some misstake lost connection for 10-20min, it could result in <100 RIs, and it would need to reseed
<zzz> Meeh there's some code to do that now, and I added a simple network checker in -1, but there's possibly more improvement needed
<Meeh> see the point of removing dead RIs, but a feature for checking if 100> RIs have been deleted over a little time, it could check if it actualy has internet connection before deleting more RIs
<zzz> ticket 519 is related
<Meeh> I will check it out and add the suggestion
<zzz> it's a tough problem, and the various mechanisms are a little ad hoc now. It needs attention if you'd like to research it further
<Meeh> true. It is also a question on what's the best way to do it
<zzz> what I added in -1 is a check for a network interface every few minutes. That doesn't handle upstream problems
<zzz> there's also something in the watchdog that looks for number of outbound UDP connection failures. If it's too many it assume disconnected
<zzz> but I don't think that's used anywhere else
<zzz> better would be to assume connected every time a connection in or out succeeds on either transport
<zzz> or maybe even if something is acked
<zzz> and then declare disconnect after some time
<Meeh> acked as in tcp ACK?
<zzz> we don't have visibility to tcp acks
<zzz> but we do for udp, streaming, crypto tag sets, tunnel builds, etc
<zzz> so we could build up these sources and feed them to some central spot, perhaps
<zzz> and then use that central indication in lots of places
<zzz> so -1 is just a baby step toward that
<Meeh> but this will not work on i2p installs with only tcp enabled then?
<zzz> you could use some of the other indicators - NTCP connections in/out, and higher-layer stuff
<zzz> but yeah, gotta take all that into account, what's enabled, how much traffic, etc

comment:39 Changed 7 years ago by zzz

As mentioned above, 0.9.3-1 contains a rudimentary network monitor that only detects total lack of network interfaces. Will not detect upstream problems.

comment:40 Changed 7 years ago by zzz

This recent LWN article has a good discussion of the challenges in connectivity detection: https://lwn.net/Articles/523058/

Note: See TracTickets for help on using tickets.