Opened 2 years ago

Closed 17 months ago

#1939 closed defect (fixed)

I2CP: Poor loopback performace due to drops

Reported by: zzz Owned by: zzz
Priority: major Milestone: 0.9.33
Component: streaming Version: 0.9.28
Keywords: streaming Cc:
Parent Tickets:

Description

Large transfer from one local dest to another starts out fast but then stalls and proceeds very slowly. This is due to the router's i2cp transmit queue overflowing. Related may be that we don't return failure to the sender, see ClientManager?.DistributeLocal?.runJob(). Streaming is also to blame, doesn't recover quickly from large number of dropped sequence numbers, also built-in delays are far too large for loopback.

Fixes could be in router-side or client-side I2CP or in streaming:

  • Client side doesn't know if dest is local, would it help if it did?
  • Block instead of dropping on router side? Or would another queue just overflow?
  • Return failure code via I2CP to sender?
  • Reduce delays in streaming
  • Improve recovery after massive drops in streaming, this is something we've worked on before, but not always successfully

While this is most apparent and easily reproducible with loopback, it could affect remote dests as well. This test was with a live router. Not sure if it also happens with LocalClientManager? http://zzz.i2p/topics/1882 or if we use blocking there.

Subtickets

Change History (18)

comment:1 Changed 2 years ago by zzz

  • Milestone changed from undecided to 0.9.30
  • Owner set to zzz
  • Status changed from new to accepted

It's actually MessageInputStream? in streaming that is overflowing. The receiver sends a choke (delay > 60000) to the sender, but the sender doesn't do anything with it?

The I2CP queue (I2CPMessageQueueImpl) isn't overflowing, at least in the test I'm doing, but I'll implement and test changes proposed in the OP anyway.

comment:2 Changed 2 years ago by zzz

Failure code returned on I2CP queue overflow in 0254435c0dc705876bbedc63518d64eaa05a21b7 to be 0.9.28-6 but this isn't the problem, and message status messages generally aren't actually sent through I2CP anyway.

Still investigating the poorly-documented optional delay and choke behavior.

comment:3 Changed 2 years ago by zzz

  • Milestone changed from 0.9.30 to 0.9.29
  • Resolution set to fixed
  • Status changed from accepted to closed

In bd037c8b542fe0f6125aa16fb3bff5d257b4e955 0.9.28-6

Also, added new sections to http://i2p-projekt.i2p/en/docs/api/streaming

comment:4 Changed 2 years ago by zzz

  • Component changed from api/i2cp to streaming

comment:5 Changed 2 years ago by zzz

  • Resolution fixed deleted
  • Status changed from closed to reopened

Report on IRC, may or may not be related, to be investigated:

<val> Hi guys, I found a bug somewhere in i2p or java. If I create HTTP server tunnel I'm unable to download some files (mostly big ones), download always hangs even locally over i2p proxy. It downloads 300 kb and then stalles.
<val> Standard tunnels work fine locally but give very low speeds remotely (4 KB/s over 0-hop tunnels on both sides)
<val> I'm using 0.9.29 with jdk8-openjdk 8.u121
<val> Hrm, the problem persist even with standard tunnel and sufficiently large file (300 MB). It downloads 7 MB and that's all.
<val> This feels like it's an MTU issue but i'm running everything locally
<val> fallocate -l 300M file.zzz
<val> python3 -m server.http -b 127.0.0.1 8811
<val> create server tunnel of type HTTP to that port
<val> curl -x http://127.0.0.1:4444 -v http://b32address.b32.i2p/file.zzz > /dev/null
<val> In most cases the whole file won't be downloaded. It would stall.

comment:6 Changed 19 months ago by zzz

ref: optional delay, see #1046
ref: comment 5, see #2071

comment:7 follow-up: Changed 19 months ago by zab

I did several experiments, and I observed that in the "stall" the window size abruptly goes from 128 to 1 and never recovers.

I don't fully understand why, but changing the minimum RTO from 100ms to 1ms seems to solve the issue or at least band-aid around it. Anecdotally, browsing eepsites is massively more responsive with min RTO of 1.

comment:8 in reply to: ↑ 7 Changed 19 months ago by val

Replying to zab:

I did several experiments, and I observed that in the "stall" the window size abruptly goes from 128 to 1 and never recovers.

I don't fully understand why, but changing the minimum RTO from 100ms to 1ms seems to solve the issue or at least band-aid around it. Anecdotally, browsing eepsites is massively more responsive with min RTO of 1.

Did you change hardcoded MIN_RESEND_DELAY or some other parameter?

comment:9 follow-up: Changed 19 months ago by zab

I think I found the issue. We never ever unchoke the connection. Here is a proposed patch with which I have been able to sustain 27MB/s on loopback. http://zerobin.i2p/?5f14986c7325a3c6#kxQUP6q6jTV6naGdF83F5lzi6smyaS9eN18bwZW5COM=

comment:10 in reply to: ↑ 9 Changed 18 months ago by val

Replying to zab:

I think I found the issue. We never ever unchoke the connection. Here is a proposed patch with which I have been able to sustain 27MB/s on loopback. http://zerobin.i2p/?5f14986c7325a3c6#kxQUP6q6jTV6naGdF83F5lzi6smyaS9eN18bwZW5COM=

Can confirm, this patch fixes the issue. I get 160 MB/s on loopback, the speed does not drop and the file fully downloads.
Thank you.

I found MIN_RTO in router/java/src/net/i2p/router/transport/udp/PeerState.java. Is that what you changed in comment:7? Is it sane to set it to 10 or less?

Last edited 18 months ago by val (previous) (diff)

comment:11 Changed 18 months ago by zab

@Val:

No, the min rto is the MIN_RESEND_DELAY in apps/streaming/.../Connection.java By all means experiment with setting it lower, but I'm not advocating it should be made default network-wide. Certainly worth researching though.

comment:12 Changed 18 months ago by zzz

  • Milestone changed from 0.9.29 to 0.9.33
  • Priority changed from minor to major

After discussion in IRC, it appears that the real change is a botched cleanup by me that was part of the "fix" referenced in comment 3. 'not allowAck' should be 'allowAck' at line 141.

The consequences of isNew almost always being false should have caused a lot of other issues. For further investigation.

To be confirmed by me, first with loopback stub testing and then on a live router. If all goes well I'll check in the fix in a few days.

comment:13 Changed 18 months ago by zzz

Confirmed after testing with my loopback stub that zab's findings in comment 9 are correct, we were never unchoking.

Confirmed in the testing that isNew was almost always false, that's a severe issue, which caused every received packet to look like a duplicate and thus speed up the acks. That may have made the connection go faster, but at a severe efficiency cost.

In my testing, it chokes after a window size of 64, slams back to 1 as designed, but after that never gets above about 20. Don't know why.

Checked in log tweak in 9c63ae7c06590cb2932aecd01396025f96a9e6f5 to be 0.9.32-6 to assist with testing.

One other change to investigate is to reduce the DEFAULT_INITIAL_ACK_DELAY in ConnectionOptions?.java, currently 750. Last reduced (from 1000) in April 2015. I propose 375 or 500.

As discussed above, reducing the MIN_RESEND_DELAY in Connection.java is also worth looking at.

comment:14 Changed 18 months ago by zzz

see the TODO at ConnectionPacketHandler?.java line 184 for another place for loopback speedups. Reducing the 250 does indeed increase the speed.

comment:15 Changed 18 months ago by zzz

First impression is that i2psnark downloads are much slower

comment:16 Changed 18 months ago by zzz

I don't take my 'impression' as an important data point, however zab has pointed out reasons why it could or would be slower - that we're acking everything more quickly when isNew is always false.

But I've concluded it doesn't matter, that we can't make any tweaks to timing parameters until we fix this horrible bug and restore the intended ack behavior. Not to mention fix the not-unblocking bug which val and zab uncovered and caused this ticket to be reopened.

One-line version of zab's patch in comment 9, in cbf25d0631cfa0a862ffec3310b22798488d12ed 0.9.32-7

Leaving ticket open for testing and tweaks prior to the 0.9.33 release.

comment:17 Changed 18 months ago by zzz

Prevent window size getting stuck at 1 by doubling RTO on congestion.
Hard to trigger on loopback testing unless you tweak a lot of parameters,
but presumably happening on the live network.
In f9f8bf7b7e4d8a696a6f2123fbcc55bbd8a1c201 to be 0.9.32-10

comment:18 Changed 17 months ago by zzz

  • Resolution set to fixed
  • Status changed from reopened to closed

presumed fixed

Note: See TracTickets for help on using tickets.