#2237 closed defect (fixed)

Not writing to outbound ntcp connections up to 2 seconds

Reported by: Zlatin Balevsky Owned by: zzz
Priority: minor Milestone: 0.9.36
Component: router/transport Version: 0.9.34
Keywords: ntcp nio Cc:
Parent Tickets: Sensitive: no

Description

When establishing an outgoing connection, NTCPConnection::outboundConnected overwrites the interest ops of the SelectionKey to READ, but those may have (and most often have) been set to WRITE because of outgoing handshake data. So, the new connection sits idle until the fail-safe code in EventPumper manually turns on writing, which happens every 2 seconds.

Steps to reproduce:

  1. Turn on INFO logging in the ntcp package and restart router
  2. Observe frequent "Failsafe write for NTCP conn" log statements
  3. Apply patch http://zerobin.i2p/?724e2cb7cc733729#SQVUvjvnIoQFdt71HynU7xNG4gRgDUG0Xwp4eOrzioo= and restart.
  4. Observe much less frequent such events.

Subtickets

Change History (8)

comment:1 Changed 17 months ago by Zlatin Balevsky

The following, more invasive patch http://zerobin.i2p/?edb68713c3fd7d7b#MR5BCqhu7ttYJqYcuw8snHkLAyLk97EUhcRT6Bk5QrQ= almost completely gets rid of "Failsafe write" events.

Highlights:

  • it includes the interest ops change in the original patch
  • elevates the log level of failsafe write events to WARN for easier tracking
  • moves "processDelayedEvents" after the select operation
  • Changes NTCPConnections._writeBufs to a LinkedList and plain locking around it. Also, the "EventPumper.processWrite" call locks coarsely on that lock to prevent other threads from enqueuing buffers while the current ones are being sent.

comment:2 Changed 17 months ago by zzz

Owner: set to zzz
Status: newaccepted

Investigating the OP report and patch now. My preliminary findings are that it is a bug, and probably safe enough to include for .35.

I haven't looked at the patch in comment 1 yet, but my preference will probably be to investigate and test it after 35 is out and NTCP2 is checked into trunk, on or about July 1.

More on the OP issue:

The line in question is unchanged from jrandom's original NTCP checkin 2006-07-04. Unconditionally setting ops to OP_READ and then calling wantsWrite() in outboundConnected() basically assumes that the socket got connected before EstablishState?.prepareOutbound() completes and sets wantsWrite(). Since that isn't always the case, it's racy. Of course wantsWrite() doesn't set the ops, only queues the con in the pumper, so it won't always caused a missed write. But since processRead() is before processWrite() in the pumper loop, if we did go into the loop with both ops set, we will miss the write.

In my testing, it's about 50/50 whether OP_WRITE is set when outboundConnected() clears it. Perhaps, in the 2006, jrandom didn't have a keygen precalc thread (DHSessionKeyBuilder) so prepareOutbound() was slow doing the ElG keygen, so he didn't see this. But with our precalc thread prepareOutbound() is fast, and I expect NTCP2 to be even faster.

This could be a major cause of latency in our network for new connections, so I'd like to get it out there for 35. If all goes well I'll commit it in a day or two.

One note, we have stats for all the failsafe events, so you can measure the results of any changes, don't need to stare at the logs.

comment:3 Changed 17 months ago by zzz

re: comment 1 patch - what's the reason for, or effect of, moving runDelayedEvents() after select() ?

comment:4 in reply to:  3 Changed 17 months ago by Zlatin Balevsky

Replying to zzz:

re: comment 1 patch - what's the reason for, or effect of, moving runDelayedEvents() after select() ?

It would cause many false-positive "Failsafe write" events because the failsafe-checking code would run before any newly registered connections in _wantsWrite would be processed.

comment:5 Changed 16 months ago by zzz

Milestone: undecided0.9.36

OP fix in a8fec46b089c3196f87541544eec493e7c0c6cd0 0.9.34-19
Leaving open for comment 1 issues, to be addressed in .36 cycle, with related #2243

comment:6 Changed 16 months ago by Zlatin Balevsky

Further investigation shows that moving runDelayedEvents() before the failsafe code alone is enough to drastically reduce the number of failsafe write events

comment:7 Changed 16 months ago by zzz

runDelayedEvents change from comment 6 in a2896fe70b279198a0ae06a7045a81a9bbfeea3c to be 0.9.35-1

comment:8 Changed 16 months ago by zzz

Resolution: fixed
Status: acceptedclosed

OP says ticket may be closed

Note: See TracTickets for help on using tickets.