Opened 3 years ago
Closed 3 years ago
#2237 closed defect (fixed)
Not writing to outbound ntcp connections up to 2 seconds
Reported by: | Zlatin Balevsky | Owned by: | zzz |
---|---|---|---|
Priority: | minor | Milestone: | 0.9.36 |
Component: | router/transport | Version: | 0.9.34 |
Keywords: | ntcp nio | Cc: | |
Parent Tickets: | Sensitive: | no |
Description
When establishing an outgoing connection, NTCPConnection::outboundConnected overwrites the interest ops of the SelectionKey to READ, but those may have (and most often have) been set to WRITE because of outgoing handshake data. So, the new connection sits idle until the fail-safe code in EventPumper manually turns on writing, which happens every 2 seconds.
Steps to reproduce:
- Turn on INFO logging in the ntcp package and restart router
- Observe frequent "Failsafe write for NTCP conn" log statements
- Apply patch http://zerobin.i2p/?724e2cb7cc733729#SQVUvjvnIoQFdt71HynU7xNG4gRgDUG0Xwp4eOrzioo= and restart.
- Observe much less frequent such events.
Subtickets
Change History (8)
comment:1 Changed 3 years ago by
comment:2 Changed 3 years ago by
Owner: | set to zzz |
---|---|
Status: | new → accepted |
Investigating the OP report and patch now. My preliminary findings are that it is a bug, and probably safe enough to include for .35.
I haven't looked at the patch in comment 1 yet, but my preference will probably be to investigate and test it after 35 is out and NTCP2 is checked into trunk, on or about July 1.
More on the OP issue:
The line in question is unchanged from jrandom's original NTCP checkin 2006-07-04. Unconditionally setting ops to OP_READ and then calling wantsWrite() in outboundConnected() basically assumes that the socket got connected before EstablishState?.prepareOutbound() completes and sets wantsWrite(). Since that isn't always the case, it's racy. Of course wantsWrite() doesn't set the ops, only queues the con in the pumper, so it won't always caused a missed write. But since processRead() is before processWrite() in the pumper loop, if we did go into the loop with both ops set, we will miss the write.
In my testing, it's about 50/50 whether OP_WRITE is set when outboundConnected() clears it. Perhaps, in the 2006, jrandom didn't have a keygen precalc thread (DHSessionKeyBuilder) so prepareOutbound() was slow doing the ElG keygen, so he didn't see this. But with our precalc thread prepareOutbound() is fast, and I expect NTCP2 to be even faster.
This could be a major cause of latency in our network for new connections, so I'd like to get it out there for 35. If all goes well I'll commit it in a day or two.
One note, we have stats for all the failsafe events, so you can measure the results of any changes, don't need to stare at the logs.
comment:3 follow-up: 4 Changed 3 years ago by
re: comment 1 patch - what's the reason for, or effect of, moving runDelayedEvents() after select() ?
comment:4 Changed 3 years ago by
Replying to zzz:
re: comment 1 patch - what's the reason for, or effect of, moving runDelayedEvents() after select() ?
It would cause many false-positive "Failsafe write" events because the failsafe-checking code would run before any newly registered connections in _wantsWrite would be processed.
comment:5 Changed 3 years ago by
Milestone: | undecided → 0.9.36 |
---|
OP fix in a8fec46b089c3196f87541544eec493e7c0c6cd0 0.9.34-19
Leaving open for comment 1 issues, to be addressed in .36 cycle, with related #2243
comment:6 Changed 3 years ago by
Further investigation shows that moving runDelayedEvents() before the failsafe code alone is enough to drastically reduce the number of failsafe write events
comment:7 Changed 3 years ago by
runDelayedEvents change from comment 6 in a2896fe70b279198a0ae06a7045a81a9bbfeea3c to be 0.9.35-1
comment:8 Changed 3 years ago by
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
OP says ticket may be closed
The following, more invasive patch http://zerobin.i2p/?edb68713c3fd7d7b#MR5BCqhu7ttYJqYcuw8snHkLAyLk97EUhcRT6Bk5QrQ= almost completely gets rid of "Failsafe write" events.
Highlights: