Opened 5 weeks ago

Closed 3 weeks ago

#2476 closed defect (fixed)

NTCP2 100% usage on read() due to con not closed

Reported by: jogger Owned by: zzz
Priority: major Milestone: 0.9.40
Component: router/transport Version: 0.9.39
Keywords: Cc: zab
Parent Tickets:

Description

On multiple routers I got something like the following:

DEBUG [NTCP Pumper ] ter.transport.ntcp.EventPumper?: Read -1 bytes total in 0 times from NTCP2 conn 86845 to 104.176.43.65 port 14933 c~OaCW not established created 4h ago, last send 4h ago, last recv 4h ago, sent 1, rcvd 0
DEBUG [NTCP Pumper ] ter.transport.ntcp.EventPumper?: EOF on NTCP2 conn 86845 to 104.176.43.65 port 14933 c~OaCW not established created 4h ago, last send 4h ago, last recv 4h ago, sent 1, rcvd 0

Repeats indefinitely and drives CPU to the max. Con does not get closed by processRead(), the failsafe stuff also claims to do it, but does not.

Only change I had within the last days was enabling IPv6. Possibly problem with connecting to the same router twice?

Subtickets

Change History (22)

comment:1 Changed 5 weeks ago by zzz

  • Cc zab added
  • Milestone changed from undecided to 0.9.40

I think it may be a bug in processRead() where we get an EOF on the 2nd time through the while loop, so readThisTime < 0 && read > 0. I guess we can't call con.close() right there but have to do it later? Should be a better way than hoping the failsafe code catches it.

comment:2 Changed 5 weeks ago by zzz

full version info please

comment:3 Changed 5 weeks ago by jogger

I2P version: 0.9.39-0
Java version: N/A 12-internal (OpenJDK Runtime Environment 12-internal+0-adhoc..jdk12)
Wrapper version: 3.5.34
Server version: 9.2.25.v20180606
Servlet version: Jasper JSP 2.3 Engine
JSTL version: standard-taglib 1.2.0
Platform: Linux arm 4.14.94-odroidxu4
Processor: (armcortexa7)
JBigI status: Locally optimized library libjbigi-linux-armv7.so loaded from file
GMP version: 6.1.2
JBigI version: 4
Encoding: UTF-8
Charset: UTF-8
Built By: unknown

+ my NTCP Pumper patch applied. JDK, wrapper, jbigi, GMP with tuneup built from source by me.

@zab: It is the first time through the loop (says 0 times above), so read=0 and readThisTime=-1. Problem seems to be that read() does now throw any exception and close() does nothing. Also noteworthy: "not established".

comment:4 Changed 5 weeks ago by jogger

Also happens with NTCP1:
DEBUG [NTCP Pumper ] ter.transport.ntcp.EventPumper?: Read -1 bytes total in 0 times from NTCP1 conn 1450162 to 95.165.172.77 port 36401 NMVu30 not established created 4m ago, last send 4m ago, last recv 4m ago, sent 0, rcvd 0
DEBUG [NTCP Pumper ] ter.transport.ntcp.EventPumper?: EOF on NTCP1 conn 1450162 to 95.165.172.77 port 36401 NMVu30 not established created 4m ago, last send 4m ago, last recv 4m ago, sent 0, rcvd 0

Peer NMVu not listed on /peers.

Last edited 5 weeks ago by jogger (previous) (diff)

comment:5 Changed 5 weeks ago by zzz

Sure, pumper doens't know about NTCP 1 vs 2, would happen to both.

So, the design intent of processRead() is that if it receives data and then an EOF while in the method, it "forgets" the EOF and stays interested. The next time around, it gets an immediate EOF (no data), prints "read -1 bytes in 0 times" and "EOF on ...", and calls con.close().

close() calls locked_close() which calls _chan.close() (NTCPConnection line 533) and we should be done. The only way it wouldn't be called is if _chan was null. locked_close is synchornized but setChannel() isn't synchronized. Is that the problem? Or is there some race where the pumper is hitting EOF before setChannel() is called?

Both the logs above are for outbound connections. Javadoc in NTCPConnection says chan will be non-null for outbound "shortly after creation".

comment:6 Changed 5 weeks ago by zzz

Possible fix??? Please test.

#
# old_revision [bdae038d4ac7d94db4cb93af5fcd4badb177b105]
#
# patch "router/java/src/net/i2p/router/transport/ntcp/NTCPConnection.java"
#  from [f4e7b9b8edb6a8db1bd4ceb2dc2e5eed0f31d76d]
#    to [90d2011fd6b28236d840a770b2f0a06bdf50b96e]
#
============================================================
--- router/java/src/net/i2p/router/transport/ntcp/NTCPConnection.java	f4e7b9b8edb6a8db1bd4ceb2dc2e5eed0f31d76d
+++ router/java/src/net/i2p/router/transport/ntcp/NTCPConnection.java	90d2011fd6b28236d840a770b2f0a06bdf50b96e
@@ -288,14 +288,14 @@ public class NTCPConnection implements C
     /**
      *  Valid for inbound; valid for outbound shortly after creation
      */
-    public SocketChannel getChannel() { return _chan; }
+    public synchronized SocketChannel getChannel() { return _chan; }
 
     /**
      *  Valid for inbound; valid for outbound shortly after creation
      */
-    public SelectionKey getKey() { return _conKey; }
-    public void setChannel(SocketChannel chan) { _chan = chan; }
-    public void setKey(SelectionKey key) { _conKey = key; }
+    public synchronized SelectionKey getKey() { return _conKey; }
+    public synchronized void setChannel(SocketChannel chan) { _chan = chan; }
+    public synchronized void setKey(SelectionKey key) { _conKey = key; }
 
     public boolean isInbound() { return _isInbound; }
     public boolean isEstablished() { return _establishState.isComplete(); }

comment:7 Changed 5 weeks ago by zzz

Also want to analyze why the failsafe didn't catch this. Even if it happens, shouldn't get stuck forever.

comment:8 Changed 5 weeks ago by zzz

or is the problem that we need to clear the read interest ops somewhere?

comment:9 Changed 5 weeks ago by zzz

_chan can't be null or getChannel() would have returned null and processRead() would have NPE'd. So I don't think synch is the solution.

comment:10 Changed 4 weeks ago by zab

NTCPConnection.close() is guarded by an AtomicBoolean to make sure close() is executed only once.

@jogger: can you enable DEBUG loglevel in NTCPConnection as well if you see this happening again?

comment:11 Changed 4 weeks ago by zab

I would try synching outboundConnected but this is also a wild guess at this stage:

#
# old_revision [00a02fe89f8354c2626bdf1970c0a96edbc77523]
#
# patch "router/java/src/net/i2p/router/transport/ntcp/NTCPConnection.java"
#  from [f4e7b9b8edb6a8db1bd4ceb2dc2e5eed0f31d76d]
#    to [5f16d45f569dbc2f8a1b56638adf83b51b2ccb3b]
#
============================================================
--- router/java/src/net/i2p/router/transport/ntcp/NTCPConnection.java	f4e7b9b8edb6a8db1bd4ceb2dc2e5eed0f31d76d
+++ router/java/src/net/i2p/router/transport/ntcp/NTCPConnection.java	5f16d45f569dbc2f8a1b56638adf83b51b2ccb3b
@@ -1125,7 +1125,7 @@ public class NTCPConnection implements C
      * async callback after the outbound connection was completed (this should NOT block, 
      * as it occurs in the selector thread)
      */
-    void outboundConnected() {
+    synchronized void outboundConnected() {
         _conKey.interestOps(_conKey.interestOps() | SelectionKey.OP_READ);
         // schedule up the beginning of our handshaking by calling prepareNextWrite on the
         // writer thread pool

comment:12 Changed 4 weeks ago by zzz

quote zab: "I think the connection is being close()-d during or shortly after tcp connect, but before read interest is set. As a result a closed connection gets registered with the selector. Further close() call do nothing because of the AtomicBoolean? guard"

maybe? I don't see it but looking futher.

The reason I think it may be a synch issue is that OP's arm system has a history of uncovering synch issues before anybody else does, I think the fast arm hardware is looser with the consistency than x86

comment:13 Changed 4 weeks ago by jogger

May have happened to me before Java 12 but may have not noticed since I was hunting other 100% CPU causes.

I think comments 8 and 12 are close to the cause. After being absent for some days I remembered the following release note which may be helpful:

https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8201169

comment:14 Changed 4 weeks ago by jogger

@zab: as you expected

DEBUG [NTCP Pumper ] .transport.ntcp.NTCPConnection: close() loop in NTCPConnection NTCP2 conn 854634 to 86.30.156.97 port 27623 bWY8ag not established created 70m ago, last send 70m ago, last recv 70m ago, sent 1, rcvd 0

comment:15 Changed 4 weeks ago by zab

Well this patch should fix the symptom, i.e. force the failsafe code to close the connection and prevent an infinite loop. The underlying cause is still being researched...

#
# old_revision [7b304521198bd6cf1f6d662d79376c4a5ec59691]
#
# patch "router/java/src/net/i2p/router/transport/ntcp/EventPumper.java"
#  from [9c34961402b2d9e4aec83a15c13cdd737cca9bdd]
#    to [801ebd801b2d20d8abf2d575fae222f9d5d1b7c5]
#
============================================================
--- router/java/src/net/i2p/router/transport/ntcp/EventPumper.java	9c34961402b2d9e4aec83a15c13cdd737cca9bdd
+++ router/java/src/net/i2p/router/transport/ntcp/EventPumper.java	801ebd801b2d20d8abf2d575fae222f9d5d1b7c5
@@ -268,6 +268,7 @@ class EventPumper implements Runnable {
                                         _log.info("Removing invalid key for " + con);
                                     // this will cancel the key, and it will then be removed from the keyset
                                     con.close();
+                                    key.cancel();
                                     failsafeInvalid++;
                                     continue;
                                 }
@@ -298,6 +299,7 @@ class EventPumper implements Runnable {
                                      con.getTimeSinceReceive(now) > expire) {
                                     // we haven't sent or received anything in a really long time, so lets just close 'er up
                                     con.sendTerminationAndClose();
+                                    key.cancel();
                                     if (_log.shouldInfo())
                                         _log.info("Failsafe or expire close for " + con);
                                     failsafeCloses++;

comment:16 Changed 4 weeks ago by zab

And if I'm right about the connection being closed during or shortly after tcp layer connect, this patch should prevent the loop from ever starting:

#
# old_revision [7b304521198bd6cf1f6d662d79376c4a5ec59691]
#
# patch "router/java/src/net/i2p/router/transport/ntcp/NTCPConnection.java"
#  from [f4e7b9b8edb6a8db1bd4ceb2dc2e5eed0f31d76d]
#    to [6aae37a593c8e156d2f499f217e9a43d40671c5b]
#
============================================================
--- router/java/src/net/i2p/router/transport/ntcp/NTCPConnection.java	f4e7b9b8edb6a8db1bd4ceb2dc2e5eed0f31d76d
+++ router/java/src/net/i2p/router/transport/ntcp/NTCPConnection.java	6aae37a593c8e156d2f499f217e9a43d40671c5b
@@ -1125,7 +1125,11 @@ public class NTCPConnection implements C
      * async callback after the outbound connection was completed (this should NOT block, 
      * as it occurs in the selector thread)
      */
-    void outboundConnected() {
+    synchronized void outboundConnected() {
+        if (_establishState == EstablishBase.FAILED) {
+            _conKey.cancel();
+            return;
+        }
         _conKey.interestOps(_conKey.interestOps() | SelectionKey.OP_READ);
         // schedule up the beginning of our handshaking by calling prepareNextWrite on the
         // writer thread pool

comment:17 Changed 4 weeks ago by zzz

could _conKey ever be null in that 2nd patch?

should we also call key.cancel() or clear the interest ops after hitting EOF in processRead()? That's what I was getting at in comment 8.

comment:18 Changed 4 weeks ago by zab

if _conKey is null it would have NPE'd all over - as you can see there is no null check when setting the read interest.

We may call key.cancel() in processRead, although I'd like jogger to first test at least the second patch to see if it eliminates the loop

comment:19 Changed 4 weeks ago by jogger

Patch from comment 16 is running on 2 machines now for some hours. Since the error has been occurring every 2-4 days on each machine for some weeks now, five days uptime should give a good indication.

comment:20 Changed 4 weeks ago by zzz

Great. Our checkin deadline for 0.9.40 is Friday May 3 so we'll need to have test results before then. Thanks for testing.

comment:21 Changed 3 weeks ago by jogger

Ran patch from comment 16 on two routers with up to 2.500 NTCP conns each for more than 6 days. Error from OP is gone. Nothing in the logs.

However than restarted due to this very old unresolved rare and unrelated issue https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8023983. zzz noticed this happening before.

comment:22 Changed 3 weeks ago by zzz

  • Resolution set to fixed
  • Status changed from new to closed

Merged in all 4 patches (comments 6, 11, 15, 16) in 197108bb675f56d2ecbffab608df607478d882f6 0.9.39-9

Note: See TracTickets for help on using tickets.