Opened 3 months ago

Last modified 3 days ago

#2434 new enhancement

Speedup EventPumper.java

Reported by: jogger Owned by: zzz
Priority: major Milestone: undecided
Component: router/transport Version: 0.9.38
Keywords: Cc:
Parent Tickets:

Description

I did some code analysis and found the following:

  • While I have not gone deep enough to review the 8K buffer size, MAX_MINB is definitely too low at 12. Under load buffers run dry causing them to be allocated and disposed on the fly. I suggest setting this at 1024, that scales up to 16 GB RAM and provides > 100 for a 2 GB system.

I believe the following code to be correct as it returns any outstanding I/O according to the docs.

                    int count = _selector.selectNow();
                    if (count == 0)
                        count = _selector.select(SELECTOR_LOOP_DELAY);

For me there is a noticeable speed boost. NTCP Pumper spends less time in the kernel.

Subtickets (add)

Attachments (1)

sexy_spikes.png (269.0 KB) - added by zab 3 months ago.
jogger-5 is patched with OP and #2412

Download all attachments as: .zip

Change History (18)

comment:1 Changed 3 months ago by zab

It makes sense under load, but also depends on the specific implementation of the select calls. Does select(int timeout) always put itself on the waiting queue? Only digging through the source code of the JVM and the OS kernel for the given platform can answer that.

But overall I support this change.

comment:2 Changed 3 months ago by jogger

I think we do not need to go to that source. This solves #2355 and its darn fast. Just saw 3.5 MBps out on 50% CPU.

selectNow() guarantees you to get everything outstanding. select() is permitted to return zero when a wakeup is pending. This is documented and the implementation may change any time. Calling select() only makes sense with previous wakeups cleared. selectNow() does this.

So the answer to your question is: No, select() is not waiting, it returns zero and we have a fruitless loop through our remaining stuff. During that time a new wakeup comes in and the thing repeats. We get nothing done and the lockup is complete. That is what we are chasing for 2 years now since #1943.

I would also remove the sleep() as there are not even any 10ms pauses on inbound TCP.

comment:3 Changed 3 months ago by zzz

A couple words of caution:

  • We've been working on pumper for years. There's several tickets, some still open, and you've added this one and #2432 in quick succession
  • NIO behavior and performace varies wildly by platform and Java version. Back in the Java 5/6 days, Windows boxes were horrible because the epoll sucks. That's hopefully a thing of the past with 8+, but we still support 7 and of course Android 7-equivalent. But there's a long history of nio bugs and poor performance, and undocumented behavior, and some of the failsafe code and other things are there for that.

It's not possible to test all platforms and flavors. Some of the workaround code may only apply to things we can't test.

It seems as though from the comment in #2355 and comment 2 above, you're converging on a change you're happy with?

If so, please provide a unified diff for us to review and test. We have nothing to look at so far.

comment:4 Changed 3 months ago by jogger

I take your advice seriously and have tested stuff below clean again on 3 more machines incl. Mac and Win. However I think good code must withstand a code review first. If it passes and does not test clean afterwards, I can file tickets. I do not doubt you put lots of effort into the EventPumper?, but this is about resolving clear faults no one can characterize as "has always worked". Here I stayed conservative, changed only a little bit, did not change any network logic nor failsafe conn handling.

Then I went through open tickets. Can´t judge the PNGs but my patch below addresses #2251, executing a sleep only if the pumper loop does not fall in wait().

I tried 3 approaches redesigning the main loop for the problems discovered while investigating #2355 which all differ little in actual CPU under load. Just using the two statements in the OP without any wait() or sleep() can be ruled out since that drives up CPU really high.

  • forced sleep(1) followed by selectNow() and possibly select(SELECTOR_LOOP_DELAY)

gave me 600-650 iterations per sec under load. (1600 keysets)
Advantages: failsafe, good for low traffic, does not miss packets
Disadvantages: does not appeal to speed freaks, wastes time under CPU pressure

  • wait(1) together with notify() (intentionally ignoring notify while processing keys in order to get some wait) and then selectNow()

gave me 1250-1450 iterations per sec under load.
Advantages: simple, fast
Disadvantages: not inherently failsafe, wastes CPU on low traffic

  • implementing the intentions of the current code: selectNow(), then possibly wait(SELECTOR_LOOP_DELAY) and selectNow() again, with forced sleep moved to failsafe clause

gave me around 2500 iterations per sec at top speeds.
Advantages: fastest, failsafe, good with low traffic and under CPU pressure, does not miss any event.
Disadvantage: eats a bit more CPU but that is the price for lower latency and not missing a packet. Same situation as with my patch for getNextVolley().

Coded and load tested 3rd approach, as it resembles what I did for SSU. There now is a forced sleep() up to MIN_FAILSAFE_WAITS ms within the failsafe clause, if not that many waits have been performed before. The timeout call and blocked IP check have been moved out of the hot path into the failsafe clause. Kicked out one exception that would never strike. wakeup() has been replaced by notify(). NTCP Pumper CPU is on the same level as UDP Packet pusher now.

--- "EventPumper orig.java"	2019-02-11 11:37:03.180913002 +0100
+++ "EventPumper patch.java"	2019-02-11 11:44:39.022894613 +0100
@@ -83,8 +83,8 @@
      * the time to iterate across them to check a few flags shouldn't be a problem.
      */
     private static final long FAILSAFE_ITERATION_FREQ = 2*1000l;
-    private static final int FAILSAFE_LOOP_COUNT = 512;
     private static final long SELECTOR_LOOP_DELAY = 200;
+    private static final long MIN_FAILSAFE_WAITS = 25;
     private static final long BLOCKED_IP_FREQ = 3*60*1000;
 
     /** tunnel test now disabled, but this should be long enough to allow an active tunnel to get started */
@@ -103,7 +103,7 @@
     //private static final String PROP_DIRECT = "i2np.ntcp.useDirectBuffers";
 
     private static final int MIN_MINB = 4;
-    private static final int MAX_MINB = 12;
+    private static final int MAX_MINB = 1024;
     private static final int MIN_BUFS;
     static {
         long maxMemory = SystemVersion.getMaxMemory();
@@ -143,8 +143,9 @@
     
     public synchronized void stopPumping() {
         _alive = false;
-        if (_selector != null && _selector.isOpen())
-            _selector.wakeup();
+        synchronized (_selector) {
+            _selector.notify();
+        }
     }
     
     /**
@@ -162,7 +163,9 @@
     public void register(ServerSocketChannel chan) {
         if (_log.shouldLog(Log.DEBUG)) _log.debug("Registering server socket channel");
         _wantsRegister.offer(chan);
-        _selector.wakeup();
+        synchronized (_selector) {
+            _selector.notify();
+        }
     }
 
     /**
@@ -173,17 +176,19 @@
             _log.debug("Registering " + con);
         _context.statManager().addRateData("ntcp.registerConnect", 1);
         _wantsConRegister.offer(con);
-        _selector.wakeup();
+        synchronized (_selector) {
+            _selector.notify();
+        }
     }
     
     /**
      *  The selector loop.
-     *  On high-bandwidth routers, this is the thread with the highest CPU usage, so
+     *  On high-bandwidth routers, this is a thread with high CPU usage, so
      *  take care to minimize overhead and unnecessary debugging stuff.
      */
     public void run() {
         int loopCount = 0;
-        int failsafeLoopCount = FAILSAFE_LOOP_COUNT;
+        long failSafeWaits = 0;
         long lastFailsafeIteration = System.currentTimeMillis();
         long lastBlockedIPClear = lastFailsafeIteration;
         while (_alive && _selector.isOpen()) {
@@ -189,9 +194,17 @@
         while (_alive && _selector.isOpen()) {
             try {
                 loopCount++;
-
                 try {
-                    int count = _selector.select(SELECTOR_LOOP_DELAY);
+                    int count;
+                    if ((count = _selector.selectNow()) == 0) {
+                       try {
+                            synchronized (_selector) {
+                                _selector.wait(SELECTOR_LOOP_DELAY);
+                            }
+                        } catch (InterruptedException ie) {}
+                        failSafeWaits++;
+                        count = _selector.selectNow();
+                    }
                     if (count > 0) {
                         Set<SelectionKey> selected = _selector.selectedKeys();
                         //_context.statManager().addRateData("ntcp.pumperKeysPerLoop", selected.size());
@@ -201,14 +214,10 @@
                     }
                     runDelayedEvents();
                 } catch (ClosedSelectorException cse) {
-                    continue;
+                    break;
                 } catch (IOException ioe) {
                     if (_log.shouldLog(Log.WARN))
                         _log.warn("Error selecting", ioe);
-                } catch (CancelledKeyException cke) {
-                    if (_log.shouldLog(Log.WARN))
-                        _log.warn("Error selecting", cke);
-		    continue;
 		}
                 
                 long now = System.currentTimeMillis();
@@ -223,11 +232,8 @@
                         int lastKeySetSize = all.size();
                         _context.statManager().addRateData("ntcp.pumperKeySetSize", lastKeySetSize);
                         _context.statManager().addRateData("ntcp.pumperLoopsPerSecond", loopCount / (FAILSAFE_ITERATION_FREQ / 1000));
-                        // reset the failsafe loop counter,
-                        // and recalculate the max loops before failsafe sleep, based on number of keys
-                        loopCount = 0;
-                        failsafeLoopCount = Math.max(FAILSAFE_LOOP_COUNT, 2 * lastKeySetSize);
                         
+                        loopCount = 0;
                         int failsafeWrites = 0;
                         int failsafeCloses = 0;
                         int failsafeInvalid = 0;
@@ -315,22 +321,25 @@
                     } catch (ClosedSelectorException cse) {
                         continue;
                     }
-                } else {
-                    // another 100% CPU workaround 
-                    // TODO remove or only if we appear to be looping with no interest ops
-                    if ((loopCount % failsafeLoopCount) == failsafeLoopCount - 1) {
+
+                    if (lastBlockedIPClear + BLOCKED_IP_FREQ < now) {
+                        _blockedIPs.clear();
+                        lastBlockedIPClear = now;
+                    }
+
+                    expireTimedOut();
+
+                    // sleep a bit if we did not wait enough above
+                    if ((failSafeWaits -= MIN_FAILSAFE_WAITS) < 0) {
                         if (_log.shouldLog(Log.INFO))
                             _log.info("EventPumper throttle " + loopCount + " loops in " +
                                       (now - lastFailsafeIteration) + " ms");
                         _context.statManager().addRateData("ntcp.failsafeThrottle", 1);
                         try {
-                            Thread.sleep(25);
+                            Thread.sleep(-failSafeWaits);
                         } catch (InterruptedException ie) {}
                     }
-                }
-                if (lastBlockedIPClear + BLOCKED_IP_FREQ < now) {
-                    _blockedIPs.clear();
-                    lastBlockedIPClear = now;
+                    failSafeWaits = 0;
                 }
             } catch (RuntimeException re) {
                 _log.error("Error in the event pumper", re);
@@ -452,7 +461,9 @@
      */
     public void wantsWrite(NTCPConnection con) {
         if (_wantsWrite.add(con)) {
-            _selector.wakeup();
+            synchronized (_selector) {
+                _selector.notify();
+            }
         }
     }
 
@@ -463,7 +474,9 @@
      */
     public void wantsRead(NTCPConnection con) {
         _wantsRead.offer(con);
-        _selector.wakeup();
+        synchronized (_selector) {
+            _selector.notify();
+        }
     }
 
     /**
@@ -881,15 +894,7 @@
                 if (_log.shouldLog(Log.WARN)) _log.warn("Error registering", cce);
             }
         }
-        
-        long now = System.currentTimeMillis();
-        if (_lastExpired + 1000 <= now) {
-            expireTimedOut();
-            _lastExpired = now;
         }
-    }
-
-    private long _lastExpired;
 
     private void expireTimedOut() {
         _transport.expireTimedOut();

comment:5 Changed 3 months ago by zab

I'm absolutely against adding forced wait()s or sleep()s in this time-critical code. In fact, I've been running the two statements in OP plus completely removed failsafe sleep() on my busy router and the bandwidth chart is looking amazing, compared to vanilla.

It is essential that incoming traffic gets processed as quickly as possible because otherwise the underlying OS logic is going to adjust the various TCP timers (more specifically RTT) and as result slow down the transmission on the tcp link. If implemented network-wide such slowdown will have multiplicative effects and impact the user experience.

comment:6 Changed 3 months ago by jogger

@zab: glad, you support my choice. My patch above will only enforce short sleep as a failsafe measure when during 2 seconds selectNow() always returns >0 except in less than 25 cases. That would mean 99.9% CPU. Just looking at the stats: NTCP Pumper 35% CPU at 2.8 MBps out on my ARM 32. So this is really a precaution, in case somewhere in the code something goes terribly wrong.

comment:7 Changed 3 months ago by zab

_selector.wait(SELECTOR_LOOP_DELAY);

That's what I'm against - if any data arrives on a TCP connection during these 200ms it will just sit in the OS buffers.

comment:8 Changed 3 months ago by jogger

OK, I see. But I have benchmarked this, see above. There are lots of notify(). At any speed the loop drives more than 1.000 iterations per sec per MB received (and this is TCP and UDP combined). On my slow machines I get 2.500+ iterations/sec under load. That´s faster than TCP packets can arrive from i2p and means average wait under .5 ms. Under load most of the packets will be harvested by selectNow() anyway, so zero delay.

But if others support your view, I will be happy to use the statements in the OP, makes things a bit shorter. I have benchmarked those to give >3.500 iterations/sec on ARM32 at noticeable expense of CPU, although less CPU than current.

Comments?

comment:9 Changed 3 months ago by zab

It's true that in your load profile there will be frequent notify()-es. However for much less loaded routers the frequency of notify()-es will be lower and data will wait in buffers more often.

I like the statements in the OP; I also like the platform-specific approach I proposed in #2251 comment 8. Your approach of adaptive sleep is also something I'm willing to consider on platforms for which we don't know if select() has bugs.

Anyway, the 25ms sleep is too long and it kicks in too often imho even after the changes described in #2251 comment 7. During a recent profiling run of 73 wall seconds, 3 seconds were spent in sleep() - that's about 4.1%

I'll post a bandwidth graph with even sexier looking spikes shortly.

Changed 3 months ago by zab

jogger-5 is patched with OP and #2412

comment:10 Changed 3 months ago by jogger

OK, so if you support the statements in the OP, I will rework the patch. My charts look sexy either way. Easy, will do tomorrow. Please note I have moved the sleep() to the failsafe clause. At the traffic levels we discussed it will not even kick in at 200 consecutive selectNow()>0 average, equivalent to the CPU running totally away. That´s fair enough, much less than before and only a band aid. Will never kick in at CPU < 95% and if so just mean 1.25% free CPU max, just enough for the emergency user interrupt.

Also I am against coding platform specific unless 100% necessary. Platforms change too fast. Example: Latest Windoze upgrade did away with the clock jump in JDK 11.0.2 (unchanged) on my machine as tested with Time.class.

comment:11 Changed 3 months ago by jogger

Had a network outage, so will postpone the diff until tomorrow after load testing. But I describe my findings now:

  • to be thread-safe the wantswrite function must call wakeup() in any case
  • as we can not rely on specifics of the selector implementation we must guard against race conditions there, if we do not want to fall into select(timeout) with traffic due or return from that with no result because of an uncleared wakeup.

I have therefore amended the code in the OP with a "_threadSafeWait" boolean that will track outstanding traffic and prevent wait if there is some. This is also a failsafe measure against errors in our code.

Looked good so far before the outage, on the /peers page less queued traffic, but still #2346.

Potential issues I stumbled upon in the current code which I do not understand:

  • When a read is b/w limited, READ is removed from the interestOps, but can´t find where reenabled. Should not occur according to comments, but will not?
  • Within runDelayedEvents when processing write, there is:

{{{ if (key == null)

continue;

}}}
This clause strikes and the conn disappears later, but why would we queue up a conn for writing through the selector when we have no key for it?

Version 0, edited 3 months ago by jogger (next)

comment:12 Changed 3 months ago by jogger

It turned out to be not that easy. The code in the OP seemed to work because the select() fell through often after selectNow() not clearing wakeups during operation. Guarding against concurrent wakeups seemed a good idea at first but the router could not even get 1 MBps TCP traffic. This is because with no wakeup or traffic outstanding, select() causes a context switch. When waking up, it has zero to one packets ready to work on, but there still is a .5 ms penalty for the switch. During this period our code can not react.

The effect is that select() returns after .5ms with one packet max. and our code starts working packet-at-a-time at higher traffic. This was checked with pidstat: NTCP Pumper doing 1200 ctx/sec. Using taskset I was able to get this at 2000 ctx/sec with higher traffic, but that is no solution. CPU was unacceptably high either.

Solution for any CPU size or traffic profile:

After selectNow() having no result, we sleep for the duration of one context switch (500000 nanos on Linux) as this is our penalty anyway. If we can write after that we update the interestOPS and do a full harvest using selectNow(). If not we call select() hoping it will immediately return some read without context switching. If no read nor write pending we incur the cost of a second context switch. This is a rare case with high traffic and fits also with slow traffic as the switch costs only time but little CPU.

Works good: Pumper stays well under 800ctx/sec and with higher traffic ctx goes down and nonvol ctx goes up, showing it is increasingly using up its entire time slice busy working.

--- "EventPumper orig.java"	2019-02-11 11:37:03.180913002 +0100
+++ "EventPumper patch.java"	2019-02-15 21:40:40.271500090 +0100
@@ -56,6 +56,7 @@
     private final ObjectCounter<ByteArray> _blockedIPs;
     private long _expireIdleWriteTime;
     private static final boolean _useDirect = false;
+    private AtomicBoolean _threadSafeWait = new AtomicBoolean();
     
     /**
      *  This probably doesn't need to be bigger than the largest typical
@@ -83,8 +84,8 @@
      * the time to iterate across them to check a few flags shouldn't be a problem.
      */
     private static final long FAILSAFE_ITERATION_FREQ = 2*1000l;
-    private static final int FAILSAFE_LOOP_COUNT = 512;
     private static final long SELECTOR_LOOP_DELAY = 200;
+    private static final long MIN_FAILSAFE_WAITS = 25;
     private static final long BLOCKED_IP_FREQ = 3*60*1000;
 
     /** tunnel test now disabled, but this should be long enough to allow an active tunnel to get started */
@@ -103,7 +104,7 @@
     //private static final String PROP_DIRECT = "i2np.ntcp.useDirectBuffers";
 
     private static final int MIN_MINB = 4;
-    private static final int MAX_MINB = 12;
+    private static final int MAX_MINB = 512;
     private static final int MIN_BUFS;
     static {
         long maxMemory = SystemVersion.getMaxMemory();
@@ -178,12 +179,12 @@
     
     /**
      *  The selector loop.
-     *  On high-bandwidth routers, this is the thread with the highest CPU usage, so
+     *  On high-bandwidth routers, this is a thread with high CPU usage, so
      *  take care to minimize overhead and unnecessary debugging stuff.
      */
     public void run() {
         int loopCount = 0;
-        int failsafeLoopCount = FAILSAFE_LOOP_COUNT;
+        long failSafeWaits = 0;
         long lastFailsafeIteration = System.currentTimeMillis();
         long lastBlockedIPClear = lastFailsafeIteration;
         while (_alive && _selector.isOpen()) {
@@ -189,26 +190,32 @@
         while (_alive && _selector.isOpen()) {
             try {
                 loopCount++;
-
                 try {
-                    int count = _selector.select(SELECTOR_LOOP_DELAY);
+                    // _threadSafeWait guards against waiting with I/O pending
+                    // and against calling select with uncleared wakeups
+                    _threadSafeWait.set(true);
+                    int count = _selector.selectNow();
+                    if (count == 0 && _threadSafeWait.get()) {
+                        failSafeWaits++;
+                        try {
+                            Thread.sleep(0, 500000); // minimum Linux context switch
+                        } catch (InterruptedException ie) {}
+                        if (_threadSafeWait.get()) //we have no write and hope for an immediate read
+                            count = _selector.select(SELECTOR_LOOP_DELAY);
+                    }
                     if (count > 0) {
                         Set<SelectionKey> selected = _selector.selectedKeys();
                         //_context.statManager().addRateData("ntcp.pumperKeysPerLoop", selected.size());
                         processKeys(selected);
-                        // does clear() do anything useful?
+                        // clear() must be called!! select() does not clear this.
                         selected.clear();
                     }
                     runDelayedEvents();
                 } catch (ClosedSelectorException cse) {
-                    continue;
+                    break;
                 } catch (IOException ioe) {
                     if (_log.shouldLog(Log.WARN))
                         _log.warn("Error selecting", ioe);
-                } catch (CancelledKeyException cke) {
-                    if (_log.shouldLog(Log.WARN))
-                        _log.warn("Error selecting", cke);
-		    continue;
 		}
                 
                 long now = System.currentTimeMillis();
@@ -223,11 +230,8 @@
                         int lastKeySetSize = all.size();
                         _context.statManager().addRateData("ntcp.pumperKeySetSize", lastKeySetSize);
                         _context.statManager().addRateData("ntcp.pumperLoopsPerSecond", loopCount / (FAILSAFE_ITERATION_FREQ / 1000));
-                        // reset the failsafe loop counter,
-                        // and recalculate the max loops before failsafe sleep, based on number of keys
-                        loopCount = 0;
-                        failsafeLoopCount = Math.max(FAILSAFE_LOOP_COUNT, 2 * lastKeySetSize);
                         
+                        loopCount = 0;
                         int failsafeWrites = 0;
                         int failsafeCloses = 0;
                         int failsafeInvalid = 0;
@@ -315,22 +319,25 @@
                     } catch (ClosedSelectorException cse) {
                         continue;
                     }
-                } else {
-                    // another 100% CPU workaround 
-                    // TODO remove or only if we appear to be looping with no interest ops
-                    if ((loopCount % failsafeLoopCount) == failsafeLoopCount - 1) {
+
+                    if (lastBlockedIPClear + BLOCKED_IP_FREQ < now) {
+                        _blockedIPs.clear();
+                        lastBlockedIPClear = now;
+                    }
+
+                    expireTimedOut();
+
+                    // sleep a bit if we did not wait enough above
+                    if ((failSafeWaits -= MIN_FAILSAFE_WAITS) < 0) {
                         if (_log.shouldLog(Log.INFO))
                             _log.info("EventPumper throttle " + loopCount + " loops in " +
                                       (now - lastFailsafeIteration) + " ms");
                         _context.statManager().addRateData("ntcp.failsafeThrottle", 1);
                         try {
-                            Thread.sleep(25);
+                            Thread.sleep(-failSafeWaits);
                         } catch (InterruptedException ie) {}
                     }
-                }
-                if (lastBlockedIPClear + BLOCKED_IP_FREQ < now) {
-                    _blockedIPs.clear();
-                    lastBlockedIPClear = now;
+                    failSafeWaits = 0;
                 }
             } catch (RuntimeException re) {
                 _log.error("Error in the event pumper", re);
@@ -451,10 +458,10 @@
      *  Only wakeup if new.
      */
     public void wantsWrite(NTCPConnection con) {
-        if (_wantsWrite.add(con)) {
+        _wantsWrite.add(con);
+        _threadSafeWait.set(false);
             _selector.wakeup();
         }
-    }
 
     /**
      *  This is only called from NTCPConnection.complete()
@@ -463,6 +470,7 @@
      */
     public void wantsRead(NTCPConnection con) {
         _wantsRead.offer(con);
+        _threadSafeWait.set(false);
         _selector.wakeup();
     }
 
@@ -881,16 +889,8 @@
                 if (_log.shouldLog(Log.WARN)) _log.warn("Error registering", cce);
             }
         }
-        
-        long now = System.currentTimeMillis();
-        if (_lastExpired + 1000 <= now) {
-            expireTimedOut();
-            _lastExpired = now;
-        }
     }
 
-    private long _lastExpired;
-
     private void expireTimedOut() {
         _transport.expireTimedOut();
     }
Last edited 3 months ago by jogger (previous) (diff)

comment:13 Changed 3 months ago by zab

as we can not rely on specifics of the selector implementation

The selector is implemented in Java and the synchronization is available for review in the JDK source code. The underlying OS system call - be it EPoll, KQueue (FreeBSD) and whatever it is on Windows is irrelevant as far as interest ops thread safety concerned. To verify this, try changing the interest ops of a SelectionKey while the selector is blocked in select() - you can't, your thread will block until select() returns.

select() causes a context switch

I'm not sure about this. According to the documentation it should return immediately if any of the interest ops are ready. The reason I'm supporting the OP statements is because the thread may be yielding and putting itself on the ready queue unnecessarily which would result in higher OS cpu time. This is where the definition of "immediately" becomes important - is it immediate from userland perspective or from overall perspective?

Calling sleep() however definitely incurs a context switch, and as before I am against adding any additional sleeps or waits in this time critical code. (In fact I would rather get rid of the failsafe sleep as well but that's a different story).

Also, I think you're unnecessarily complicating things with the _threadSafeWait as the selector is thread-safe to begin with. I don't really see what you're achieving with the _threadSafeWait, and besides using an AtomicBoolean makes sense only if you're going to use compareAndSet or lazySet; regular get/set operations are identical to using a volatile boolean.

comment:14 Changed 3 months ago by jogger

I would like to make things clearer (All numbers from my machines, all scenarios long time tested):

Why _threadSafeWait?
During selectNow() a wakeup may have occured that will select() make fall through with zero return. This fact is recorded by _threadSafeWait and we avoid the call as we want to update our interestOps first then go to selectNow() to get as much as possible.

Why not call select() directly after selectNow() returns zero (tested with and without _threadSafeWait)?
As nothing to do select() will now directly go into wait and cause a context switch. It will return with one packet read and all further traffic - until the scheduler let us go - queues up in the network buffers. Added up over a second this is 400ms forced sleep we all would like to avoid.
This means you spend a context switch to get just one packet and we have 1200 to spend in a second. This acts as a throughput limiter - our code spends near zero time to work on that packet and the same begins again.

Why sleep(500000 nanos)?
If we would have been calling select() in the above situation we would have spent a context switch for just one packet. Now we have control: if we have something to write - see first paragraph. If nothing to write, we have spent .5ms as the unavoidable switch would have cost anyway. Now we have a very good chance to call select() and get a bunch of packets queued up during sleep without a second switch. This bunch of reads makes the performance difference.

Try all scenarios for yourself (you will need #2438 for that).
With the sleep I get 50%-100% more traffic (measured as calls to SendFinisher?) while Pumper switches in the 700 area.
Without the sleep Pumper drives like crazy through the loop, gets little done and switches 1200/sec. That´s 600 ms for context switches and most of that time my code can do nothing.

comment:15 Changed 3 months ago by zab

During selectNow() a wakeup may have occured

selectNow() clears any pending wakeups. Let's take a look at java.base/sun/nio/ch/EPollSelectorImpl.java

    @Override
    public Selector wakeup() {
        synchronized (interruptLock) {
            if (!interruptTriggered) {
                try {
                    IOUtil.write1(fd1, (byte)0);
                } catch (IOException ioe) {
                    throw new InternalError(ioe);
                }
                interruptTriggered = true;
            }
        }
        return this;
    }

as you can see wakeup() is implemented by writing to a file descriptor; that write will cause the next select() operation to return immediately.

So this is an argument against using selectNow() followed by select(timeout); we use wakeup whenever we want to queue something for writing or reading; if selectNow clears that call the following select(200) will indeed sit for 200ms while there is write waiting to be done.

I no longer support the statements in the OP, much less the following patches. I believe that you are reducing CPU usage at the expense of latency. If we wanted to do that a simple sleep() will have the same effect, but I'm against that.

comment:16 Changed 3 months ago by jogger

I understand that we do not differ at all about how the selector works. I started this to implement an improvement, which

  • solves the high load bug in #2355. See comment 2
  • solves the CPU hike bug from #2355, for which I changed the forced sleep above
  • tries to avoid the situation where nothing is outstanding, select() goes into wait and before returning with zero to one packet to work on, lets the pumper sit idle until the context switch wait is over.
  • tries to get as much work as possible out of each call to the selector, thus reducing CPU for every platform.

For these goals my last patch works best, of course alternative solutions exist. One could even substitute the sleep by the NTCP Writer code (partially), that would really bring additional performance, saving > 1000 ctx/sec.

comment:17 Changed 3 days ago by jogger

I still would like to see this going forward. Let´s try a different perspective:

  • on low traffic any code needs two pumper loops for one key to work on - one for the wakeup, one for the select
  • the current code needs more than one loop average per key produced at any traffic level
  • my patch is light on CPU, needs less than 1 loop/key from about 600 keys/sec up and behaves nicely under load.

If you do not like 500ns sleep (which I do not understand if you run on a system that does context switching), I have finally found and benchmarked a solution that uses no sleep (not even in the failsafe) and at least uses less than one loop/key average at higher traffic levels from about 1,500 keys/sec depending on HW.

It is a longer writeup and a long patch as it involves code reordering, so I will only do so, if someone is interested and I have a chance of getting it released.

Note: See TracTickets for help on using tickets.