Opened 5 months ago

Last modified 4 months ago

#2617 new enhancement

Speedup outbound I2CP

Reported by: jogger Owned by: zzz
Priority: major Milestone: undecided
Component: router/general Version: 0.9.42
Keywords: Cc:
Parent Tickets: Sensitive: no

Description

While working on the UDP transport, I noticed that messages are trickling in slowly all the time, making the router work packet-at-a-time. For inbound this is unavoidable, given the speed of todays CPUs. However for Tunnel GW Pumper zzz claimed there was batching. Can´t work on inbound but reasonable for outbound.

Running with #2432 comment 11 I saw I2CP reader spending 40% more CPU creating a message than the pump() needed for it. Thus the pumper queue could never get loaded which would need at least 3 messages for batching to work.

One could put message creation into a thread pool to parallize (the use case being 16 consecutive messages for a bittorrent chunk). However there has been already a solution in ClientMessagePool?:

            if (true) // blocks the I2CP reader for a nontrivial period of time
                j.runJob();
            else
                _context.jobQueue().addJob(j);

Change that to false and we have a valid use case for multiple job queue runners. Checked that in my test bed and number of invocations for Tunnel GW Pumper went down. Further checked that those jobs ran short enough not to be preempted on 1.8 GHz ARM32 (in which case # of threads must be limited to # CPUs).

So I came up with 16 job queue runners for the above use case together with #2432 comment 11. Result in my test bed:

  • Tunnel GW Pumper runs 20-30% less frequently than I2CP Reader, so batching works
  • As message creation is distributed across several threads there is no more preemption which occurs frequently for I2CP Reader looping over input with current code.

To make our torrent friends really happy one could as a followup tweak the UDP transport with a minimum send window of 16*1033 bytes and MAX_ALLOCATE_SEND = 16.

Subtickets

#2432: Overhaul Tunnel GW Pumpernewzzz

Attachments (1)

I2CP_analysis.ods (13.1 KB) - added by Zlatin Balevsky 4 months ago.

Download all attachments as: .zip

Change History (15)

comment:1 Changed 5 months ago by jogger

Addition: The batching also carries over to other threads like NTCP Pumper and NTCP Writer that loop less frequently, saving some CPU.

Furthermore: This mod just gave me > 1 MBps torrent output in the testbed with 4*1.8 GHz cores. Such speed I had only seen before with 8 cores (4*2.0, 4*1.5).

Last edited 5 months ago by jogger (previous) (diff)

comment:2 Changed 4 months ago by zzz

Component: router/transportrouter/general

This change was made in 2005 by jrandom with the message:

      Adjust the router's queueing of outbound client messages when under
      heavy load by running the preparatory job in the client's I2CP handler
      thread, thereby blocking additional outbound messages when the router is
      hosed.

Would changing this back to being on the job queue be of benefit, independent of #2432? Would we need to increase the job runner count (again, assuming we haven't implemented #2432) ?

comment:3 Changed 4 months ago by zzz

I'm testing this, not looking at any stats, but I don't see how this could help. The I2CP reader is still taking messages out of the queue one at a time, in its own thread. The batching happens in BatchedPreprocessor? and PumpedTunnelGateway?. Having I2CP reader spin off part of its work into OCMOSJ on the job queue isn't going to help the batching at all, that I can see. Yes OCMOSJ does most of the work but it's still being fed messages one at a time from the I2CP reader. If the batching scheme in preprocessor/PTG/TGP isn't working then we should look at that.

I guess what this is really doing is parallelizing the OCMOSJ work into multiple threads, rather than being single-threaded by I2CP reader. There is still going to be a lot of lock contention in OCMOSJ, and this could also increase out-of-order delivery, but you said you had good results. Any improvement here I think is just do to parallelizing, it's not about batching improvements.

comment:4 Changed 4 months ago by jogger

This change might overload the job queue (loaded slow machines only, at about 1500 KBps torrent out on my ARM32), but there is a comment for this on the /jobs page. It is no real overloading, but some router steering depends on job queue metrics, so a separate thread pool might be considered as well (sized to Runtime.getRuntime().availableProcessors()).

In my testbed the I2CP Readers consumed more CPU than a single outbound pumper.

With the use case being 16 messages from a torrent chunk, the benefit of parallization is clear. Message generation is more time consuming than any following task, thus packet-at-a-time. On a fast machine with 8 job queue runners configured I see 6 to 7 jobs finish on the same ms all the time. This finally allows for batching.

Then we have #2432. The current TGW Pumper would benefit from this change, however the code is awkward and produces way too many context switches. I would favor to use the inbound code from comment 11 also for outbound, doing away with the thread pool. Maximum looping/batching already from 2 messages upward, minimum context switches.

comment:5 Changed 4 months ago by zzz

(one more clarification to the ticket title, this is about outbound client messages, which are inbound I2CP, from the router's perspective)

Clearly this change wouldn't help if you only have one processor core, it would just add another context switch. Probably wouldn't help with two cores either, as there's other things going on too. Leaving OCMOSJ inline in the I2CP reader thread does provide a natural throttling that keeps the router from having its job queue get overloaded. What do you think about enabling this only for "fast" processors, where we define "fast" as:

SystemVersion?.getCores() ≥ 4 && (not SystemVersion?.isSlow())

Last edited 4 months ago by zzz (previous) (diff)

comment:6 Changed 4 months ago by Zlatin Balevsky

SystemVersion??.getCores() ≥ 4 && SystemVersion?.isSlow()

I would get rid of the second predicate as fast cores can benefit from such change as well, I think.

Once someone proposes a self-contained patch for this I'll benchmark it in my testnet.

comment:7 Changed 4 months ago by zzz

The markdown ate the exclamation point, edited above.

Here's my patch, untested (although I have been testing the oneliner to thread everythign for a few days)

--- router/java/src/net/i2p/router/ClientMessagePool.java	917ef181d8aa052bc151900a9a40d1aad31f97f9
+++ router/java/src/net/i2p/router/ClientMessagePool.java	a4ccb36ed4ed7545f83cead3adaeda28f0b69db7
@@ -11,6 +11,7 @@ import net.i2p.util.Log;
 import net.i2p.router.message.OutboundCache;
 import net.i2p.router.message.OutboundClientMessageOneShotJob;
 import net.i2p.util.Log;
+import net.i2p.util.SystemVersion;
 
 /**
  * Manage all of the inbound and outbound client messages maintained by the router.
@@ -24,6 +25,9 @@ public class ClientMessagePool {
     private final Log _log;
     private final RouterContext _context;
     private final OutboundCache _cache;
+
+    private static final boolean OCMOSJ_INLINE = SystemVersion.getCores() < 4 ||
+                                                 SystemVersion.isSlow();
     
     public ClientMessagePool(RouterContext context) {
         _context = context;
@@ -73,10 +77,18 @@ public class ClientMessagePool {
             if (_log.shouldLog(Log.DEBUG))
                 _log.debug("Adding message for remote delivery");
             OutboundClientMessageOneShotJob j = new OutboundClientMessageOneShotJob(_context, _cache, msg);
-            if (true) // blocks the I2CP reader for a nontrivial period of time
+            // see discussion in ticket #2617
+            if (OCMOSJ_INLINE) {
+                // Slower processors.
+                // Blocks the I2CP reader for a nontrivial period of time
+                // but prevents job queue overrun
                 j.runJob();
-            else
+            } else {
+                // Faster processors.
+                // Allows better batching at the TunnelGatewayPumper
+                // by parallelizing OCMOSJ
                 _context.jobQueue().addJob(j);
+            }
         }
     }
     

Changed 4 months ago by Zlatin Balevsky

Attachment: I2CP_analysis.ods added

comment:8 Changed 4 months ago by Zlatin Balevsky

See the file I2CP_analysis.ods for eepget speed readings of a transfer of 1MB file between two nodes in my testnet. No latency or packet loss was added. As you can see the patch in comment 7 causes the performance to drop significantly.

comment:9 Changed 4 months ago by jogger

That result comes to no surprise as you are repeatedly locking up the entire TGW Pumper, also for inbound, and are causing two context switches for most pumps.

Already last winter I urged to change that with #2432. I am testing > 100 GB traffic per machine on the live net each day and have been able to reduce context switches by more than 80% with #2432 comment 11. Going further I will suggest wrapping up all pumps within a single job as already shown for inbound.

comment:10 Changed 4 months ago by zzz

In comment 2 I asked:

Would changing this back to being on the job queue be of benefit, independent of #2432? Would we need to increase the job runner count (again, assuming we haven't implemented #2432) ?

In comment 4 jogger replied:

The current TGW Pumper would benefit from this change

However in comments 8 and 9, zlatinb and jogger now say that this is not a benefit.

I conclude that this ticket now depends on #2432, and cannot be implemented before #2432 - correct?

comment:11 Changed 4 months ago by Zlatin Balevsky

To me the patch in comment 7 makes sense, and I am surprised it didn't result in an improvement of throughput. I would like to understand why that was the case before making larger changes.

Also, will this "non-trivial" period of time mentioned in the code comment be affected by the new crypto (prop 144, others)? If so, maybe it's worth revisiting this ticket after those proposals are implemented.

One possible explanation of the reduction of throughput is that when using the existing inline code, the crypto code gets better cached in the CPU instruction cache, and there is no overhead of handing off to other threads. If that is true, I would say it makes sense to look in the opposite direction, i.e. de-parallelize as much as possible.

comment:12 Changed 4 months ago by jogger

Re Comment 10: correct. I have run the various patches from #2432 since last winter. All of them would do, the last being the best. The current code can not properly deal with a burst on one tunnel. I have increased the job queue runner count to the number of cores.

Re comment 11: Good point. Cache invalidation must always be considered. In my test bed the I2CP Readers received context switches anyway and the job queue runners loop as well over the crypto but less often. From my figures (picked 1000 sec samples with 98k loop runs of the reader) I see no difference in CPU or run queue wait times from this change. Reasonable because the tight inner loops do not change. However now messages arrive in batches at the pump() where there also is expensive crypto. My proposal for #2432 loops over the pump() from 2 messages up without any context switch when fed from the job queue,

comment:13 Changed 4 months ago by zzz

Add a subticket #2432 (Overhaul Tunnel GW Pumper).

comment:14 Changed 4 months ago by zzz

re: comment 11 and the impact of new crypto:

  • widespread adoption of new crypto is probably a year or more from now
  • while x25519 is much much faster than ElG (10x?), we will be doing more operations for increased security. The overall speedup for new session messages (the ones that do asymmetric crypto) might be 2-3x. There's probably not significant change for the existing session messages (symmetric crypto). The mix of the two depends on the traffic mix (long sessions or short ones).
Note: See TracTickets for help on using tickets.