Opened 7 months ago

Last modified 6 weeks ago

#2252 new enhancement

Reduce number of Reader and Writer threads

Reported by: zab Owned by: zab
Priority: minor Milestone: undecided
Component: router/transport Version: 0.9.34
Keywords: ntcp nio Cc:
Parent Tickets:

Description

Adding this one on my TODO list to research.

Hypothesis:

Profiling with YourKit shows that the Reader._pendingConnections and Writer._pendingConnections locks are still very heavily contended even though few years ago the data structures were changed to LinkedHashSet. It is possible these threads are spending more time blocked than doing work. Reducing the number of threads should alleviate this and may increase throughput.

Subtickets (add)

Attachments (1)

1thread.png (673.5 KB) - added by zab 7 months ago.
5 day bw graph comparing multiple vs 1 thread

Download all attachments as: .zip

Change History (7)

comment:1 Changed 7 months ago by zzz

Both AES and ChaChaPoly?, which is what these threads do, are fast on most platforms. But the Reader threads also can "push through" a message all the way through the router from read to write, with various places were it could be blocked along the way.

Number is based on available memory, more if we have more. We may actually have that backwards, we should have more threads on e.g. Android, where there's more chance of blocking?

I'd be more comfortable reducing writers than readers, for the reasons stated above.

Perhaps also some short-circuit code to avoid some of the locking and other code if there's only one Reader or Writer may be helpful.

Changed 7 months ago by zab

5 day bw graph comparing multiple vs 1 thread

comment:2 Changed 7 months ago by zab

I made a more extreme experiment - reduced the number of threads to 1 each for NTCP Readers, Writers, UDP processors, JobQueue and TunnelGatewayPumper threads. As you can see in the graph, there was no noticeable effect on throughput.

I think we should consider such reduction as it can simplify the code a lot and get us rid of synchronization in many places.

comment:3 Changed 7 months ago by zzz

This is good stuff. I will be able to help out in a few weeks. Both transports were written by jrandom on Java 4, predating Java 5 concurrent libs, then converted to concurrent by me and updated over the years with your help.

As I briefly said in comment 1 above, the threads may be more for potential blocking than speed. Testing on a modern, fast, and busy router is a good start but there's more to be considered, especially blocking issues. Quick checklist to be researched for each pool:

  • Why created - speed or blocking? Wild guess or good reasons?
  • History of thread count changes and why
  • Any crypto or other expensive operations in the thread?
  • Blocking / contention possibilities in the thread
  • Average / max time for single run on slow platform e.g. RPi 2
  • Does the thread stay local in the particular transport or can it "push through" a message through the entire router, potentially back to this or the other transport (read ... router ... write)
  • Potential for deadlocks or effective deadlock (thread exhaustion) if we reduce to single thread
  • Do we want more threads for slow platforms, fast platforms, both, or neither?

JobQueue?, for example, is a place where things are fast 99% of the time but sometimes can be slow or be blocked. We have to be very careful not to introduce problems that could be really hard to diagnose/reproduce hangs/deadlocks/crashes.

comment:4 Changed 7 months ago by zzz

I also need to measure NTCP1 AES vs. NTCP2 AEAD/ChaCha/Poly, preferably on a slow box w/o AES-NI. AEAD is probably slower, but maybe not by much.

comment:5 Changed 7 weeks ago by jogger

I am running 8 core ARM and my NTCP readers and tunnel GW pumpers run > 100% CPU combined each. Otherwise I support your idea for <= 4 cores. I could use more UDP packet pushers, this one is maxing out from time to time on my system. There are also ARM boards out there with 16+ cores, these need still more parallelization.

comment:6 Changed 6 weeks ago by jogger

Would like to emphasize the need for more UDP packet pushers. It constantly runs at 60-80% CPU (single core) on my system and I suspect it to be the source of "high message delay - rejecting tunnels", while the entire system runs below 50% total CPU at the same time. After a short glance at the source I think that message delay is calculated from UDP only.

Of course this would be obsolete if a new SSU version would bring down CPU needed for this thread.

Note: See TracTickets for help on using tickets.