Opened 7 years ago

Closed 7 years ago

#697 closed enhancement (fixed)

Heavy monitor contention in ntcp.{Reader,Writer}$Runner.run

Reported by: Zlatin Balevsky Owned by: zzz
Priority: minor Milestone: 0.9.2
Component: router/transport Version: 0.9.1
Keywords: NTCP monitor contention Cc: Zlatin Balevsky, Meeh
Parent Tickets: Sensitive: no

Description

I'm observing very frequent and heavy contention of the _pendingConnections monitor in both ntcp reader and writer.

There are some potentially expensive operations happening inside those monitors like removing from the head of an ArrayList?, but before I start changing code I'll try running the high-capacity router with just a single Reader and single Writer threads. Modern cpus should be able to handle many megabits of traffic on a single core, and I suspect the number of those threads was decided through the "favorite number" method.

I'll be updating this post as I get more data.

Subtickets

Change History (28)

comment:1 Changed 7 years ago by zzz

Yeah those are pretty bad, they are pretty much unchanged since the beginning of NTCP in 2006… I've never rewritten them for concurrent. It's all ArrayLists?, locking, notifies, ….

#689 is related.

comment:2 Changed 7 years ago by zzz

… but as we discussed elsewhere, the Readers push data all the way through the router and I don't think it's wise to reduce them to a single thread

comment:3 Changed 7 years ago by Zlatin Balevsky

yourkit snapshot of monitor profiling with 0.9.1-12 demonstrating the contention:

http://nnz2serwcyvndjwqufqpubiu2khqjqelrwrim4pxiyhk7ixwnhnq.b32.i2p/0.9.1-12/monitor.snapshot.bz2

am now running 0.9.1-14 with 2 reader and 2 writer threads instead of 4

comment:4 Changed 7 years ago by Zlatin Balevsky

Example trace where a reader thread bumps into a writer thread:

http://nnz2serwcyvndjwqufqpubiu2khqjqelrwrim4pxiyhk7ixwnhnq.b32.i2p/0.9.1-12/yourkit-monitors/reader-bumps-writer.html

So, any reader thread can bump into any writer thread + the event pumper! This is pretty bad. (quadratic contention? looks like it)

comment:5 Changed 7 years ago by Zlatin Balevsky

Encouraging results from running -14 with LinkedHashSet? instead of ArrayList? ( http://pastethis.i2p/show/1684/ )

http://nnz2serwcyvndjwqufqpubiu2khqjqelrwrim4pxiyhk7ixwnhnq.b32.i2p/0.9.1-14-LHS/monitor.snapshot.bz2

The time spent waiting to acquire monitors has dropped by 66%!

comment:6 Changed 7 years ago by zzz

Milestone: 0.9.2
Status: newaccepted

LinkedHashSets? in 0.9.1-16, similar to paste 1684, to reduce time spent in _pendingConnection lock. An easy and obvious improvement.

Not sure why "bumping into" is bad, or what you mean by "quadratic contention". 4x4 chances of blocking? But how would reducing number of threads make it better?

Leaving ticket open as there are probably more improvements possible. And to await results of thread count changes.

comment:7 Changed 7 years ago by Zlatin Balevsky

results from -15 with patch (almost identical to -16) but with 2 reader & 2 writer threads

http://nnz2serwcyvndjwqufqpubiu2khqjqelrwrim4pxiyhk7ixwnhnq.b32.i2p/0.9.1-15-LHS-2r2w/

Included are bandwidth, peers and participating tunnel graphs as well as yourkit snapshot. Immediate first impression - the bandwidth use is much higher than in previous runs, whereas the number of tunnels and peers is comparable. I have asked KillYourTV to confirm this independently.


Why bumping impacts throughput:

Bumping is bad because it can put the thread into a BLOCKED state, which means it is waiting in a queue associated with the monitor object and it's up to the scheduler to wake it up when the monitor gets released. That can happen a non-deterministic amount of time later, and is a system call i.e. expensive and slow.

It is important to distinguish between the queues for threads in wait() (TIMED_WAIT) and those in BLOCKED states. They have different underlying semantics and are very much platform-dependent.

Since these threads are responsible for handling TCP traffic, any delay can affect the various window and rtt calculations and impact throughput.


Why reducing the number of threads reduces contention:

There always are two monitors to compete for. In one case there will be 9 threads competing for these 2 monitors, in the other 5. Since notifyAll is used, in the first case always 4 threads will be notified, in the second only 2. Consider the following scenario:

  1. There is nothing to write, all 4 writer threads are sleeping.
  2. Somebody (maybe reader, maybe not) calls wantsWrite and issues notifyAll().
  3. The 4 writer threads wake up, one of them acquires the monitor.
  4. On a different cpu core, a reader or the pumper thread decide they want to call wantsWrite(). However, the monitor is held so they get in line in the monitor queue.
  5. The writer thread that acquired the monitor proceeds to write and releases it

6. The remaining 3 writer threads acquire the monitor, and release it because there is nothing to write

  1. Only now can the reader or pumper threads acquire the monitor and proceed with what it was doing.

If there are only 2 writer threads, step 6 takes significantly less time and the NTCP engine can get back to handling traffic sooner.

comment:8 Changed 7 years ago by zzz

thanks for the detailed explanation of your thinking.

re: bumping, sure, blocking is expensive, but the reason to have multiple threads is so something else can get done if something blocks. If NTCP readers only went to NTCP writers, maybe that would imply less threads, but an incoming packet could also end up going to I2CP (internal) or SSU.

re: contention, interesting, I hadn't thought about notifyAll() being expensive. Seems like we don't need to do it every time - if the conn was already in _pendingConnections, we probably don't need to notify anybody, and if it wasn't in there, couldn't we get away with just notify() to wake up one thread?

comment:9 Changed 7 years ago by zzz

Ahh but notify() could wake up something on the input side instead of the output side.

Looking at it further, the Writer really doesn't do much - just calls NTCPConnection.prepareNextWrite(), which does a quick AES encryption and tells NIO. So maybe we just need one Writer thread.

But if it's that fast, then maybe we can scrap the Writer thread altogether and just have Writer.wantsWrite() call NTCPConnection.prepareNextWrite() directly.

But if we're going to do that, we should fix the horrid O(N2) FIXME in NTCPConnection.prepareNextWriteFast(). That one's my fault, I did it years and years ago when I was not so smart and nobody was around to slap me for it. A couple years ago I started to use a real priority queue but I never finished it. I'm still uncertain on whether we need to worry about priorities.

comment:10 Changed 7 years ago by Zlatin Balevsky

Ahh but notify() could wake up something on the input side instead of the output side.

The queue for acquiring the monitor is different the queue for wait()-ing on the monitor. From the code it looks like only readers wait() on Reader._pendingConnections and only writers wait() on Writer._pendingConnections, so notify() should wake up a thread of the right kind.

This notify() vs notifyAll() may be irrelevant because of the other points you make above; I'm going to try -17 with 2r,1w next. The ideal solution would be to get rid of the writer thread(s) completely; don't know yet how much effort that would be and whether you want it for 0.9.2 or 0.9.3.

comment:11 Changed 7 years ago by Zlatin Balevsky

General plan for the next few days.. if the LimeWire traffic patterns are relevant to I2P then weekends would show natural higher values so I have to be careful before drawing conclusions.

  1. 24 hrs with -17, 2r1w threads
  2. 24 hrs with vanilla (-17 or whatever) to confirm baseline
  3. 24 hrs with vanilla + s/notifyAll()/notify() ( http://pastethis.i2p/show/1709/ )
  4. analyze results, and fork: a) 24 hrs with 2r1w + notify() b) 24 hrs with 4r1w + notify() ← based on a nebulous hunch but that's my style c) hack up your suggestion to get rid of writer threads, 24 hrs with that d) something entirely different, yet to be determined ???

Meanwhile….

KYTV was running with default number of max NTCP connections so changing the number of threads did not affect him. Over IRC I instructed him:

<zab_> Let's the the following:
<zab_> 1. set i2np.ntcp.maxConnections=1024
<zab_> 2. restart with vanilla -17 and run that for 24 hrs
<zab_> 3. apply patch 1709 to -17 and after another 24 hrs compare

comment:12 Changed 7 years ago by Zlatin Balevsky

  1. 24 hrs with vanilla (-17 or whatever) to confirm baseline

On a second thought, make that 48 hours. This covers Monday and rules out the possible weekend effect. Meeh isn't upgrading until Wednesday so I still get at least 24 hours for step 3. If you push any changes related to this in trunk I'll stick to -17.

comment:13 Changed 7 years ago by Zlatin Balevsky

  1. 24 hrs with -17, 2r1w threads

http://nnz2serwcyvndjwqufqpubiu2khqjqelrwrim4pxiyhk7ixwnhnq.b32.i2p/0.9.1-17-2r1w/

nothing conclusive, so continuing with

24 hrs with vanilla (-17 or whatever) to confirm baseline

This will also be an opportunity to see how the churn improvements in -18 ( #699 ) are doing.

comment:14 Changed 7 years ago by Zlatin Balevsky

24 hrs with vanilla (-17 or whatever) to confirm baseline

On a second thought, make that 48 hours.

On a third thought, make that 36 hours.

After looking at the charts http://nnz2serwcyvndjwqufqpubiu2khqjqelrwrim4pxiyhk7ixwnhnq.b32.i2p/0.9.1-18/ there either is no day-of-week factor, and/or the uptime factor is much more dominant, and/or something entirely different is happening. Included is a screenshot demonstrating the contention as well as a memory usage graph that shows nice little improvements from #699 .

Proceeding with

  1. 24 hrs with vanilla + s/notifyAll()/notify() ( http://pastethis.i2p/show/1709/ )

except it will probably be 36 hours to match the baseline.

comment:15 Changed 7 years ago by Zlatin Balevsky

(the last comment was really meant to be posted 10 hours ago)

Very good-looking results with so far:
contention vs. vanilla

bandwidth ramp-up also the best so far

Will continue running this for a total of 36 hours as planned. In the meantime, will look into completely getting rid of writer threads. Hopefully there's plenty of time until the tag freeze…

comment:16 Changed 7 years ago by Zlatin Balevsky

Final results from 0.9.1-19 + notify() are clearly the best ones so far. If I had to make a recommendation right now it would without a doubt be http://pastethis.i2p/show/1709

However, there are few more days until tag freeze so I'm going to proceed with

c) hack up your suggestion to get rid of writer threads, 24 hrs with that

and run dev router with the more intrusive http://pastethis.i2p/show/1709

comment:17 Changed 7 years ago by Zlatin Balevsky

more intrusive http://pastethis.i2p/show/1733 not 1709

comment:18 Changed 7 years ago by Zlatin Balevsky

make that http://pastethis.i2p/show/1744/ which contains 1709 + 1733

comment:19 Changed 7 years ago by Zlatin Balevsky

http://pastethis.i2p/show/1744/ + 4 read threads caused a deadlock, Aborting the test and continuing with 1744 + 1 read thread just for the heck of it.

comment:20 Changed 7 years ago by Zlatin Balevsky

At this point it's pretty clear s/notifyAll()/notify() ( http://pastethis.i2p/show/1709 ) will be the final recommendation. I'll keep playing with some wild ideas but I doubt anything realistic will come up in time for 0.9.2.

comment:21 Changed 7 years ago by zzz

Paste 1709 checked in as 0.9.1-21. I included some comments on how it could be improved further, by only calling notify() when necessary. Haven't tested that, just speculation.

Nice job with the extensive testing.

comment:22 Changed 7 years ago by Zlatin Balevsky

About to restart with http://pastethis.i2p/show/1746/ - notify only if added

comment:23 Changed 7 years ago by Zlatin Balevsky

Not too impressed with selective notify(), results after 24 hours

Trying -21 with 2 reader, 2 writer threads next… ( http://pastethis.i2p/show/1698/ )

comment:24 Changed 7 years ago by Zlatin Balevsky

Not too impressed with 2r 2w threads either results after 24 hours

Trying -22 with 8 reader, 8 writer threads next… ( http://pastethis.i2p/show/1758/ )

comment:25 Changed 7 years ago by Zlatin Balevsky

Cc: Zlatin Balevsky added

adding cc

comment:26 Changed 7 years ago by Zlatin Balevsky

after 24 hours things look alright but can't see any difference than vanilla and #707 is much more interesting so jumping to that.

comment:27 Changed 7 years ago by Meeh

Cc: Meeh added

comment:28 Changed 7 years ago by zzz

Resolution: fixed
Status: acceptedclosed

seems like we're done with this one. 0.9.2 contains the changes.

Note: See TracTickets for help on using tickets.