Opened 8 years ago
Closed 8 years ago
#697 closed enhancement (fixed)
Heavy monitor contention in ntcp.{Reader,Writer}$Runner.run
Reported by: | Zlatin Balevsky | Owned by: | zzz |
---|---|---|---|
Priority: | minor | Milestone: | 0.9.2 |
Component: | router/transport | Version: | 0.9.1 |
Keywords: | NTCP monitor contention | Cc: | Zlatin Balevsky, Meeh |
Parent Tickets: | Sensitive: | no |
Description
I'm observing very frequent and heavy contention of the _pendingConnections monitor in both ntcp reader and writer.
There are some potentially expensive operations happening inside those monitors like removing from the head of an ArrayList?, but before I start changing code I'll try running the high-capacity router with just a single Reader and single Writer threads. Modern cpus should be able to handle many megabits of traffic on a single core, and I suspect the number of those threads was decided through the "favorite number" method.
I'll be updating this post as I get more data.
Subtickets
Change History (28)
comment:1 Changed 8 years ago by
comment:2 Changed 8 years ago by
… but as we discussed elsewhere, the Readers push data all the way through the router and I don't think it's wise to reduce them to a single thread
comment:3 Changed 8 years ago by
yourkit snapshot of monitor profiling with 0.9.1-12 demonstrating the contention:
http://nnz2serwcyvndjwqufqpubiu2khqjqelrwrim4pxiyhk7ixwnhnq.b32.i2p/0.9.1-12/monitor.snapshot.bz2
am now running 0.9.1-14 with 2 reader and 2 writer threads instead of 4
comment:4 Changed 8 years ago by
Example trace where a reader thread bumps into a writer thread:
So, any reader thread can bump into any writer thread + the event pumper! This is pretty bad. (quadratic contention? looks like it)
comment:5 Changed 8 years ago by
Encouraging results from running -14 with LinkedHashSet? instead of ArrayList? ( http://pastethis.i2p/show/1684/ )
The time spent waiting to acquire monitors has dropped by 66%!
comment:6 Changed 8 years ago by
Milestone: | → 0.9.2 |
---|---|
Status: | new → accepted |
LinkedHashSets? in 0.9.1-16, similar to paste 1684, to reduce time spent in _pendingConnection lock. An easy and obvious improvement.
Not sure why "bumping into" is bad, or what you mean by "quadratic contention". 4x4 chances of blocking? But how would reducing number of threads make it better?
Leaving ticket open as there are probably more improvements possible. And to await results of thread count changes.
comment:7 Changed 8 years ago by
results from -15 with patch (almost identical to -16) but with 2 reader & 2 writer threads
http://nnz2serwcyvndjwqufqpubiu2khqjqelrwrim4pxiyhk7ixwnhnq.b32.i2p/0.9.1-15-LHS-2r2w/
Included are bandwidth, peers and participating tunnel graphs as well as yourkit snapshot. Immediate first impression - the bandwidth use is much higher than in previous runs, whereas the number of tunnels and peers is comparable. I have asked KillYourTV to confirm this independently.
Why bumping impacts throughput:
Bumping is bad because it can put the thread into a BLOCKED state, which means it is waiting in a queue associated with the monitor object and it's up to the scheduler to wake it up when the monitor gets released. That can happen a non-deterministic amount of time later, and is a system call i.e. expensive and slow.
It is important to distinguish between the queues for threads in wait() (TIMED_WAIT) and those in BLOCKED states. They have different underlying semantics and are very much platform-dependent.
Since these threads are responsible for handling TCP traffic, any delay can affect the various window and rtt calculations and impact throughput.
Why reducing the number of threads reduces contention:
There always are two monitors to compete for. In one case there will be 9 threads competing for these 2 monitors, in the other 5. Since notifyAll is used, in the first case always 4 threads will be notified, in the second only 2. Consider the following scenario:
- There is nothing to write, all 4 writer threads are sleeping.
- Somebody (maybe reader, maybe not) calls wantsWrite and issues notifyAll().
- The 4 writer threads wake up, one of them acquires the monitor.
- On a different cpu core, a reader or the pumper thread decide they want to call wantsWrite(). However, the monitor is held so they get in line in the monitor queue.
- The writer thread that acquired the monitor proceeds to write and releases it
6. The remaining 3 writer threads acquire the monitor, and release it because there is nothing to write
- Only now can the reader or pumper threads acquire the monitor and proceed with what it was doing.
If there are only 2 writer threads, step 6 takes significantly less time and the NTCP engine can get back to handling traffic sooner.
comment:8 Changed 8 years ago by
thanks for the detailed explanation of your thinking.
re: bumping, sure, blocking is expensive, but the reason to have multiple threads is so something else can get done if something blocks. If NTCP readers only went to NTCP writers, maybe that would imply less threads, but an incoming packet could also end up going to I2CP (internal) or SSU.
re: contention, interesting, I hadn't thought about notifyAll() being expensive. Seems like we don't need to do it every time - if the conn was already in _pendingConnections, we probably don't need to notify anybody, and if it wasn't in there, couldn't we get away with just notify() to wake up one thread?
comment:9 Changed 8 years ago by
Ahh but notify() could wake up something on the input side instead of the output side.
Looking at it further, the Writer really doesn't do much - just calls NTCPConnection.prepareNextWrite(), which does a quick AES encryption and tells NIO. So maybe we just need one Writer thread.
But if it's that fast, then maybe we can scrap the Writer thread altogether and just have Writer.wantsWrite() call NTCPConnection.prepareNextWrite() directly.
But if we're going to do that, we should fix the horrid O(N2) FIXME in NTCPConnection.prepareNextWriteFast(). That one's my fault, I did it years and years ago when I was not so smart and nobody was around to slap me for it. A couple years ago I started to use a real priority queue but I never finished it. I'm still uncertain on whether we need to worry about priorities.
comment:10 Changed 8 years ago by
Ahh but notify() could wake up something on the input side instead of the output side.
The queue for acquiring the monitor is different the queue for wait()-ing on the monitor. From the code it looks like only readers wait() on Reader._pendingConnections and only writers wait() on Writer._pendingConnections, so notify() should wake up a thread of the right kind.
This notify() vs notifyAll() may be irrelevant because of the other points you make above; I'm going to try -17 with 2r,1w next. The ideal solution would be to get rid of the writer thread(s) completely; don't know yet how much effort that would be and whether you want it for 0.9.2 or 0.9.3.
comment:11 Changed 8 years ago by
General plan for the next few days.. if the LimeWire traffic patterns are relevant to I2P then weekends would show natural higher values so I have to be careful before drawing conclusions.
- 24 hrs with -17, 2r1w threads
- 24 hrs with vanilla (-17 or whatever) to confirm baseline
- 24 hrs with vanilla + s/notifyAll()/notify() ( http://pastethis.i2p/show/1709/ )
- analyze results, and fork: a) 24 hrs with 2r1w + notify() b) 24 hrs with 4r1w + notify() ← based on a nebulous hunch but that's my style c) hack up your suggestion to get rid of writer threads, 24 hrs with that d) something entirely different, yet to be determined ???
Meanwhile….
KYTV was running with default number of max NTCP connections so changing the number of threads did not affect him. Over IRC I instructed him:
<zab_> Let's the the following: <zab_> 1. set i2np.ntcp.maxConnections=1024 <zab_> 2. restart with vanilla -17 and run that for 24 hrs <zab_> 3. apply patch 1709 to -17 and after another 24 hrs compare
comment:12 Changed 8 years ago by
- 24 hrs with vanilla (-17 or whatever) to confirm baseline
On a second thought, make that 48 hours. This covers Monday and rules out the possible weekend effect. Meeh isn't upgrading until Wednesday so I still get at least 24 hours for step 3. If you push any changes related to this in trunk I'll stick to -17.
comment:13 Changed 8 years ago by
- 24 hrs with -17, 2r1w threads
http://nnz2serwcyvndjwqufqpubiu2khqjqelrwrim4pxiyhk7ixwnhnq.b32.i2p/0.9.1-17-2r1w/
nothing conclusive, so continuing with
24 hrs with vanilla (-17 or whatever) to confirm baseline
This will also be an opportunity to see how the churn improvements in -18 ( #699 ) are doing.
comment:14 Changed 8 years ago by
24 hrs with vanilla (-17 or whatever) to confirm baseline
On a second thought, make that 48 hours.
On a third thought, make that 36 hours.
After looking at the charts http://nnz2serwcyvndjwqufqpubiu2khqjqelrwrim4pxiyhk7ixwnhnq.b32.i2p/0.9.1-18/ there either is no day-of-week factor, and/or the uptime factor is much more dominant, and/or something entirely different is happening. Included is a screenshot demonstrating the contention as well as a memory usage graph that shows nice little improvements from #699 .
Proceeding with
- 24 hrs with vanilla + s/notifyAll()/notify() ( http://pastethis.i2p/show/1709/ )
except it will probably be 36 hours to match the baseline.
comment:15 Changed 8 years ago by
(the last comment was really meant to be posted 10 hours ago)
Very good-looking results with so far:
contention vs. vanilla
bandwidth ramp-up also the best so far
Will continue running this for a total of 36 hours as planned. In the meantime, will look into completely getting rid of writer threads. Hopefully there's plenty of time until the tag freeze…
comment:16 Changed 8 years ago by
Final results from 0.9.1-19 + notify() are clearly the best ones so far. If I had to make a recommendation right now it would without a doubt be http://pastethis.i2p/show/1709
However, there are few more days until tag freeze so I'm going to proceed with
c) hack up your suggestion to get rid of writer threads, 24 hrs with that
and run dev router with the more intrusive http://pastethis.i2p/show/1709
comment:18 Changed 8 years ago by
make that http://pastethis.i2p/show/1744/ which contains 1709 + 1733
comment:19 Changed 8 years ago by
http://pastethis.i2p/show/1744/ + 4 read threads caused a deadlock, Aborting the test and continuing with 1744 + 1 read thread just for the heck of it.
comment:20 Changed 8 years ago by
At this point it's pretty clear s/notifyAll()/notify() ( http://pastethis.i2p/show/1709 ) will be the final recommendation. I'll keep playing with some wild ideas but I doubt anything realistic will come up in time for 0.9.2.
comment:21 Changed 8 years ago by
Paste 1709 checked in as 0.9.1-21. I included some comments on how it could be improved further, by only calling notify() when necessary. Haven't tested that, just speculation.
Nice job with the extensive testing.
comment:22 Changed 8 years ago by
About to restart with http://pastethis.i2p/show/1746/ - notify only if added
comment:23 Changed 8 years ago by
Not too impressed with selective notify(), results after 24 hours
Trying -21 with 2 reader, 2 writer threads next… ( http://pastethis.i2p/show/1698/ )
comment:24 Changed 8 years ago by
Not too impressed with 2r 2w threads either results after 24 hours
Trying -22 with 8 reader, 8 writer threads next… ( http://pastethis.i2p/show/1758/ )
comment:26 Changed 8 years ago by
after 24 hours things look alright but can't see any difference than vanilla and #707 is much more interesting so jumping to that.
comment:27 Changed 8 years ago by
Cc: | Meeh added |
---|
comment:28 Changed 8 years ago by
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
seems like we're done with this one. 0.9.2 contains the changes.
Yeah those are pretty bad, they are pretty much unchanged since the beginning of NTCP in 2006… I've never rewritten them for concurrent. It's all ArrayLists?, locking, notifies, ….
#689 is related.