Opened 7 years ago

Closed 5 years ago

#674 closed defect (wontfix)

NTCP Event pumper should not have unbounded byte cache

Reported by: Zlatin Balevsky Owned by: zzz
Priority: minor Milestone:
Component: router/transport Version: 0.9.1
Keywords: NTCP OOM Cc:
Parent Tickets: Sensitive: no

Description

example OOM: http://pastethis.i2p/show/1544/

reason: when receiving data too fast the NTCP event pumper thread will run out of buffers and keep creating new byte buffers. Since tcp has built-in backoff, it should be safe to not create such buffers and just wait until they become available.

Subtickets

Change History (10)

comment:1 Changed 7 years ago by zzz

Maybe, but then we have to:
1) Pre-fill the queue with some number of buffers (wasteful)
2) Make darn sure we never leak a buffer

So perhaps better to:
1) Increase max queue size based on max mem or max bandwidth
or
2) adjust the queue size algorithm in releaseBuf()
or
3) sleep and retry on OOM

never seen this OOM before

comment:2 Changed 7 years ago by Zlatin Balevsky

All of the above work except 3) because by definition ? extends java.lang.Error is a fatal event (I'm starting to sound like a broken record on this).

The reason you may not have seen this OOM is because stack traces from OOMs can be misleading. Thread A can have a bug that fills up 99.9% of the available heap, then Thread B allocates a small object as part of normal operation and BAM - you have an OOM in Thread B with a stack trace that sends you in the wrong direction.

After KYTV reported this OOM he reduced his bandwidth limits and his router has been more stable which strongly suggests this is the actual cause. In the exp. builds I've opted for simple pre-allocation, lets see how that works out.

comment:3 Changed 7 years ago by Zlatin Balevsky

btw I asked KYTV to get a histogram from the heap dump with jhat and the byte[] objects were by far the largest heap user both in count and total size, which also points in the direction of the NTCP Event pumper thread

comment:4 Changed 7 years ago by Zlatin Balevsky

actually what's the problem with "2) Make darn sure we never leak a buffer" ? All it takes is code review and there are plenty of tools to help detect that case; whereas there will always be a risk of OOM if there are unbounded queues anywhere in the code.

comment:5 Changed 7 years ago by zzz

re: 3) I'm not lobbying hard for it, but only because it's unlikely to help, of course for the reason you stated above, that the OOM can be far from the memory hog. I don't think there's anything wrong with catching an OOM and trying to do something about it.

re: byte[], there's a zillion uses of byte[] in the router. It's a huge leap to assume this is the one that's the hog.

re: 2), sure it's not impossible, just another source of potential problems

comment:6 Changed 7 years ago by Zlatin Balevsky

here are a few more OOM traces KYTV took just now all pointing all over the place
http://pastethis.i2p/show/1550/
and here is a heap histogram with the biggest offender being byte[]s, followed closely by UDPPackets (who are also unbounded and created in identical fashion)
http://pastethis.i2p/show/1549/

OOMs should never happen in the first place. If you get to the state of OOM you're most likely not going to recover and it's much better to

  1. dump the heap for post-mortem analysis via -XX:+HeapDumpOnOutOfMemoryError? -XX:HeapDumpPath=<file name>
  2. restart the jvm. IIRC the wrapper already restarts the router automatically?

Unbounded queues & unlimited object creation are never a good idea, especially in response to network events. It shouldn't be done even for internal events because it will mask some other bug with an OOM.

comment:7 Changed 7 years ago by zzz

Milestone: 0.9.2

The ticket title - and your argument above - are a little confusing, because of course the queue (cache) *is* bounded now. What isn't bounded is that we do indeed make more objects if the cache is empty. Which is the easiest, safest, most bug-resistant way to do it. This is a design pattern that's used in a dozen or more places throughout the code.

Somewhat paradoxically, given the ticket title, is that it seems to me that the solution is to make the queue *bigger*, or at least making the min size bigger, or tweaking the adaptive min-size adjustment to better handle high bandwidth routers i.e. large input bursts.

I'll stop advocating catching the OOM (if it was ever a serious proposal to begin with)

Note that we do have an inbound bandwidth throttle, although it may be after buffers are created, I'll have to check.

comment:8 Changed 7 years ago by str4d

Milestone: 0.9.2

comment:9 Changed 5 years ago by zzz

Priority: majorminor

comment:10 Changed 5 years ago by zzz

Resolution: wontfix
Status: newclosed
Note: See TracTickets for help on using tickets.