Opened 9 years ago

Closed 7 years ago

#762 closed defect (worksforme)

NTCP outbound queue stuck

Reported by: zzz Owned by: zzz
Priority: minor Milestone:
Component: router/transport Version: 0.9.3
Keywords: NTCP Cc: Zlatin Balevsky
Parent Tickets: Sensitive: no


I occasionally see, on the /peers page, an NTCP connection with several hundred pending messages (this is the NTCPConnection._outbound size) and no activity for several seconds or minutes (I've seen over 10 minutes).

It isn't clear this is a bug - perhaps the far-end went away. But at the very least we should find out faster. 300 messages sitting in a queue for several minutes is a memory hog if nothing else.

This isn't new, but I can't recall if I saw it before the redesigns of this summer, or the more recent codel work. But NTCP is not using codel now, only PriBlockingQueue?. My guess is it's been around longer than all that but I don't have any evidence.

The failsafe code in EventPumper?.run() only checks NTCPConnection.isWriteBufEmpty(), but the queue backing up is before that, i.e. _outbound, which must be pulled by the Writer thread to get stuffed into a write buf.

I've never understood the point of the _currentOutbound field and all the synchronization around it in NTCPConnection. Perhaps related, perhaps not.

#689 is about the inbound side but is somewhat related as it discusses the general architecture surrounding EventPumper? et al.

Perhaps for starters, enhance the failsafe code to kick the Writer, or at least log the state if a connection is in this jam, and proceed with further debugging.


Change History (5)

comment:1 Changed 9 years ago by Zlatin Balevsky

Next time you catch this live check if the jvm process has an open socket to the remote host. If no, then it's a matter of properly cleaning up state. If yes, I'd start looking at how the SelectionKey.interestOps is managed.

There is one non-bug case where this would be possible and that is if there is starvation and the connection never actually gets to write anything. This would be detectable by counting the number of writes to the SocketChannel that write 0 bytes.

I'll look into it more on Meeh's router because I suspect we may be having similar problems with SSU

comment:2 Changed 9 years ago by Zlatin Balevsky

Cc: Zlatin Balevsky removed

comment:3 Changed 9 years ago by Zlatin Balevsky

Cc: Zlatin Balevsky added

trac sux

comment:4 Changed 8 years ago by str4d

Keywords: NTCP added
Milestone: 0.9.5

comment:5 Changed 7 years ago by zzz

Resolution: worksforme
Status: newclosed

I haven't seen this in a long time. Optimistically closing.

Note: See TracTickets for help on using tickets.