Opened 12 months ago

Last modified 10 months ago

#2640 new defect

Fix UDP dups

Reported by: jogger Owned by: zzz
Priority: major Milestone: 0.9.45
Component: router/transport Version: 0.9.42
Keywords: Cc:
Parent Tickets: Sensitive: no

Description

The UDP transport contains a whole bunch of timing quirks and logic flaws I have patches for. Here I am presenting only those that call for an immediate fix as they seriously impact router operation.

UDP speed is severely impacted by the send pauses following retransmissions. There is a bug somewhere, I am not sure to have solved completely, but the patches for the following obvious bugs cut down retransmissions with the associated send pauses from 1-3% to far below 1% as viewed on /peers, totals of Dup Rx column. CPU usage for ACKsender is also down more than 50%. Please test on the LIVE net.

  • OMF puts the ACKs into a set while an ordered list is expected from buildPacket(). This nearly completely disables piggyback ACKing.
  • We keep and send resend ACKs for 5 minutes while they are useless after _rto, wasting CPU, memory and bandwidth.
  • retrieveACKBitfields() does not call removeACKMessage() as required
  • ACKs for dups should be handled with priority, since the other end is stalled.
  • _sendFullyTime should be correctly set, used for router steering to some degree.
  • ACKSender should start ACKing when the other side may have used up 1/3 of its send window.
--- i2p-0.9.42/router/java/src/net/i2p/router/transport/udp/OutboundMessageFragments.java
+++ i2p-0.9.42 patch/router/java/src/net/i2p/router/transport/udp/OutboundMessageFragments.java
@@ -386,7 +386,7 @@
         int piggybackedPartialACK = partialACKBitfields.size();
         // getCurrentFullACKs() already makes a copy, do we need to copy again?
         // YES because buildPacket() now removes them (maybe)
-        Set<Long> remaining = new HashSet<Long>(msgIds);
+        List<Long> remaining = new ArrayList<Long>(msgIds);
 
         // build the list of fragments to send
         List<Fragment> toSend = new ArrayList<Fragment>(8);
@@ -474,11 +474,12 @@
             newFullAckCount = Math.max(0, newFullAckCount - (before - after));
 
             int piggybackedAck = 0;
+            long now = _context.clock().now();
             if (msgIds.size() != remaining.size()) {
                 for (int j = 0; j < msgIds.size(); j++) {
                     Long id = msgIds.get(j);
                     if (!remaining.contains(id)) {
-                        peer.removeACKMessage(id);
+                        peer.removeACKMessage(id, now);
                         piggybackedAck++;
                     }
                 }
--- i2p-0.9.42/router/java/src/net/i2p/router/transport/udp/PeerState.java
+++ /i2p-0.9.42 patch/router/java/src/net/i2p/router/transport/udp/PeerState.java
@@ -339,9 +339,6 @@
     /** for small MTU */
     private static final int MAX_RESEND_ACKS_SMALL = MAX_RESEND_ACKS / 5;
 
-    private static final long RESEND_ACK_TIMEOUT = 5*60*1000;
-
-    
     /**
      *  @param rtt from the EstablishState, or 0 if not available
      */
@@ -860,6 +857,8 @@
         } else {
             //if (true || _retransmissionPeriodStart + 1000 < _context.clock().now()) {
                 _packetsReceivedDuplicate++;
+		// rush dup ACKs
+                _wantACKSendSince = 1;
             //} else {
             //    _retransmissionPeriodStart = _context.clock().now();
             //    _packetsReceivedDuplicate = 1;
@@ -986,7 +985,7 @@
             int sz = _currentACKsResend.size();
             List<Long> randomResends = new ArrayList<Long>(sz);
             if (sz > 0) {
-                long cutoff = _context.clock().now() - RESEND_ACK_TIMEOUT;
+                long cutoff = _context.clock().now() - _rto * 2;
                 int i = 0;
                 for (Iterator<ResendACK> iter = _currentACKsResend.iterator(); iter.hasNext(); ) {
                     ResendACK rack  = iter.next();
@@ -995,7 +994,7 @@
                     } else {
                         iter.remove();
                         if (_log.shouldLog(Log.INFO))
-                            _log.info("Expired ack " + rack.id + " sent " + (cutoff + RESEND_ACK_TIMEOUT - rack.time) +
+                            _log.info("Expired ack " + rack.id + " sent " + (cutoff + _rto * 2 - rack.time) +
                                       " ago, now " + i + " resend acks");
                     }
                 }
@@ -1009,19 +1008,19 @@
      * The ack was sent.
      * Side effect - sets _lastACKSend
      */
-    public void removeACKMessage(Long messageId) {
+    public void removeACKMessage(Long messageId, Long now) {
             boolean removed = _currentACKs.remove(messageId);
             if (removed) {
                 // only add if removed from current, as this may be called for
                 // acks already in _currentACKsResend.
-                _currentACKsResend.offer(new ResendACK(messageId, _context.clock().now()));
+                _currentACKsResend.offer(new ResendACK(messageId, now));
                 // trim happens in getCurrentResendACKs above
                 if (_log.shouldLog(Log.INFO))
                     _log.info("Sent ack " + messageId + " now " + _currentACKs.size() + " current and " +
                               _currentACKsResend.size() + " resend acks");
             }
             // should we only do this if removed?
-            _lastACKSend = _context.clock().now();
+            _lastACKSend = now;
     }
     
     /** 
@@ -1060,18 +1059,17 @@
                 maxResendAcks = MAX_RESEND_ACKS_LARGE;
             List<ACKBitfield> rv = new ArrayList<ACKBitfield>(maxResendAcks);
 
-            // save to add to currentACKsResend later so we don't include twice
-            List<Long> currentACKsRemoved = new ArrayList<Long>(_currentACKs.size());
             // As explained above, we include the acks in any order
             // since we are unlikely to get backed up -
             // just take them using the Set iterator.
             Iterator<Long> iter = _currentACKs.iterator();
+            long now = _context.clock().now();
             while (bytesRemaining >= 4 && iter.hasNext()) {
                 Long val = iter.next();
                 iter.remove();
                 long id = val.longValue();
                 rv.add(new FullACKBitfield(id));
-                currentACKsRemoved.add(val);
+                removeACKMessage(id, now);
                 bytesRemaining -= 4;
             }
             if (_currentACKs.isEmpty())
@@ -1096,13 +1094,6 @@
                         bytesRemaining -= 4;
                     //}
                 }
-                if (!currentACKsRemoved.isEmpty()) {
-                    long now = _context.clock().now();
-                    for (Long val : currentACKsRemoved) {
-                        _currentACKsResend.offer(new ResendACK(val, now));
-                    }
-                    // trim happens in getCurrentResendACKs above
-                }
             }
 
         int partialIncluded = 0;
@@ -1209,6 +1200,7 @@
         
         _consecutiveFailedSends = 0;
         // _lastFailedSendPeriod = -1;
+        long now = _context.clock().now();
         if (numSends < 2) {
             if (_context.random().nextInt(_concurrentMessagesAllowed) <= 0)
                 _concurrentMessagesAllowed++;
@@ -1234,8 +1226,6 @@
         }
         if (_sendWindowBytes > MAX_SEND_WINDOW_BYTES)
             _sendWindowBytes = MAX_SEND_WINDOW_BYTES;
-        _lastReceiveTime = _context.clock().now();
-        _lastSendFullyTime = _lastReceiveTime;
         
         //if (true) {
             if (_sendWindowBytesRemaining + bytesACKed <= _sendWindowBytes)
@@ -1243,6 +1233,9 @@
             else
                 _sendWindowBytesRemaining = _sendWindowBytes;
         //}
+        _lastReceiveTime = now;
+        // to avoid clock jumps would need to get state._startedOn
+        _lastSendFullyTime = Math.max(now - lifetime, _lastSendFullyTime);
         
         if (numSends < 2) {
             // caller synchs
@@ -1442,7 +1435,7 @@
     public boolean unsentACKThresholdReached() {
         //int threshold = countMaxACKData() / 4;
         //return _currentACKs.size() >= threshold;
-        return _currentACKs.size() >= MAX_RESEND_ACKS / 2;
+        return _currentACKs.size() >= DEFAULT_SEND_WINDOW_BYTES / 3 / 1033;
     }
 
     /**

Subtickets

Change History (6)

comment:1 Changed 12 months ago by zzz

Milestone: 0.9.430.9.44
Priority: criticalmajor

.43 checkin deadline was yesterday, but we can take a look at this for .44

comment:2 Changed 12 months ago by jogger

Maybe here is a reason to do something for .43: Now I found the root cause which is in part mitigated by the patches above, providing better ACKing. Since #2505 we let large messages through even if the conn has not grown its send window to sufficient size in order to avoid starving.

Since we usually fire them into a 1000 ms RTO, we produce lots of dups and retrans delays in the net as most UDP conns are not fast enough for this.

As a quick fix I set MIN_RTO to 3000, allowing for one retry with 6000 ms before the message times out after 10 sec. This brings the retrans rate seen on /stats back down to around 1%.

Alternatively one could set nextSendTime depending on message size, but that would involve reworking the RTO logic which is totally inadequate modelled after TCP, since we meld single and multi-fragment messages, piggyback ACKs and ACKSender delays into one figure.

comment:3 Changed 12 months ago by zzz

Sorry, that horse has left the barn. I'll turn my attention to .44 tickets in a couple weeks.

comment:4 Changed 10 months ago by zzz

Milestone: 0.9.440.9.45

via email from jogger:

As discussed here we have still one single fix possible
without reworking much of the code. This is the issue from comment 2.
Solution is to grow the RTO in case of multi-fragment messages.

I have tested the following for PeerState?.java, line 1873:

state.setNextSendTime(now + rto + (state.getFragmentCount() - 1) * ACKSender.ACK_FREQUENCY); 

This gives the remote end enough time to ACK. Ideally one would send
the smallest fragment first in case another messages is batched onto
this long one.

The OP still stands, will be included in the list we discussed.

Last edited 10 months ago by zzz (previous) (diff)

comment:5 Changed 10 months ago by zzz

This looks reasonable. I'll be testing it over the next few days to see if I get the same results.

comment:6 Changed 10 months ago by zzz

On a fast floodfill I see a modest reduction in retransmissions (10-30%). I presume the real win is on a low-bandwith box.
In 61d60600d48b41439435e377f1a98e22ac4393f1 to be 0.9.44-5
Leaving ticket open for the more complex changes to follow

Note: See TracTickets for help on using tickets.