Opened 2 years ago

Closed 2 years ago

Last modified 2 years ago

#1964 closed enhancement (fixed)

Probability based tunnel rejection kicks in at hardcoded value of participating tunnels

Reported by: Mysterious Owned by:
Priority: minor Milestone: 0.9.30
Component: router/general Version: 0.9.29
Keywords: Cc:
Parent Tickets:

Description

By default, at a value of 1000 (or configured using router.minThrottleTunnels) the router starts throttling tunnel acceptance using a mechanism that seeks to limit tunnel growth.

When playing with the router.maxParticipatingTunnels value, I as a user in the past didn't understand why this counterintuitive behavior occured. Increasing this value should also adjust related behavior. So that the effect is capacity increase or decrease. router.minThrottleTunnels is a fairly obscure setting, i only found it this time because i was looking at the code.

Also the default maximum of participating tunnels is insanely high at 10000.

Example patch coming soon.

Subtickets

Change History (34)

comment:1 Changed 2 years ago by Mysterious

Please check this blog post for patch: http://mysterious.i2p/blog/poking-into-the-i2p-router.html

comment:2 Changed 2 years ago by echelon

  • Resolution set to not a bug
  • Status changed from new to closed

Those limits are for a reason to not overload a single router with too much load and connections.
I2P is not always being installed on high capacity systems with lots of ressources.
E.g. some SOHO routers will die horrible with too much connections. Or some will not accept to much connection requests in a minute.
These throttles in participating tunnels will limit:

  • amount of connections per node
  • amount of tunnel per requesting node
  • CPU load of node

comment:3 Changed 2 years ago by Mysterious

I'm well aware of those routers that die, I used to own one that hardlocked on me at ~1500 tunnels. What i'm proposing here is something that increases this limit using the obvious "hey i have plenty of capacity" slider. Did you notice my proposal patch left the default threshold at 1000?

comment:4 Changed 2 years ago by echelon

Hi
It is not about the single router, it is about the network and limiting the amount of connections and participating tunnels from single nodes to a router.
You do enable high capacity routers with the bandwidth setting (look into the router class X e.g.). Thats the way it is set, no need for a slider, as it also set other options to make sure the capacity is used to these limits max.

comment:5 Changed 2 years ago by Mysterious

Hi Echelon,

I'm aware that a typical internet gateway has a limited tolerance for traffic before it breaks down. And I also understand that bandwith limits set by the typical user, and the associated bandwith classes regulate this.

All of this is based on assumptions, on safe bets what a typical SOHO situation will handle. I'm not proposing changing this default mechanismm, it works for a typical user. Instead i'm talking about the advanced options which are for power users, for people who know their network infrastructure.

Some of the advanced options that are fairly simple to understand, and which i found way before i even looked into the code:
router.maxParticipatingTunnels
i2np.ntcp.maxConnections
i2np.udp.maxConnections

My expectation was that these are the limiters in, one for tunnels in general since that determines router load, and one per transport, since that depends on the capacities of the network gateway.

What i'm asking that other limiters that complement this, in this case router.minThrottleTunnels scale with these kind of obvious advanced settings.

The reason I ask this is, because for someone who understands their network, who understands the connection load that their internet gateway can handle, it isn't obvious that I2P will try to throttle their connections even if they just increased these limits to match their network capabilities.

In essence i'm asking that you give the power user, who sits above the typical user who just uses defaults, and the software engineer who can dig through the codebase, a little bit more pleasant experience.

comment:6 Changed 2 years ago by echelon

So you ask for a 2h code session for a 0.01% power user part of the network, for whom the 3 advanced config options are not enough and need a slider for those three values?
Sure, it is a enhancement for code to be inserted with a patch. But usually not worth the effort.
Also the throttle code is there for a reason to not overload a router with too much tunnel built requests, tunnels from a single router and too much tunnels overall, not for a ressource option alone, also for security reasons.
E.g. some attacks do overload a router with too much tunnel requests and shut it down this way, or try to route all tunnels through one single peer to catch all traffic,..
Thats what I meant with "network reasons". Several attacks are mostly mitigated with throttles/limits.
And the 1000 is the minimum level to start with this limits, and it seems quite sane, as it is not hit by low level routers, but even for class X routers it is sane to have this throttle to not overload the single router. And to not hard limit, there is some kind of adoptable throttle built in, as hard limits will show a zig-zig line in active part tunnels due to network setup (e.g. router is not capable, ignore for x days...)

comment:7 Changed 2 years ago by Mysterious

I didn't open this ticket with the expectation someone else would work on it, it's more for tracking purposes (i checked with zzz and he likes to have tickets for bugs/improvements that are found). I'm glad to do the work myself, that's how open source development works, because I'm the one who cares.

But for any changes i would like to propose/make, I do need buy-in. I simple identified those 3 advanced as the 3 ones that a power user is likely to tweak, and can relate to their cpu and network power. In this case I'm proposing to derive router.minThrottleTunnels from router.maxParticipatingTunnels. Not a slider, which wouldn't suit power users anyways.

As far as security/reliability (which is interesting feedback), you seem to mention several things:

1) Bringing down a node by doing too many requests
--> Isn't this the node owner responsibility, to set limits that can be handled load wise? We are talking advanced settings here.

2) Bandwith class X i2p routers that are still prone to overload
--> This is definately true, the default settings should not allow class X routers to be truly unlimited in terms of number of tunnels, home broadband can meet the 20 Mbit/s needed for this, without having the internet gateway that is up for the job. Again this falls under standard settings, not advanced.

3) Routing too much traffic through a single peer, and associated anonimity risk
--> A client would never want to purposefully expose themselves and their tunnels, so I'm assuming you are referring to destination tunnels where the service is part of traffic analysis of some kind. Doesn't the service already know everything about the traffic through the tunnels associated with the service? Or is the interest in the entry point of the client tunnels?

4) Not having hard limits kicking in frequently
--> Agreed, this seems sensible graceful degredation strategy. At the very least for all but the very high end of i2p routers. Even there it might have merit too.

comment:8 Changed 2 years ago by zzz

  • Resolution not a bug deleted
  • Status changed from closed to reopened

I haven't yet carefully reviewed the above or the blog post, but in the meantime, reopening so I can find it easier. Not necessarily committing to accepting it, just that it's worth further review.

comment:9 Changed 2 years ago by Mysterious

As promised the patches the directly affect this ticket:
http://p66g2a4nzfkvidd3l7nwphcnfa3ttyu5kiolcb4czec2rn2kvwsq.b32.i2p/assets/patches-06-03-2017/0004-RouterThrottleImpl-make-tunnel-throttling-dependent-.patch

I incorporated some of echelon's feedback in this one. These are slightly updated versions from yesterday. Please also check the blog for the cleanup patches 5 and 6.

comment:10 Changed 2 years ago by zzz

Some initial thoughts and principles:

1) I don't run a very-high-speed router. hottuna and echelon do, each with over 10 years experience, and echelon is probably our expert in configuration for it. I think he's even written a guide.

2) We must be very careful not to make changes that inadvertently reduce the overall capacity of the network. That would be very bad.

2a) We must be careful not to make low-powered routers more prone to getting hung up.

3) For those people in 1) that have tuned and configured their router, possibly following a configuration guide that's out there, I don't want to make it run slower or be more congested.

4) If there is a configuration that even echelon doesn't know about, or isn't in any of his guides, it's probably safe to change the default or the way it works for non-default, as long as we adhere to 2) and 3). I'd like to hear from echelon if he knows about router.minThrottleTunnels option. Maybe the most important result of this ticket would be to document it better.

5) All the stuff in RouterThrottleImpl? is just one part of the throttling. There's a lot more in BuildHandler?.

6) DEFAULT_MAX_TUNNELS is supposed to be insanely high. It's a last resort. The dozen other throttle checks will limit almost any hardware before that. Due to 2) and 3), NACK on reducing it from 10000 to 3000.

7) The probabalistic test in question here, in acceptTunnelRequest(), defaults to 1.3x growth every 10 minutes, or 4.8x/hour, starting at 1000. So it would allow up to 26000 after two hours.

8) If the sole purpose of the probabalistic test is to gradually start rejecting tunnels as we approach some limit, and that limit is knowable (because of a configation), then it makes sense that the threshold for that test (currently 1000 by default) have a default that's some fraction of the limit. However:

9) If the probabalistic test has other benefits, e.g. to keep the router from being quickly overwhelmed (and before other lagging stats catch up, allowing the multiple other throttle checks to work correctly), then it makes sense to keep the threshold at 1000 by default, or even lower it to protect low-powered devices. Alternatively, the default could be lowered for ARM and Android.

Agreeing on what we're trying to do here (8 and 9) is a good first step.

The message in the console sidebar - that changes based on which throttle test did the rejection - is good for quick glances to see what's going on. For serious experiments, you may need to resort to logging, or at least looking at the stats.

comment:11 Changed 2 years ago by echelon

Ok, I did not know that router.minThrottleTunnels is a advanced option available at all, have no knowledge about. Was under impression it is a hardcoded value.
DEFAULT_MAX_TUNNELS should be high in the 10k area, as I do hit often 3k or 4k tunnels.

About mysterious: some values are not about the client itself, but for network. Sure, a client wants to config itself to be not overloaded, but if we change a value for all clients, the single client cannot stop others to DDOS him, whatever limits itself has included. Even if a throttle/limit code seems to act only local, it needs to take care of the impact on the network as a whole.

comment:12 Changed 2 years ago by Mysterious

@Echelon: I'm not surprised by 3K/4K (i've seen 3K myself), but I sincerely doubt that a default config router can hit that.
@Echelon: I know that defaults, and the easy to access tuneables, are part of a well intentioned router and thus the network. I would like to know what the stance on all peers being untrusted and potentially malicious.
@zzz/5: I know, the cleanup patches specifically add comments about that ;-)
@zzz/6: Why have a insane high limit?
@zzz/8: For mid to high-end I suspect this is the case.
@zzz/9: One of my follow-up ideas was to make the maxtunnels+threshold dependent on bandwith, because like you say, a low end device will not be served by a limiter that kicks in at 1000. My experience from the past is that 1000, or a bit more, is a nice value for a class X router with default settings for connections.

comment:13 Changed 2 years ago by echelon

Hm, a default config router is class X and that hit 3k or 4k tunnels rather easy with time of good usage. Also I had once >30k part tunnels in times of a attack, which results in even more throttles and limits. But 10k is a sane limit. Routers with less capacity are limited by a lot of other limits before and will never hit the 10k. No need to lower that limit. Currently 10k are not hit, but 5-7k can be done on good routers. No need to lower the 10k and later raise again in times of higher needs.
And with all the limits implemented, the max part tunnels are limited with the bandwidth selection (and therefor the speed class of router). Not directly, but indirect.
And class X router is "unlimited", no need to cut out at 1000.
Remeber: I2P depends on connections, not on bandwidth. We need the connections AND the tunnels, bandwidth does not care much, as long as tunnels can be built.

comment:14 Changed 2 years ago by zzz

re: why high limit

I think our goal should be for a router that's on a supercomputer and with only a single configuration change to the defaults (setting the in/out bandwidths to some huge number), that it gets up to 10K participating tunnels. That's a nice goal, and a reasonable limit, as we don't want one router to swallow ALL the participating tunnels in the network.

If the proposed changes help us get there, while not breaking our other goals in comment 10 above, I'm in favor.

comment:15 Changed 2 years ago by Mysterious

@echelon: A class X router, with floodfill, has a ~4.8K connection limit, with 3/4 of it being SSU. Even if you max that out, you will need 2 connections to sustain a tunnel. So in ideal world the limit is 2400 tunnels, ignoring client and exploration tunnels. I acknowledge that I2P is a connection-game, and rarely is bandwith the problem.

My example of 1000+ tunnels is based on a 1:1 ratio NTCP/SSU usage, where only half of 2400 theoretical maximum is used.

I also know that if you raise the connection limits, and you're not CPU bound, you will go a long way. Essentially limited by demand, which would be extreme in times of a DDOS-type of attack.

I'll give it some thought if we can use the individual connection limits of NTCP and SSU tramsports as the measure of capacity, and what kind graceful degredation strategy could be beneficial, for the embedded routers, for the typical routers and the high-end ones.

And on the sideline also consider the extreme case of enterprise grade hardware, where connection limits and bandwith stop correlating.

Much thanks for the input, if you have anymore, let me know.

comment:16 Changed 2 years ago by echelon

Yeah, a limit to connections is builtin to not overload it. Those look rather sane, as those number are hit not often. Do you think we should raise them?
Also 2400 tunnels per 4800 connections maybe a good theoretical value, in practical terms most tunnels are idle and not used, so you get a 1:1 ratio or worse. No need to really change the max tunnels as of connection limits. Be free to allow more tunnels, if needed.
Also do not be fixed on 1:1 tcp/UDP ratio, thats only a theoretical value which will not happen in real life, though there aree different costs and preferences.
Do you want to change the costs? Are those values not what the real costs in ressources are?

The idea behind the classes was to get I2P adopted to these setups: small network (ISDN), better network (DSL), good network (cable), superb network (gigabit line). Also in correlation with CPU and memory ressources needed. But except for class X it was in 2003 - a long time ago. We should look again in distribution of these classes and maybe adopt some settings, e.g. the lowest class does not route any traffic at all,...
In time of I2P the classes went off the "network ressources base" towards "base set of function being able to do with this class" category, although still based on bandwidth.
And I am in favor of setting those based on bandwidth for the user, the power users may use advanced config options to tweak their routers, which they do for sure currently, too.

comment:17 Changed 2 years ago by zzz

The math in comment 15 isn't valid, there's not a 1:1 mapping from tunnels to connections.

We could - perhaps - raise the X conn limits even higher, or add a test for whether there's a UPnP device in the path before increasing. On the NTCP side, one other issue is file descriptors (ulimit).

For the default minThrottleTunnels, what do you think about 100 for Android, else 500 for ARM, else 1000?

On the settings, we currently use max bandwidth as set by user as a proxy for conn limit settings also. This has worked pretty well. We ask users to configure one thing, something that they understand. Turning that around, I think, would be problematic.

comment:18 Changed 2 years ago by Mysterious

Your proposal seems like an improvement, but misses a fundemental consideration. Throttling thresholds that are in place for connection reasons should be related to maximum connections somehow. And this I don't see back. Or is this purely cpu load determined threshold?

As promised, I wrote down my thoughts, because i realize i was oversimplifying things. I've put that in a blog post and linked it to this forum post: http://zzz.i2p/topics/2254-my-ideas-on-how-to-handle-finite-resources-in-an-i2p-context

comment:19 Changed 2 years ago by Mysterious

Example: A low bandwidth router on a X86 platform is never going to kick that 1000 threshold, so it doesn't serve a purpose.

comment:20 Changed 2 years ago by zzz

The probabalistic stuff is unrelated to connection limits.

Conn limit throttling is based on real-time conditions, not 10-minute stats.

A lot of the other throttling is based on 10-minute stats, so the probabalistic section limits growth per 10 minutes, so the router isn't overwhelmed before the stats have a chance to 'catch up'. I do believe this is more about CPU.

The 100/500/1000 proposal in comment 17 in intended to address only my item 9) in comment 10.

Now that I look at it more, the probabalistic throttling seems only to be concerned about growth rate - not about dropping as we approach max tunnels. The lower limit minThrottleTunnels is only because the code section might be a little expensive. If that's true, and we want it to stay that way, then item 8) in comment 10 is wrong, and while we've learned a lot in this discussion, the premise of this ticket doesn't hold. What are your thoughts on this - do we now, or should we, do probabilistic dropping as we approach maxTunnels? If so how would you fix the code?

comment:21 Changed 2 years ago by Mysterious

I'd be tempted to replace the entire thing with a mechanism that works the same all the time (and doesn't have any hysteresis-like properties), acting on growth rate isn't something I can explain easily.

But we should agree on a couple of things first:

  • Are we limiting CPU load of the I2P node?
  • Or are we limiting connections?
  • What have we implicitly connected to tunnel count that needs throttling/graceful degredation?

The rest will depend on the answer to this question.

The idea is that we can slowly introduce better understood metrics and throttling criteria, starting with this one, but it would be kind of silly just to take it out and start with a new random one.

I'm more than willing to code something up, but I do need to know what kind of "knobs" we want, because any new metric will always require tuning.

comment:22 Changed 2 years ago by echelon

We do limit all, not only one aspect. We want to limit the CPU usage if it gets to high (which will increase lag, on which we do react and kill tunnels).
We want to limit connections, to keep enough free.
We want to limit participating tunnels based in sum and single node, also in creation rate and share of client/part tunnels.

comment:23 Changed 2 years ago by zzz

  • 1) We don't directly measure or limit CPU load, and it's not really possible from Java. But several of the checks elsewhere measure and react to delays and/or queue sizes, which will be indirectly affected once we start running out of CPU.
  • 2) We limit connections, as we limit almost everything, but the probabalistic code isn't related to connections, as I stated in comment 20.
  • 3) I think the best way I can answer that, and to your question in comment 18 "is this purely cpu load determined threshold?": The probabalistic throttling / growth limiting code is there to protect all of the other throttling checks that depend on lagging stats. This is what I tried to explain in comment 10 item 9). By limiting growth, the router won't be driven into overload (and perhaps unable to recover) before the lagging stats have time to 'catch up' and start rejecting tunnels at the various other checks. So it's a general-purpose protection that's in addition to all the other specific checks and limits.

If we agree with that, then I don't think the probabalistic code needs to be replaced - on the contrary, I think it needs to kick in a lot earlier on low-power platforms, as I proposed in comment 17. Something like the following:

--- router/java/src/net/i2p/router/RouterThrottleImpl.java	e34313db907278c0881f442a559110b58d71c3ff
+++ router/java/src/net/i2p/router/RouterThrottleImpl.java	3f961a5e2b05681b47d8ea7911dadd32209c0673
@@ -8,6 +8,7 @@ import net.i2p.util.SimpleTimer;
 import net.i2p.stat.RateStat;
 import net.i2p.util.Log;
 import net.i2p.util.SimpleTimer;
+import net.i2p.util.SystemVersion;
 
 /**
  * Simple throttle that basically stops accepting messages or nontrivial 
@@ -33,6 +34,8 @@ public class RouterThrottleImpl implemen
     private static final String PROP_MAX_PROCESSINGTIME = "router.defaultProcessingTimeThrottle";
     private static final long DEFAULT_REJECT_STARTUP_TIME = 10*60*1000;
     private static final String PROP_REJECT_STARTUP_TIME = "router.rejectStartupTime";
+    private static final int DEFAULT_MIN_THROTTLE_TUNNELS = SystemVersion.isAndroid() ? 100 :
+                                                            SystemVersion.isARM() ? 500 : 1000;
 
     /**
      *  TO BE FIXED - SEE COMMENTS BELOW
@@ -461,7 +464,7 @@ public class RouterThrottleImpl implemen
     
     /** dont ever probabalistically throttle tunnels if we have less than this many */
     private int getMinThrottleTunnels() { 
-        return _context.getProperty("router.minThrottleTunnels", 1000);
+        return _context.getProperty("router.minThrottleTunnels", DEFAULT_MIN_THROTTLE_TUNNELS);
     }
     
     private double getTunnelGrowthFactor() {

comment:24 Changed 2 years ago by Mysterious

A protection that is instant instead of time delayed, sounds fine. Although this is definately worth a few lines of comments explaining this. Something like this should probably be put above the code block that does the rejection:

/* 
 * Lag based statistics use a moving average window (of for example 10 minutes), they are therefore
 * sensitive to sudden rapid growth of load, which are not instantly detected by these metrics. 
 * Reduce tunnel growth if we are growing faster than the lag based metrics can detect reliably.
 */

Something does bother me about the proposed values, if we are protecting against inrush, why do we even want a minimum? I remember you mentioned calculation cost, but is that really justified? And why dependent on CPU architecture? A high end ARM server could be on par with some X86 systems.
From a simplicity point of view (and thus also reliability) removing the mimimum makes more sense.

FYI I'm not being nitpicky for the sake of it, I just want these things to be well understood.

Last edited 2 years ago by Mysterious (previous) (diff)

comment:25 Changed 2 years ago by zzz

The reason you don't start the probabalistic calculation at zero (or one) is that we have to assume a minimum level of capability or else we throttle things way too early.

We wouldn't want to wait half an hour to go from 1 tunnel to 2, another half hour to go from 2 to 4, ...

comment:26 Changed 2 years ago by Mysterious

You make a good point (which deserves a comment IMO). What happens if we put the limit at 100 already for all platforms? It would take 90 minutes to reach 1000 participating tunnels. And if the growth limiter happens to have a problem, a lot of routers will see it, rather than only the android routers. Alternative code paths are usually undertested and potentially broken in my experience.

comment:27 Changed 2 years ago by zzz

90 minutes to 1000 tunnels is waaay too slow and violates some of my principles in comment 10 above.

We already have people complaining that we don't accept any tunnels for the first 10 minutes, so it takes 15-20 minutes for big routers to fill up after a restart. We can't extend that to 2 hours to get to 2000 tunnels.

I propose applying the patch in comment 23 as-is for 0.9.30. We can review and adjust after the release, or even prior to the release if it gets enough testing.

Alternatively, we can keep discussing and hold off until 0.9.31 or whenever we've agreed on the details.

comment:28 Changed 2 years ago by Mysterious

I'm open to incremental improvements, although I do ask you to consider the code comment i mentioned in comment 24. Especially since a true long term solution isn't going to be a quick thing to get right.

As for the longer term thing, if variable growth rate for different systems is a must (which I can understand), and cpu is the primary lag introducer, then why not use java.lang.management to check CPU usage?

comment:29 Changed 2 years ago by zzz

  • Milestone changed from undecided to 0.9.30
  • Resolution set to fixed
  • Status changed from reopened to closed

patch from comment 23 and comment from comment 24 in 36bccf88c6e3b8c856c8acc147337a207be25703 to be 0.9.29-13

comment:30 Changed 2 years ago by zzz

I neglected to answer your java.lang.management question.

I'm not familiar with it, and I don't recall anybody suggesting it before. I see it's not available on Android. Don't know if there are any other availability issues.

If you have a specific suggestion on how we could use it effectively, please either open a separate ticket or start a thread on zzz.i2p.

comment:31 Changed 2 years ago by zzz

FYI I did a test to put the load average from java.lang.management in the console and it worked - basically:

String.format("%5.3f", ManagementFactory?.getOperatingSystemMXBean().getSystemLoadAverage())

still unsure how we could use it.

comment:32 Changed 2 years ago by zzz

getSystemLoadAverage() returns -1 (not available) on Windows.

comment:33 Changed 2 years ago by Mysterious

For some reason I thought it would be simple CPU usage (something that could be normalized to 0-100 range), load avarages are far from perfect indicators. Anyways, if the availability is poor on different platforms then it makes no sense at all to even consider using these things.

Since my original intent was to help i2p, what does i2p need (at the router level)? If it's a long answer you can always throw it on a mail to t2ooN5-nwmiPTZrEfwQQvq~nWuJpndvQLl4w0uudBaM-8VLb3eSBeGFryxy~X0lwHfDUmKvcPLERAh2FeB0oWC

comment:34 Changed 2 years ago by zzz

The best place to look for things we need help with is right here on trac. zzz.i2p also has discussions on a number of topics. See what interests you. Talk to us on IRC #i2p-dev for more info.

Note: See TracTickets for help on using tickets.