Opened 7 years ago

Closed 6 years ago

#819 closed defect (fixed)

high job lag

Reported by: Eche|on Owned by: mkvore
Priority: major Milestone: 0.9.5
Component: api/i2cp Version: 0.9.4
Keywords: job lag many scheduled Cc: mkvore@…
Parent Tickets: Sensitive: no

Description

After some hours of 0.9.4 active in the live net, my 0.9.3-15 (and now 0.9.4) router got a very high job lag of 20-30 sec.
The connections broke in (but ONLY SSU/UDP) and tunnel builts were not very fine.
Now I just restarted third time in 24h as of this. It only afflects the UDP connections so far.
And: I did limit SSU to max 500 peers, I limited connections to high load eepsite echelon.i2p, I limited other services… no luck yet.
On the other server with nearly same load - no such problem.

In the log I do only see:
12/19/12 10:00:55 PM WARN [P reader ¼] net.i2p.router.JobQueue? : Dropping job due to overload! # ready jobs: 22869: job = HandleFloodfillDatabaseLookupMessageJob?: Job 14492061: Handle Database Lookup Message
12/19/12 10:00:53 PM ↓↓↓ 1 similar message omitted ↓↓↓
12/19/12 10:00:29 PM WARN [P reader 2/4] net.i2p.router.JobQueue? : Dropping job due to overload! # ready jobs: 21649: job = HandleFloodfillDatabaseLookupMessageJob?: Job 14482160: Handle Database Lookup Message
12/19/12 10:00:17 PM ↓↓↓ 3 similar messages omitted ↓↓↓
12/19/12 10:00:03 PM WARN [P reader ¾] net.i2p.router.JobQueue? : Dropping job due to overload! # ready jobs: 22497: job = HandleFloodfillDatabaseLookupMessageJob?: Job 14469626: Handle Database Lookup Message

No other log entry due to job lag.

echelon

Subtickets

Change History (23)

comment:1 Changed 7 years ago by Eche|on

Moin

Got more information. Looks like a problem with jetty - jetty pool was set to X receiver, but in I2P tunnel I alloed 0 (aka unlimited). If the number of requests via I2Ptunnel >X (as noted in jetty.xml config file), the high job lag does occure.
And it will not go down fast.
But why it will only afflect UDP and not TCP connections and why only the FloodFill? errors do appear in logfile, is left open in here.

echelon

comment:2 Changed 7 years ago by zzz

So far, every time I've seen high job lag it is "real" - caused by extreme memory pressure, or high CPU usage by other things running on the PC. This case - queueing of HTTP server requests - is a new one.

As you have discovered, it's important to configure things correctly, and there's some decisions to make. When things get really busy, do you start dropping HTTP requests, or queue them? What's the threshold for doing that? And there's a few queues to worry about - one in i2ptunnel, and one in Jetty, configured in jetty.xml. Especially with inbound requests, it's good to start dropping at some point. or you can get DoSed?.

The overload dropping algorithm (in JobQueue?.shouldDrop()) is really crappy. There's dozens of different job types. Only one is dropped on overload - HandleFloodfillDatabaseLookupJob?. No matter what the cause of a huge job queue, dropping only those probably isn't sufficient. We need to look for other low-priority jobs we can add to the list, so the router has a better chance of recovering from overload.

comment:3 Changed 7 years ago by zzz

Possibly related: #779

comment:4 Changed 7 years ago by Eche|on

Component: unspecifiedapps/SAM

Moin

Some more information from simple testing:
running I2Psnark with >50 torrents does raise message lag from 50 to 900 msec while starting and to 200-300 while running (80-90 connected peers, 200/200 kb/sec traffic).

Now the interesting Job lag: iMule !
Looks like SAMv3 has a serious bug.
As of running iMule with some traffic, the job lag raises slowly. If traffic raises , job lag raises (had 1 sec with upload of 50 kb/sec in iMule limited). Stopping iMule and stopping its tunnels shows: job lag of 0.

Sorry for that work now…

echelon

comment:5 Changed 7 years ago by sponge

No SAM here, but I got job pileup!
ERROR [uterWatchdog] 2p.router.tasks.RouterWatchdog?: Ready and waiting jobs: 6101

Yes, you read that correctly…. 6101 jobs— no wonder it's lagging!

I will be keeping an eye on http://localhost:7657/jobs
Perhaps an additional clue can be found there.

comment:6 Changed 7 years ago by zzz

Do you guys think 0.9.4 is worse or has a new bug, or is it just that the network is now faster and it's pumping more data and getting clogged up?

i.e. did you have this level of traffic before?

comment:7 Changed 7 years ago by sponge

Component: apps/SAMrouter/transport
Keywords: many scheduled added

Looks like some kind of job leak or duplicate job creation on Update Delivery Status to me.
The scheduled jobs keeps growing… Are some of these jobs rescheduling themselves and some other code is adding them as well?
Scheduled jobs: 784

Changing component to 'router transport' because it seems to be related to delivery of packets.

comment:8 Changed 7 years ago by zzz

Owner: zzz deleted
Status: newassigned

huhwhut?

what evidence do you have of 'update delivery status' issues? (this is an I2CP thing)

what evidence do you have of a transport problem? (which is unrelated to I2CP)

you should never see 'update delivery status' except when running, maybe, SAM or BOB or another app that sets message reliability to non-default (default is 'none')

please supply data / evidence, all I see is guesses.

comment:9 Changed 7 years ago by Eche|on

Moin

from my side: load/traffic/usage of router has not changed much. Just slightly more bandwidth over all due to 0.9.4.
But in iMule I got no more traffic, it is limited to a fixed value for down/uploading.
I do not run BOB, I run everything like before 0.9.4 without any problems.
But as soon as iMule starts up and pushing data to other iMule users, the job lag grows slightly and the I2P CPU usage grows from 40% up to 200%.
As soon as iMule is stopped, job lag goes down to 0 and CPU usage down to 40% again.
It could be the better net which implies iMule sending to much data at once and it piles up in I2P router, I do not know. Sorry. Anything I could easy provide helping to debug?

echelon

comment:10 Changed 7 years ago by zzz

Component: router/transportapi/i2cp

sponge will add more info here on his fix.

echelon, do you see 'update delivery status' on /jobs when running imule? It doesn't look like it's added in SAM (can't be sure though), but it could be set by iMule itself.

comment:11 Changed 7 years ago by Astral2012

In iMule-2.3.2.1-src/src/libs/i2p/CSamSession.cpp:

void CSam::sam_datagram_session_create ()
{

wxString cmd =

wxString (wxT ("SESSION CREATE STYLE=DATAGRAM DESTINATION="))
<< m_privKey << wxT (" ID=")
<< getSessionName ()
<< wxT (" i2cp.messageReliability=BestEffort? ")
<< m_session_options << wxT ("\n");

sam_queue (prepareMsg (cmd));

}

void CSam::sam_stream_session_create ()
{

wxString cmd =

wxString (wxT ("SESSION CREATE STYLE=STREAM DESTINATION="))
<< m_privKey << wxT (" ID=")
<< getSessionName ()
<< wxT (" i2cp.messageReliability=BestEffort? ")
<< m_session_options << wxT ("\n");

sam_queue (prepareMsg (cmd));

}

comment:12 Changed 7 years ago by zzz

OK, finally getting to the bottom of this. Definitely an I2CP issue on the router side.

There's a real bad problem in 0.9.4 if i2cp.messageReliability=BestEffort?. Those Update Delivery Status jobs will requeue every half second, forever.

This should be fixed in 0.9.4-4 24025a1dbd3d2d7633e683cfa025c6926f248b12

In addition, iMule should not be setting BestEffort?. None is much more efficient. I'm going to fix SAM to force None even if the client sets something else.

comment:13 Changed 7 years ago by zzz

ech, please confirm on /jobs:

  • Thru 0.9.4-3, when running with imule, you'll see a bazillion Update Delivery Status jobs queued, and eventual explosion
  • As of 0.9.4-4, you won't see them queued (more than a couple anyway) but you'll still see them counted in the table at the bottom

comment:14 Changed 7 years ago by sponge

Turning off best effort worked for me. I had it on for irc, from way back in the days when such things actually mattered.

comment:15 Changed 7 years ago by Eche|on

Moin

For mkvore: http://pastethis.i2p/show/2565/
So far it looks good after 1h uptime. Far better than with 0.9.4-3.
Will report if anything changes.

echelon

comment:16 Changed 7 years ago by zzz

ech please confirm the info in comment 13 above - the Update Delivery Status counts (queued and run) on /jobs in -3 and -4

comment:17 Changed 7 years ago by Eche|on

Moin

I can confirm the info in comment 13.

echelon

comment:18 Changed 7 years ago by zzz

Thanks. I'm leaving the ticket open until I force "None" in SAM as described in comment 12 above. Then I'll reassign to mkvore.

See also http://forum.i2p/viewtopic.php?t=7436

kytv has patched imule and uploaded new packages to https://launchpad.net/~i2p.packages/+archive/i2p

comment:19 Changed 7 years ago by str4d

I am going to assume my problem is related:

Ever since the update to 0.9.4, one of my routers has been hanging every few hours (sometimes a couple, sometimes almost a day). The graphs show a spike in job lag just before the hang, followed by 5 minutes of no graph stats, and then the wrapper restarts the router. The spike isn't always large (sometimes 0.5s, sometimes several seconds), but I'm assuming that for the smaller ones the spike just doesn't get logged properly.

I updated to 0.9.4-4 and the problem ensued (the logs below are from 0.9.4-4).

An example in wrapper.log:

    XXXX/XX/XX 00:10:XX | Please fasten seatbelts, SeedlessServer is now launched.
    XXXX/XX/XX 00:11:XX | Please fasten seatbelts, EepAnnouncer is now launched.
    XXXX/XX/XX 10:30:XX | JVM appears hung: Timed out waiting for signal from JVM.
    XXXX/XX/XX 10:30:XX | JVM did not exit on request, terminated
    XXXX/XX/XX 10:30:XX | Launching a JVM...

And the corresponding log-1.txt (the remaining timestamps correspond to those above):

    XX/XX/XX 0:06:XX AM WARN  [Addressbook ] .naming.BlockfileNamingService: The hosts database was not closed cleanly or is still open by another process
    XX/XX/XX 6:38:XX AM WARN  [P reader 1/4] net.i2p.router.JobQueue       : Dropping job due to overload!  # ready jobs: 104: job = HandleFloodfillDatabaseLookupMessageJob: Job 488288: Handle Database Lookup Message
    XX/XX/XX 6:38:XX AM ^^^ 3 similar messages omitted ^^^
    XX/XX/XX 9:12:XX PM INFO  [JobQueue 2/5] b.kademlia.FloodfillMonitorJob: Only 339 ff peers and we want 360 so we are becoming floodfill
    XX/XX/XX 10:19:XX PM WARN  [receiver 4/5] net.i2p.router.JobQueue       : Dropping job due to overload!  # ready jobs: 101: job = HandleFloodfillDatabaseLookupMessageJob: Job 786946: Handle Database Lookup Message
    XX/XX/XX 10:19:XX PM ^^^ 57 similar messages omitted ^^^
    XX/XX/XX 10:22:XX PM WARN  [P reader 1/4] net.i2p.router.JobQueue       : Dropping job due to overload!  # ready jobs: 101: job = HandleFloodfillDatabaseLookupMessageJob: Job 793055: Handle Database Lookup Message
    XX/XX/XX 10:22:XX PM ^^^ 149 similar messages omitted ^^^
    XX/XX/XX 10:26:XX PM WARN  [P reader 1/4] net.i2p.router.JobQueue       : Dropping job due to overload!  # ready jobs: 101: job = HandleFloodfillDatabaseLookupMessageJob: Job 803832: Handle Database Lookup Message
    XX/XX/XX 10:26:XX PM ^^^ 44 similar messages omitted ^^^
    XX/XX/XX 10:27:XX PM WARN  [P reader 3/4] net.i2p.router.JobQueue       : Dropping job due to overload!  # ready jobs: 102: job = HandleFloodfillDatabaseLookupMessageJob: Job 805592: Handle Database Lookup Message
    XX/XX/XX 10:27:XX PM ^^^ 17 similar messages omitted ^^^
    XX/XX/XX 10:27:XX PM WARN  [P reader 4/4] net.i2p.router.JobQueue       : Dropping job due to overload!  # ready jobs: 135: job = HandleFloodfillDatabaseLookupMessageJob: Job 805747: Handle Database Lookup Message
    XX/XX/XX 10:27:XX PM ^^^ 46 similar messages omitted ^^^

I have checked all my config files; i2cp.messageReliability=BestEffort is not set anywhere.

comment:20 in reply to:  15 Changed 7 years ago by deadman

Replying to echelon:

For mkvore: http://pastethis.i2p/show/2565/
echelon

Thank you. I've committed changes to the i2p.imule.v2 branch
More than 8 hours works without huge I2P CPU usage.

Will build MacOS and win versions soon.

comment:21 Changed 7 years ago by mkvore

Cc: mkvore@… added

@deadman: I did not see your changes in i2p.imule.v2. I made them again.
@zzz: I cannot sync with mtn.i2p2.i2p with key "mkvore-transport@…". I only synced with mtn.i2p-projekt.i2p . Maybe an explanation why I did not see deadman's changes ?

comment:22 Changed 7 years ago by zzz

Owner: set to mkvore

@mkvore you'll have to talk to the guy that runs mtn.i2p2.i2p (welterde) to get that fixed.

alternates are i2p-projekt and killyourtv's servers, they all sync to each other.

As I promised in comment 18 above, I fixed SAM to force 'none' in 17077999dc2c94bd5e65b1c3c6dcb6220d691d0f (0.9.4-5).

Reassigning to mkvore.

comment:23 Changed 6 years ago by zzz

Resolution: fixed
Status: assignedclosed

mkvore released a new version with the changes

Note: See TracTickets for help on using tickets.