Opened 4 years ago

Last modified 4 years ago

#1705 testing defect

net.i2p.util.SimpleTimer2 : timer error

Reported by: hummingbird Owned by:
Priority: minor Milestone: 0.9.24
Component: api/utils Version: 0.9.22
Keywords: Cc: Zlatin Balevsky
Parent Tickets: #1694 Sensitive: no

Description

I've no idea if this is something to report or not. It's just write in red.

Uptime: 18 hours.

I2P version: 0.9.22-24-rc
Java version: Oracle Corporation 1.7.0_85 (OpenJDK Runtime Environment 1.7.0_85-b01)
Wrapper version: 3.5.25
Server version: 8.1.17.v20150415
Servlet version: Jasper JSP 2.1 Engine
Platform: Linux amd64 3.16.0-38-generic
Processor: (athlon64)
Jbigi: Locally optimized library libjbigi-linux-athlon64_64.so loaded from file
Encoding: UTF-8
Charset: UTF-8

08/11/15 15:04:42 ERROR [leTimer2 ¾] net.i2p.util.SimpleTimer2 : timer error

java.lang.IllegalStateException?: net.i2p.router.transport.OutboundMessageRegistry?$CleanupTask?@39cf2ebd not possible to be in RUNNING
at net.i2p.util.SimpleTimer2$TimedEvent?.run2(SimpleTimer2.java:418)
at net.i2p.util.SimpleTimer2$TimedEvent?.run(SimpleTimer2.java:397)
at java.util.concurrent.Executors$RunnableAdapter?.call(Executors.java:471)
at java.util.concurrent.FutureTask?.run(FutureTask?.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor?$ScheduledFutureTask?.access$201(ScheduledThreadPoolExecutor?.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor?$ScheduledFutureTask?.run(ScheduledThreadPoolExecutor?.java:292)
at java.util.concurrent.ThreadPoolExecutor?.runWorker(ThreadPoolExecutor?.java:1145)
at java.util.concurrent.ThreadPoolExecutor?$Worker.run(ThreadPoolExecutor?.java:615)
at java.lang.Thread.run(Thread.java:745)

Subtickets

Attachments (7)

SimpleTimer2Test.2.java (8.5 KB) - added by kay 4 years ago.
junit test for SimpleTimer2.java
SimpleTimer2Test.java (9.0 KB) - added by kay 4 years ago.
junit test for SimpleTimer2.java
SimpleTimer2Test.3.java (10.0 KB) - added by kay 4 years ago.
version including forceReschedule
test-log.tgz (149.2 KB) - added by kay 4 years ago.
the full log up to the first error
1705-patch.diff (4.9 KB) - added by kay 4 years ago.
patch
finer-patch.diff (12.0 KB) - added by kay 4 years ago.
finer patch
SimpleTimer2Test.4.java (12.8 KB) - added by kay 4 years ago.
test for the two main design goals

Download all attachments as: .zip

Change History (60)

comment:1 Changed 4 years ago by zzz

Component: unspecifiedapi/utils
Parent Tickets: 1694

Dup of #1694, we'll call this a failed test

comment:2 Changed 4 years ago by kay

The exception message suggests that it shouldn't be possible that the CleanupTask?,
which extends SimpleTimer2.TimedEvent?, is running, but actually nothing prevents this.

For the underlying ScheduledThreadpoolExecutor? repeated submissions of one and the same
task object (there is always only one CleanupTask? instance) are seen and treated as different tasks.
The schedule method decorates the task and submits it to the internal queue.
Thus, repeated submissions via schedule() are different tasks.

Hmm, possible remedies could be a rewrite of TimedEvent? that makes use of the returned
ScheduledFuture? from the schedule() call. This could also avoid keeping track of the
event's states (CANCELED, RUNNING,…). Quite some rewrite, I guess.

A second possibility might be to override the decorateTask() method from ScheduledThreadPoolExecutor?.
But care has to be taken, that this doesn't cause problems for the scheduler internal queue.

An interesting problem.

comment:3 Changed 4 years ago by zzz

Milestone: undecided0.9.23

You say "actually nothing prevents" it being queued twice, but everything in ST2 actually tries pretty hard to prevent this. You say repeated schedule()s are different but we don't do that. The STPE javadoc says that decorateTask() does nothing so the object isn't replaced.

The one window I can think of is where the task is dequeued by a timer thread and hasn't run yet. At that point would a schedule() or cancel() put it on the queue again, so now we have two?

Please help me see what you're seeing. In the meantime, going to do some logging, and also ask zab for help.

comment:4 Changed 4 years ago by kay

The following is quoted from ScheduledThreadPoolExecutor?.java (jdk1.8, but
a quick web-search gave me the same for jdk7 and 6):

   public ScheduledFuture<?> schedule(Runnable command,                        
                                       long delay,                              
                                       TimeUnit unit) {                         
        if (command == null || unit == null)                                    
            throw new NullPointerException();                                   
        RunnableScheduledFuture<?> t = decorateTask(command,                    
            new ScheduledFutureTask<Void>(command, null,                        
                                          triggerTime(delay, unit)));           
        delayedExecute(t);                                                      
        return t;                                                               
    }    

and, you're right decoratedTask() dosn't change Runnable,
but returns the second argument for further processing.
So, the comment is somewhat missleading.

    /**
     * Modifies or replaces the task used to execute a runnable.
     * This method can be used to override the concrete
     * class used for managing internal tasks.
     * The default implementation simply returns the given task.
     *
     * @param runnable the submitted Runnable
     * @param task the task created to execute the runnable
     * @return a task that can execute the runnable
     * @since 1.6
     */
    protected <V> RunnableScheduledFuture<V> decorateTask(
        Runnable runnable, RunnableScheduledFuture<V> task) {
        return task;
    }

I couldn't reproduce the exception in my router, but
I will try to make up a test case reproducing this,
when I have some time.

Last edited 4 years ago by kay (previous) (diff)

comment:5 Changed 4 years ago by zzz

@OP the error was at 18 hours of uptime, or that's when you saw it? If the latter, how long after startup was it?

comment:6 Changed 4 years ago by Zlatin Balevsky

Yes, ScheduledExecutorService does wrap the Runnable into a Future, but that's why we keep track of the state of the Runnable using that state machine. It is intended to prevent dups so if dups are still appearing then it's either a bug in the state machine or something else.

comment:7 Changed 4 years ago by Zlatin Balevsky

Cc: Zlatin Balevsky added

comment:8 Changed 4 years ago by hummingbird

@zzz 18 hours uptime is when I saw it. I cannot find the startup time in the log, all I can find is that at "08/11/15 00:22:16" the node was up.

comment:9 Changed 4 years ago by zzz

OK thanks, so it had been up at least 15 hours. i've seen a regular burst of timer activity an hour after startup, I was wondering if that was it.

For future reference, you can find the exact start time on /events in the console.

comment:10 Changed 4 years ago by hummingbird

09/11/15 12:09:44 Stopped router 2
08/11/15 05:28:09 Enabled floodfill
08/11/15 01:02:21 Watchdog warning
07/11/15 23:48:20 Started router 0.9.22-24-rc
07/11/15 23:48:20 Updated router from 0.9.22-22 to 0.9.22-24-rc
07/11/15 23:47:39 Stopped router 5
07/11/15 23:38:45 Started router 0.9.22-22

comment:11 Changed 4 years ago by zzz

Status: newtesting

zab and I think it's a race in cancel() (via reschedule()) that can cause a dup if the task has been dequeued but hasn't hit the first synch block in run2() yet. We don't have proof, and don't have a foolproof fix yet.

But hopeful that this is an effective workaround: 5650f4a2c634873750403488ae6493b32c67599d in 0.9.22-25-rc, please test.

comment:12 Changed 4 years ago by kay

I do not have a nice junit test b/c 'retrieving' the IllegalStateExceptions?
seems quite tricky from junit.

But I created a test-hack which I attach here.
Please, review, if it makes sense.

In addition, I noticed that changing the various parameters of the test lead to quite different results.
Thus, I could trigger all possible ISE from SimpleTimer2.

The below test results were triggered using java1.8(oracle) and the attached SimpleTimer2Test.java

The 'usage' comment from the attached SimpleTimer2Test:

/** (see #1694 and #1705 for background)
 * This test will always fail due to timeout!!!
 *
 * This is not a good junit test, but a quick hack that requires 
 * adding a line to SimpleTimer2.java and looking at the output
 * of the junit task's log, if there are any IllegalStateExceptions
 * thrown in SimpleTimer2.
 * (It ought to be possible to tweak this hack-test and turn 
 * it into a real one w/t having to add the Systen.err.println to
 * SimpleTimer2)
 *
 * The repeated* methods below create some stressing workload
 * to trigger IllegalStateExceptions. 
 * Changing the load or uncommenting these methods lead 
 * to different results.
 *
 * 
 * Add something like
 *    System.err.println("testing timer error: " + re.getMessage());
 * to the catch clause of run() in SimpleTimer2.TimedEvent
 * in order to have some output.
 */

Test results for version 7c7c5cd7a371b07c27174d83ac90349

Many "reschedule after run" and one "not possible to be RUNNING"
during the 60sec of test run.

junit.test:
    [mkdir] Created dir: /some/path/i2p.i2p-7c7c5cd7a371b07c27174d83ac90349/reports/core/junit
    [junit] Running net.i2p.util.SimpleTimer2Test
    [junit] => starting repeated scheduling 
    [junit] => starting repeated cancelling 
    [junit] => starting repeated re-scheduling 
    [junit] => starting repeated unproductive work 
    [junit] =>  waiting ... 
    [junit] testing timer errorjava.lang.IllegalStateException: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@7341dbce rescheduleAfterRun cannot be true here
    [junit] testing timer errorjava.lang.IllegalStateException: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@7341dbce rescheduleAfterRun cannot be true here
    [junit] testing timer errorjava.lang.IllegalStateException: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@7341dbce rescheduleAfterRun cannot be true here
    [junit] testing timer errorjava.lang.IllegalStateException: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@7341dbce rescheduleAfterRun cannot be true here
    [junit] testing timer errorjava.lang.IllegalStateException: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@7341dbce rescheduleAfterRun cannot be true here
    [junit] testing timer errorjava.lang.IllegalStateException: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@7341dbce not possible to be in RUNNING
    [junit] testing timer errorjava.lang.IllegalStateException: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@7341dbce rescheduleAfterRun cannot be true here
    .....

Test results for version 5650f4a2c634873750403488ae6493b32c67599d

Many "reschedule after run" and no "not possible to be RUNNING"
during the 120 sec of test run.
I saw a "not possible to be RUNNING" during one test run, but I couldn't
reproduce it.

junit.test:
    [mkdir] Created dir: /some/path/i2p.i2p-5650f4a2c634873750403488ae6493b32c67599d/i2p.i2p/reports/core/junit
    [junit] Running net.i2p.util.SimpleTimer2Test
    [junit] => starting repeated scheduling 
    [junit] => starting repeated cancelling 
    [junit] => starting repeated re-scheduling 
    [junit] => starting repeated unproductive work 
    [junit] =>  waiting ... 
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@388d97d1 rescheduleAfterRun cannot be true here
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@388d97d1 rescheduleAfterRun cannot be true here
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@388d97d1 rescheduleAfterRun cannot be true here
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@388d97d1 rescheduleAfterRun cannot be true here
    ....

comment:13 Changed 4 years ago by kay

addendum for version 5650f4a2c634873750403488ae6493b32c67599d

setting the reschedule parameter to -7 (i.e. reschedule(-7)) causes several "not possible to be RUNNING" ISEs:

/i2p.i2p-5650f4a2c634873750403488ae6493b32c67599d/i2p.i2p/reports/core/junit
    [junit] Running net.i2p.util.SimpleTimer2Test
    [junit] => starting repeated scheduling 
    [junit] => starting repeated cancelling 
    [junit] => starting repeated re-scheduling 
    [junit] => starting repeated unproductive work 
    [junit] =>  waiting ... 
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@198444aa not possible to be in RUNNING
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@198444aa rescheduleAfterRun cannot be true here
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@198444aa not possible to be in RUNNING
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@198444aa not possible to be in RUNNING
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@198444aa not possible to be in RUNNING
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@198444aa rescheduleAfterRun cannot be true here
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@198444aa rescheduleAfterRun cannot be true here
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@198444aa rescheduleAfterRun cannot be true here
    [junit] testing timer error: net.i2p.util.SimpleTimer2Test$

This was the worst case, but three runs in a row yielded one, four, and two of these.

comment:14 Changed 4 years ago by Zlatin Balevsky

@kay:
can you try the test on the latest version of ST2 from mtn?

comment:15 in reply to:  14 Changed 4 years ago by kay

Replying to zab:

@kay:
can you try the test on the latest version of ST2 from mtn?

yes, could you post the revision; i currently only see changes from
kytv 6c9304cfe2f746a6d9cd9918b158f157f5f33293

comment:16 Changed 4 years ago by Zlatin Balevsky

@kay:
since you so kindly volunteered to write a (really good!) unit test, I'm gonna throw something more at you - http://pastethis.i2p/show/9699/

The logic behind that paste is that if the Future is already running, _future.cancel(true) will interrupt the thread it's running on. Then inside the first sync in block in run checking for Thread.currentThread::isInterrupted() should do the trick.

Could you please run a few tests after applying that patch? It's against trunk revision 6c9304cfe2f746a6d9cd9918b158f157f5f33293

comment:17 Changed 4 years ago by kay

I 'littered' the ST2 with another System.err.println inside the if(...isInterrupted()){} part.

The testruns reliably produce a few "rescheduleAfterRun cannot be true here" each time.

I saw one "not possible to be RUNNING", but am still investigating how to repeat this.
(As I managed to overwrite the respective log, it might have been a ghost.)

log excerpt revision 6c9304cfe2f746a6d9cd9918b158f157f5f33293 and zab's patch from pastethis

... /i2p.i2p-6c9304cfe2f746a6d9cd9918b158f157f5f33293-latest/reports/core/junit
    [junit] Running net.i2p.util.SecondSimpleTimer2Test
    [junit] => starting repeated scheduling 
    [junit] => starting repeated cancelling 
    [junit] => starting repeated re-scheduling 
    [junit] => starting repeated unproductive work 
    [junit] =>  waiting ... 
    [junit] I was interrupted in run, state CANCELLED event net.i2p.util.SecondSimpleTimer2Test$DummyTimedEvent@512677c9
      ...
    [junit] I was interrupted in run, state CANCELLED event net.i2p.util.SecondSimpleTimer2Test$DummyTimedEvent@512677c9
    [junit] I was interrupted in run, state SCHEDULED event net.i2p.util.SecondSimpleTimer2Test$DummyTimedEvent@512677c9
                                            ^^^^^^^^^
    [junit] testing timer error net.i2p.util.SecondSimpleTimer2Test$DummyTimedEvent@512677c9 rescheduleAfterRun cannot be true here
    [junit] I was interrupted in run, state CANCELLED event net.i2p.util.SecondSimpleTimer2Test$DummyTimedEvent@512677c9
    [junit] I was interrupted in run, state CANCELLED event net.i2p.util.SecondSimpleTimer2Test$DummyTimedEvent@512677c9
    ....

comment:18 Changed 4 years ago by kay

With a little reflection I changed the hack-test into a unit test.

It doesn't require touching the ST2 anymore.
I just intercept the logger and retrieve all error and warn statements.
The junit test fails if there are any errors logged during execution.

Please find the newly attached SimpleTimer2Test.java .

Changed 4 years ago by kay

Attachment: SimpleTimer2Test.2.java added

junit test for SimpleTimer2.java

comment:19 Changed 4 years ago by zzz

Status: testingneeds_work

The kytv checkin 6c93 was about reseed, unrelated to ST2, so it's equivalent to 0.9.22-25-rc 5650.

Test results on 5650 in comment 13 indicate that it didn't fix the problem.
Not sure why zab in comment 14 requested to try on 'the latest'.

Will review zab's proposed changes soon.

Last edited 4 years ago by zzz (previous) (diff)

Changed 4 years ago by kay

Attachment: SimpleTimer2Test.java added

junit test for SimpleTimer2.java

comment:20 Changed 4 years ago by kay

added an execution counter in order to make the test fail when deadlocks or similar occur

the file wasn't replaced. this is the one with the counter
https://trac.i2p2.de/attachment/ticket/1705/SimpleTimer2Test.java

Last edited 4 years ago by kay (previous) (diff)

comment:21 Changed 4 years ago by hummingbird

"But hopeful that this is an effective workaround: 5650f4a2c634873750403488ae6493b32c67599d in 0.9.22-25-rc, please test."

@zzz: after 41 hours up with 0.9.22-25-rc, no errors in my log.

comment:22 Changed 4 years ago by kay

As there is a really long list of comments now:

But, in total there are less and less errors.

  • comment 20 points to the latest junit test for this problem.

kind regards,
k

comment:23 Changed 4 years ago by Zlatin Balevsky

@kay
can you confirm that you saw (or didn't see) the "cannot be in RUNNING" with my patch?

comment:24 Changed 4 years ago by kay

I just repeated the test several times and can confirm ISE "not possible in RUNNING" did show, but it is rare. See the second 'caused by…' below:

junit.framework.AssertionFailedError: => errors logged
=> ERROR logged: timer error
caused by --->java.lang.IllegalStateException: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@5e2fc13f rescheduleAfterRun ccannot be true here
=> ERROR logged: timer error
caused by --->java.lang.IllegalStateException:: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@5e2fc13f not possible to be in RUNNING
                                                                             ^^^^^^^
=> ERROR logged: timer error                                   
caused by --->java.lang.IllegalStateException: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@5e2fc13f rescheduleAfterRun cannot be true here=> all warninngs and errors in interception order 

I also expanded the test (will post the newer version soon) to track the size of the ScheduledThreadPoolExecutor?'s queue, which should maximally be one during the test.
And, found this

   ...
=> no. of execs: 30814
=> [java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@3a988a87]
=> no. of execs: 32344
=> [java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4f0b4aa, java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@3bcde019]
=> no. of execs: 32345
=> [java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1ae3627a]
=> no. of execs: 60069
=> [java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@64d10f13]
=> no. of execs: 60164
=> [java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@6505b2c7, java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@319c4b56]
=> no. of execs: 60165
=> [java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@30ed1fad]
=> no. of execs: 60260
  ...

I modified the test's logging for a third run:

=> errors logged
=> ERROR logged: timer error caused by --->java.lang.IllegalStateException: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@9836b79 rescheduleAfterRun cannot be true here
=> ERROR logged: Found more that one element in executor queue: 2 caused by --->null
=> ERROR logged: timer error caused by --->java.lang.IllegalStateException: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@9836b79 rescheduleAfterRun cannot be true here
=> ERROR logged: Found more that one element in executor queue: 2 caused by --->null
=> ERROR logged: timer error caused by --->java.lang.IllegalStateException: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@9836b79 rescheduleAfterRun cannot be true here

The queue size is just a sample not continuous monitoring, as it is checked when a warning or error is logged.

In addition, I'm concerned that cancelling a running task could cause a problem elsewhere.
What did I overlook here?

comment:25 in reply to:  23 Changed 4 years ago by kay

Replying to zab:

@kay
can you confirm that you saw (or didn't see) the "cannot be in RUNNING" with my patch?

When the test runs longer and is not stopped after collection three to five errors
the "not possible to be RUNNING" show more often.

comment:26 in reply to:  23 Changed 4 years ago by kay

Third reply @zab:
after adding a test thread that triggers forceReschedule repeatedly there are more "not possible to be in RUNNING" ISEs

comment:27 Changed 4 years ago by Zlatin Balevsky

I too am not sure if cancelling a running task is a good idea. If the exception persists it means duplicates are still somehow making their way into the queue; so interrupting the executing thread may persist onto the next execution.

Can you try my patch without resetting the status to CANCELLED in the isInterrupted() block?

comment:28 Changed 4 years ago by Zlatin Balevsky

Also, are you getting any "could not cancel" errors? If not then we're looking at the wrong place.

Are you on IRC2P? It might be easier to bounce things back and forth there.

Last edited 4 years ago by Zlatin Balevsky (previous) (diff)

comment:29 Changed 4 years ago by Zlatin Balevsky

Also, see if you can reproduce this with DEBUG log level and if yes please attach a log of that here. Thanks!

comment:30 Changed 4 years ago by Zlatin Balevsky

I've checked in a modified version of the interrupt() patch in revision be0fe594126eb1975b9ba2941a6de086ca858a5f (or -28) Please test that with DEBUG logging enabled and post the log here.

Last edited 4 years ago by Zlatin Balevsky (previous) (diff)

comment:31 in reply to:  30 Changed 4 years ago by kay

Replying to zab:

I've checked in a modified version of the interrupt() patch in revision be0fe594126eb1975b9ba2941a6de086ca858a5f (or -28) Please test that with DEBUG logging enabled and post the log here.

I'll take a look at this revision soon.

@comment28: there were no "could not cancel errors"

comment:32 Changed 4 years ago by zzz

FYI for everybody, the rev in comment 30 is 0.9.22-28-rc, please test

I'm now running the test here. Simplest way if you don't want to fight with junit: copy to core/java/test/junit/net/i2p/util/, add a main(), ant buildTest, java -cp build/i2ptest.jar net.i2p.util.SimpleTimer2Test

I did get a test CME in the test at the end:

11/14 09:17:30.998 WARN  [Thread-3    ] net.i2p.util.SimpleTimer2Test : Resched. timeout < 0: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@5c5ce611 timeout = -7 state: RUNNING
=> no. of execs: 149580
Exception in thread "main" java.util.ConcurrentModificationException
	at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
	at java.util.ArrayList$Itr.next(ArrayList.java:831)
	at net.i2p.util.SimpleTimer2Test.printNicely(SimpleTimer2Test.java:182)
	at net.i2p.util.SimpleTimer2Test.testTimerEvent(SimpleTimer2Test.java:97)
	at net.i2p.util.SimpleTimer2Test.main(SimpleTimer2Test.java:274)
11/14 09:17:31.022 WARN  [Thread-3    ] net.i2p.util.SimpleTimer2Test : Resched. timeout < 0: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@5c5ce611 timeout = -7 state: SCHEDULED
11/14 09:17:31.045 WARN  [Thread-3    ] net.i2p.util.SimpleTimer2Test : Resched. timeout < 0: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@5c5ce611 timeout = -7 state: SCHEDULED
11/14 09:17:31.068 WARN  [Thread-3    ] net.i2p.util.SimpleTimer2Test : Resched. timeout < 0: net.i2p.util.SimpleTimer2Test$DummyTimedEvent@5c5ce611 timeout = -7 state: CANCELLED

zab and I are both stumped by the 'cannot be true' errors reported above. Haven't seen here yet, but haven't given up.

comment:33 Changed 4 years ago by kay

Just ran a few tests and I think it looks good except for these warnings:

=> No exceptions logged, but some warnings
...
 => WARN logged: I was interrupted in run, state SCHEDULED event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@278d638a future: isCancelled = false, isDone = false caused by --->null
 => WARN logged: I was interrupted in run, state SCHEDULED event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@278d638a future: isCancelled = false, isDone = false caused by --->null
 => WARN logged: I was interrupted in run, state SCHEDULED event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@278d638a future: isCancelled = false, isDone = false caused by --->null
 => WARN logged: I was interrupted in run, state SCHEDULED event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@278d638a future: isCancelled = false, isDone = false caused by --->null
 => WARN logged: I was interrupted in run, state SCHEDULED event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@278d638a future: isCancelled = false, isDone = false caused by --->null
 => WARN logged: I was interrupted in run, state SCHEDULED event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@278d638a future: isCancelled = false, isDone = false caused by --->null
 => WARN logged: I was interrupted in run, state SCHEDULED event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@278d638a future: isCancelled = false, isDone = false caused by --->null
 => WARN logged: I was interrupted in run, state SCHEDULED event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@278d638a future: isCancelled = false, isDone = false caused by --->null

I adapted the interrupted part:

--- core/java/src/net/i2p/util/SimpleTimer2.java	56048756f81635fbde8cd9f7445df4cc6a4d46a8
+++ core/java/src/net/i2p/util/SimpleTimer2.java	a6958181098c7ac78ac5271ca86b024a96edac49
@@ -423,7 +423,12 @@ public class SimpleTimer2 {
             long delay = 0;
             synchronized(this) {
                 if (Thread.currentThread().isInterrupted()) {
-                    _log.warn("I was interrupted in run, state "+_state+" event "+this);
+                    if(_future.isCancelled() && _future.isDone()){
+                        _state = TimedEventState.CANCELLED;
+                        return;
+                    }
+                    _log.warn("I was interrupted in run, state " + _state + " event " + this
+                              + " future: isCancelled = " + _future.isCancelled() + ", isDone = " + _future.isDone());
                     return;
                 }
                 if (_rescheduleAfterRun)

b/c a cancelled task that is done should be reflected in _state. Does it make sense?

comment:34 in reply to:  32 ; Changed 4 years ago by kay

Replying to zzz:

FYI for everybody, the rev in comment 30 is 0.9.22-28-rc, please test

I'm now running the test here. Simplest way if you don't want to fight with junit: copy to core/java/test/junit/net/i2p/util/, add a main(), ant buildTest, java -cp build/i2ptest.jar net.i2p.util.SimpleTimer2Test

I edited the core build.xml to only include the SimpleTimer2Test.

I did get a test CME in the test at the end:

That seems to be the tests fault: the threads were not finished writing to the list that is printed.
Will work that out.

zab and I are both stumped by the 'cannot be true' errors reported above. Haven't seen here yet, but haven't given up.

What are these 'cannot be true' errors? What do I miss here?

In general, a thread is not cancelled when running or more often when it is done.
That's why I made the adaption in the above comment:33

Question:
What should the state be after SCHEDULED ... future: isCancelled = false, isDone = false (in the if(…interrupted()){..}) ? CANCELLED would be ok, too.
Only the cases with _state RUNNING will be problematic, but can they show up still?

comment:35 Changed 4 years ago by Zlatin Balevsky

What should the state be after SCHEDULED … future: isCancelled = false, isDone = false (in the if(…interrupted()){..}) ?

That's the bit I'm not sure about.

comment:36 Changed 4 years ago by zzz

Status: needs_worktesting

for the purposes of your unit test, you can probably ignore WARN altogether, those are just for zab and I to see what's going on. That also would fix the CME.

I've run your test about 10 times (on 0.9.22-28-rc with ST2 tweaked to not log the warnings, just for less noise and no CME) without error.

comment:37 in reply to:  34 Changed 4 years ago by kay

Addendum to 34:

That seems to be the tests fault: the threads were not finished writing to the list that is printed.

Please replace the tests printNicely() method with the following:

    private String printNicely(String header, List<Throwable> list){
        StringBuilder sb = new StringBuilder(header);
        sb.append('\n');
        List<Throwable> copyList = new ArrayList<Throwable>(list.size());
        copyList.addAll(list);
        String prefix = " caused by --->";
        for(Throwable t : copyList){
            sb.append(t.getMessage())
                .append(prefix).append(t.getCause()).append('\n');
        }
        return sb.toString();
    }
    

comment:38 in reply to:  35 Changed 4 years ago by kay

Replying to zab:

What should the state be after SCHEDULED … future: isCancelled = false, isDone = false (in the if(…interrupted()){..}) ?

That's the bit I'm not sure about.

The future wasn't cancelled b/c it was executing (in the STPE sense), but not yet RUNNING in the TimedEventState? sense. Thus it is ok to set _state to CANCELLED and return from run2().
I think.

comment:39 Changed 4 years ago by Zlatin Balevsky

@kay:38

yes but you were getting ISEs with the patch setting _state to CANCELLED. Can you re-test both with and without setting that?

comment:40 in reply to:  39 Changed 4 years ago by kay

Replying to zab:
You're right that ISEs showed up when everything was set to CANCELLED (comment:26).

The following hints that those ISEs were caused by interrupted threads in _state RUNNING
with both _future.isCancelled and isDone equal to false:

I polished the test a little and added repeatedForceReschedule(). (see the attached file)

Log excerpt:

   ...
 => ERROR logged: timer error caused by ---> java.lang.IllegalStateException: Interrupted in run, state RUNNING event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@337b2546 future: isCancelled = false, isDone = false (856)
 => ERROR logged: timer error caused by ---> java.lang.IllegalStateException: Interrupted in run, state RUNNING event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@337b2546 future: isCancelled = false, isDone = false (2608)
 => ERROR logged: timer error caused by ---> java.lang.IllegalStateException: Interrupted in run, state RUNNING event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@337b2546 future: isCancelled = false, isDone = false (3005)
 => ERROR logged: timer error caused by ---> java.lang.IllegalStateException: Interrupted in run, state RUNNING event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@337b2546 future: isCancelled = false, isDone = false (3428)
 => ERROR logged: timer error caused by ---> java.lang.IllegalStateException: Interrupted in run, state RUNNING event net.i2p.util.SimpleTimer2Test$DummyTimedEvent@337b2546 future: isCancelled = false, isDone = false 
  ...

Added these lines to ST2:

--- core/java/src/net/i2p/util/SimpleTimer2.java	56048756f81635fbde8cd9f7445df4cc6a4d46a8
+++ core/java/src/net/i2p/util/SimpleTimer2.java	b7b41a10213098a4d178d43bee40725affc11192
@@ -423,6 +423,11 @@ public class SimpleTimer2 {
             long delay = 0;
             synchronized(this) {
                 if (Thread.currentThread().isInterrupted()) {
+                    if(_state == TimedEventState.RUNNING){
+                        throw new IllegalStateException("Interrupted in run, state " + _state + " event " + this
+                                   + " future: isCancelled = " + _future.isCancelled() + ", isDone = " + _future.isDone());
+                    }
+
                     _log.warn("I was interrupted in run, state "+_state+" event "+this);
                     return;
                 }


Changed 4 years ago by kay

Attachment: SimpleTimer2Test.3.java added

version including forceReschedule

comment:41 Changed 4 years ago by Zlatin Balevsky

@kay:40
so clearly we still have duplicates entering the queue; the interruption happens on the second execution of the duplicate. Can you do a run at DEBUG level and attach a log?

comment:42 in reply to:  41 Changed 4 years ago by kay

Replying to zab:

I tweaked the logging a little to see more of the future :-)
and to avoid OOMs. DummyTimedEvent?.toString() returns "DTE at <counter of executions>"

  • a task is scheduled and _future holds ScheduledFutureTask?@31802fd7
  • then it waits
  • then it runs and increases the counter to 1685
(3995) => WARN: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@31802fd7 Post scheduling: DTE at 1684 future: isCancelled = false, isDone = false
(3995) => WARN: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@31802fd7 Scheduling: DTE at 1684 timeout = 0 state: SCHEDULED future: isCancelled = false, isDone = false
 ^^^^  485 more  of the same ^^^^ 
(3996) => WARN: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@31802fd7 Scheduling: DTE at 1684 timeout = 0 state: SCHEDULED future: isCancelled = false, isDone = false
 ^^^^  71 more  of the same ^^^^ 
(3996) => WARN: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@31802fd7 Running: DTE at 1684
(3996) => WARN: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@31802fd7 Scheduling: DTE at 1684 timeout = 0 state: SCHEDULED future: isCancelled = false, isDone = false
(3996) => WARN: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@31802fd7 Scheduling: DTE at 1684 timeout = 0 state: RUNNING future: isCancelled = false, isDone = false
 (3996) => no. of execs: 1685
(3996) => WARN: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@31802fd7 Scheduling: DTE at 1685 timeout = 0 state: RUNNING future: isCancelled = false, isDone = false
 ^^^^  283 more  of the same ^^^^ 
(3996) => WARN: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@31802fd7 Scheduling: DTE at 1685 timeout = 1 state: RUNNING future: isCancelled = false, isDone = false

  • then scheduling catches it at being IDLE, but the future is not finished
  • post scheduling _future holds ScheduledFutureTask?@1daabc0c
  • the new task waits
  • and (in the last line below) the old one ScheduledFutureTask?@31802fd7 is executed again.
(3996) => WARN: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@31802fd7 Scheduling: DTE at 1685 timeout = 1 state: IDLE future: isCancelled = false, isDone = false
(3996) => WARN: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1daabc0c Post scheduling: DTE at 1685 future: isCancelled = false, isDone = false
(3996) => WARN: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1daabc0c Scheduling: DTE at 1685 timeout = 0 state: SCHEDULED future: isCancelled = false, isDone = false
 ^^^^  23 more  of the same ^^^^ 
 ^^^^  23 more  of the same ^^^^ 
(3997) => WARN: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1daabc0c Scheduling: DTE at 1685 timeout = 0 state: SCHEDULED future: isCancelled = false, isDone = false
(3996) => ERROR: _future: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@31802fd7 DTE at 1685 timer error: Interrupted in run, state RUNNING event DTE at 1685 future: isCancelled = false, isDone = false

I'm not sure why _state was IDLE, but future@31802fd7 not done?

Nevertheless _future and _state need to be considered together.

Thoughts?

The full log is attached.

Last edited 4 years ago by kay (previous) (diff)

Changed 4 years ago by kay

Attachment: test-log.tgz added

the full log up to the first error

comment:43 Changed 4 years ago by Zlatin Balevsky

@kay:42

I'm not sure why _state was IDLE, but future@31802fd7 not done?

well there is a period of time after _state gets set to IDLE in the finally block before the run() method returns. If another thread tries to schedule in that window it would describe the behavior you see.

I'll analyze the log you've attached and post soon. The fact that the old future runs again feels very wrong…

P.S. please disable log shortening, I need to see all lines.

Last edited 4 years ago by Zlatin Balevsky (previous) (diff)

comment:44 Changed 4 years ago by zzz

We've decided to proceed with the 0.9.23 release, based on 0.9.22-28-rc (-29-rc has unrelated changes) so we can get the fixes (including the original issues in #1694 and the previous changes that uncovered those issues) out to everybody. The 0.9.22 implementation has serious problems with the streaming timers dying silently from ISE. We have no reports of any errors on a running router on -25-rc or higher.

That being said, we aren't finished with this issue and will continue to work on fixes for 0.9.24, up to and including a complete rewrite if necessary.

Thanks especially to OP, zab, and kay for their assistance to date, and hopefully we can continue the collaboration to understand what's going on and make ST2 bulletproof.

Leaving this ticket in the "testing" state for now.

comment:45 in reply to:  43 Changed 4 years ago by kay

Replying to zab:

@kay:42

I'm not sure why _state was IDLE, but future@31802fd7 not done?

well there is a period of time after _state gets set to IDLE in the finally block before the run() method returns. If another thread tries to schedule in that window it would describe the behavior you see.

Yes, that sounds right. and the RESCHEDULE case even calls a schedule from the running thread.

I'll analyze the log you've attached and post soon. The fact that the old future runs again feels very wrong…

Maybe, also good to check if there is a known java bug?

P.S. please disable log shortening, I need to see all lines.

With the number of requests triggered by the test the log runs up quickly to loads of Mb.
The "more of the same" lines in the attached log mean identical lines up to the milliseconds timestamp.

comment:46 Changed 4 years ago by Zlatin Balevsky

@kay:45

Maybe, also good to check if there is a known java bug?

Generally Sun has been very good at testing their stuff, I'm expecting the same from Oracle. But if proven that it's their issue then we need to look at a complete rewrite that doesn't use STPE.

P.S. please disable log shortening, I need to see all lines.

With the number of requests triggered by the test the log runs up quickly to loads of Mb.
The "more of the same" lines in the attached log mean identical lines up to the milliseconds timestamp.

It's still better to have them so I can get a clearer picture. A DEBUG log level would be even better

comment:47 Changed 4 years ago by kay

Please test thoroughly.

I attached a diff that passes the test cleanly.

The only warnings are about negative timeouts which the tests uses on purpose. No errors logged. And 173021 runs of DummyTimedEvent? during the test time of 240 secs.

Changes:

Method run2() is now synchronized and the synchronized statements inside are removed.
The cancel(true) is turned back into cancel(false) and the if(…interrupted()) part is removed.

PS:
Reasoning behind the change:
the delicate part were those _state changes in run2() that could cause a replacement of the reference held by _future.
Once, there was a new future task w/o the old one being truly done or canceled the problem began.
Synchronizing run2 should encapsulate the changes and readings of _state appropriately.

Last edited 4 years ago by kay (previous) (diff)

Changed 4 years ago by kay

Attachment: 1705-patch.diff added

patch

comment:48 Changed 4 years ago by zzz

That's an interesting suggestion, and good to know it works, but it blocks reschedule() and cancel() from other threads while in timeReached(), which defeats one of the primary design goals of the timer.

comment:49 Changed 4 years ago by kay

@zzz comment:48: Thanks for pointing this out! I used an obviously wrong assumption here.
There might be a way to also include this in a test.

comment:50 Changed 4 years ago by zzz

Milestone: 0.9.230.9.24

Yeah, I view it as a sledgehammer fix. I'm not surprised it works, but there's still something to learn about our problem from it. Sledgehammers can teach you things…

comment:51 Changed 4 years ago by Zlatin Balevsky

I'm starting to think the "sledgehammer" approach may not be that bad. After all most calls reschedule() will happen on the timer thread itself. Calls to cancel() will still work if the task has not yet started to execute. Off the top of my head I can't think of any use case where we need to be able to schedule or cancel while timeReached() is running.

comment:52 Changed 4 years ago by kay

Design goals for SimpleTimer2:

1) timeReached of one TimerEvent? instance is only run one at a time
2) calls to force/re/schedule/cancel should be swift, i.e. they should not block during timeReached() execution.

The sledgehammer-patch enforces 1 but violates 2.
The attached "finer-patch" enforces 1 in a similar manner as the sledgehammer-patch,
but also can provide 2.

I let SimpleTimer2Test measure the execution times for force/re/schedule/cancel calls.

The exact numbers and times vary with machine and machine load, but the proportional differences are consistent, roughly the same number of executions of the DummyTimedEvent? task and really substantial differences for the calls to force/re/schedule/cancel.
The following results represent typical examples:

"finer-patch", _repeat=1:
no. of execs: 129231
no. of calls  to force/re/schedule/cancel: 286820012
average time for calls to force/re/schedule/cancel in nanosec: 770

"sledgehammer-patch", _repeat=1:
no. of execs: 125779
no. of calls  to force/re/schedule/cancel: 60481636
average time for calls to force/re/schedule/cancel in nanosec: 4142

When the load of DummyTimedEvent? (e.g. by setting _repeat to 1E6) is increased the results for sledgehammer get worse, whereas the "finer-patch" performs as before.

"finer-patch", _repeat=1E6:
no. of execs: 11
no. of calls  to force/re/schedule/cancel: 246829868
average time for calls to force/re/schedule/cancel in nanosec: 889

"sledgehammer-patch", _repeat=1E6:
no. of execs: 29
no. of calls  to force/re/schedule/cancel: 16814
average time for calls to force/re/schedule/cancel in nanosec: 56408231

An additional change was to replace _cancelAfterRun and _rescheduleAfterRun.
I introduced TimedEventState _nextState, which signals what should be done after a run of TimedEvent? finished.

The patch and the altered test are attached.

PS:
The new test also would throw an exception, if more than one thread was executing timeReached() at the same time.

Last edited 4 years ago by kay (previous) (diff)

Changed 4 years ago by kay

Attachment: finer-patch.diff added

finer patch

Changed 4 years ago by kay

Attachment: SimpleTimer2Test.4.java added

test for the two main design goals

comment:53 Changed 4 years ago by zzz

@zab re: comment 51

Most timers don't need or use out-of-thread reschedule, but some do, and those are the only ones causing trouble. If we eliminate all of them, we don't need a rewrite at all, as this bug will never be triggered.

Note: See TracTickets for help on using tickets.