2006-05-13 02:24:58

by Darren Hart

[permalink] [raw]
Subject: rt20 scheduling latency testcase and failure data

I have been noticing unexpected intermittant large latencies. I wrote the
attached test case to try and capture some information on them. The librt.h
file contains convenience functions I use for writing other tests as well, so
much of it is irrelevant, but the test case itself is pretty clear I believe.

The test case emulates a periodic thread that wakes up on time%PERIOD=0, so
rather than sleeping the same amount of time each round, it checks now
against the start of its next period and sleeps for that length of time.
Every so often it will miss it's period, I've captured that data and included
a few of the interesting bits below. The results are from a run with a
period of 5ms, although I have seen them with periods as high as 17ms. The
system was under heavy network load for some of the time, but not all.



One scenario is when we just sleep longer than we planned to. Note below that
the actual delta is much larger that the requested sleep, so the latency is
very high there, high enough that we woke up well into the next period.

ITERATION DELAY(US) MAX_DELAY(US) FAILURES
--------- --------- ------------- --------
1514 11588 11588 1

PERIOD MISSED!
scheduled delta: 4080 us
actual delta: 15666 us
latency: 11585 us
---------------------------------------
previous start: 7581588 us
now: 7582504 us
scheduled start: 7575000 us
next scheduled start is in the past!



In another scenario, far more common in my experience, is to awake with a very
reasonable latency (6us more than requested here), but by the time we
complete the loop and look at how long we need to sleep for until the next
period, we see that it's deadline has already passed (scheduled_start < now).

ITERATION DELAY(US) MAX_DELAY(US) FAILURES
--------- --------- ------------- --------
1240 9 14 0

PERIOD MISSED!
scheduled delta: 4072 us
actual delta: 4079 us
latency: 6 us
---------------------------------------
previous start: 6200009 us
now: 6210989 us
scheduled start: 6205000 us
next scheduled start is in the past!


I'd appreciate any feedback on the test case, and in particular suggestions on
how I can go about determining where this lost time is being spent.

Thanks,
--
Darren Hart
IBM Linux Technology Center
Realtime Linux Team
Phone: 503 578 3185
T/L: 775 3185


Attachments:
(No filename) (2.43 kB)
librt.h (14.41 kB)
sched_latency_lkml.c (4.01 kB)
Download all attachments

2006-05-13 09:20:43

by Florian Schmidt

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Fri, 12 May 2006 19:24:53 -0700
Darren Hart <[email protected]> wrote:

> The test case emulates a periodic thread that wakes up on time%PERIOD=0, so
> rather than sleeping the same amount of time each round, it checks now
> against the start of its next period and sleeps for that length of time.
> Every so often it will miss it's period, I've captured that data and included
> a few of the interesting bits below. The results are from a run with a
> period of 5ms, although I have seen them with periods as high as 17ms. The
> system was under heavy network load for some of the time, but not all.

[snip]

> I'd appreciate any feedback on the test case, and in particular suggestions on
> how I can go about determining where this lost time is being spent.

There's a multitude of ways how you can misconfigure your -rt system :)
Tell us more about your setup. Hardware? Full preemption? High
resolution timers? Priority setup? From your code i see you run at prio
98. What about the IRQ handlers? And the softirq's, too? Other software?

Flo

P.S.: I ran the test a few [20 or so] times and didn't get any failures
of the sort you see. Even with a 1ms period:

~/downloads$ ./sched_latency_lkml
-------------------------------
Scheduling Latency
-------------------------------

Running 10000 iterations with a period of 1 ms
Expected running time: 10 s

ITERATION DELAY(US) MAX_DELAY(US) FAILURES
--------- --------- ------------- --------
10000 32 47 0

Start Latency: 305 us: FAIL
Min Latency: 16 us: PASS
Avg Latency: 29 us: PASS
Max Latency: 47 us: PASS
Failed Iterations: 0

~/downloads$ uname -a
Linux mango.fruits 2.6.16-rt20 #4 PREEMPT Wed May 10 12:53:39 CEST 2006 i686 GNU/Linux

Ooops, i must admit i have the nvidia binary only kernel module loaded,
but i suppose this wouldn't make a difference for the better ;)

I got high resolution timers enabled and left the softirq threads at
their defaults.

--
Palimm Palimm!
http://tapas.affenbande.org

2006-05-13 11:55:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Sat, 2006-05-13 at 11:20 +0200, Florian Paul Schmidt wrote:

> P.S.: I ran the test a few [20 or so] times and didn't get any failures
> of the sort you see. Even with a 1ms period:

Something odd happened here... the first time I booted rt21, I could
reproduce the problem quite regularly. Since reboot though, poof.

Elves and Gremlins.

-Mike

2006-05-13 15:39:58

by Steven Rostedt

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data


On Sat, 13 May 2006, Mike Galbraith wrote:

> On Sat, 2006-05-13 at 11:20 +0200, Florian Paul Schmidt wrote:
>
> > P.S.: I ran the test a few [20 or so] times and didn't get any failures
> > of the sort you see. Even with a 1ms period:
>
> Something odd happened here... the first time I booted rt21, I could
> reproduce the problem quite regularly. Since reboot though, poof.
>
> Elves and Gremlins.
>

Careful, rt21 has a bug slipped in that might have funny results on SMP
machines:

+ if (!cpus_equal(current->cpus_allowed, irq_affinity[irq]));
+ set_cpus_allowed(current, irq_affinity[irq]);

John (although he later fixed it) added a ; after the if. But the fix is
not yet in Ingo's patch.

-- Steve

2006-05-13 16:36:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Sat, 2006-05-13 at 11:39 -0400, Steven Rostedt wrote:
> Careful, rt21 has a bug slipped in that might have funny results on SMP
> machines:
>
> + if (!cpus_equal(current->cpus_allowed, irq_affinity[irq]));
> + set_cpus_allowed(current, irq_affinity[irq]);
>
> John (although he later fixed it) added a ; after the if. But the fix is
> not yet in Ingo's patch.

I saw that go by, and fixed it before building. Mine is a UP build
anyway.

-Mike

2006-05-13 18:06:27

by Darren Hart

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Saturday 13 May 2006 02:20, you wrote:
> On Fri, 12 May 2006 19:24:53 -0700
>
> Darren Hart <[email protected]> wrote:
> > The test case emulates a periodic thread that wakes up on time%PERIOD=0,
> > so rather than sleeping the same amount of time each round, it checks now
> > against the start of its next period and sleeps for that length of time.
> > Every so often it will miss it's period, I've captured that data and
> > included a few of the interesting bits below. The results are from a run
> > with a period of 5ms, although I have seen them with periods as high as
> > 17ms. The system was under heavy network load for some of the time, but
> > not all.
>
> [snip]
>
> > I'd appreciate any feedback on the test case, and in particular
> > suggestions on how I can go about determining where this lost time is
> > being spent.
>
> There's a multitude of ways how you can misconfigure your -rt system :)
> Tell us more about your setup. Hardware? Full preemption? High
> resolution timers? Priority setup? From your code i see you run at prio
> 98. What about the IRQ handlers? And the softirq's, too? Other software?
>

These tests are running on a 4 way opteron at 2 GHz with 4 GB RAM. I have
attached the .config and a listing of all the rt tasks running on the system
(which addresses the questions regarding priority setup, IRQ handlers, and
softirqs - all default). I am running with the futex priority based wakeup
patches from Sebastien Duque, but I don't think this test excercises those
paths.

I have also included oprofile results from a passing run and a failed run for
both the testcase and the kernel. I run the test case with the attached
script as follows to get a log and profiles.

$ ./run_sl.sh 100 sl.log ./sched_latency_lkml

The oprofile runs don't seem to vary from passing tests to failing tests. The
hot spots are the busy work loop for the test case (as expected) and
acpi_pm_read and __schedule for the kernel.

I haven't yet tried running with the RT Latency / Trace tools. I can try
those if folks they think they will be useful.

Thanks,

--Darren Hart

> Flo
>
> P.S.: I ran the test a few [20 or so] times and didn't get any failures
> of the sort you see. Even with a 1ms period:
>
> ~/downloads$ ./sched_latency_lkml
> -------------------------------
> Scheduling Latency
> -------------------------------
>
> Running 10000 iterations with a period of 1 ms
> Expected running time: 10 s
>
> ITERATION DELAY(US) MAX_DELAY(US) FAILURES
> --------- --------- ------------- --------
> 10000 32 47 0
>
> Start Latency: 305 us: FAIL
> Min Latency: 16 us: PASS
> Avg Latency: 29 us: PASS
> Max Latency: 47 us: PASS
> Failed Iterations: 0
>
> ~/downloads$ uname -a
> Linux mango.fruits 2.6.16-rt20 #4 PREEMPT Wed May 10 12:53:39 CEST 2006
> i686 GNU/Linux
>
> Ooops, i must admit i have the nvidia binary only kernel module loaded,
> but i suppose this wouldn't make a difference for the better ;)
>
> I got high resolution timers enabled and left the softirq threads at
> their defaults.

--
Darren Hart
IBM Linux Technology Center
Realtime Linux Team


Attachments:
(No filename) (3.08 kB)
run_sl.sh (703.00 B)
38-sched_latency_lkml-profile-FAILED.txt (462.00 B)
config (52.85 kB)
psrt.log (1.30 kB)
38-vmlinux-profile-FAILED.txt (13.52 kB)
39-sched_latency_lkml-profile-PASSED.txt (553.00 B)
39-vmlinux-profile-PASSED.txt (18.47 kB)
Download all attachments

2006-05-13 18:21:15

by Lee Revell

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Sat, 2006-05-13 at 11:06 -0700, Darren Hart wrote:
> 1 [softirq-timer/0]

What happens if you set the softirq-timer threads to 99?

Lee

2006-05-13 23:01:42

by Darren Hart

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Saturday 13 May 2006 11:21, Lee Revell wrote:
> On Sat, 2006-05-13 at 11:06 -0700, Darren Hart wrote:
> > 1 [softirq-timer/0]
>
> What happens if you set the softirq-timer threads to 99?
>

After setting all 4 softirq-timer threads to prio 99 I seemed to get only 2
failures in 100 runs. #51 slept too long (10ms too long!), the latency
appeared after the sleep in #90 (nearly 483ms worth). Those latencies seem
huge to me.

ITERATIONS 0-50 Passed

ITERATION 51
-------------------------------
Scheduling Latency
-------------------------------

Running 10000 iterations with a period of 5 ms
Expected running time: 50 s

ITERATION DELAY(US) MAX_DELAY(US) FAILURES
--------- --------- ------------- --------
7000 10197 10197 1

PERIOD MISSED!
scheduled delta: 4079 us
actual delta: 14263 us
latency: 10183 us
---------------------------------------
previous start: 35010197 us
now: 35011117 us
scheduled start: 35005000 us
next scheduled start is in the past!


Start Latency: 114 us: FAIL
Min Latency: 9 us: PASS
Avg Latency: 7 us: PASS
Max Latency: 10197 us: FAIL
Failed Iterations: 1

ITERATIONS 52-89 Passed

ITERATION 90
-------------------------------
Scheduling Latency
-------------------------------

Running 10000 iterations with a period of 5 ms
Expected running time: 50 s

ITERATION DELAY(US) MAX_DELAY(US) FAILURES
--------- --------- ------------- --------
2747 9 20 0

PERIOD MISSED!
scheduled delta: 4072 us
actual delta: 4079 us
latency: 6 us
---------------------------------------
previous start: 13735009 us
now: 14218183 us
scheduled start: 13740000 us
next scheduled start is in the past!


Start Latency: 112 us: FAIL
Min Latency: 8 us: PASS
Avg Latency: 2 us: PASS
Max Latency: 20 us: PASS
Failed Iterations: 0

ITERATIONS 91-99 Passed

Thanks,

--
Darren Hart
IBM Linux Technology Center
Realtime Linux Team

2006-05-14 03:46:42

by Mike Galbraith

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Sat, 2006-05-13 at 16:01 -0700, Darren Hart wrote:
> On Saturday 13 May 2006 11:21, Lee Revell wrote:
> > On Sat, 2006-05-13 at 11:06 -0700, Darren Hart wrote:
> > > 1 [softirq-timer/0]
> >
> > What happens if you set the softirq-timer threads to 99?
> >
>
> After setting all 4 softirq-timer threads to prio 99 I seemed to get only 2
> failures in 100 runs.

If you disable printf + fflush in iterations loop, problem goes away?

-Mike

2006-05-14 05:48:19

by Mike Galbraith

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Sun, 2006-05-14 at 05:47 +0200, Mike Galbraith wrote:
> On Sat, 2006-05-13 at 16:01 -0700, Darren Hart wrote:
> > On Saturday 13 May 2006 11:21, Lee Revell wrote:
> > > On Sat, 2006-05-13 at 11:06 -0700, Darren Hart wrote:
> > > > 1 [softirq-timer/0]
> > >
> > > What happens if you set the softirq-timer threads to 99?
> > >
> >
> > After setting all 4 softirq-timer threads to prio 99 I seemed to get only 2
> > failures in 100 runs.
>
> If you disable printf + fflush in iterations loop, problem goes away?

P.S.

I think it probably will, because...

sched_latency [ea53a0b0]D 00000001 0 8261 7858 8260 (NOTLB)
e29a0e70 e29a0e58 00000008 00000001 df6158e0 00000000 623266f4 0000017d
b23d45c4 efd53870 dfcb8dc0 efd53870 00000000 000011e6 ea53a1e8 ea53a0b0
efdf0d30 b2454560 623e5018 0000017d 00000001 efdf0d30 00000100 00000000
Call Trace:
[<b1038454>] __rt_mutex_adjust_prio+0x1f/0x24 (112)
[<b1038ad8>] task_blocks_on_rt_mutex+0x1b6/0x1c9 (16)
[<b13bfeb1>] schedule+0x34/0x10b (24)
[<b13c0963>] rt_mutex_slowlock+0xc7/0x258 (28)
[<b13c0bb6>] rt_mutex_lock+0x3f/0x43 (100)
[<b1039075>] rt_down+0x12/0x32 (20)
[<b13c14a7>] lock_kernel+0x1d/0x23 (16)
[<b1228246>] tty_write+0x119/0x21b (12)
[<b122b758>] write_chan+0x0/0x338 (24)
[<b10352bd>] hrtimer_wakeup+0x0/0x1c (20)
[<b10671f0>] vfs_write+0xc1/0x19b (24)
[<b1067bfa>] sys_write+0x4b/0x74 (40)
[<b1002eeb>] sysenter_past_esp+0x54/0x75 (40)

...generated via SysRq-T, induces...

PERIOD MISSED!
scheduled delta: 4964 us
actual delta: 4974 us
latency: 10 us
---------------------------------------
previous start: 1750012 us
now: 13122245 us
scheduled start: 1755000 us
next scheduled start is in the past!


2006-05-14 07:04:42

by Darren Hart

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

> > > On Saturday 13 May 2006 11:21, Lee Revell wrote:
> > If you disable printf + fflush in iterations loop, problem goes away?

Unfortunately not, after disabling the printf and fflush, my very first run
resulted in:

ITERATION 0
-------------------------------
Scheduling Latency
-------------------------------

Running 10000 iterations with a period of 5 ms
Expected running time: 50 s

ITERATION DELAY(US) MAX_DELAY(US) FAILURES
--------- --------- ------------- --------


PERIOD MISSED!
scheduled delta: 4078 us
actual delta: 14213 us
latency: 10135 us
---------------------------------------
previous start: 42050139 us
now: 42051059 us
scheduled start: 42045000 us
next scheduled start is in the past!


Start Latency: 99 us: PASS
Min Latency: 8 us: PASS
Avg Latency: 8 us: PASS
Max Latency: 10139 us: FAIL


> P.S.
>
> I think it probably will, because...
>
> sched_latency [ea53a0b0]D 00000001 0 8261 7858 8260
> (NOTLB) e29a0e70 e29a0e58 00000008 00000001 df6158e0 00000000 623266f4
> 0000017d b23d45c4 efd53870 dfcb8dc0 efd53870 00000000 000011e6 ea53a1e8
> ea53a0b0 efdf0d30 b2454560 623e5018 0000017d 00000001 efdf0d30 00000100
> 00000000 Call Trace:
> [<b1038454>] __rt_mutex_adjust_prio+0x1f/0x24 (112)
> [<b1038ad8>] task_blocks_on_rt_mutex+0x1b6/0x1c9 (16)
> [<b13bfeb1>] schedule+0x34/0x10b (24)
> [<b13c0963>] rt_mutex_slowlock+0xc7/0x258 (28)
> [<b13c0bb6>] rt_mutex_lock+0x3f/0x43 (100)
> [<b1039075>] rt_down+0x12/0x32 (20)
> [<b13c14a7>] lock_kernel+0x1d/0x23 (16)
> [<b1228246>] tty_write+0x119/0x21b (12)
> [<b122b758>] write_chan+0x0/0x338 (24)
> [<b10352bd>] hrtimer_wakeup+0x0/0x1c (20)
> [<b10671f0>] vfs_write+0xc1/0x19b (24)
> [<b1067bfa>] sys_write+0x4b/0x74 (40)
> [<b1002eeb>] sysenter_past_esp+0x54/0x75 (40)
>

What is it about this dump that made you suspect the printf? Or was it just
that printing the trace seemed to trigger a failure - so it seemed reasonable
that the process may have been blocked on writing to the console? I could
see that causing a failure like the one below, but not like the one I posted
above. (The one above has no printfs between the time measurements
surrounding the clock_nanosleep() call and it overslept by 10ms). Also,
shouldn't I have seen something in the oprofile reports I posted earlier if
the printf was causing the latencies?

Thanks for the comments, thoughts, and suggestions.

> ...generated via SysRq-T, induces...
>
> PERIOD MISSED!
> scheduled delta: 4964 us
> actual delta: 4974 us
> latency: 10 us
> ---------------------------------------
> previous start: 1750012 us
> now: 13122245 us
> scheduled start: 1755000 us
> next scheduled start is in the past!

--
Darren Hart
IBM Linux Technology Center
Realtime Linux Team

2006-05-14 07:38:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Sun, 2006-05-14 at 00:04 -0700, Darren Hart wrote:
> > > > On Saturday 13 May 2006 11:21, Lee Revell wrote:
> > > If you disable printf + fflush in iterations loop, problem goes away?
>
> Unfortunately not, after disabling the printf and fflush, my very first run
> resulted in:

Drat.

> What is it about this dump that made you suspect the printf? Or was it just
> that printing the trace seemed to trigger a failure - so it seemed reasonable
> that the process may have been blocked on writing to the console?

Yeah, realtime task in D state.

-Mike

2006-05-15 05:43:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Sat, 2006-05-13 at 11:06 -0700, Darren Hart wrote:

> I haven't yet tried running with the RT Latency / Trace tools. I can try
> those if folks they think they will be useful.

FWIW, enabling tracing made the 10ms failure variant fairly repeatable
here.

-Mike

2006-05-15 08:05:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data


* Steven Rostedt <[email protected]> wrote:

> Careful, rt21 has a bug slipped in that might have funny results on
> SMP machines:
>
> + if (!cpus_equal(current->cpus_allowed, irq_affinity[irq]));
> + set_cpus_allowed(current, irq_affinity[irq]);
>
> John (although he later fixed it) added a ; after the if. But the fix
> is not yet in Ingo's patch.

ouch - i missed that. I've released -rt22 with this fix.

Ingo

2006-05-15 08:13:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data


* Darren Hart <[email protected]> wrote:

> On Saturday 13 May 2006 11:21, Lee Revell wrote:
> > On Sat, 2006-05-13 at 11:06 -0700, Darren Hart wrote:
> > > 1 [softirq-timer/0]
> >
> > What happens if you set the softirq-timer threads to 99?
> >
>
> After setting all 4 softirq-timer threads to prio 99 I seemed to get
> only 2 failures in 100 runs. #51 slept too long (10ms too long!), the
> latency appeared after the sleep in #90 (nearly 483ms worth). Those
> latencies seem huge to me.

have you tried to use the latency tracer to capture this latency? It is
programmable to a high degree. (I've attached trace-it.c that shows how
to use it)

(If the latency is particularly long you might want to increase
kernel/latency.c:MAX_TRACE.)

once you have a latency trace, you can use grep to produce some
highlevel overview of what's happening. E.g. syscall activity:

grep ' [<>] ' latency_trace.txt

or context-switches done:

grep ' : __schedule <' latency_trace

looking at the highlevel traces should give you a quick idea of what's
going on, and then you can zoom into the time period that triggers the
long latency. (but feel free to also send these traces to us, preferably
in bzip2 -9 format.)

Ingo


Attachments:
(No filename) (1.21 kB)
trace-it.c (2.32 kB)
Download all attachments

2006-05-15 11:11:31

by Sébastien Dugué

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Fri, 2006-05-12 at 19:24 -0700, Darren Hart wrote:
> I have been noticing unexpected intermittant large latencies. I wrote the
> attached test case to try and capture some information on them. The librt.h
> file contains convenience functions I use for writing other tests as well, so
> much of it is irrelevant, but the test case itself is pretty clear I believe.
>
> The test case emulates a periodic thread that wakes up on time%PERIOD=0, so
> rather than sleeping the same amount of time each round, it checks now
> against the start of its next period and sleeps for that length of time.
> Every so often it will miss it's period, I've captured that data and included
> a few of the interesting bits below. The results are from a run with a
> period of 5ms, although I have seen them with periods as high as 17ms. The
> system was under heavy network load for some of the time, but not all.
>

Hi Darren,

FWIW:

I've been running you test program on my box under a stress-kernel
load and did not observe any failure as you describe, not even a max
latency overshooting the 100 us limit (max latencies in the 60~70 us).

I even went to decrease PERIOD to 1 ms and still no failure.

I'm running rt20 with the futex priority based wakeup patch on
a dual 2.8 GHz HT Xeon box. All hardirq and softirq threads are at their
default priority.

How do you generate the network load you mention? Maybe I could try at
least with the same load you're using.

S?bastien.

2006-05-15 11:17:11

by Sébastien Dugué

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Sat, 2006-05-13 at 11:06 -0700, Darren Hart wrote:
> These tests are running on a 4 way opteron at 2 GHz with 4 GB RAM. I have
> attached the .config and a listing of all the rt tasks running on the system
> (which addresses the questions regarding priority setup, IRQ handlers, and
> softirqs - all default). I am running with the futex priority based wakeup
> patches from Sebastien Duque, but I don't think this test excercises those
> paths.

Which watchdog are you using here? Have you tried without the
watchdog?

S?bastien.



2006-05-15 14:34:53

by Darren Hart

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Monday 15 May 2006 04:15, S?bastien Dugu? wrote:

> I've been running you test program on my box under a stress-kernel
> load and did not observe any failure as you describe, not even a max
> latency overshooting the 100 us limit (max latencies in the 60~70 us).
>
> I even went to decrease PERIOD to 1 ms and still no failure.
>
> I'm running rt20 with the futex priority based wakeup patch on
> a dual 2.8 GHz HT Xeon box. All hardirq and softirq threads are at their
> default priority.

Interesting, I'll have to try this on some more hardware and see if I can
reproduce there.

>
> How do you generate the network load you mention? Maybe I could try at
> least with the same load you're using.

I was simply copying a 60MB file to the test machine via scp, in a bash while
loop. I haven't been doing this on my most recent runs however, and they
still fail. So I don't believe the net load is directly related.

I am going to work with Ingo's trace-it.c today and report back.

--
Darren Hart
IBM Linux Technology Center
Realtime Linux Team

2006-05-15 16:52:31

by Lee Revell

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Mon, 2006-05-15 at 07:43 +0200, Mike Galbraith wrote:
> On Sat, 2006-05-13 at 11:06 -0700, Darren Hart wrote:
>
> > I haven't yet tried running with the RT Latency / Trace tools. I can try
> > those if folks they think they will be useful.
>
> FWIW, enabling tracing made the 10ms failure variant fairly repeatable
> here.

Make sure you're not reading /proc/latency_trace during the test - it
will reliably cause missed deadlines.

Lee

2006-05-15 21:49:30

by Darren Hart

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Monday 15 May 2006 04:20, S?bastien Dugu? wrote:
> On Sat, 2006-05-13 at 11:06 -0700, Darren Hart wrote:
> > These tests are running on a 4 way opteron at 2 GHz with 4 GB RAM. I
> > have attached the .config and a listing of all the rt tasks running on
> > the system (which addresses the questions regarding priority setup, IRQ
> > handlers, and softirqs - all default). I am running with the futex
> > priority based wakeup patches from Sebastien Duque, but I don't think
> > this test excercises those paths.
>
> Which watchdog are you using here? Have you tried without the
> watchdog?

Those are the softlockup watchdog threads (kernel/softlockup.c). They run
once a second and reports a bug if the watchdog failed to run in 10 seconds.
It is difficult to reproduce but at run 126 (1,260,000 iterations) it finally
failed. Note that I am only counting runs that completely miss an entire
period as a failure for the purposes of this test. I want to knock out the
10+ms latencies before I concern myself too much with the >100us failures :-)

--------------------
ITERATION 26
--------------------
-------------------------------
Scheduling Latency
-------------------------------

Running 10000 iterations with a period of 5 ms
Expected running time: 50 s

ITERATION DELAY(US) MAX_DELAY(US) FAILURES
--------- --------- ------------- --------


PERIOD MISSED!
scheduled delta: 4076 us
actual delta: 14892 us
latency: 10815 us
---------------------------------------
previous start: 18365818 us
now: 18366739 us
scheduled start: 18360000 us
next scheduled start is in the past!


Start Latency: 106 us: FAIL
Min Latency: 8 us: PASS
Avg Latency: 4 us: PASS
Max Latency: 10818 us: FAIL
Failed Iterations: 1


It's interesting, this 10ms latency seems to be the most common result. I'm
going to take a look at ingo's tracing script now, more info a bit later...

--
Darren Hart
IBM Linux Technology Center
Realtime Linux Team

2006-05-16 01:30:35

by Darren Hart

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Monday 15 May 2006 01:13, Ingo Molnar wrote:
>
> have you tried to use the latency tracer to capture this latency? It is
> programmable to a high degree. (I've attached trace-it.c that shows how
> to use it)
> ...
> looking at the highlevel traces should give you a quick idea of what's
> going on, and then you can zoom into the time period that triggers the
> long latency. (but feel free to also send these traces to us, preferably
> in bzip2 -9 format.)

Following Ingo's example I have included the modified test case (please see
the original mail for librt.h) that starts the trace before each sleep and
disables it after we wake up. If we have missed a period, we print the
trace.

My initial trace (lt_01.log.bz2) contained entries like the following:

<idle>-0 2...1 24us!: default_idle (cpu_idle)
<idle>-0 2D..1 1609us : smp_apic_timer_interrupt (c0100c80 0 0)

<idle>-0 2...1 1635us!: default_idle (cpu_idle)
<idle>-0 2D..1 4756us : smp_apic_timer_interrupt (c0100c80 0 0)

I noticed the that there were long latencies (as high as 3ms) all related to
irq handling in the idle loop. I thought it might have something to do with
the CPU going into some kind of a sleep state and taking a long time to wake
up. John Stultz suggested I boot with idle=poll.

The idle=poll trace (lt_02.log.bz2) had much smaller maximum latencies, like
the following:

<idle>-0 0D..1 1341us+: write_watchdog_counter (nmi_watchdog_tick)
<idle>-0 0D..1 1441us : do_nmi (poll_idle)

<idle>-0 0D..1 1445us+: write_watchdog_counter (nmi_watchdog_tick)
<idle>-0 0D..1 1545us : do_nmi (poll_idle)

All the longest latencies were related to the nmi watchdogs. Booting with
idle=poll nmi_watchdog=0 unfortunately seemed to return to the original
behavior - so perhaps the second test was just a fluke. This trace
(lt_03.log.bz2) had entries like the following:

<idle>-0 1...1 19us!: poll_idle (cpu_idle)
<idle>-0 1D..1 1089us : smp_apic_timer_interrupt (c0100ccf 0 0)

<idle>-0 1...1 1113us!: poll_idle (cpu_idle)
<idle>-0 1D..1 4765us : smp_apic_timer_interrupt (c0100ccf 0 0)

sched_la-2978 1D... 4798us!: init_fpu (math_state_restore)
sched_la-2978 1D... 5088us : smp_apic_timer_interrupt (4b3b98e8 0 0)

sched_la-2978 1D.h. 5103us!: irq_exit (smp_apic_timer_interrupt)
sched_la-2978 1.... 5719us > sys_clock_gettime (00000001 b7f65378 0000007b)

Finally I booted with idle=poll and nmi_watchdog=0. The very first run
failed, (I've noticed that the first iteration seems to always hit PERIOD
MISSED! while the second usually passes fine). It's still running a 1M
iteration run with no more failures so far (100K so far).

The latency tracer is a very interesting tool. I have a few
questions/assumptions I'd like to run by you to make sure I understand the
output of the latency trace:

o ! in the delay column means there is a long latency here?
o + in the delay column means there is a > 1us latency here?
o > means entering the kernel from a sys_call?
o < means returning from the sys_call?
o : is not < or >
o Why do I only see ~5ms of trace when it appears that I am seeing more like
~25ms between the time when I start and stop the trace?
o What are the maximum latencies that get printed to /var/log/messages? What
exacly do they measure? From what point to what point?
--
Darren Hart
IBM Linux Technology Center
Realtime Linux Team


Attachments:
(No filename) (3.36 kB)
lt_01.log.bz2 (4.25 kB)
lt_02.log.bz2 (5.58 kB)
lt_03.log.bz2 (3.74 kB)
sched_latency_lkml.c (5.18 kB)
Download all attachments

2006-05-16 07:18:40

by Sébastien Dugué

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Mon, 2006-05-15 at 18:30 -0700, Darren Hart wrote:
> Following Ingo's example I have included the modified test case (please see
> the original mail for librt.h) that starts the trace before each sleep and
> disables it after we wake up. If we have missed a period, we print the
> trace.
>
<snip>
> The very first run
> failed, (I've noticed that the first iteration seems to always hit PERIOD
> MISSED! while the second usually passes fine).

I've noticed also that, on some occasions, the first run would fail,
but subsequent runs would be fine. Strange!

I finally managed to hit a missed period under heavy heavy load:

Running 100000 iterations with a period of 5 ms
Expected running time: 500 s

ITERATION DELAY(US) MAX_DELAY(US) FAILURES
--------- --------- ------------- --------
211 70 97 0

PERIOD MISSED!
scheduled delta: 3128 us
actual delta: 3191 us
latency: 62 us
---------------------------------------
previous start: 1055070 us
now: 1060172 us
scheduled start: 1060000 us
next scheduled start is in the past!


Start Latency: 198 us: FAIL
Min Latency: 6 us: PASS
Avg Latency: 0 us: PASS
Max Latency: 97 us: PASS
Failed Iterations: 0


I'll try to trace it now.

> It's still running a 1M
> iteration run with no more failures so far (100K so far).
>
> The latency tracer is a very interesting tool. I have a few
> questions/assumptions I'd like to run by you to make sure I understand the
> output of the latency trace:
>
> o ! in the delay column means there is a long latency here?

! means latency > 100 us

> o + in the delay column means there is a > 1us latency here?

+ means latency > 1 us

> o > means entering the kernel from a sys_call?

yep

> o < means returning from the sys_call?

or from interrupt

> o : is not < or >

yep


S?bastien.

2006-05-18 08:40:01

by Sébastien Dugué

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

Darren,

On Mon, 2006-05-15 at 18:30 -0700, Darren Hart wrote:
> Following Ingo's example I have included the modified test case (please see
> the original mail for librt.h) that starts the trace before each sleep and
> disables it after we wake up. If we have missed a period, we print the
> trace.
>

Your test program fails (at least on my box) as the overhead of
starting and stopping the trace in the 5 ms period is just too high.

By moving the latency_trace_start() at the start of the thread
function and latency_trace_stop() at the end, everything runs fine.
I did not have any period missed even under heavy load.

S?bastien.



2006-05-18 08:47:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data


* S?bastien Dugu? <[email protected]> wrote:

> Darren,
>
> On Mon, 2006-05-15 at 18:30 -0700, Darren Hart wrote:
> > Following Ingo's example I have included the modified test case (please see
> > the original mail for librt.h) that starts the trace before each sleep and
> > disables it after we wake up. If we have missed a period, we print the
> > trace.
> >
>
> Your test program fails (at least on my box) as the overhead of
> starting and stopping the trace in the 5 ms period is just too high.
>
> By moving the latency_trace_start() at the start of the thread
> function and latency_trace_stop() at the end, everything runs fine. I
> did not have any period missed even under heavy load.

could you send us the fixed testcase?

thanks for tracking this down. FYI, the latency of stopping the trace is
that expensive because we are copying large amounts of trace data
around, to ensure that /proc/latency_trace is always consistent and is
updated atomically, and to make sure that we can update the trace from
interrupt contexts too - without /proc/latency_trace accesses blocking
them. The latency of stopping the trace is hidden from the tracer itself
- but it cannot prevent indirect effects such as your app from missing
periods, if the periods are in the 5msec range.

Ingo

2006-05-18 08:53:47

by Sébastien Dugué

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Thu, 2006-05-18 at 10:47 +0200, Ingo Molnar wrote:
> * S?bastien Dugu? <[email protected]> wrote:
>
> > Darren,
> >
> > On Mon, 2006-05-15 at 18:30 -0700, Darren Hart wrote:
> > > Following Ingo's example I have included the modified test case (please see
> > > the original mail for librt.h) that starts the trace before each sleep and
> > > disables it after we wake up. If we have missed a period, we print the
> > > trace.
> > >
> >
> > Your test program fails (at least on my box) as the overhead of
> > starting and stopping the trace in the 5 ms period is just too high.
> >
> > By moving the latency_trace_start() at the start of the thread
> > function and latency_trace_stop() at the end, everything runs fine. I
> > did not have any period missed even under heavy load.
>
> could you send us the fixed testcase?

No problem, see attachment.

>
> thanks for tracking this down. FYI, the latency of stopping the trace is
> that expensive because we are copying large amounts of trace data
> around, to ensure that /proc/latency_trace is always consistent and is
> updated atomically, and to make sure that we can update the trace from
> interrupt contexts too - without /proc/latency_trace accesses blocking
> them. The latency of stopping the trace is hidden from the tracer itself
> - but it cannot prevent indirect effects such as your app from missing
> periods, if the periods are in the 5msec range.
>

Thanks for the explanation, will have to look deeper into the code
to understand how it works though.

S?bastien.


Attachments:
sched_latency_lkml2.c (5.17 kB)

2006-05-18 08:56:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data


* S?bastien Dugu? <[email protected]> wrote:

> > thanks for tracking this down. FYI, the latency of stopping the trace is
> > that expensive because we are copying large amounts of trace data
> > around, to ensure that /proc/latency_trace is always consistent and is
> > updated atomically, and to make sure that we can update the trace from
> > interrupt contexts too - without /proc/latency_trace accesses blocking
> > them. The latency of stopping the trace is hidden from the tracer itself
> > - but it cannot prevent indirect effects such as your app from missing
> > periods, if the periods are in the 5msec range.
> >
>
> Thanks for the explanation, will have to look deeper into the code
> to understand how it works though.

there's another complexity on SMP: if trace_all_cpus is set then the
per-cpu trace buffers are sorted chronologically as well while the
copying into the current-max-trace-buffer, to produce easier to read
latency_trace output.

Ingo

2006-05-18 09:14:11

by Darren Hart

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Tuesday 16 May 2006 00:22, S?bastien Dugu? wrote:
> On Mon, 2006-05-15 at 18:30 -0700, Darren Hart wrote:
> > Following Ingo's example I have included the modified test case (please
> > see the original mail for librt.h) that starts the trace before each
> > sleep and disables it after we wake up. If we have missed a period, we
> > print the trace.
>

I found several flaws in my first two iterations of the test case.
Particularly false negatives. It's printing would cause large latencies
waiting for the tty, or even the disk if output was being redirected. The
new version uses buffered output, and I cleaned up the periodic loop using a
MARK() system. You can run the test like this:

./sched_latency_lkml -v4

It will refrain from spewing output until after the test completes, or a
missed period is detected. If a missed period is detected it will stop and
print the latency trace as well. Note that I restart the trace at each
iteration by just calling gettimeofday(0,1) - I don't stop it first. This
was because I found that stopping it with gettimeofday(0,0) could take as
long as 30ms!

The output looks something like this:

000000: ----- ITERATION 1 -----
000001: MARK 97: 197 us (112606039526 us) delta 0
000002: MARK 102: 214 us (16 us) delta 16
000003: MARK 110: 5024 us (4826 us) delta 4810
000004: MARK 121: 5029 us (4831 us) delta 5
000005:
000006: ----- ITERATION 2 -----
000007: MARK 97: 5957 us (5759 us) delta 928
000008: MARK 102: 5977 us (19 us) delta 19
000009: MARK 110: 10018 us (4060 us) delta 4041
000010: MARK 121: 10023 us (4065 us) delta 4

Format:
MARK LINE TIME_SINCE_START TIME_SINCE_TRACE_START TIME_SINCE_LAST_MARK

TIME_SINCE_TRACE_START is handy for indexing the latency_traces if you happen
to get one. Without all the added i/o and calls to stop the trace, I have
yet to see a missed period with the version of the program. While getting to
this point, I did see some things that concerned me:

sched_la-4856 3D... 4083us!: math_state_restore (device_not_available)
sched_la-4856 3D... 16033us : smp_apic_timer_interrupt (4b3b98e8 0 0)

Am I reading that right? 12ms to complete math_state_restore()? What
does "device_not_available" mean here?

Here are some other similar traces:

sched_la-5008 2D... 4104us!: math_state_restore (device_not_available)
sched_la-5008 2.... 4992us > sys_clock_gettime (00000001 b7fc8378 0000007b)
--
sched_la-4187 3D... 4786us : math_state_restore (device_not_available)
sched_la-4187 3D... 4786us!: init_fpu (math_state_restore)
sched_la-4187 3D... 5524us : smp_apic_timer_interrupt (4b3b98e8 0 0)
--
sched_la-4729 1D... 4078us!: math_state_restore (device_not_available)
sched_la-4729 1.... 4978us > sys_clock_gettime (00000001 b7f5a348 0000007b)

I'll run the test in a loop overnight and see if I can trigger a missed
period. In the meantime, can anyone explain with the math_state_restore()
might be causing so much latency?


Thanks,

--
Darren Hart
IBM Linux Technology Center
Realtime Linux Team
Phone: 503 578 3185
T/L: 775 3185


Attachments:
(No filename) (2.99 kB)
sched_latency_lkml.c (5.62 kB)
Download all attachments

2006-05-18 09:15:39

by Sébastien Dugué

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Thu, 2006-05-18 at 10:56 +0200, Ingo Molnar wrote:
> * S?bastien Dugu? <[email protected]> wrote:
>
> > > thanks for tracking this down. FYI, the latency of stopping the trace is
> > > that expensive because we are copying large amounts of trace data
> > > around, to ensure that /proc/latency_trace is always consistent and is
> > > updated atomically, and to make sure that we can update the trace from
> > > interrupt contexts too - without /proc/latency_trace accesses blocking
> > > them. The latency of stopping the trace is hidden from the tracer itself
> > > - but it cannot prevent indirect effects such as your app from missing
> > > periods, if the periods are in the 5msec range.
> > >
> >
> > Thanks for the explanation, will have to look deeper into the code
> > to understand how it works though.
>
> there's another complexity on SMP: if trace_all_cpus is set then the
> per-cpu trace buffers are sorted chronologically as well while the
> copying into the current-max-trace-buffer, to produce easier to read
> latency_trace output.
>
Well, that's not the case here, but thanks for the info.

S?bastien.

2006-05-18 09:38:37

by Darren Hart

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Thursday 18 May 2006 01:58, S?bastien Dugu? wrote:
> On Thu, 2006-05-18 at 10:47 +0200, Ingo Molnar wrote:
> > * S?bastien Dugu? <[email protected]> wrote:
> > > Darren,
> > >
> > > On Mon, 2006-05-15 at 18:30 -0700, Darren Hart wrote:
> > > > Following Ingo's example I have included the modified test case
> > > > (please see the original mail for librt.h) that starts the trace
> > > > before each sleep and disables it after we wake up. If we have
> > > > missed a period, we print the trace.
> > >
> > > Your test program fails (at least on my box) as the overhead of
> > > starting and stopping the trace in the 5 ms period is just too high.
> > >
> > > By moving the latency_trace_start() at the start of the thread
> > > function and latency_trace_stop() at the end, everything runs fine. I
> > > did not have any period missed even under heavy load.
> >
> > could you send us the fixed testcase?
>
> No problem, see attachment.

I found several similar problems in my original test case, please see my
earlier mail from today where I included a completely rewritten test case
with buffered output and new periodic logic.

The case attached here seems to try to print the trace without first stopping
it. I don't think that will result in the desired output. My new test case
addresses that issue as well.

I'd appreciate any feedback, thanks.

--
Darren Hart
IBM Linux Technology Center
Realtime Linux Team
Phone: 503 578 3185
T/L: 775 3185

2006-05-18 09:53:50

by Sébastien Dugué

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Thu, 2006-05-18 at 02:38 -0700, Darren Hart wrote:
> On Thursday 18 May 2006 01:58, S?bastien Dugu? wrote:
> > On Thu, 2006-05-18 at 10:47 +0200, Ingo Molnar wrote:
> > > * S?bastien Dugu? <[email protected]> wrote:
> > > > Darren,
> > > >
> > > > On Mon, 2006-05-15 at 18:30 -0700, Darren Hart wrote:
> > > > > Following Ingo's example I have included the modified test case
> > > > > (please see the original mail for librt.h) that starts the trace
> > > > > before each sleep and disables it after we wake up. If we have
> > > > > missed a period, we print the trace.
> > > >
> > > > Your test program fails (at least on my box) as the overhead of
> > > > starting and stopping the trace in the 5 ms period is just too high.
> > > >
> > > > By moving the latency_trace_start() at the start of the thread
> > > > function and latency_trace_stop() at the end, everything runs fine. I
> > > > did not have any period missed even under heavy load.
> > >
> > > could you send us the fixed testcase?
> >
> > No problem, see attachment.
>
> I found several similar problems in my original test case, please see my
> earlier mail from today where I included a completely rewritten test case
> with buffered output and new periodic logic.
>
> The case attached here seems to try to print the trace without first stopping
> it. I don't think that will result in the desired output. My new test case
> addresses that issue as well.
>
> I'd appreciate any feedback, thanks.
>

Just saw your post, testing now...

S?bastien.

2006-05-18 11:24:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data


* Darren Hart <[email protected]> wrote:

> yet to see a missed period with the version of the program. While
> getting to this point, I did see some things that concerned me:
>
> sched_la-4856 3D... 4083us!: math_state_restore (device_not_available)
> sched_la-4856 3D... 16033us : smp_apic_timer_interrupt (4b3b98e8 0 0)
>
> Am I reading that right? 12ms to complete math_state_restore()? What
> does "device_not_available" mean here?

no - the kernel returned to userspace after doing the
math_state_restore, and the next thing you saw is the timer IRQ. The
tracer traces function entries, but not function exits.

> Here are some other similar traces:
>
> sched_la-5008 2D... 4104us!: math_state_restore (device_not_available)
> sched_la-5008 2.... 4992us > sys_clock_gettime (00000001 b7fc8378 0000007b)

same here: we returned to userspace after FPU restore, and the next
thing was a sys_clock_gettime() syscall ~800 usecs later.

Ingo

2006-05-19 05:47:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

I just got a trace from rt23 that contains...

sched_la-8326 0.... 4977us : hrtimer_cancel (do_nanosleep)
...
sched_la-8322 1.... 14358us : hrtimer_cancel (do_nanosleep)

...if anyone is interested.

-Mike


2006-05-19 06:25:17

by Mike Galbraith

[permalink] [raw]
Subject: Re: rt20 scheduling latency testcase and failure data

On Fri, 2006-05-19 at 07:48 +0200, Mike Galbraith wrote:
> I just got a trace from rt23 that contains...
>
> sched_la-8326 0.... 4977us : hrtimer_cancel (do_nanosleep)
> ...
> sched_la-8322 1.... 14358us : hrtimer_cancel (do_nanosleep)
>
> ...if anyone is interested.

Ahem. Never mind. I was too glued to usecs to notice pid :)