LinuxLists.cc - 2.6.11, IDE: Strange scheduling behaviour: high-pri RT process not scheduled?

2005-03-30 11:53:40

Subject: 2.6.11, IDE: Strange scheduling behaviour: high-pri RT process not scheduled?

I've written a small test program which enables periodic RTC interrupts
at 8192 Hz and then goes into a loop reading /dev/rtc and collecting
timing statistics (using the rdtscl macro).

The program runs at highest realtime scheduling priority (99) with
memory
locked. In the loop, it doesn't do any I/O except for /dev/rtc reads, no

memory allocations, nothing which could block or take time:

rdtscl(old);

for (i = 0; i < SAMPLES; ++i) {
if (read(fd, &rtcval, sizeof (unsigned long)) != sizeof (unsigned
long)) {
fprintf(stderr, "%s: reading rtc data failed: %s\n", argv[0],
strerror(errno));
exit(1);
}
rtcval >>= 8;
--rtcval;
if (rtcval >= INTERRUPTS)
++intr[INTERRUPTS];
else
++intr[rtcval];

rdtscl(new);
delta = new - old;
delta /= RESOLUTION;
if (delta >= INTERVALS)
++cnt[INTERVALS];
else
++cnt[delta];

old = new;
}

The test system runs a 2.6.11 kernel (no SMP) on a Pentium3 500 MHz
embedded hardware. All filesystems are on a CF Card connected to the
primary IDE controller (intel 440BX chipset) in PIO mode 2. The CF
card can transfer up to 2 MB/s.
There are 192 MB RAM, most of it free.
The system is configured without swap space.

As long as the system is idle otherwise, my program works as expected:
Most of the time, after an rtc interrupt the program gets the data from
/dev/rtc within less than 50 microseconds.

Even with the test program running (which causes >16000 context switches

per second), vmstat shows more than 95 % idle CPU. Hence, the CPU needed

by the rtc irq handler and the test program is minimal.

However, things break seriously when exercising the CF card in parallel
(e.g. with a dd if=/dev/hda of=/dev/null):

* The rtc *interrupt handler* is delayed for up to 250 *micro*seconds.
This is very bad for my purpose, but easy to explain: It is roughly the
time needed to transfer 512 Bytes from a CF card which can transfer 2
Mbyte/sec, and obviously, the CPU blocks all interrupts while making pio

transfers. (Why? Is this really necessary?)

(I know because I instrumented the rtc irq handler with rdtscl(), too)

* The *test program* is regularly blocked for 63 *milli*seconds,
sometimes for up to 300 *milli*seconds, which is absolutely
unacceptable.

* vmstat shows around 50 % sys CPU and around 50 % I/O wait, and 0 or
1 % user CPU (zero idle CPU). context switches are down to around
500 from the expected >16000, which also indicates that my program is
scheduled much less often than it should. Most of the time, vmstat shows
one running and one blocked process.

Now the big question:
*** Why doesn't my rt test program get *any* CPU for 63 jiffies? ***
(the system ticks at 1000 HZ)

* There is around 50 % idle CPU (I/O wait). I/O wait should not block my

program, because my program neither performs any disk I/O nor swaps.

* The dd program obviously gets some CPU regularly (because it copies 2
MB/s, and because no other program could cause the 1 % user CPU load).
The dd runs at normal shell scheduling priority, so it should be
preempted immediately by my test program!

Some things I tried:
1.) Using a mdma2 instead of a pio2 CF card: The CF card speed doubles
(4 MB/s instead of 2 MB/s), and the "blind time" of my test program
halves from 63 ms to 32 ms, but basically, the problem remains.
2.) Using a realtime-preempt-2.6.12-rc1-V0.7.41-11 kernel with
PREEMPT_RT:
If my test program runs at rtpri 99, the problem is gone: It is
scheduled within 30 microseconds after the rtc interrupt.
If my test program runs at rtpri 1, it still suffers from delays
in the 30 to 300 millisecond range.

Thanks for any help!

Klaus Kusche
> Entwicklung Software - Steuerung
> Software Development - Control
>
> KEBA AG
> A-4041 Linz
> Gewerbepark Urfahr
> Tel +43 / 732 / 7090-3120
> Fax +43 / 732 / 7090-8919
> E-Mail: [email protected]
> http://www.keba.com
>
>

2005-03-30 12:06:44

by Bartlomiej Zolnierkiewicz

[permalink] [raw]

Subject: Re: 2.6.11, IDE: Strange scheduling behaviour: high-pri RT process not scheduled?

On Wed, 30 Mar 2005 13:52:05 +0200, kus Kusche Klaus <[email protected]> wrote:

> However, things break seriously when exercising the CF card in parallel
> (e.g. with a dd if=/dev/hda of=/dev/null):
>
> * The rtc *interrupt handler* is delayed for up to 250 *micro*seconds.
> This is very bad for my purpose, but easy to explain: It is roughly the
> time needed to transfer 512 Bytes from a CF card which can transfer 2
> Mbyte/sec, and obviously, the CPU blocks all interrupts while making pio
>
> transfers. (Why? Is this really necessary?)
>
> (I know because I instrumented the rtc irq handler with rdtscl(), too)

hdparm -u1 /dev/hda

should help

2005-03-30 12:21:37

by kus Kusche Klaus

[permalink] [raw]

Subject: RE: 2.6.11, IDE: Strange scheduling behaviour: high-pri RT process not scheduled?

> From: Bartlomiej Zolnierkiewicz [mailto:[email protected]]
>
> On Wed, 30 Mar 2005 13:52:05 +0200, kus Kusche Klaus
> <[email protected]> wrote:
> > However, things break seriously when exercising the CF card
> in parallel
> > (e.g. with a dd if=/dev/hda of=/dev/null):
> >
> > * The rtc *interrupt handler* is delayed for up to 250
> *micro*seconds.
> > This is very bad for my purpose, but easy to explain: It is
> roughly the
> > time needed to transfer 512 Bytes from a CF card which can
> transfer 2
> > Mbyte/sec, and obviously, the CPU blocks all interrupts
> while making pio
> >
> > transfers. (Why? Is this really necessary?)
> >
> > (I know because I instrumented the rtc irq handler with
> rdtscl(), too)
>
> hdparm -u1 /dev/hda
>
> should help

Hmmm, thanks, that sounds very reasonable, and I didn't know that flag.

Unfortunately, it doesn't help. The bad timings stay the same (still
delays in the 30-300 ms range), the number of context switches stays the
same, ...

The only thing which changes is the CPU load shown by vmstat:
* With -u0, I have 1 % user, ~50 % sys, ~50 % wa
* With -u1, I have 1 % user, ~98 % sys, 1 % wa

P.S.: Apologies for my badly formatted mails. The company forces us to
use outlook, we may not even change the settings...

Klaus Kusche
>Entwicklung Software - Steuerung
>Software Development - Control
>
>KEBA AG
>A-4041 Linz
>Gewerbepark Urfahr
>Tel +43 / 732 / 7090-3120
>Fax +43 / 732 / 7090-8919
>E-Mail: [email protected]
>http://www.keba.com
>

2005-03-30 14:42:37

by Mark Hahn

[permalink] [raw]

Subject: Re: 2.6.11, IDE: Strange scheduling behaviour: high-pri RT process not scheduled?

> I've written a small test program which enables periodic RTC interrupts
> at 8192 Hz and then goes into a loop reading /dev/rtc and collecting
> timing statistics (using the rdtscl macro).

straightforward test, used for many years in the linux community
(I claim to have been the first to publish it on lkml ;)

> The test system runs a 2.6.11 kernel (no SMP) on a Pentium3 500 MHz
> embedded hardware.

which probably has memory bandwidth of at most a couple hundred MB/s,
which is really horrible by modern standards.

> However, things break seriously when exercising the CF card in parallel
> (e.g. with a dd if=/dev/hda of=/dev/null):
>
> * The rtc *interrupt handler* is delayed for up to 250 *micro*seconds.
> This is very bad for my purpose, but easy to explain: It is roughly the
> time needed to transfer 512 Bytes from a CF card which can transfer 2
> Mbyte/sec, and obviously, the CPU blocks all interrupts while making pio
>
> transfers. (Why? Is this really necessary?)

even with -u1, isn't there still a softirq queue that will delay the wakeup
of your user-space tester?

> * The *test program* is regularly blocked for 63 *milli*seconds,
> sometimes for up to 300 *milli*seconds, which is absolutely
> unacceptable.

guessing that's VM housekeeping.

> Now the big question:
> *** Why doesn't my rt test program get *any* CPU for 63 jiffies? ***
> (the system ticks at 1000 HZ)

because it's user-space. the 'rt' is a bit of a misnomer - it's merely
a higher priority, less preemptable job.

> * The dd program obviously gets some CPU regularly (because it copies 2
> MB/s, and because no other program could cause the 1 % user CPU load).
> The dd runs at normal shell scheduling priority, so it should be
> preempted immediately by my test program!

out of curiosity, does running it with "nice -n 19" change anything?

> 2.) Using a realtime-preempt-2.6.12-rc1-V0.7.41-11 kernel with
> PREEMPT_RT:
> If my test program runs at rtpri 99, the problem is gone: It is
> scheduled within 30 microseconds after the rtc interrupt.
> If my test program runs at rtpri 1, it still suffers from delays
> in the 30 to 300 millisecond range.

so your problem is solved, no?

also, did you try a (plain) preemptable kernel?

2005-03-30 20:38:06

by Lee Revell

[permalink] [raw]

Subject: Re: 2.6.11, IDE: Strange scheduling behaviour: high-pri RT process not scheduled?

On Wed, 2005-03-30 at 09:41 -0500, Mark Hahn wrote:
> >> The test system runs a 2.6.11 kernel (no SMP) on a Pentium3 500 MHz
> > embedded hardware.
>
> which probably has memory bandwidth of at most a couple hundred MB/s,
> which is really horrible by modern standards.

What does that have to do with anything? He said it was an embedded
system.

Lee

2005-03-31 11:05:56

by kus Kusche Klaus

[permalink] [raw]

Subject: RE: 2.6.11, IDE: Strange scheduling behaviour: high-pri RT process not scheduled?

> > The test system runs a 2.6.11 kernel (no SMP) on a Pentium3 500 MHz
> > embedded hardware.
>
> which probably has memory bandwidth of at most a couple hundred MB/s,
> which is really horrible by modern standards.

It's my job to find out if linux can be used for control systems
on this (and smaller: ARM, embedded PPC) hardware.
It's not my job to discuss the hardware.

This hardware is quite fast for embedded control system standards,
where product lifetimes exceed 10 years. AFAIK, intel does not
offer any P4 products with industrial availability and lifetime
gurantee (and they would run too hot anyway).

> > However, things break seriously when exercising the CF card
> in parallel
> > (e.g. with a dd if=/dev/hda of=/dev/null):
> >
> > * The rtc *interrupt handler* is delayed for up to 250
> *micro*seconds.
> > This is very bad for my purpose, but easy to explain: It is
> roughly the
> > time needed to transfer 512 Bytes from a CF card which can
> transfer 2
> > Mbyte/sec, and obviously, the CPU blocks all interrupts
> while making pio
> >
> > transfers. (Why? Is this really necessary?)
>
> even with -u1, isn't there still a softirq queue that will
> delay the wakeup
> of your user-space tester?

I assume your comment belongs to the timings of my test program
(below), not to the timings of the rtc irq handler (above):
The softirq queue should not block the rtc irq handler I think.

The softirq queue should not even block user-space rt programs
for 60-300 ms. If something takes that long, it should perhaps
be implemented as threads with a well-known rtprio, not in the
softirq queue.

> > * The *test program* is regularly blocked for 63 *milli*seconds,
> > sometimes for up to 300 *milli*seconds, which is absolutely
> > unacceptable.
>
> guessing that's VM housekeeping.

I would feel better if I knew instead of guessing.
Moreover, VM housekeeping caused by non-rtpri programs should not
block memory-resident rt programs.
And the VM housekeeping in this case is just too simple to waste
milliseconds on it: Just take one page of free mem (there is
plenty of it) and use it as a buffer...

> > Now the big question:
> > *** Why doesn't my rt test program get *any* CPU for 63 jiffies? ***
> > (the system ticks at 1000 HZ)
>
> because it's user-space. the 'rt' is a bit of a misnomer -
> it's merely
> a higher priority, less preemptable job.
>
> > * The dd program obviously gets some CPU regularly (because
> it copies 2
> > MB/s, and because no other program could cause the 1 % user
> CPU load).
> > The dd runs at normal shell scheduling priority, so it should be
> > preempted immediately by my test program!
>
> out of curiosity, does running it with "nice -n 19" change anything?

no.

> > 2.) Using a realtime-preempt-2.6.12-rc1-V0.7.41-11 kernel with
> > PREEMPT_RT:
> > If my test program runs at rtpri 99, the problem is gone: It is
> > scheduled within 30 microseconds after the rtc interrupt.
> > If my test program runs at rtpri 1, it still suffers
> from delays
> > in the 30 to 300 millisecond range.
>
> so your problem is solved, no?

No, running at an rtpri above the irq handlers is not an option
in practice: My program would block irq handling...

> also, did you try a (plain) preemptable kernel?

What I called "vanilla" was configured with "plain" preemption.

--
Klaus Kusche
Entwicklung Software - Steuerung
Software Development - Control

KEBA AG
A-4041 Linz
Gewerbepark Urfahr
Tel +43 / 732 / 7090-3120
Fax +43 / 732 / 7090-8919
E-Mail: [email protected]
http://www.keba.com

2005-03-31 11:35:47

by Ingo Molnar

[permalink] [raw]

Subject: Re: 2.6.11, IDE: Strange scheduling behaviour: high-pri RT process not scheduled?

* kus Kusche Klaus <[email protected]> wrote:

> I've written a small test program which enables periodic RTC
> interrupts at 8192 Hz and then goes into a loop reading /dev/rtc and
> collecting timing statistics (using the rdtscl macro).

getting /dev/rtc handling right for latency measurement is ... tricky.
The method i'm using under PREEMPT_RT is:

chrt -f 84 -p `pidof 'IRQ 0'`
chrt -f 95 -p `pidof 'IRQ 8'`
./rtc_wakeup -f 1024 -t 100000

you can get rtc_wakeup from:

http://www.affenbande.org/~tapas/wiki/index.php?rtc_wakeup

written by Florian Schmidt.

do you see high latencies even with rtc_wakeup?

Ingo

2005-03-31 15:20:13

by kus Kusche Klaus

[permalink] [raw]

Subject: RE: 2.6.11, IDE: Strange scheduling behaviour: high-pri RT process not scheduled?

> getting /dev/rtc handling right for latency measurement is
> ... tricky.
> The method i'm using under PREEMPT_RT is:
>
> chrt -f 84 -p `pidof 'IRQ 0'`
> chrt -f 95 -p `pidof 'IRQ 8'`
> ./rtc_wakeup -f 1024 -t 100000

I tried it your way.

First impressions with realtime-preempt-2.6.12-rc1-V0.7.42-01,
rtc_wakeup and chrt:

The main difference is not between rtc_wakeup and my test script
(the timings agree very well), but comes from chrt for 'IRQ 8':

System not loaded, original 'IRQ 8' prio 49:
* rtc_wakeup at 89(99): max jitter: 378.2% (461 usec), missed irqs: 0
* rtc_wakeup at 1(11): max jitter: 97.1% (118 usec), missed irqs: 25
System not loaded, 'IRQ 8' prio at 95:
* rtc_wakeup at 89(99): max jitter: 19.1% ( 23 usec), missed irqs: 0
* rtc_wakeup at 1(11): max jitter: 190.7% (232 usec), missed irqs: 52

The following tests are made with 'IRQ 8' at 95, rtc_wakeup at 89(99):
* Heavy mmap load, no oom: max jitter: 42.1% ( 51 usec)
* Heavy mmap load, oom: max jitter: 11989.2% (14635 usec)
(but still "missed irqs: 0", so IRQ 8 was also blocked for 14 ms)
* USB reads, no error: max jitter: 707.7% ( 863 usec)
(happens rarely)
* USB reads, error: max jitter: 7247.2% ( 8846 usec)
(again, no missed irqs)
* USB stick connect: max jitter: 5776.7% ( 7051 usec)
* USB stick disconnect: max jitter: 2966.6% ( 3621 usec)
(both without error!)

Notes:
* My system does not have a 'IRQ 0' kernel thread.
It has IRQ 1, 3, 7, 8, 12 and 14.
This is strange, because /proc/interrupts says
0: 4805658 XT-PIC timer 0/5658
2: 0 XT-PIC cascade 0/0
3: 0 XT-PIC serial 0/0
7: 17538 XT-PIC uhci_hcd:usb1, eth0 0/17538
8: 838523 XT-PIC rtc 0/38523
14: 5775 XT-PIC ide0 0/5775
* The syntax of chrt has changed:
chrt -f -p <prio> <pid>
* I run rtc_wakeup with -f 8192.
This is more realistic in our case.

CF card (IDE) load does not hurt as long as the rtc test runs
with a very high rtpri: The results did not differ significantly
from no-load results.
Attempts with tracing on and results with running the test
at low rtpri will follow tomorrow...

--
Klaus Kusche
Entwicklung Software - Steuerung
Software Development - Control

KEBA AG
A-4041 Linz
Gewerbepark Urfahr
Tel +43 / 732 / 7090-3120
Fax +43 / 732 / 7090-8919
E-Mail: [email protected]
http://www.keba.com

2005-03-31 15:23:48

by Ingo Molnar

[permalink] [raw]

Subject: Re: 2.6.11, IDE: Strange scheduling behaviour: high-pri RT process not scheduled?

* kus Kusche Klaus <[email protected]> wrote:

> The following tests are made with 'IRQ 8' at 95, rtc_wakeup at 89(99):
> * Heavy mmap load, no oom: max jitter: 42.1% ( 51 usec)
> * Heavy mmap load, oom: max jitter: 11989.2% (14635 usec)
> (but still "missed irqs: 0", so IRQ 8 was also blocked for 14 ms)

did you get any kernel messages in that time? (about missed irqs, etc.)
Please do a 'dmesg -n 0' to minimize the effect of kernel messages.

Ingo

2005-04-01 09:08:58

by kus Kusche Klaus

[permalink] [raw]

Subject: RE: 2.6.11, IDE: Strange scheduling behaviour: high-pri RT process not scheduled?

> > The following tests are made with 'IRQ 8' at 95, rtc_wakeup
> at 89(99):
> > * Heavy mmap load, no oom: max jitter: 42.1% ( 51 usec)
> > * Heavy mmap load, oom: max jitter: 11989.2% (14635 usec)
> > (but still "missed irqs: 0", so IRQ 8 was also blocked for 14 ms)
>
> did you get any kernel messages in that time? (about missed
> irqs, etc.)
> Please do a 'dmesg -n 0' to minimize the effect of kernel messages.

Excellent, thanks!

It turned out that the latencies are not caused by the kernel
messages themselves, but by sending them to a serial console
(which was off), in all my high latency cases at rtpri 89(99).

After removing the serial console from the boot parameters,
* the OOM timings are back to normal (around 50 microseconds)
* the USB error and remove timings are back to normal
* the USB plugin timings are in the range of the USB read
(which is up to 1 ms - still bad)

However, latencies at rtprio 2 are still very frustrating
(details will follow).

--
Klaus Kusche
Entwicklung Software - Steuerung
Software Development - Control

KEBA AG
A-4041 Linz
Gewerbepark Urfahr
Tel +43 / 732 / 7090-3120
Fax +43 / 732 / 7090-8919
E-Mail: [email protected]
http://www.keba.com