2005-03-30 13:01:28

by kus Kusche Klaus

[permalink] [raw]
Subject: PREEMPT_RT: Undetected latencies?

I'm experimenting with a realtime-preempt-2.6.12-rc1-V0.7.41-11 kernel,
preemption fully enabled, and a small RTC test program (see my previous
mail for details).

If my program runs at rtprio 99, most of the time everything is fine:
* My program records intervals of 122 +- 40 microseconds (which
corresponds nicely to the 8192 Hz RTC interrupt).
* Both the kernel's RTC histogram and the kernel's RT latency trace stay
below 40 microseconds.

However, during extensive testing, I found some cases which extend the
intervals unacceptably:
1.) When a test program performs massive mmaps (at base shell prio) and
activates the oom killer, my program measures intervals in the 15
millisecond range.
2.) When I get the USB errors described in a separate mail earlier
today, I also get intervals from 10 to 15 milliseconds.
(a combination of heavy USB and CF card traffic also seems to cause
blackouts in the 1 ms range even for rtprio 99, but I have to look into
that again)

Surprisingly, in both cases the worst latencies traced by the RT latency
trace and the RTC histogram are still less than 40 microseconds!
Moreover, the values read from /dev/rtc do not indicate any lost
interrupts. This seems to indicate that neither the RTC IRQ handler
itself nor timers were executed during these 15 millisecond period,
otherwise, either the RTC IRQ handler or the rtc_dropped_irq watchdog
would have counted lost interrupts! It seems that something locked up
the system completely for 15 ms: No interrupts at all, no scheduling,
...

Now, there are several questions:
* What is causing these 15 ms latencies in the two cases above, and how
can they be fixed?
* Why aren't they detected at all by any traces?
Are these classical spinlocks which aren't instrumented?
Are they explicit cli/sti in device drivers?
Any other possibilities?
How can this be traced and analyzed?

A simple explanation would be that the TSC is reset. Is there any code
in the kernel messing with the TSC?

If my program runs at rtprio 1 instead of rtprio 99, many things cause
long intervals:
* normal working load (up to 1.2 ms)
* heavy mmap load without oom (many samples increased by 0.2 ms)
* heavy USB stick I/O without errors (up to 3 ms)
* core dumps (up to 10 ms)
* heavy USB stick I/O with errors (up to 15 ms)
* oom killer (up to 100 ms)
* heavy CF card I/O (up to 300 ms, see separate mail)
Even on an idle system, I see 1 ms intervals now and then!

Again, the latency traces are in the range of 40 to 50 microseconds
only, so these delays are caused by something not covered by the latency
tracer. How to find that out?

One more question: When my test program runs at rtprio 1, there are huge
delays (up to 100 ms) between the moment the rtc read gets ready and the
moment the program is actually scheduled. However, even then the wakeup
latency trace is not showing it. What is actually measured by the wakeup
latency trace?

Thanks for any help!

Klaus Kusche
> Entwicklung Software - Steuerung
> Software Development - Control
>
> KEBA AG
> A-4041 Linz
> Gewerbepark Urfahr
> Tel +43 / 732 / 7090-3120
> Fax +43 / 732 / 7090-8919
> E-Mail: [email protected]
> http://www.keba.com
>
>