LinuxLists.cc - High resolution timers and BH processing on -RT

2005-01-28 00:16:36

Subject: High resolution timers and BH processing on -RT

Hi,

some thoughts and observations.

I'm running -RT + a port of UTIME (the grandfather of HRT, IIRC) on a
couple of machines (x86,ppc,arm - all UP) here.

One part of the problem is the gettimeofday() issue, which seems to be
already addressed by John Stulz's patches.

The more interresting question in terms of realtime is the processing of
the timer queue(s) and the reprogramming of the timer for the next
event.

With -RT enabled the timer queue is solely processed by ksoftirqd, which
handles all the other bottom halfs too. So pushing the priority of
ksoftirqd up is not a solution.
If I understood the HRT code correctly then this applies here too.
Please correct me if I'm wrong.

This makes the timer queue dependend on a variety of other softirq
clients, so you can not predict which of those clients are active and
what's the computation time they consume.

One possibility to solve this is seperating the softirqs into different
threads. We have done this already and it seems to be an interesting
approach in terms of scalability of bottom half processing.

Another approach is splitting realtime timers and normal timers, which
introduces the ugliness of different timer queues, but it has the
advantage that the rare realtime queue users can be handled in the
(threaded) top half for accuracy sake and the "normal" users are still
running on the jiffy bound vanilla linux timer queue, which is handled
in the non predictible path of the ksoftirqd thread.

I know that this topic was discussed various times already, but the -RT
view adds a different dimension. The modifications for the splitted
queues are minimal invasive in kernel/timer.c, but still introduce the
same amount of noise in nanosleep, itimer and posix_timer code.

Even if we agree that most of the timers are removed from the queue
before the expiry - especially those of the networking code - moving all
timers into high resolution mode results in non trivial noise injection
due to the short reprogramming intervals.

Some numbers to make this more transparent.

Machine: PIII Celeron 333MHz
RT - T0: 1ms cyclic
RT - T1: 2ms cyclic
....

Load average is ~4.0 for all tests. The numbers are maximum deviation
from the timeline in microseconds. Test time was ~60 minutes for each
szenario.

Running all timers in high resolution mode (ksoftirqd) results in:
[T0 Prio: 60] 2123
[T1 Prio: 59] 2556
[T2 Prio: 58] 2882
[T3 Prio: 57] 2993
[T4 Prio: 56] 2888

Running all timers in high resolution mode (seperated timer softirqd
PRIO=70) results in:
[T0 Prio: 60] 423
[T1 Prio: 59] 372
[T2 Prio: 58] 756
[T3 Prio: 57] 802
[T4 Prio: 56] 1208

Seperating realtime process related timers into the threaded top half
queue (TIMER IRQ PRIO = 90) and running all other timers inside of
ksoftirqd results in:
[T0 Prio: 60] 84
[T1 Prio: 59] 114
[T2 Prio: 58] 120
[T3 Prio: 57] 117
[T4 Prio: 56] 158

I think that these numbers are significant for low power machines. This
area is my main concern - not the highend SMP szenario - as the main
users of realtime systems can be met in the embedded world.

Offtopic, but related question:

I discovered that the USB subsystem introduces 2-5 ms random chaos when
a device is pluged/unplugged. I had not time to track the culprit down
yet. Any pointers ???

tglx

2005-01-28 04:43:20

by Ingo Molnar

[permalink] [raw]

Subject: Re: High resolution timers and BH processing on -RT

* Thomas Gleixner <[email protected]> wrote:

> Some numbers to make this more transparent.
>
> Machine: PIII Celeron 333MHz
> RT - T0: 1ms cyclic
> RT - T1: 2ms cyclic
> ....
>
> Load average is ~4.0 for all tests. The numbers are maximum deviation
> from the timeline in microseconds. Test time was ~60 minutes for each
> szenario.
>
> Running all timers in high resolution mode (ksoftirqd) results in:
> [T0 Prio: 60] 2123
> [T1 Prio: 59] 2556
> [T2 Prio: 58] 2882
> [T3 Prio: 57] 2993
> [T4 Prio: 56] 2888
>
> Running all timers in high resolution mode (seperated timer softirqd
> PRIO=70) results in:
> [T0 Prio: 60] 423
> [T1 Prio: 59] 372
> [T2 Prio: 58] 756
> [T3 Prio: 57] 802
> [T4 Prio: 56] 1208

is this due to algorithmic/PIT-programming overhead, or due to the noise
introduced by other, non-hard-RT timers? I'd guess the later from the
looks of it, but did your test introduce such noise (via networking and
application workloads?).

Ingo

2005-01-28 08:20:26

by Thomas Gleixner

[permalink] [raw]

Subject: Re: High resolution timers and BH processing on -RT

On Fri, 2005-01-28 at 05:43 +0100, Ingo Molnar wrote:
> *
> is this due to algorithmic/PIT-programming overhead, or due to the noise
> introduced by other, non-hard-RT timers? I'd guess the later from the
> looks of it, but did your test introduce such noise (via networking and
> application workloads?).
>

Right, it's due to noise by non-RT timers, which I enforced by adding
networking and applications.

This adds random timer expires and admittedly the PIT reprogramming
overhead is adding portions of that noise.

tglx

2005-01-28 08:24:52

by Ingo Molnar

[permalink] [raw]

Subject: Re: High resolution timers and BH processing on -RT

* Thomas Gleixner <[email protected]> wrote:

> > is this due to algorithmic/PIT-programming overhead, or due to the noise
> > introduced by other, non-hard-RT timers? I'd guess the later from the
> > looks of it, but did your test introduce such noise (via networking and
> > application workloads?).
>
> Right, it's due to noise by non-RT timers, which I enforced by adding
> networking and applications.
>
> This adds random timer expires and admittedly the PIT reprogramming
> overhead is adding portions of that noise.

i havent seen your latest code - what is the basic data-structure? The
stock kernel has arrays of timers with increasing granularity and a
cascade mechanism to move timers down the arrays as they slowly expire -
but with a high-resolution API (1 usec accuracy?) how does the basic
data structure look like?

Is the "noise" due to timers expiring "at once" - but isnt it unlikely
for 'normal' timers to expire in exactly the same usec as the real
high-resolution one?

or is it that we have a 'group' of normal timers expiring, which, if
they happen to occur _just_ prior a HRT event will generate a larger
delay?

Ingo

2005-01-28 08:30:23

by Thomas Gleixner

[permalink] [raw]

Subject: Re: High resolution timers and BH processing on -RT

On Fri, 2005-01-28 at 09:24 +0100, Ingo Molnar wrote:
> * Thomas Gleixner <[email protected]> wrote:
>
> > > is this due to algorithmic/PIT-programming overhead, or due to the noise
> > > introduced by other, non-hard-RT timers? I'd guess the later from the
> > > looks of it, but did your test introduce such noise (via networking and
> > > application workloads?).
> >
> > Right, it's due to noise by non-RT timers, which I enforced by adding
> > networking and applications.
> >
> > This adds random timer expires and admittedly the PIT reprogramming
> > overhead is adding portions of that noise.
>
> i havent seen your latest code - what is the basic data-structure? The
> stock kernel has arrays of timers with increasing granularity and a
> cascade mechanism to move timers down the arrays as they slowly expire -
> but with a high-resolution API (1 usec accuracy?) how does the basic
> data structure look like?

The current testing code just moves the HRT timers to a seperate list.
It's a linked list at them moment, which must be optimized when the
general dust settles.

> Is the "noise" due to timers expiring "at once" - but isnt it unlikely
> for 'normal' timers to expire in exactly the same usec as the real
> high-resolution one?
>
> or is it that we have a 'group' of normal timers expiring, which, if
> they happen to occur _just_ prior a HRT event will generate a larger
> delay?
>

Yep. The timers expire at random times. So it's likely to have short
sequences of timer interrupts going off. This needs reprogramming of the
PIT and processing of the expired timers.

tglx

2005-01-28 08:48:50

by Ingo Molnar

[permalink] [raw]

Subject: Re: High resolution timers and BH processing on -RT

* Thomas Gleixner <[email protected]> wrote:

> > or is it that we have a 'group' of normal timers expiring, which, if
> > they happen to occur _just_ prior a HRT event will generate a larger
> > delay?
>
> Yep. The timers expire at random times. So it's likely to have short
> sequences of timer interrupts going off. This needs reprogramming of
> the PIT and processing of the expired timers.

i dont really like the static splitup of 'normal' vs. 'HRT' timers -
there might in fact be separate priority requirements between HRT timers
too.

i think one possible solution would be to introduce some notion of
'timer priority', and to expire each timer priority level in a separate
timer expiry thread. Priority 0 (lowest) would be expired in ksoftirqd,
and there would be 3 separate threads for say priorities 1-3. Or
something like this. Potentially exposed to user-space as well, via new
APIs. Hm?

To push this even further: in theory timers could inherit the priority
of the task that starts them, and they would be expired in that priority
order - but this probably needs a pretty clever (and most likely
complex) data-structure ...

Ingo

2005-01-28 18:39:04

by George Anzinger

[permalink] [raw]

Subject: Re: High resolution timers and BH processing on -RT

Ingo Molnar wrote:
> * Thomas Gleixner <[email protected]> wrote:
>
>
>>>or is it that we have a 'group' of normal timers expiring, which, if
>>>they happen to occur _just_ prior a HRT event will generate a larger
>>>delay?
>>
>>Yep. The timers expire at random times. So it's likely to have short
>>sequences of timer interrupts going off. This needs reprogramming of
>>the PIT and processing of the expired timers.

If you can use a machine that has a local apic we can leave the PIT out of it.
Really this is MUCH preferred. If your box has a LAPIC, make sure it is not
disabled by your config setup.

Leaving the PIT out of it, the structure is that HRT timers are put in the
normal timer list and, when they expire, are moved to a HRT list which only
contains timers that will expire prior to the next jiffie. This list is managed
by interrupt, ideally from the LAPIC, or the PIT is need be. Aside from the PIT
reprograming (once per HRT timer plus once to get back to the 1/HZ period),
there can be delays in getting the timer out of the normal timer list. The main
thing here is that the list MUST be processed as close to the jiffie edge as
possible as any timers due shortly after the jiffie edge will be shadowed by
this regardless of the HRT interrupt. Of course, it an expired timer is
presented to the HRT code by the normal timer expire code, it is expired
immeadiatly.

A quick comment here on the current RT code. It looks to me like there is a
race in timer delivery. It looks like the softirq is "raised" by the PIT
interrupt code and the jiffie is bumped by the timer thread. If the softirq
gets to run prior to the PIT interrupt thread we could end up in the run_timer
list code with a stale jiffie value and do nothing. This would delay normal
timers for a jiffie and HRT timers for some time less than a jiffie, depending
on when they were really due.

I thing we should move the raising of the timer softirq to the PIT interrupt
thread after we release the xtime_lock.
>
>
> i dont really like the static splitup of 'normal' vs. 'HRT' timers -
> there might in fact be separate priority requirements between HRT timers
> too.

Yes, and high priority tasks might want low res timers...
>
> i think one possible solution would be to introduce some notion of
> 'timer priority', and to expire each timer priority level in a separate
> timer expiry thread. Priority 0 (lowest) would be expired in ksoftirqd,
> and there would be 3 separate threads for say priorities 1-3. Or
> something like this. Potentially exposed to user-space as well, via new
> APIs. Hm?
>
> To push this even further: in theory timers could inherit the priority
> of the task that starts them, and they would be expired in that priority
> order - but this probably needs a pretty clever (and most likely
> complex) data-structure ...

A long time ago in another land, I did such a system. The timer priority was
taken from the calling task. At that time (and now, till convinced otherwise) I
thought it a _good thing_ to expire timers in order, regardless of their
priority, so all timers pending delivery were delivered at the priority of the
highest priority timer in the "batch". The basic idea was that the interrupt
code pulled expired timers from the timer list and pushed them into the pending
list. In the process it found the highest priority timer in the list. The
timer delivery thread was then run at that priority. This thread adjusted its
priority downward as needed, but in all cases the timers were delivered in
strict time order.

Since then, as now, the timer delivery usually just _notified_ a task of a
pending signal, the low priority timers did not really hold up things for long.
Once the high priority timer was delivered and the thread either finished or
dropped its priority, the waiting task (having been wakened by the signal
delivery) could switch in.

The primary thing needed for this is a simple and quick way to switch a tasks
priority, both from outside and from the task itself.
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/

2005-01-28 18:51:45

by Ingo Molnar

[permalink] [raw]

Subject: Re: High resolution timers and BH processing on -RT

* George Anzinger <[email protected]> wrote:

> A quick comment here on the current RT code. It looks to me like
> there is a race in timer delivery. It looks like the softirq is
> "raised" by the PIT interrupt code and the jiffie is bumped by the
> timer thread. If the softirq gets to run prior to the PIT interrupt
> thread we could end up in the run_timer list code with a stale jiffie
> value and do nothing. This would delay normal timers for a jiffie and
> HRT timers for some time less than a jiffie, depending on when they
> were really due.
>
> I thing we should move the raising of the timer softirq to the PIT
> interrupt thread after we release the xtime_lock.

ok - mind sending a patch for this?

Ingo

2005-01-28 18:59:29

by Ingo Molnar

[permalink] [raw]

Subject: Re: High resolution timers and BH processing on -RT

* George Anzinger <[email protected]> wrote:

> The primary thing needed for this is a simple and quick way to switch
> a tasks priority, both from outside and from the task itself.

check out sched.c's mutex_setprio(p, prio) and mutex_getprio(p), which
is used by the PI code in kernel/rt.c. It's pretty robust and heavily
used.

Ingo

2005-01-31 09:13:36

by Thomas Gleixner

[permalink] [raw]

Subject: Re: High resolution timers and BH processing on -RT

On Fri, 2005-01-28 at 10:34 -0800, George Anzinger wrote:
> If you can use a machine that has a local apic we can leave the PIT out of it.
> Really this is MUCH preferred. If your box has a LAPIC, make sure it is not
> disabled by your config setup.

Unfortunately its an embedded box w/o LAPIC. Embedded systems with low
computing power and restricted peripherals are often used in industrial
environments and likely to be a major client of RT/HRT implementations.

> Leaving the PIT out of it, the structure is that HRT timers are put in the
> normal timer list and, when they expire, are moved to a HRT list which only
> contains timers that will expire prior to the next jiffie. This list is managed
> by interrupt, ideally from the LAPIC, or the PIT is need be. Aside from the PIT
> reprograming (once per HRT timer plus once to get back to the 1/HZ period),
> there can be delays in getting the timer out of the normal timer list. The main
> thing here is that the list MUST be processed as close to the jiffie edge as
> possible as any timers due shortly after the jiffie edge will be shadowed by
> this regardless of the HRT interrupt. Of course, it an expired timer is
> presented to the HRT code by the normal timer expire code, it is expired
> immeadiatly.

This requires to run the 1/Hz timer code with high priority in order to
run as close as possible to the jiffie edge. I know that you made this
move in order to make the HRT code less intrusive, but I think this
trades code beauty with latency issues. In order to reduce the latency
for HRT timers it will be necessary to split the timer list work into
pieces.

Jiffie interrupt
- Move the HRT timers out of the next jiffie bucket, reprogram HRT
timesource.
- Run the shadowed HRT timers, which are close to the jiffie edge (too
close to reprogram the HRT timesource)
- Move expired low res timers to a work list
- schedule the low res timer work

Not sure, if this is less intrusive than the previous modifications to
add_timer/mod_timer/del_timer, which can nicely be done with simple
macros and does not touch the low res timer list including the
processing at all.

> > i dont really like the static splitup of 'normal' vs. 'HRT' timers -
> > there might in fact be separate priority requirements between HRT timers
> > too.
>
> Yes, and high priority tasks might want low res timers...

Sure, is there any API available to select the timer type / priority ?

> A long time ago in another land, I did such a system. The timer priority was
> taken from the calling task. At that time (and now, till convinced otherwise) I
> thought it a _good thing_ to expire timers in order, regardless of their
> priority, so all timers pending delivery were delivered at the priority of the
> highest priority timer in the "batch". The basic idea was that the interrupt
> code pulled expired timers from the timer list and pushed them into the pending
> list. In the process it found the highest priority timer in the list. The
> timer delivery thread was then run at that priority. This thread adjusted its
> priority downward as needed, but in all cases the timers were delivered in
> strict time order.
>
> Since then, as now, the timer delivery usually just _notified_ a task of a
> pending signal, the low priority timers did not really hold up things for long.
> Once the high priority timer was delivered and the thread either finished or
> dropped its priority, the waiting task (having been wakened by the signal
> delivery) could switch in.

Make sense.

tglx