2001-04-11 18:47:14

by Bret Indrelee

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Mikulas Patocka ([email protected]) wrote:
> Adding and removing timers happens much more frequently than PIT tick,
> so
> comparing these times is pointless.
>
> If you have some device and timer protecting it from lockup on buggy
> hardware, you actually
>
> send request to device
> add timer
>
> receive interrupt and read reply
> remove timer
>
> With the curent timer semantics, the cost of add timer and del timer is
> nearly zero. If you had to reprogram the PIT on each request and reply,
> it
> would slow things down.
>
> Note that you call mod_timer also on each packet received - and in worst
> case (which may happen), you end up reprogramming the PIT on each
> packet.

You can still have nearly zero cost for the normal case. Avoiding worst
case behaviour is also pretty easy.

You only reprogram the PIT if you have to change the interval.

Keep all timers in a sorted double-linked list. Do the insert
intelligently, adding it from the back or front of the list depending on
where it is in relation to existing entries.

You only need to reprogram the interval timer when:
1. You've got a new entry at the head of the list
AND
2. You've reprogrammed the interval to something larger than the time to
the new head of list.

In the case of a device timeout, it is usually not going to be inserted at
the head of the list. It is very seldom going to actually timeout.

Choose your interval wisely, only increasing it when you know it will pay
off. The best way of doing this would probably be to track some sort
of LCD for timeouts.

The real trick is to do a lot less processing on every tick than is
currently done. Current generation PCs can easily handle 1000s of
interrupts a second if you keep the overhead small.

-Bret

------------------------------------------------------------------------------
Bret Indrelee | Sometimes, to be deep, we must act shallow!
[email protected] | -Riff in The Quatrix



2001-04-12 17:40:48

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Bret Indrelee wrote:
>
> Mikulas Patocka ([email protected]) wrote:
> > Adding and removing timers happens much more frequently than PIT tick,
> > so
> > comparing these times is pointless.
> >
> > If you have some device and timer protecting it from lockup on buggy
> > hardware, you actually
> >
> > send request to device
> > add timer
> >
> > receive interrupt and read reply
> > remove timer
> >
> > With the curent timer semantics, the cost of add timer and del timer is
> > nearly zero. If you had to reprogram the PIT on each request and reply,
> > it
> > would slow things down.
> >
> > Note that you call mod_timer also on each packet received - and in worst
> > case (which may happen), you end up reprogramming the PIT on each
> > packet.
>
> You can still have nearly zero cost for the normal case. Avoiding worst
> case behaviour is also pretty easy.
>
> You only reprogram the PIT if you have to change the interval.
>
> Keep all timers in a sorted double-linked list. Do the insert
> intelligently, adding it from the back or front of the list depending on
> where it is in relation to existing entries.

I think this is too slow, especially for a busy system, but there are
solutions...
>
> You only need to reprogram the interval timer when:
> 1. You've got a new entry at the head of the list
> AND
> 2. You've reprogrammed the interval to something larger than the time to
> the new head of list.

Uh, I think 1. IMPLIES 2. If it is at the head, it must be closer in
than what the system is waiting for (unless, of course its time has
already passed, but lets not consider that here).
>
> In the case of a device timeout, it is usually not going to be inserted at
> the head of the list. It is very seldom going to actually timeout.

Right, and if the system doesn't have many timers, thus putting this at
the head, it has the time to do the extra work.
>
> Choose your interval wisely, only increasing it when you know it will pay
> off. The best way of doing this would probably be to track some sort
> of LCD for timeouts.

I wonder about a more relaxed device timeout semantic that says
something like: wait X + next timer interrupt. This would cause the
timer insert code to find an entry at least X out and put this timer
just after it. There are other ways to do this of course. The question
here is: Would this be worth while?
>
> The real trick is to do a lot less processing on every tick than is
> currently done. Current generation PCs can easily handle 1000s of
> interrupts a second if you keep the overhead small.

I don't see the logic here. Having taken the interrupt, one would tend
to want to do as much as possible, rather than schedule another
interrupt to continue the processing. Rather, I think you are trying to
say that we can afford to take more interrupts for time keeping. Still,
I think what we are trying to get with tick less timers is a system that
takes FEWER interrupts, not more.

George

2001-04-12 21:19:39

by Bret Indrelee

[permalink] [raw]
Subject: Re: No 100 HZ timer!

On Thu, 12 Apr 2001, george anzinger wrote:
> Bret Indrelee wrote:
> > Keep all timers in a sorted double-linked list. Do the insert
> > intelligently, adding it from the back or front of the list depending on
> > where it is in relation to existing entries.
>
> I think this is too slow, especially for a busy system, but there are
> solutions...

It is better than the current solution.

The insert takes the most time, having to scan through the list. If you
had to scan the whole list it would be O(n) with a simple linked list. If
you insert it from the end, it is almost always going to be less than
that.

The time to remove is O(1).

Fetching the first element from the list is also O(1), but you may have to
fetch several items that have all expired. Here you could do something
clever. Just make sure it is O(1) to determine if the list is empty.

[ snip ]
> > The real trick is to do a lot less processing on every tick than is
> > currently done. Current generation PCs can easily handle 1000s of
> > interrupts a second if you keep the overhead small.
>
> I don't see the logic here. Having taken the interrupt, one would tend
> to want to do as much as possible, rather than schedule another
> interrupt to continue the processing. Rather, I think you are trying to
> say that we can afford to take more interrupts for time keeping. Still,
> I think what we are trying to get with tick less timers is a system that
> takes FEWER interrupts, not more.

The system should be CAPABLE of handling 1000s of interrupts without
excessive system load.

The actual number of interrupts would be reduced if we adjusted the
interrupt interval based on the head of the list.

Two different things.

-Bret

------------------------------------------------------------------------------
Bret Indrelee | Sometimes, to be deep, we must act shallow!
[email protected] | -Riff in The Quatrix

2001-04-12 22:21:47

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Bret Indrelee wrote:
>
> On Thu, 12 Apr 2001, george anzinger wrote:
> > Bret Indrelee wrote:
> > > Keep all timers in a sorted double-linked list. Do the insert
> > > intelligently, adding it from the back or front of the list depending on
> > > where it is in relation to existing entries.
> >
> > I think this is too slow, especially for a busy system, but there are
> > solutions...
>
> It is better than the current solution.

Uh, where are we talking about. The current time list insert is real
close to O(1) and never more than O(5).
>
> The insert takes the most time, having to scan through the list. If you
> had to scan the whole list it would be O(n) with a simple linked list. If
> you insert it from the end, it is almost always going to be less than
> that.

Right, but compared to the current O(5) max, this is just too long.
>
> The time to remove is O(1).
>
> Fetching the first element from the list is also O(1), but you may have to
> fetch several items that have all expired. Here you could do something
> clever. Just make sure it is O(1) to determine if the list is empty.
>
I would hope to move expired timers to another list or just process
them. In any case they should not be a problem here.

One of the posts that started all this mentioned a tick less system (on
a 360 I think) that used the current time list. They had to scan
forward in time to find the next event and easy 10 ms was a new list to
look at. Conclusion: The current list structure is NOT organized for
tick less time keeping.

George

2001-04-13 04:01:06

by Bret Indrelee

[permalink] [raw]
Subject: Re: No 100 HZ timer!

On Thu, 12 Apr 2001, george anzinger wrote:
> Bret Indrelee wrote:
> >
> > On Thu, 12 Apr 2001, george anzinger wrote:
> > > Bret Indrelee wrote:
> > > > Keep all timers in a sorted double-linked list. Do the insert
> > > > intelligently, adding it from the back or front of the list depending on
> > > > where it is in relation to existing entries.
> > >
> > > I think this is too slow, especially for a busy system, but there are
> > > solutions...
> >
> > It is better than the current solution.
>
> Uh, where are we talking about. The current time list insert is real
> close to O(1) and never more than O(5).

I don't like the cost of the cascades every (as I recall) 256
interrupts. This is more work than is done in the rest of the interrupt
processing, happens during the tick interrupt, and results in a rebuild of
much of the table.

-Bret

------------------------------------------------------------------------------
Bret Indrelee | Sometimes, to be deep, we must act shallow!
[email protected] | -Riff in The Quatrix

2001-04-13 06:06:21

by Ben Greear

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Bret Indrelee wrote:
>
> On Thu, 12 Apr 2001, george anzinger wrote:
> > Bret Indrelee wrote:
> > >
> > > On Thu, 12 Apr 2001, george anzinger wrote:
> > > > Bret Indrelee wrote:
> > > > > Keep all timers in a sorted double-linked list. Do the insert
> > > > > intelligently, adding it from the back or front of the list depending on
> > > > > where it is in relation to existing entries.
> > > >
> > > > I think this is too slow, especially for a busy system, but there are
> > > > solutions...
> > >
> > > It is better than the current solution.
> >
> > Uh, where are we talking about. The current time list insert is real
> > close to O(1) and never more than O(5).
>
> I don't like the cost of the cascades every (as I recall) 256
> interrupts. This is more work than is done in the rest of the interrupt
> processing, happens during the tick interrupt, and results in a rebuild of
> much of the table.
>
> -Bret

Wouldn't a heap be a good data structure for a list of timers? Insertion
is log(n) and finding the one with the least time is O(1), ie pop off the
front.... It can be implemented in an array which should help cache
coherency and all those other things they talked about in school :)

Ben
>
> ------------------------------------------------------------------------------
> Bret Indrelee | Sometimes, to be deep, we must act shallow!
> [email protected] | -Riff in The Quatrix
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Ben Greear ([email protected]) http://www.candelatech.com
Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL)
http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-04-13 08:43:53

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Ben Greear wrote:
>
> Bret Indrelee wrote:
> >
> > On Thu, 12 Apr 2001, george anzinger wrote:
> > > Bret Indrelee wrote:
> > > >
> > > > On Thu, 12 Apr 2001, george anzinger wrote:
> > > > > Bret Indrelee wrote:
> > > > > > Keep all timers in a sorted double-linked list. Do the insert
> > > > > > intelligently, adding it from the back or front of the list depending on
> > > > > > where it is in relation to existing entries.
> > > > >
> > > > > I think this is too slow, especially for a busy system, but there are
> > > > > solutions...
> > > >
> > > > It is better than the current solution.
> > >
> > > Uh, where are we talking about. The current time list insert is real
> > > close to O(1) and never more than O(5).
> >
> > I don't like the cost of the cascades every (as I recall) 256
> > interrupts. This is more work than is done in the rest of the interrupt
> > processing, happens during the tick interrupt, and results in a rebuild of
> > much of the table.

Right, it needs to go, we need to eliminate the "lumps" in time :)
> >
> > -Bret
>
> Wouldn't a heap be a good data structure for a list of timers? Insertion
> is log(n) and finding the one with the least time is O(1), ie pop off the
> front.... It can be implemented in an array which should help cache
> coherency and all those other things they talked about in school :)
>
I must be missing something here. You get log(n) from what? B-trees?
How would you manage them to get the needed balance? Stopping the world
to re-balance is worse than the cascade. I guess I need to read up on
this stuff. A good pointer would be much appreciated.

Meanwhile, I keep thinking of a simple doubly linked list in time
order. To speed it up keep an array of pointers to the first N whole
jiffie points and maybe pointers to coarser points beyond the first N.
Make N, say 512. Build the pointers as needed. This should give
something like O(n/N) insertion and O(1) removal.

George

2001-04-13 10:36:59

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer!

george anzinger wrote:
> > Wouldn't a heap be a good data structure for a list of timers? Insertion
> > is log(n) and finding the one with the least time is O(1), ie pop off the
> > front.... It can be implemented in an array which should help cache
> > coherency and all those other things they talked about in school :)
> >
> I must be missing something here. You get log(n) from what? B-trees?
> How would you manage them to get the needed balance? Stopping the world
> to re-balance is worse than the cascade. I guess I need to read up on
> this stuff. A good pointer would be much appreciated.

Look for "priority queues" and "heaps". In its simplest form, you use a
heap-ordered tree, which can be implemented using an array (that's
usually how it's presented), or having the objects in the heap point to
each other.

A heap-ordered tree is not as strictly ordered as, well, an ordered tree
:-) The rule is: if A is the parent of B and C, then A expires earlier
than B, and A expires earlier than C. There is no constraint on the
relative expiry times of B and C.

There is no "stop the world" to rebalance, which you may consider an
advantage over the present hierarchy of tables. On the other hand, each
insertion or deletion operation takes O(log n) time, where n is the
number of items in the queue. Although fairly fast, this bound can be
improved if you know the typical insertion/deletion patterns, to O(1)
for selected cases. Also you should know that not all priority queues
are based on heap-ordered trees.

Linux' current hierarchy of tables is a good example of optimisation: it
is optimised for inserting and actually running short term timers, as
well as inserting and deleting (before running) long term timers. These
extremes take O(1) for insertion, removal and expiry, including the
"stop the world" time. This should be considered before and move to a
heap-based priority queue, which may turn out slower.

> Meanwhile, I keep thinking of a simple doubly linked list in time
> order. To speed it up keep an array of pointers to the first N whole
> jiffie points and maybe pointers to coarser points beyond the first N.
> Make N, say 512. Build the pointers as needed. This should give
> something like O(n/N) insertion and O(1) removal.

You've just described the same as the current implementation, but with
lists for longer term timers. I.e. slower. With your coarser points,
you have to sort the front elements of the coarse list into a fine one
from time to time.

The idea of having jiffie-point pointers into a data structure for fast
insertion is a good one for speeding up any data structure for that
common case, though.

-- Jamie

2001-04-13 16:09:07

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Jamie Lokier wrote:
>
> george anzinger wrote:
> > > Wouldn't a heap be a good data structure for a list of timers? Insertion
> > > is log(n) and finding the one with the least time is O(1), ie pop off the
> > > front.... It can be implemented in an array which should help cache
> > > coherency and all those other things they talked about in school :)
> > >
> > I must be missing something here. You get log(n) from what? B-trees?
> > How would you manage them to get the needed balance? Stopping the world
> > to re-balance is worse than the cascade. I guess I need to read up on
> > this stuff. A good pointer would be much appreciated.
>
> Look for "priority queues" and "heaps". In its simplest form, you use a
> heap-ordered tree, which can be implemented using an array (that's
> usually how it's presented), or having the objects in the heap point to
> each other.
>
> A heap-ordered tree is not as strictly ordered as, well, an ordered tree
> :-) The rule is: if A is the parent of B and C, then A expires earlier
> than B, and A expires earlier than C. There is no constraint on the
> relative expiry times of B and C.
>
> There is no "stop the world" to rebalance, which you may consider an
> advantage over the present hierarchy of tables. On the other hand, each
> insertion or deletion operation takes O(log n) time, where n is the
> number of items in the queue. Although fairly fast, this bound can be
> improved if you know the typical insertion/deletion patterns, to O(1)
> for selected cases. Also you should know that not all priority queues
> are based on heap-ordered trees.
>
> Linux' current hierarchy of tables is a good example of optimisation: it
> is optimised for inserting and actually running short term timers, as
> well as inserting and deleting (before running) long term timers. These
> extremes take O(1) for insertion, removal and expiry, including the
> "stop the world" time. This should be considered before and move to a
> heap-based priority queue, which may turn out slower.
>
> > Meanwhile, I keep thinking of a simple doubly linked list in time
> > order. To speed it up keep an array of pointers to the first N whole
> > jiffie points and maybe pointers to coarser points beyond the first N.
> > Make N, say 512. Build the pointers as needed. This should give
> > something like O(n/N) insertion and O(1) removal.
>
> You've just described the same as the current implementation, but with
> lists for longer term timers. I.e. slower. With your coarser points,
> you have to sort the front elements of the coarse list into a fine one
> from time to time.
>
> The idea of having jiffie-point pointers into a data structure for fast
> insertion is a good one for speeding up any data structure for that
> common case, though.
>
If we are to do high-res-timers, I think we will always have to do some
sort of a sort on insert. If we can keep the sort to a VERY short list
and only do it on sub jiffie timers, we will have something that is VERY
close to what we have now.

I think that the density of timers tends to increase as they get closer
to "now". This should allow coarser pointers for times more removed
from "now". Still if we sort on insert we will not have to resort
later.

The pointers into the list, however, are perishable and need to be
refreshed as time passes. My though was to do this only when a needed
pointer was found to be "expired" and then only for the needed pointer.
On first need we would have a small overhead, but would remember for
next time.

Thanks for the good input

George

2001-04-13 16:11:58

by Horst von Brand

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Ben Greear <[email protected]> said:

[...]

> Wouldn't a heap be a good data structure for a list of timers? Insertion
> is log(n) and finding the one with the least time is O(1), ie pop off the
> front.... It can be implemented in an array which should help cache
> coherency and all those other things they talked about in school :)

Insertion and deleting the first are both O(log N). Plus the array is fixed
size (bad idea) and the jumping around in the array thrashes the caches.
--
Horst von Brand [email protected]
Casilla 9G, Vin~a del Mar, Chile +56 32 672616

2001-04-13 21:54:42

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Horst von Brand wrote:
>
> Ben Greear <[email protected]> said:
>
> [...]
>
> > Wouldn't a heap be a good data structure for a list of timers? Insertion
> > is log(n) and finding the one with the least time is O(1), ie pop off the
> > front.... It can be implemented in an array which should help cache
> > coherency and all those other things they talked about in school :)
>
> Insertion and deleting the first are both O(log N). Plus the array is fixed
> size (bad idea) and the jumping around in the array thrashes the caches.
> --
And your solution is?

George

2001-04-13 23:01:18

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer!

george anzinger wrote:
> If we are to do high-res-timers, I think we will always have to do some
> sort of a sort on insert.

Yes, that's called a priority queue... (And you don't actually sort on
each insertion).

-- Jamie

2001-04-13 23:11:22

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer!

george anzinger wrote:
> Horst von Brand wrote:
> >
> > Ben Greear <[email protected]> said:
> >
> > [...]
> >
> > > Wouldn't a heap be a good data structure for a list of timers? Insertion
> > > is log(n) and finding the one with the least time is O(1), ie pop off the
> > > front.... It can be implemented in an array which should help cache
> > > coherency and all those other things they talked about in school :)
> >
> > Insertion and deleting the first are both O(log N). Plus the array is fixed
> > size (bad idea) and the jumping around in the array thrashes the caches.
> > --
> And your solution is?

Note that jumping around the array thrashes no more cache than
traversing a tree (it touches the same number of elements). I prefer
heap-ordered trees though because fixed size is always a bad idea.

Insertion is O(1) if entries can be predicted to be near
enough some place in the list, be that the beginning, the end, or some
marked places in the middle.

By the way, the current timer implementation only appears to be O(1) if
you ignore the overhead of having to do a check on every tick, and the
extra processing on table rollover. For random timer usage patterns,
that overhead adds up to O(log n), the same as a heap.

However for skewed usage patterns (likely in the kernel), the current
table method avoids the O(log n) sorting overhead because long-delay
timers are often removed before percolating down to the smallest tables.
It is possible to produce a general purpose heap which also has this
property.

-- Jamie

2001-04-16 02:15:38

by Ben Greear

[permalink] [raw]
Subject: Re: No 100 HZ timer!

george anzinger wrote:
>
> Horst von Brand wrote:
> >
> > Ben Greear <[email protected]> said:
> >
> > [...]
> >
> > > Wouldn't a heap be a good data structure for a list of timers? Insertion
> > > is log(n) and finding the one with the least time is O(1), ie pop off the
> > > front.... It can be implemented in an array which should help cache
> > > coherency and all those other things they talked about in school :)
> >
> > Insertion and deleting the first are both O(log N). Plus the array is fixed
> > size (bad idea) and the jumping around in the array thrashes the caches.

Jumping around an array can't be much worse than jumping around
a linked list, can it?

It does not have to be fixed length, though it wouldn't be log(n) to
grow the array, it can still be done...and once you reach maximal
size, you won't be growing it any more...

I had forgotten about the log(n) to delete, though log(n) is
still pretty good. As others have suggested, it might be good
to have a linked list for very-soon-to-expire timers. However,
they would have to be few enough that your linear insert was
not worse than a log(n) instert and a log(n) delete...

> > --
> And your solution is?


>
> George

--
Ben Greear ([email protected]) http://www.candelatech.com
Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL)
http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-04-16 02:36:24

by Ben Greear

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Jamie Lokier wrote:
>
> george anzinger wrote:
> > Horst von Brand wrote:
> > >
> > > Ben Greear <[email protected]> said:
> > >
> > > [...]
> > >
> > > > Wouldn't a heap be a good data structure for a list of timers? Insertion
> > > > is log(n) and finding the one with the least time is O(1), ie pop off the
> > > > front.... It can be implemented in an array which should help cache
> > > > coherency and all those other things they talked about in school :)
> > >
> > > Insertion and deleting the first are both O(log N). Plus the array is fixed
> > > size (bad idea) and the jumping around in the array thrashes the caches.
> > > --
> > And your solution is?
>
> Note that jumping around the array thrashes no more cache than
> traversing a tree (it touches the same number of elements). I prefer
> heap-ordered trees though because fixed size is always a bad idea.

With a tree, you will be allocating and de-allocating for every
insert/delete right? That seems like a reasonable performance
hit that an array-based approach might not have...

On cache-coherency issues, wouldn't it be more likely to have a cache hit when you are
accessing one contigious (ie the array) piece of memory? A 4-k page
will hold a lot of indexes!!

To get around the fixed size thing..could have
the array grow itself when needed (and probably never shrink again).
This would suck if you did it often, but I'm assuming that it would
quickly grow to needed size and then stabalize...

>
> Insertion is O(1) if entries can be predicted to be near
> enough some place in the list, be that the beginning, the end, or some
> marked places in the middle.
>
> By the way, the current timer implementation only appears to be O(1) if
> you ignore the overhead of having to do a check on every tick, and the
> extra processing on table rollover. For random timer usage patterns,
> that overhead adds up to O(log n), the same as a heap.
>
> However for skewed usage patterns (likely in the kernel), the current
> table method avoids the O(log n) sorting overhead because long-delay
> timers are often removed before percolating down to the smallest tables.
> It is possible to produce a general purpose heap which also has this
> property.
>
> -- Jamie
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Ben Greear ([email protected]) http://www.candelatech.com
Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL)
http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-04-16 02:47:08

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Ben Greear wrote:
> > Note that jumping around the array thrashes no more cache than
> > traversing a tree (it touches the same number of elements). I prefer
> > heap-ordered trees though because fixed size is always a bad idea.
>
> With a tree, you will be allocating and de-allocating for every
> insert/delete right?

No, there is no memory allocation.

> On cache-coherency issues, wouldn't it be more likely to have a cache
> hit when you are accessing one contigious (ie the array) piece of
> memory? A 4-k page will hold a lot of indexes!!

No, because when traversing an array-heap, you don't access contiguous
entries. You might get one or two more cache hits near the root of the
heap.

> To get around the fixed size thing..could have
> the array grow itself when needed (and probably never shrink again).

You could to that, but then you'd have to deal with memory allocation
failures and memory deadlocks, making add_timer rather complicated.
It's not acceptable for add_timer to fail or require kmalloc().

-- Jamie

2001-04-16 12:53:12

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer!

all this talk about which data structure to use and how to allocate memory is
waaaay premature.

there needs to be a clear definition of the requirements that we wish to meet,
including whether we are going to do ticked, tickless, or both

a func spec, for lack of a better term needs to be developed

then, when we go to design the thing, THEN is when we decide on the particular
flavor of list/tree/heap/array/dbase that we use.

let's engineer this thing instead of hacking it.



On Sun, 15 Apr 2001, Jamie Lokier wrote:
> Ben Greear wrote:
> > > Note that jumping around the array thrashes no more cache than
> > > traversing a tree (it touches the same number of elements). I prefer
> > > heap-ordered trees though because fixed size is always a bad idea.
> >
> > With a tree, you will be allocating and de-allocating for every
> > insert/delete right?
>
> No, there is no memory allocation.
>
> > On cache-coherency issues, wouldn't it be more likely to have a cache
> > hit when you are accessing one contigious (ie the array) piece of
> > memory? A 4-k page will hold a lot of indexes!!
>
> No, because when traversing an array-heap, you don't access contiguous
> entries. You might get one or two more cache hits near the root of the
> heap.
>
> > To get around the fixed size thing..could have
> > the array grow itself when needed (and probably never shrink again).
>
> You could to that, but then you'd have to deal with memory allocation
> failures and memory deadlocks, making add_timer rather complicated.
> It's not acceptable for add_timer to fail or require kmalloc().
>
> -- Jamie
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-04-16 19:21:57

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Functional Specification for the high-res-timers project.

http://sourceforge.net/projects/high-res-timers

We are developing code to implement the POSIX clocks & timers as defined
by IEEE Std 1003.1b-1993 Section 14. (For an on line reference see our
home page: http://high-res-timers.sourceforge.net/ )

The API specifies the following functions (for details please see the spec):

int clock_settime(clockid_t clock_id, const struct timespec *tp);
int clock_gettime(clockid_t clock_id, struct timespec *tp);
int clock_getres(clockid_t clock_id, struct timespec *res);

int timer_creat(clockid_t clock_id, struct sigevent *evp,
timer_t *timerid);
int timer_delete(timer_t *timerid);

int timer_settime(timer_t *timerid, int flags,
const struct itimerspec *value,
struct itimerspec *ovalue);
int timer_gettime(timer_t *timerid, struct itimerspec *value);
int timer_getoverrun(timer_t *timerid);

int nanosleep( const struct timesped *rqtp, struct timespec *rmtp);

In addition we expect that we will provide a high resolution timer for
kernel use (heck, we may provide several).

In all this code we will code to allow resolutions to 1 nanosecond second (the
max provided by the timespec structure). The actual resolution of
any given clock will be fixed at compile time and the code will do its
work at a higher resolution to avoid round off errors as much as
possible.

We will provide several "clocks" as defined by the standard. In
particular, the following capabilities will be attached to some clock,
regardless of the actual clock "name" we end up using:

CLOCK_10MS a wall clock supporting timers with 10 ms resolution (same as
linux today).

CLOCK_HIGH_RES a wall clock supporting timers with the highest
resolution the hardware supports.

CLOCK_1US a wall clock supporting timers with 1 micro second resolution
(if the hardware allows it).

CLOCK_UPTIME a clock that give the system up time. (Does this clock
need to support timers?)

CLOCK_REALTIME a wall clock supporting timers with 1/HZ resolution.

At the same time we will NOT support the following clocks:

CLOCK_VIRTUAL a clock measuring the elapsed execution time (real or
wall) of a given task.

CLOCK_PROFILE a clock used to generate profiling events.

CLOCK_??? any clock keyed to a task.

(Note that this does not mean that the clock system will not support the
virtual and profile clocks, but that they will not be accessible thru
the POSIX timer interface.)

It would be NICE if we could provide a way to hook other time support
into the system. In particular a

CLOCK_WWV or CLOCK_GPS

might be nice. The problem with these sorts of clocks is that they
imply an array of function pointers for each clock and function pointers
slow the code down because of their non predictability. Never the less,
we will attempt to allow easy expansion in this direction.

Implications on the current kernel:

The high resolution timers will require a fast clock access with the
maximum supported resolution in order to convert relative times to
absolute times. This same fast clock will be used to support the
various user and system time requests.

There are two ways to provide timers to the kernel. For lack of a
better term we will refer to them as "ticked" and "tick less". Until we
have performance information that implies that one or the other of these
methods is better in all cases we will provide both ticked and tick less
systems. The variety to be used will be selected at configure time.

For tick less systems we will need to provide code to collect execution
times. For the ticked system the current method of collection these
times will be used. This project will NOT attempt to improve the
resolution of these timers, however, the high speed, high resolution
access to the current time will allow others to augment the system in
this area.

For the tick less system the project will also provide a time slice
expiration interrupt.

The timer list(s) (all pending timers) need to be organized so that the
following operations are fast:

a.) list insertion of an arbitrary timer,
b.) removal of canceled and expired timers, and
c.) finding the timer for "NOW" and its immediate followers.

Times in the timer list will be absolute and related to system up time.
These times will be converted to wall time as needed.

The POSIX interface provides for "absolute" timers relative to a given
clock. When these timers are related to a "wall" clock they will need
adjusting when the wall clock time is adjusted. These adjustments are
done for "leap seconds" and the date command. (Note, we are keeping
timers relative to "uptime" which can not be adjusted. This means that
relative timers and absolute timers related to CLOCK_UPTIME are not
affected by the above wall time adjustments.)

In either a ticked or tick less system, it is expected that resolutions
higher than 1/HZ will come with some additional overhead. For this
reason, the CLOCK resolution will be used to round up times for each
timer. When the CLOCK provides 1/HZ (or coarser) resolution, the
project will attempt to meet or exceed the current systems timer
performance.

Safe guards:

Given a system speed, there is a repeating timer rate which will consume
100% of the system in handling the timer interrupts. An attempt will
be made to detect this rate and adjust the timer to prevent system
lockup. This adjustment will look like timer overruns to the user
(i.e. we will take a percent of the interrupts and record the untaken
interrupts as overruns).

What the project will NOT do:

This project will NOT provide higher resolution accounting (i.e. user
and system execution times).

The project will NOT provide higher resolution VIRTUAL or PROFILE timers.




Attachments:
high-res-timers-fun-spec.txt (5.65 kB)

2001-04-16 20:48:14

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: No 100 HZ timer!

> CLOCK_10MS a wall clock supporting timers with 10 ms resolution (same as
> linux today).

Except on the Alpha, and on some ARM systems, etc.
The HZ constant varies from 10 to 1200.

> At the same time we will NOT support the following clocks:
>
> CLOCK_VIRTUAL a clock measuring the elapsed execution time (real or
> wall) of a given task.
...
> For tick less systems we will need to provide code to collect execution
> times. For the ticked system the current method of collection these
> times will be used. This project will NOT attempt to improve the
> resolution of these timers, however, the high speed, high resolution
> access to the current time will allow others to augment the system in
> this area.
...
> This project will NOT provide higher resolution accounting (i.e. user
> and system execution times).

It is nice to have accurate per-process user/system accounting.
Since you'd be touching the code anyway...

> The POSIX interface provides for "absolute" timers relative to a given
> clock. When these timers are related to a "wall" clock they will need
> adjusting when the wall clock time is adjusted. These adjustments are
> done for "leap seconds" and the date command.

This is a BIG can of worms. You have UTC, TAI, GMT, and a loosely
defined POSIX time that is none of the above. This is a horrid mess,
even ignoring gravity and speed. :-)

Can a second be 2 billion nanoseconds?
Can a nanosecond be twice as long as normal?
Can a second appear twice, with the nanoseconds getting reset?
Can a second never appear at all?
Can you compute times more than 6 months into the future?
How far does time deviate from solar time? Is this constrained?

If you deal with leap seconds, you have to have a table of them.
This table grows with time, with adjustments being made with only
about 6 months notice. So the user upgrades after a year or two,
and the installer discovers that the user has been running a
system that is unaware of the most recent leap second. Arrrgh.

Sure you want to touch this? The Austin group argued over it for
a very long time and never did find a really good solution.
Maybe you should just keep the code simple and fast, without any
concern for clock adjustments.

> In either a ticked or tick less system, it is expected that resolutions
> higher than 1/HZ will come with some additional overhead. For this
> reason, the CLOCK resolution will be used to round up times for each
> timer. When the CLOCK provides 1/HZ (or coarser) resolution, the
> project will attempt to meet or exceed the current systems timer
> performance.

Within the kernel at least, it would be good to let drivers specify
desired resolution. Then a near-by value could be selected, perhaps
with some consideration for event type. (for cache reasons)

2001-04-16 21:30:09

by Chris Wedgwood

[permalink] [raw]
Subject: Re: No 100 HZ timer!

On Mon, Apr 16, 2001 at 04:45:39PM -0400, Albert D. Cahalan wrote:

> CLOCK_10MS a wall clock supporting timers with 10 ms resolution (same as
> linux today).

Except on the Alpha, and on some ARM systems, etc.
The HZ constant varies from 10 to 1200.

Or some people use higher values on x86:

cw@charon(cw)$ uptime
14:13:02 up 11 days, 21:44, 2 users, load average: 0.00, 0.00, 0.00
cw@charon(cw)$ cat /proc/interrupts
CPU0
0: 2106736381 XT-PIC timer
1: 187633 XT-PIC keyboard
2: 0 XT-PIC cascade
7: 94128650 XT-PIC eth0, usb-uhci, usb-uhci
10: 18265929 XT-PIC ide2
12: 3107415 XT-PIC PS/2 Mouse
14: 753375 XT-PIC via82cxxx
15: 2 XT-PIC ide1
NMI: 0
ERR: 0



thats 2048 ticks/sec -- I had to hack the kernel in a couple of
places to make things happy with that, but so far it seems to work.

I also had to hack libproc.so.* -- what a terribly obscure way that
it uses to determine what HZ is defined as!

I know this has ben beaten to death before but can someone tell me:

- why we cannot export HZ somewhere

- why is it called HZ, it seems misleading why not something like
TICKS_PER_SECOND or similar

- it seems there are a couple of places that assume HZ==100 still,
or am I just imagining this (ide code, I forget where now)




--cw

2001-04-16 22:28:19

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer!

"Albert D. Cahalan" wrote:
>
> > CLOCK_10MS a wall clock supporting timers with 10 ms resolution (same as
> > linux today).
>
> Except on the Alpha, and on some ARM systems, etc.
> The HZ constant varies from 10 to 1200.

I suspect we will want to use 10 ms resolution for a clock named
CLOCK_10MS :)
On the other hand we could have a CLOCK_1_OVER_HZ... Actually with
high-res-timers the actual HZ value becomes a bit less important. It
would be "nice" to keep 1/HZ == jiffie, however.
>
> > At the same time we will NOT support the following clocks:
> >
> > CLOCK_VIRTUAL a clock measuring the elapsed execution time (real or
> > wall) of a given task.
> ...
> > For tick less systems we will need to provide code to collect execution
> > times. For the ticked system the current method of collection these
> > times will be used. This project will NOT attempt to improve the
> > resolution of these timers, however, the high speed, high resolution
> > access to the current time will allow others to augment the system in
> > this area.
> ...
> > This project will NOT provide higher resolution accounting (i.e. user
> > and system execution times).
>
> It is nice to have accurate per-process user/system accounting.
> Since you'd be touching the code anyway...

Yeah sure... and will you pick up the ball on all the platform dependent
code to get high-res-timers on all the other platforms? On second
thought I am reminded of the corollary to the old saw: "The squeaking
wheel get the grease." Which is: "He who complains most about the
squeaking gets to do the greasing." Hint hint.
>
> > The POSIX interface provides for "absolute" timers relative to a given
> > clock. When these timers are related to a "wall" clock they will need
> > adjusting when the wall clock time is adjusted. These adjustments are
> > done for "leap seconds" and the date command.
>
> This is a BIG can of worms. You have UTC, TAI, GMT, and a loosely
> defined POSIX time that is none of the above. This is a horrid mess,
> even ignoring gravity and speed. :-)
>
> Can a second be 2 billion nanoseconds?
> Can a nanosecond be twice as long as normal?
> Can a second appear twice, with the nanoseconds getting reset?
> Can a second never appear at all?
> Can you compute times more than 6 months into the future?
> How far does time deviate from solar time? Is this constrained?
>
> If you deal with leap seconds, you have to have a table of them.
> This table grows with time, with adjustments being made with only
> about 6 months notice. So the user upgrades after a year or two,
> and the installer discovers that the user has been running a
> system that is unaware of the most recent leap second. Arrrgh.
>
> Sure you want to touch this? The Austin group argued over it for
> a very long time and never did find a really good solution.
> Maybe you should just keep the code simple and fast, without any
> concern for clock adjustments.

There is a relatively simple way to handle this, at least from the
high-res-timers point of view. First we convert all timers to absolute
uptime. This is a nice well behaved time. At boot time we peg the wall
clock to the uptime. Then at any given time, wall time is boot wall
time + uptime. Then date, leap seconds, etc. affect the pegged value of
boot wall time. Using the POSIX CLOCK id we allow the user to ask for
either version of time. Now if we define an array of struc clock_id
which contains pointers to such things as functions to return time, any
algorithm you might want can be plugged in to bend time as needed. The
only fly in this mess is the NTP rate adjustment stuff. This code is
supposed to allow the system to adjust its ticker to produce accurate
seconds and so gets at the root of the uptime counter be it in hardware
or software or some combination of the two. But then that's what makes
life interesting :)
>
> > In either a ticked or tick less system, it is expected that resolutions
> > higher than 1/HZ will come with some additional overhead. For this
> > reason, the CLOCK resolution will be used to round up times for each
> > timer. When the CLOCK provides 1/HZ (or coarser) resolution, the
> > project will attempt to meet or exceed the current systems timer
> > performance.
>
> Within the kernel at least, it would be good to let drivers specify
> desired resolution. Then a near-by value could be selected, perhaps
> with some consideration for event type. (for cache reasons)

This could be done, however, I would prefer to provide CLOCK_s to do
this as per the standard. What does the community say? In either case
you get different resolutions, but in the latter case the possible
values are fixed at least at configure time.

George

2001-04-16 23:52:10

by Mark Salisbury

[permalink] [raw]
Subject: Re: No 100 HZ timer!


> Given a system speed, there is a repeating timer rate which will consume
> 100% of the system in handling the timer interrupts. An attempt will
> be made to detect this rate and adjust the timer to prevent system
> lockup. This adjustment will look like timer overruns to the user
> (i.e. we will take a percent of the interrupts and record the untaken
> interrupts as overruns)

just at first blush, there are some things in general but I need to read
this again and more closely....

but, with POSIX timers, there is a nifty little restriction/protection built
into the spec regarding the re-insertion of short interval repeating timers.
that is: a repeating timer will not be re-inserted until AFTER the
associated signal handler has been handled.

this has some interesting consequences for signal handling and signal
delivery implementations, but importantly, it ensures that even a flood of
POSIX timers with very short repeat intervals will be handled cleanly.

I will get more detailed comments to you tomorrow.

Mark Salisbury

2001-04-17 00:48:51

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Mark Salisbury wrote:
>
> > Given a system speed, there is a repeating timer rate which will consume
> > 100% of the system in handling the timer interrupts. An attempt will
> > be made to detect this rate and adjust the timer to prevent system
> > lockup. This adjustment will look like timer overruns to the user
> > (i.e. we will take a percent of the interrupts and record the untaken
> > interrupts as overruns)
>
> just at first blush, there are some things in general but I need to read
> this again and more closely....
>
> but, with POSIX timers, there is a nifty little restriction/protection built
> into the spec regarding the re-insertion of short interval repeating timers.
> that is: a repeating timer will not be re-inserted until AFTER the
> associated signal handler has been handled.

Actually what it says is: "Only a single signal shall be queued to the
process for a given timer at any point in time. When a timer for which
a signal is still pending expires, no signal shall be queued, and a
timer overrun shall occur."

It then goes on to talk about the overrun count and how it is to be
managed.

What I am suggesting is that the system should detect when these
interrupts would come so fast as to stall the system and just set up a
percent of them while bumping the overrun count as if they had all
occured.

George

>
> this has some interesting consequences for signal handling and signal
> delivery implementations, but importantly, it ensures that even a flood of
> POSIX timers with very short repeat intervals will be handled cleanly.
>
> I will get more detailed comments to you tomorrow.
>
> Mark Salisbury
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-04-17 12:24:15

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer!

On Mon, 16 Apr 2001, you wrote:
> Mark Salisbury wrote:
> >
> > > Given a system speed, there is a repeating timer rate which will consume
> > > 100% of the system in handling the timer interrupts. An attempt will
> > > be made to detect this rate and adjust the timer to prevent system
> > > lockup. This adjustment will look like timer overruns to the user
> > > (i.e. we will take a percent of the interrupts and record the untaken
> > > interrupts as overruns)
> >
> > just at first blush, there are some things in general but I need to read
> > this again and more closely....
> >
> > but, with POSIX timers, there is a nifty little restriction/protection built
> > into the spec regarding the re-insertion of short interval repeating timers.
> > that is: a repeating timer will not be re-inserted until AFTER the
> > associated signal handler has been handled.
>
> Actually what it says is: "Only a single signal shall be queued to the
> process for a given timer at any point in time. When a timer for which
> a signal is still pending expires, no signal shall be queued, and a
> timer overrun shall occur."
>
> It then goes on to talk about the overrun count and how it is to be
> managed.
>
I guess I was confusing what the spec said with the way in which I chose to
comply with the spec. I calculate overruns (I know when a timer went off, and
how many of its intervals have passed since it went off by by
time_since_expire/interval) and don't reinsert until return from the signal
handler.

this prevents a flood of high frequency interrupts inherently and allows
processing of the signal handlers to proceed cleanly.

--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-04-17 14:32:46

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer!

> Functional Specification for the high-res-timers project.
>
> In addition we expect that we will provide a high resolution timer for
> kernel use (heck, we may provide several).

what we do here determines what we can do for the user..

> We will provide several "clocks" as defined by the standard. In
> particular, the following capabilities will be attached to some clock,
> regardless of the actual clock "name" we end up using:
>
> CLOCK_10MS a wall clock supporting timers with 10 ms resolution (same as
> linux today).
>
> CLOCK_HIGH_RES a wall clock supporting timers with the highest
> resolution the hardware supports.
>
> CLOCK_1US a wall clock supporting timers with 1 micro second resolution
> (if the hardware allows it).
>
> CLOCK_UPTIME a clock that give the system up time. (Does this clock
> need to support timers?)
>
> CLOCK_REALTIME a wall clock supporting timers with 1/HZ resolution.
>

Too many clocks. we should have CLOCK_REALTIME and CLOCK_UPTIME for sure, but
the others are just fluff. we should have 1 single clock mechanism for the
whole system with it's resolution and tick/tickless characteristics determined
at compile time.

also CLOCK_UPTIME should be the "true" clock, with CLOCK_REALTIME just a
convenience/compliance offset.



> At the same time we will NOT support the following clocks:
>
> CLOCK_VIRTUAL a clock measuring the elapsed execution time (real or
> wall) of a given task.
>
> CLOCK_PROFILE a clock used to generate profiling events.
>
> CLOCK_??? any clock keyed to a task.

we could do some KOOL things here but they are more related to process time
accounting and should be dealt with in that context and as a separate project.

however our design should take these concepts into account and allow for easy
integration of these types of functionality.

>
> (Note that this does not mean that the clock system will not support the
> virtual and profile clocks, but that they will not be accessible thru
> the POSIX timer interface.)

I think that should sombody choose to implement them, they should be available
at least through the clock_xxx interfaces..

>
> It would be NICE if we could provide a way to hook other time support
> into the system. In particular a
>
> CLOCK_WWV or CLOCK_GPS
>

CLOCK_NNTP

> might be nice. The problem with these sorts of clocks is that they
> imply an array of function pointers for each clock and function pointers
> slow the code down because of their non predictability. Never the less,
> we will attempt to allow easy expansion in this direction.
>
> Implications on the current kernel:
>
> The high resolution timers will require a fast clock access with the
> maximum supported resolution in order to convert relative times to
> absolute times. This same fast clock will be used to support the
> various user and system time requests.
>
> There are two ways to provide timers to the kernel. For lack of a
> better term we will refer to them as "ticked" and "tick less". Until we
> have performance information that implies that one or the other of these
> methods is better in all cases we will provide both ticked and tick less
> systems. The variety to be used will be selected at configure time.
>
> For tick less systems we will need to provide code to collect execution
> times. For the ticked system the current method of collection these
> times will be used. This project will NOT attempt to improve the
> resolution of these timers, however, the high speed, high resolution
> access to the current time will allow others to augment the system in
> this area.

huh? why not?

>
> For the tick less system the project will also provide a time slice
> expiration interrupt.
>
> The timer list(s) (all pending timers) need to be organized so that the
> following operations are fast:
>
> a.) list insertion of an arbitrary timer,
should be O(log(n)) at worst

> b.) removal of canceled and expired timers, and
easy to make O(1)

> c.) finding the timer for "NOW" and its immediate followers.
head of list or top of tree or top of heap or whatever, O(1)

> Times in the timer list will be absolute and related to system up time.
> These times will be converted to wall time as needed.

and ONLY when needed by users


>
> The POSIX interface provides for "absolute" timers relative to a given

do you mean "relative" timers?

> clock. When these timers are related to a "wall" clock they will need
> adjusting when the wall clock time is adjusted. These adjustments are
> done for "leap seconds" and the date command. (Note, we are keeping
> timers relative to "uptime" which can not be adjusted. This means that
> relative timers and absolute timers related to CLOCK_UPTIME are not
> affected by the above wall time adjustments.)

absolute timers will automatically fall out when you adjust CLOCK_UPTIME...
i.e. an absolute time is an absolute time and since CLOCK_UPTIME is the
ultimate arbiter of what absolute time it is, adjusting CLOCK_UPTIME will cause
the absolute times in the timer list to be handled properly without modifying
them. (am I makeing myself clear? I will try to come up with a better
description of what I mean)

> In either a ticked or tick less system, it is expected that resolutions
> higher than 1/HZ will come with some additional overhead. For this
> reason, the CLOCK resolution will be used to round up times for each
> timer. When the CLOCK provides 1/HZ (or coarser) resolution, the
> project will attempt to meet or exceed the current systems timer
> performance.

ONLY in a ticked system.

>
> Safe guards:
>
> Given a system speed, there is a repeating timer rate which will consume
> 100% of the system in handling the timer interrupts. An attempt will
> be made to detect this rate and adjust the timer to prevent system
> lockup. This adjustment will look like timer overruns to the user
> (i.e. we will take a percent of the interrupts and record the untaken
> interrupts as overruns).

see my earlier e-mail, although it is a simple matter to enforce a minimum
allowable interval by parameter checking.


>
> What the project will NOT do:
>
> This project will NOT provide higher resolution accounting (i.e. user
> and system execution times).
>
> The project will NOT provide higher resolution VIRTUAL or PROFILE timers.

as I said, there are some kool things we could do here, and we should design in
allowances for future upgrades which implement these things and let it get done
as a followon.



--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-04-17 19:04:37

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer!

Mark Salisbury wrote:
>
> > Functional Specification for the high-res-timers project.
> >
> > In addition we expect that we will provide a high resolution timer for
> > kernel use (heck, we may provide several).
>
> what we do here determines what we can do for the user..

I was thinking that it might be good to remove the POSIX API for the
kernel and allow a somewhat simplified interface. For example, the user
gets to resolution by specifying a CLOCK, where we MAY want to allow the
kernel call to directly specify the resolution. This has already been
suggested. I suppose you could say the functional spec should define
the kernel PI (KPI?) as well as the user API, but that is a bit fuzzy at
this time as I think it will depend on how we actually code the user
functionality. Another example: I am leaning toward using a two word
uptime composed of jiffies (i.e. 1/HZ since boot) and a machine
dependent sub jiffie unit. Each ARCH would then define this unit as
well as the conversion routines to move it back and forth to
nanoseconds, microseconds, and 1/HZ. I think this format would play
well in task time accounting, as well as in the timer management.

For calls to something like delay(), however, it sucks. I think these
calls want a single word, possibly microsecond, time specification.
This gives a 16 or 17 minutes max and 1 microsecond min. which probably
covers 99.99% of all kernel delay needs.

Another kernel internal interface should allow the user specified
structures (timespec and timeval). The notion is to put all the
conversion routines in the timer code to maintain the specified
resolution, and (more importantly), to avoid conversion to a format that
just needs an additional conversion.

To summarize,

I think there is a need for two classes of timer interfaces in the
kernel:

a.) For drivers and others that need "delay()" sorts of things, and
b.) For system calls that handle user specified times.
>
> > We will provide several "clocks" as defined by the standard. In
> > particular, the following capabilities will be attached to some clock,
> > regardless of the actual clock "name" we end up using:
> >
> > CLOCK_10MS a wall clock supporting timers with 10 ms resolution (same as
> > linux today).
> >
> > CLOCK_HIGH_RES a wall clock supporting timers with the highest
> > resolution the hardware supports.
> >
> > CLOCK_1US a wall clock supporting timers with 1 micro second resolution
> > (if the hardware allows it).
> >
> > CLOCK_UPTIME a clock that give the system up time. (Does this clock
> > need to support timers?)
> >
> > CLOCK_REALTIME a wall clock supporting timers with 1/HZ resolution.
> >
>
> Too many clocks. we should have CLOCK_REALTIME and CLOCK_UPTIME for sure, but
> the others are just fluff. we should have 1 single clock mechanism for the
> whole system with it's resolution and tick/tickless characteristics determined
> at compile time.

I think you already have let the nose of the camel into the tent :)
Here is what I am thinking:

Suppose an array of structures of type clock. Clock_id is an index into
this array. Here is what is in the structure:

struct clock{
int resolution;
int *gettime();
int *settime();
int *convert_to_uptime();
int *convert_from_uptime();
<room for more bright ideas>;
};

Now the difference between CLOCK_UPTIME and CLOCK_REALTIME is surly in
the gettime/settime and possibly in the resolution. But the difference
between CLOCK_REALTIME and CLOCK_1US, CLOCK_HIGH_RES, CLOCK_10MS is JUST
the resolution! In other words, all they cost is the table entry. Note
that CLOCK_GMT, CLOCK_UST, and CLOCK_GPS, etc. all fit nicely into this
same structure.

We should also provide a way to "register" a new clock so the user can
easily configure in additional clocks. There are ways of doing this
that are really easy to use, e.g. the module_init() macro.
>
> also CLOCK_UPTIME should be the "true" clock, with CLOCK_REALTIME just a
> convenience/compliance offset.

If you mean by "true" that this clock can not be set, starts at 0 at
boot time and can only be affected by rate adjustments to get it to pass
a real second every second, I agree.
>
> > At the same time we will NOT support the following clocks:
> >
> > CLOCK_VIRTUAL a clock measuring the elapsed execution time (real or
> > wall) of a given task.
> >
> > CLOCK_PROFILE a clock used to generate profiling events.
> >
> > CLOCK_??? any clock keyed to a task.
>
> we could do some KOOL things here but they are more related to process time
> accounting and should be dealt with in that context and as a separate project.
>
> however our design should take these concepts into account and allow for easy
> integration of these types of functionality.

I agree.
>
> >
> > (Note that this does not mean that the clock system will not support the
> > virtual and profile clocks, but that they will not be accessible thru
> > the POSIX timer interface.)
>
> I think that should sombody choose to implement them, they should be available
> at least through the clock_xxx interfaces..

With care, this could be done. We might need to add an item to the
structure saying if the clock supports timers. If you want timers on
these clocks you would need a different way of clocking them (or so I
think, haven't thought about this in much detail).
>
> >
> > It would be NICE if we could provide a way to hook other time support
> > into the system. In particular a
> >
> > CLOCK_WWV or CLOCK_GPS
> >
>
> CLOCK_NNTP
>
> > might be nice. The problem with these sorts of clocks is that they
> > imply an array of function pointers for each clock and function pointers
> > slow the code down because of their non predictability. Never the less,
> > we will attempt to allow easy expansion in this direction.
> >
> > Implications on the current kernel:
> >
> > The high resolution timers will require a fast clock access with the
> > maximum supported resolution in order to convert relative times to
> > absolute times. This same fast clock will be used to support the
> > various user and system time requests.
> >
> > There are two ways to provide timers to the kernel. For lack of a
> > better term we will refer to them as "ticked" and "tick less". Until we
> > have performance information that implies that one or the other of these
> > methods is better in all cases we will provide both ticked and tick less
> > systems. The variety to be used will be selected at configure time.
> >
> > For tick less systems we will need to provide code to collect execution
> > times. For the ticked system the current method of collection these
> > times will be used. This project will NOT attempt to improve the
> > resolution of these timers, however, the high speed, high resolution
> > access to the current time will allow others to augment the system in
> > this area.
>
> huh? why not?

Well, first accounting (as I call this usage) is not part of the POSIX
API the project is working on. Second, it is in a real way
independent. That is to say, it can be done today with the current
clock system as well as any new one. And Third, more work is needed in
areas of the system we would not otherwise touch to do them, (wait()
for example).
>
> >
> > For the tick less system the project will also provide a time slice
> > expiration interrupt.
> >
> > The timer list(s) (all pending timers) need to be organized so that the
> > following operations are fast:
> >
> > a.) list insertion of an arbitrary timer,
> should be O(log(n)) at worst
>
> > b.) removal of canceled and expired timers, and
> easy to make O(1)

I thought this was true also, but the priority heap structure that has
been discussed here has a O(log(n)) removal time.
>
> > c.) finding the timer for "NOW" and its immediate followers.
> head of list or top of tree or top of heap or whatever, O(1)
>
> > Times in the timer list will be absolute and related to system up time.
> > These times will be converted to wall time as needed.
>
> and ONLY when needed by users
>
> >
> > The POSIX interface provides for "absolute" timers relative to a given
>
> do you mean "relative" timers?
No, the user specifies a wall time for the event.
>
> > clock. When these timers are related to a "wall" clock they will need
> > adjusting when the wall clock time is adjusted. These adjustments are
> > done for "leap seconds" and the date command. (Note, we are keeping
> > timers relative to "uptime" which can not be adjusted. This means that
> > relative timers and absolute timers related to CLOCK_UPTIME are not
> > affected by the above wall time adjustments.)
>
> absolute timers will automatically fall out when you adjust CLOCK_UPTIME...
> i.e. an absolute time is an absolute time and since CLOCK_UPTIME is the
> ultimate arbiter of what absolute time it is, adjusting CLOCK_UPTIME will cause
> the absolute times in the timer list to be handled properly without modifying
> them. (am I makeing myself clear? I will try to come up with a better
> description of what I mean)

I don't read the spec this way. In my reading, the user can specify
that a timer expire at some fixed wall time based on a given clock. If
that clock moves past that time, we had better expire that timer. It
doesn't matter if the clock moved because of STD to DST or a leap second
or the date command. Likewise, if we have converted that CLOCK time to
CLOCK_UPTIME (as we will when we put the timer in the list) we will need
to redo the calculation when any of the above events occur. Also, as
said earlier, CLOCK_UPTIME can only be set by booting the system. (By
the way, somewhere in the high-res-timer mailing list archive is a note
requesting (as I read it) CLOCK_UPTIME, a clock that never gives the
same time twice (within resolution/ cpu speed limits), never moves
backward, etc. See TIME_MONOTONIC at
http://www.cl.cam.ac.uk/~mgk25/c-time/
but the whole thing is a good read.)
>
> > In either a ticked or tick less system, it is expected that resolutions
> > higher than 1/HZ will come with some additional overhead. For this
> > reason, the CLOCK resolution will be used to round up times for each
> > timer. When the CLOCK provides 1/HZ (or coarser) resolution, the
> > project will attempt to meet or exceed the current systems timer
> > performance.
>
> ONLY in a ticked system.

I am not sure what you mean here. The resolution round up will tend to
group timers and thus, given the same timer expiration rate, cause fewer
interrupts. This is true in both systems, with the exception that in
ticked systems the improvement goes flat once the resolution becomes
less than or equal to the tick rate.
>
> >
> > Safe guards:
> >
> > Given a system speed, there is a repeating timer rate which will consume
> > 100% of the system in handling the timer interrupts. An attempt will
> > be made to detect this rate and adjust the timer to prevent system
> > lockup. This adjustment will look like timer overruns to the user
> > (i.e. we will take a percent of the interrupts and record the untaken
> > interrupts as overruns).
>
> see my earlier e-mail, although it is a simple matter to enforce a minimum
> allowable interval by parameter checking.
>
> >
> > What the project will NOT do:
> >
> > This project will NOT provide higher resolution accounting (i.e. user
> > and system execution times).
> >
> > The project will NOT provide higher resolution VIRTUAL or PROFILE timers.
>
> as I said, there are some kool things we could do here, and we should design in
> allowances for future upgrades which implement these things and let it get done
> as a followon.
>
Right. Remember Linus seems to like small patches.

George

2001-04-17 19:42:40

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer!

george anzinger wrote:
> > > a.) list insertion of an arbitrary timer,
> > should be O(log(n)) at worst
> >
> > > b.) removal of canceled and expired timers, and
> > easy to make O(1)
>
> I thought this was true also, but the priority heap structure that has
> been discussed here has a O(log(n)) removal time.

Several priority queue structures support removal in O(1) time.
Perhaps you are thinking of the classic array-based heap, for
which removal is O(log n) in the general case.

-- Jamie

2001-04-23 08:06:32

by Ulrich Windl

[permalink] [raw]
Subject: Re: No 100 HZ timer!

IMHO the POSIX is doable to comply with POSIX. Probably not what many
of the RT freaks expect, but doable. I'm tuning the nanoseconds for a
while now...

Ulrich

On 17 Apr 2001, at 11:53, george anzinger wrote:

> I was thinking that it might be good to remove the POSIX API for the
> kernel and allow a somewhat simplified interface. For example, the user

2001-04-23 13:32:26

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer!

On Mon, 23 Apr 2001, Ulrich Windl wrote:
> IMHO the POSIX is doable to comply with POSIX. Probably not what many
> of the RT freaks expect, but doable. I'm tuning the nanoseconds for a
> while now...
>
> Ulrich

thanks for calling me a freak :)

yes, linux could have posix timers cobbled on today and comply with the POSIX
spec.

but, we would like to make it better, not just feature complete.

10ms timer resolutions, while completely in compliance w/ the posix spec, are
completely useless for "RT freaks"

remember that 95% of all computers in the world are embedded and of those most
nead at least "soft" RT behavior. seems like a good growth market for linux to
me.

when a PPC G4 is capable of 30 nanosecond resolution, why would you want to
settle for 10 millisecond (30 billionths of a second vs 10 thousandths for
those of you who haven't had to familiarize yourselves with sub-second time
units)

>
> On 17 Apr 2001, at 11:53, george anzinger wrote:
>
> > I was thinking that it might be good to remove the POSIX API for the
> > kernel and allow a somewhat simplified interface. For example, the user
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/