2001-04-09 15:57:47

by Martin Schwidefsky

[permalink] [raw]
Subject: No 100 HZ timer !




Hi,
seems like my first try with the complete patch hasn't made it through
to the mailing list. This is the second try with only the common part of
the patch. Here we go (again):
---

I have a suggestion that might seem unusual at first but it is important
for Linux on S/390. We are facing the problem that we want to start many
(> 1000) Linux images on a big S/390 machine. Every image has its own
100 HZ timer on every processor the images uses (normally 1). On a single
image system the processor use of the 100 HZ timer is not a big deal but
with > 1000 images you need a lot of processing power just to execute the
100 HZ timers. You quickly end up with 100% CPU only for the timer
interrupts of otherwise idle images. Therefore I had a go at the timer
stuff and now I have a system running without the 100 HZ timer. Unluckly
I need to make changes to common code and I want you opinion on it.

The first problem was how to get rid of the jiffies. The solution is
simple. I simply defined a macro that calculates the jiffies value from
the TOD clock:
#define jiffies ({ \
uint64_t __ticks; \
asm ("STCK %0" : "=m" (__ticks) ); \
__ticks = (__ticks - init_timer_cc) >> 12; \
do_div(__ticks, (1000000/HZ)); \
((unsigned long) __ticks); \
})
With this define you are independent of the jiffies variable which is no
longer needed so I ifdef'ed the definition. There are some places where a
local variable is named jiffies. You may not replace these so I renamed
them to _jiffies. A kernel compiled with only this change works as always.

The second problem is that you need to be able to find out when the next
timer event is due to happen. You'll find a new function "next_timer_event"
in the patch which traverses tv1-tv5 and returns the timer_list of the next
timer event. It is used in timer_bh to indicate to the backend when the
next interrupt should happen. This leads us to the notifier functions.
Each time a new timer is added, a timer is modified, or a timer expires
the architecture backend needs to reset its timeout value. That is what the
"timer_notify" callback is used for. The implementation on S/390 uses the
clock comparator and looks like this:
static void s390_timer_notify(unsigned long expires)
{
S390_lowcore.timer_event =
((__u64) expires*CLK_TICKS_PER_JIFFY) + init_timer_cc;
asm volatile ("SCKC %0" : : "m" (S390_lowcore.timer_event));
}
This causes an interrupt on the cpu which executed s390_timer_notify after
"expires" has passed. That means that timer events are spread over the cpus
in the system. Modified or deleted timer events do not cause a deletion
notification. A cpu might be errornously interrupted to early because of a
timer event that has been modified or deleted. But that doesn't do any
harm,
it is just unnecessary work.

There is a second callback "itimer_notify" that is used to get the per
process timers right. We use the cpu timer for this purpose:
void set_cpu_timer(void)
{
unsigned long min_ticks;
__u64 time_slice;
if (current->pid != 0 && current->need_resched == 0) {
min_ticks = current->counter;
if (current->it_prof_value != 0 &&
current->it_prof_value < min_ticks)
min_ticks = current->it_prof_value;
if (current->it_virt_value != 0 &&
current->it_virt_value < min_ticks)
min_ticks = current->it_virt_value;
time_slice = (__u64) min_ticks*CLK_TICKS_PER_JIFFY;
asm volatile ("spt %0" : : "m" (time_slice));
}
}
The cpu timer is a one shot timer that interrupts after the specified
amount
of time has passed. Not a 100% accurate because VM can schedule the virtual
processor before the "spt" has been done but good enough for per process
timers.

The remaining changes to common code parts deal with the problem that many
ticks may be accounted at once. For example without the 100 HZ timer it is
possible that a process runs for half a second in user space. With the next
interrupt all the ticks between the last update and the interrupt have to
be added to the tick counters. This is why update_wall_time and do_it_prof
have changed and update_process_times2 has been introduced.

That leaves three problems: 1) you need to check on every system entry if
a tick or more has passed and do the update if necessary, 2) you need to
keep track of the elapsed time in user space and in kernel space and 3) you
need to check tq_timer every time the system is left and setup a timer
event for the next timer tick if there is work to do on the timer queue.
These three problems are related and have to be implemented architecture
dependent. A nice thing we get for free is that the user/kernel elapsed
time
measurement gets much more accurate.

The number of interrupts in an idle system due to timer activity drops from
from 100 per second on every cpu to about 5-6 on all (!) cpus if this patch
is used. Exactly what we want to have.

All this new timer code is only used if the config option
CONFIG_NO_HZ_TIMER
is set. Without it everything works as always, especially for architectures
that will not use it.

Now what do you think?

(See attached file: timer_common)

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]


Attachments:
(No filename) (5.42 kB)
timer_common (12.83 kB)
Download all attachments

2001-04-09 17:18:45

by Jeff Dike

[permalink] [raw]
Subject: Re: No 100 HZ timer !

[email protected] said:
> I have a suggestion that might seem unusual at first but it is
> important for Linux on S/390. We are facing the problem that we want
> to start many (> 1000) Linux images on a big S/390 machine. Every
> image has its own 100 HZ timer on every processor the images uses
> (normally 1). On a single image system the processor use of the 100 HZ
> timer is not a big deal but with > 1000 images you need a lot of
> processing power just to execute the 100 HZ timers. You quickly end up
> with 100% CPU only for the timer interrupts of otherwise idle images.

This is going to be a problem for UML as well, and I was considering something
very similar. I did a quick scan of your prose, and the description sounds
like what I had in mind.

So, count me in as a supporter of this.

A small request: Since S/390 is not the only port that needs this, I'd be
happy if it was made as generic as possible (and it may already be, I haven't
gone through the code yet).

Jeff


2001-04-09 18:29:46

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer !

this is one of linux biggest weaknesses. the fixed interval timer is a
throwback. it should be replaced with a variable interval timer with interrupts
on demand for any system with a cpu sane/modern enough to have an on-chip
interrupting decrementer. (i.e just about any modern chip)

On Mon, 09 Apr 2001, Jeff Dike wrote:
> [email protected] said:
> > I have a suggestion that might seem unusual at first but it is
> > important for Linux on S/390. We are facing the problem that we want
> > to start many (> 1000) Linux images on a big S/390 machine. Every
> > image has its own 100 HZ timer on every processor the images uses
> > (normally 1). On a single image system the processor use of the 100 HZ
> > timer is not a big deal but with > 1000 images you need a lot of
> > processing power just to execute the 100 HZ timers. You quickly end up
> > with 100% CPU only for the timer interrupts of otherwise idle images.
>
> This is going to be a problem for UML as well, and I was considering something
> very similar. I did a quick scan of your prose, and the description sounds
> like what I had in mind.
>
> So, count me in as a supporter of this.
>
> A small request: Since S/390 is not the only port that needs this, I'd be
> happy if it was made as generic as possible (and it may already be, I haven't
> gone through the code yet).
>
> Jeff
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-04-09 20:11:21

by Alan

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> this is one of linux biggest weaknesses. the fixed interval timer is a
> throwback. it should be replaced with a variable interval timer with interrupts
> on demand for any system with a cpu sane/modern enough to have an on-chip
> interrupting decrementer. (i.e just about any modern chip)

Its worth doing even on the ancient x86 boards with the PIT. It does require
some driver changes since


while(time_before(jiffies, we_explode))
poll_things();

no longer works

2001-04-09 20:46:46

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Mon, 09 Apr 2001, Alan Cox wrote:
> > this is one of linux biggest weaknesses. the fixed interval timer is a
> > throwback. it should be replaced with a variable interval timer with interrupts
> > on demand for any system with a cpu sane/modern enough to have an on-chip
> > interrupting decrementer. (i.e just about any modern chip)
>
> Its worth doing even on the ancient x86 boards with the PIT. It does require
> some driver changes since
>
>
> while(time_before(jiffies, we_explode))
> poll_things();
>
> no longer works
>

It would be great if this could be one of the 2.5 goals/projects.

it would make all sorts of fun and useful timed event services easier to
implement and provide (potentially) microsecond resolution as opposed to the
current 10Ms.

plus, we would only get one "sysclock" timer interrupt per process quantum
instead of 10.



--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-04-09 22:32:09

by Mikulas Patocka

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> > this is one of linux biggest weaknesses. the fixed interval timer is a
> > throwback. it should be replaced with a variable interval timer with interrupts
> > on demand for any system with a cpu sane/modern enough to have an on-chip
> > interrupting decrementer. (i.e just about any modern chip)
>
> Its worth doing even on the ancient x86 boards with the PIT.

Note that programming the PIT is sloooooooow and doing it on every timer
add_timer/del_timer would be a pain.

Mikulas

2001-04-09 22:35:09

by Alan

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> > Its worth doing even on the ancient x86 boards with the PIT.
>
> Note that programming the PIT is sloooooooow and doing it on every timer
> add_timer/del_timer would be a pain.

You only have to do it occasionally.

When you add a timer newer than the current one
(arguably newer by at least 1/2*HZ sec)
When you finish running the timers at an interval and the new interval is
significantly larger than the current one.

Remember each tick we poke the PIT anyway

2001-04-10 05:51:50

by Andi Kleen

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Mon, Apr 09, 2001 at 02:19:28PM -0400, Mark Salisbury wrote:
> this is one of linux biggest weaknesses. the fixed interval timer is a
> throwback. it should be replaced with a variable interval timer with interrupts
> on demand for any system with a cpu sane/modern enough to have an on-chip
> interrupting decrementer. (i.e just about any modern chip)

Just how would you do kernel/user CPU time accounting then ? It's currently done
on every timer tick, and doing it less often would make it useless.


-Andi

2001-04-10 07:32:17

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: No 100 HZ timer !



>Its worth doing even on the ancient x86 boards with the PIT. It does
require
>some driver changes since
>
>
> while(time_before(jiffies, we_explode))
> poll_things();
>
>no longer works
On S/390 we have a big advantage here. Driver code of this kind does not
exist.
That makes it a lot easier for us compared to other architectures. As I
said in
the original posting, the patch I have is working fine for S/390.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]


2001-04-10 07:29:17

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: No 100 HZ timer !



>Just how would you do kernel/user CPU time accounting then ? It's
currently done
>on every timer tick, and doing it less often would make it useless.
This part is architecture dependent. For S/390 I choose to do a "STCK" on
every
system entry/exit. Dunno if this can be done on other architectures too, on
S/390
this is reasonably cheap (one STCK costs 15 cycles). That means the
kernel/user CPU
time accounting is MUCH better now.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]


2001-04-10 09:33:40

by Martin Mares

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Hi!

> Just how would you do kernel/user CPU time accounting then ? It's currently done
> on every timer tick, and doing it less often would make it useless.

Except for machines with very slow timers we really should account time
to processes during context switch instead of sampling on timer ticks.
The current values are in many situations (i.e., lots of processes
or a process frequently waiting for events bound to timer) a pile
of random numbers.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
We all live in a yellow subroutine...

2001-04-10 10:01:20

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Martin Mares writes:
> [lost]

>> Just how would you do kernel/user CPU time accounting then ?
>> It's currently done on every timer tick, and doing it less
>> often would make it useless.
>
> Except for machines with very slow timers we really should account time
> to processes during context switch instead of sampling on timer ticks.
> The current values are in many situations (i.e., lots of processes
> or a process frequently waiting for events bound to timer) a pile
> of random numbers.

Linux should maintain some sort of per-process decaying average.
This data is required for a Unix98-compliant ps program. (for %CPU)
Currently ps is using total CPU usage over the life of the process.

2001-04-10 11:18:55

by Alan

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> > interrupting decrementer. (i.e just about any modern chip)
>
> Just how would you do kernel/user CPU time accounting then ? It's currently done
> on every timer tick, and doing it less often would make it useless.

On the contrary doing it less often but at the right time massively improves
its accuracy. You do it on reschedule. An rdtsc instruction is cheap and all
of a sudden you have nearly cycle accurate accounting

Alan

2001-04-10 11:41:15

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: No 100 HZ timer !



>> Just how would you do kernel/user CPU time accounting then ? It's
currently done
>> on every timer tick, and doing it less often would make it useless.
>
>On the contrary doing it less often but at the right time massively
improves
>its accuracy. You do it on reschedule. An rdtsc instruction is cheap and
all
>of a sudden you have nearly cycle accurate accounting
If you do the accounting on reschedule, how do you find out how much time
has been spent in user versus kernel mode? Or do the Intel chips have two
counters, one for user space execution and one for the kernel?

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]


2001-04-10 11:44:35

by David Schleef

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Mon, Apr 09, 2001 at 11:35:44PM +0100, Alan Cox wrote:
> > > Its worth doing even on the ancient x86 boards with the PIT.
> >
> > Note that programming the PIT is sloooooooow and doing it on every timer
> > add_timer/del_timer would be a pain.
>
> You only have to do it occasionally.
>
> When you add a timer newer than the current one
> (arguably newer by at least 1/2*HZ sec)
> When you finish running the timers at an interval and the new interval is
> significantly larger than the current one.
>
> Remember each tick we poke the PIT anyway

Reprogramming takes 3-4 times as long. However, I still agree
it's a good idea.

RTAI will run the 8254 timer in one-shot mode if you ask it to.
However, on machines without a monotonically increasing counter,
i.e., the TSC, you have to use 8254 timer 0 as both the timebase
and the interval counter -- you end up slowly losing time because
of the race condition between reading the timer and writing a
new interval. RTAI's solution is to disable kd_mksound and
use timer 2 as a poor man's TSC. If either of those is too big
of a price, it may suffice to report that the timer granularity
on 486's is 10 ms.

It would be nice to see any redesign in this area make it more
modular. I have hardware that would make it possible to slave
the Linux system clock directly off a high-accuracy timebase,
which would be super-useful for some applications. I've been
doing some of this already, both as a kernel patch and as part
of RTAI; search for 'timekeeper' in the LKML archives if interested.




dave...

2001-04-10 11:54:28

by Alan

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> If you do the accounting on reschedule, how do you find out how much ti=
> me
> has been spent in user versus kernel mode? Or do the Intel chips have t=
> wo
> counters, one for user space execution and one for the kernel?

Unfortunately not so you'd need to do a little bit per syscall, at least for
non trivial syscalls.

2001-04-10 12:02:18

by Andi Kleen

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, Apr 10, 2001 at 12:18:03PM +0100, Alan Cox wrote:
> > > interrupting decrementer. (i.e just about any modern chip)
> >
> > Just how would you do kernel/user CPU time accounting then ? It's currently done
> > on every timer tick, and doing it less often would make it useless.
>
> On the contrary doing it less often but at the right time massively improves
> its accuracy. You do it on reschedule. An rdtsc instruction is cheap and all
> of a sudden you have nearly cycle accurate accounting

Does not sound very attractive all at all on non virtual machines (I see the point on
UML/VM):
making system entry/context switch/interrupts slower, making add_timer slower, just to
process a few less timer interrupts. That's like robbing the fast paths for a slow path.

-Andi

2001-04-10 12:05:08

by Mikulas Patocka

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> > > > Its worth doing even on the ancient x86 boards with the PIT.
> > >
> > > Note that programming the PIT is sloooooooow and doing it on every timer
> > > add_timer/del_timer would be a pain.
> >
> > You only have to do it occasionally.
> >
> > When you add a timer newer than the current one
> > (arguably newer by at least 1/2*HZ sec)
> > When you finish running the timers at an interval and the new interval is
> > significantly larger than the current one.
> >
> > Remember each tick we poke the PIT anyway
>
> Reprogramming takes 3-4 times as long. However, I still agree
> it's a good idea.

Adding and removing timers happens much more frequently than PIT tick, so
comparing these times is pointless.

If you have some device and timer protecting it from lockup on buggy
hardware, you actually

send request to device
add timer

receive interrupt and read reply
remove timer

With the curent timer semantics, the cost of add timer and del timer is
nearly zero. If you had to reprogram the PIT on each request and reply, it
would slow things down.

Note that you call mod_timer also on each packet received - and in worst
case (which may happen), you end up reprogramming the PIT on each packet.

Mikulas

2001-04-10 12:11:18

by Alan

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> Does not sound very attractive all at all on non virtual machines (I see the point on
> UML/VM):
> making system entry/context switch/interrupts slower, making add_timer slower, just to
> process a few less timer interrupts. That's like robbing the fast paths for a slow path.

Measure the number of clocks executing a timer interrupt. rdtsc is fast. Now
consider the fact that out of this you get KHz or better scheduling
resolution required for games and midi. I'd say it looks good. I agree
the accounting of user/system time needs care to avoid slowing down syscall
paths

Alan


2001-04-10 12:17:59

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer !

which kind of U/K accaounting are you referring to?

are you referring to global changes in world time? are you referring to time
used by a process?

I think the reduction of clock interrupts by a factor of 10 would buy us some
performance margin that could be traded for a slightly more complex handler.

On Tue, 10 Apr 2001, Andi Kleen wrote:
> On Mon, Apr 09, 2001 at 02:19:28PM -0400, Mark Salisbury wrote:
> > this is one of linux biggest weaknesses. the fixed interval timer is a
> > throwback. it should be replaced with a variable interval timer with interrupts
> > on demand for any system with a cpu sane/modern enough to have an on-chip
> > interrupting decrementer. (i.e just about any modern chip)
>
> Just how would you do kernel/user CPU time accounting then ? It's currently done
> on every timer tick, and doing it less often would make it useless.
>
>
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-04-10 12:21:39

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Mon, 09 Apr 2001, Alan Cox wrote:
>
> Its worth doing even on the ancient x86 boards with the PIT. It does require
> some driver changes since
>
>
> while(time_before(jiffies, we_explode))
> poll_things();
>
> no longer works

jiffies could be replaced easily enough w/ a macro such as

#define jiffies (get_time_in_jiffies())

then driver code would not need to be touched.

--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-04-10 12:25:59

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, 10 Apr 2001, Martin Mares wrote:
> Except for machines with very slow timers we really should account time
> to processes during context switch instead of sampling on timer ticks.
> The current values are in many situations (i.e., lots of processes
> or a process frequently waiting for events bound to timer) a pile
> of random numbers.

yup. however, there is a performance penalty even on fast machines for the
fine grained process time usage accounting, and it in the past there has been a
strong reluctance to add overhead to syscalls and other context switches.

It would probably be a good compile config option to allow fine or coarse
process time accounting, that leaves the choice to the person setting up the
system to make the choice based on their needs.


--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-04-10 12:32:31

by David Schleef

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, Apr 10, 2001 at 02:04:17PM +0200, Mikulas Patocka wrote:
>
> Adding and removing timers happens much more frequently than PIT tick, so
> comparing these times is pointless.
>
> If you have some device and timer protecting it from lockup on buggy
> hardware, you actually
>
> send request to device
> add timer
>
> receive interrupt and read reply
> remove timer
>
> With the curent timer semantics, the cost of add timer and del timer is
> nearly zero. If you had to reprogram the PIT on each request and reply, it
> would slow things down.
>
> Note that you call mod_timer also on each packet received - and in worst
> case (which may happen), you end up reprogramming the PIT on each packet.

This just indicates that the interface needs to be richer -- i.e.,
such as having a "lazy timer" that means: "wake me up when _at least_
N ns have elapsed, but there's no hurry." If waking you up at N ns
is expensive, then the wakeup is delayed until something else
interesting happens.

This is effectively what we have now anyway. Perhaps the
current add_timer() should be mapped to lazy timers.




dave...

2001-04-10 12:32:41

by Andi Kleen

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, Apr 10, 2001 at 01:12:14PM +0100, Alan Cox wrote:
> Measure the number of clocks executing a timer interrupt. rdtsc is fast. Now
> consider the fact that out of this you get KHz or better scheduling
> resolution required for games and midi. I'd say it looks good. I agree

And measure the number of cycles a gigahertz CPU can do between a 1ms timer.
And then check how often the typical application executes something like
gettimeofday.

> the accounting of user/system time needs care to avoid slowing down syscall
> paths

It's also all interrupts, not only syscalls, and also context switch if you
want to be accurate.

On modern PC hardware it might be possible to do user/system accounting using
performance MSRs. They have a bit in the performance counter that allows to
only account user or system. If you find a count that is near equivalent to
the cycles you have both: total = rdtsc, user = msr, system = rdtsc-msr.
At least PPro derived have event 0x16, number of instructions executed, which
might be good enough when multiplied with a factor if your instruction mix is not
too unusual.

Still even with that the more complex checking in add_timer doesn't look too good.


-Andi

2001-04-10 12:34:52

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, 10 Apr 2001, David Schleef wrote:
> i.e., the TSC, you have to use 8254 timer 0 as both the timebase
> and the interval counter -- you end up slowly losing time because
> of the race condition between reading the timer and writing a
> new interval.

actually, I have an algorithm to fix that. (had to implement this on a system
with a single 32 bit decrementer (an ADI21060 SHARC, YUK!)) the algorithm
simulates a free spinning 64 bit incrementer given a 32 bit interrupting
decrementer under exclusive control of the timekeeping code. it also takes
into account the read/calculate/write interval.



> It would be nice to see any redesign in this area make it more
> modular. I have hardware that would make it possible to slave
> the Linux system clock directly off a high-accuracy timebase,
> which would be super-useful for some applications. I've been
> doing some of this already, both as a kernel patch and as part
> of RTAI; search for 'timekeeper' in the LKML archives if interested.
>
>
>
>
> dave...
--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-04-10 12:37:35

by Alan

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> It's also all interrupts, not only syscalls, and also context switch if you
> want to be accurate.

We dont need to be that accurate. Our sample rate is currently so low the
data is worthless anyway

2001-04-10 12:41:49

by Andi Kleen

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, Apr 10, 2001 at 01:36:27PM +0100, Alan Cox wrote:
> > It's also all interrupts, not only syscalls, and also context switch if you
> > want to be accurate.
>
> We dont need to be that accurate. Our sample rate is currently so low the
> data is worthless anyway

Just without checking on context switch you would lose the information of per pid
system/user/total that is currently collected completely.

-Andi

2001-04-10 12:41:45

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, 10 Apr 2001, Alan Cox wrote:
> > Does not sound very attractive all at all on non virtual machines (I see the point on
> > UML/VM):
> > making system entry/context switch/interrupts slower, making add_timer slower, just to
> > process a few less timer interrupts. That's like robbing the fast paths for a slow path.
>
> Measure the number of clocks executing a timer interrupt. rdtsc is fast. Now
> consider the fact that out of this you get KHz or better scheduling
> resolution required for games and midi. I'd say it looks good. I agree
> the accounting of user/system time needs care to avoid slowing down syscall
> paths
>
> Alan

the system I implemented this in went from 25 Millisecond resolution to 25-60
Nanosecond resolution (depending on the CPU underneath). that is a theoretical
factor of 500,000 to 1,000,000 improvement in resolution for timed events, and
the clock overhead after the change was about the same. (+-10% depending on
underlying CPU)

this is on a system that only had one clock tick per process quantum, as
opposed to the 10 in linux.





--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-04-10 12:46:12

by Andi Kleen

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, Apr 10, 2001 at 08:07:04AM -0400, Mark Salisbury wrote:
> which kind of U/K accaounting are you referring to?
>
> are you referring to global changes in world time? are you referring to time
> used by a process?

The later.

>
> I think the reduction of clock interrupts by a factor of 10 would buy us some
> performance margin that could be traded for a slightly more complex handler.

It depends on your workload if you can trade that in. e.g. when a few hundred TCP
sockets are active a lot of timer ticks will run some timer handler. Also generally
the kernel has quite a lot of timers. There is some reduction on a idle system. That
is no doubt useful for VM/UML/VMware where you can have idle sessions hanging around,
but otherwise it's not very interesting to optimize idle systems (except maybe for
power saving purposes)


-Andi

2001-04-10 12:49:42

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, 10 Apr 2001, David Schleef wrote:
> This just indicates that the interface needs to be richer -- i.e.,
> such as having a "lazy timer" that means: "wake me up when _at least_
> N ns have elapsed, but there's no hurry." If waking you up at N ns
> is expensive, then the wakeup is delayed until something else
> interesting happens.

all POSIX timer and timer like functions (timer_xxx, nanosleep, alarm etc)
are defined to operate on a 'no earlier than' semantic. i.e. if you ask to
nanosleep for 500 nsec, you will sleep for no less than 500 nanoseconds, but
possibly up to 20,000,500 nanoseconds. this is because the maximum legal POSIX
time resolution is 20,000,000 nanoseconds (1/50th second or 50hz because of
european electricity and old hardware)

--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-04-10 12:53:12

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, 10 Apr 2001, Andi Kleen wrote:
> .... Also generally the kernel has quite a lot of timers.

are these generaly of the long interval, failure timeout type?
i.e. 5 second device timeouts that are killed before they get executed because
the device didn't timeout? if so, they have no effect on the number of timer
interrupts because they generally never get hit. and when they do, you have to
pay the price anyway.

--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-04-10 12:55:22

by Andi Kleen

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, Apr 10, 2001 at 08:42:33AM -0400, Mark Salisbury wrote:
> On Tue, 10 Apr 2001, Andi Kleen wrote:
> > .... Also generally the kernel has quite a lot of timers.
>
> are these generaly of the long interval, failure timeout type?

A lot of them are, but not all.

-Andi

2001-04-10 12:57:02

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: No 100 HZ timer !



>Does not sound very attractive all at all on non virtual machines (I see
the point on
>UML/VM):
>making system entry/context switch/interrupts slower, making add_timer
slower, just to
>process a few less timer interrupts. That's like robbing the fast paths
for a slow path.
The system entry/exit/context switch is slower. The add_timer/mod_timer is
only
a little bit slower in the case a new soonest timer event has been created.
I
think you can forget the additional overhead for add_timer/mod_timer, its
the
additional path length on the system entry/exit that might be problematic.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]


2001-04-10 14:11:26

by Mikulas Patocka

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> > Note that you call mod_timer also on each packet received - and in worst
> > case (which may happen), you end up reprogramming the PIT on each packet.
>
> This just indicates that the interface needs to be richer -- i.e.,
> such as having a "lazy timer" that means: "wake me up when _at least_
> N ns have elapsed, but there's no hurry." If waking you up at N ns
> is expensive, then the wakeup is delayed until something else
> interesting happens.
>
> This is effectively what we have now anyway. Perhaps the
> current add_timer() should be mapped to lazy timers.

BTW. Why we need to redesign timers at all? The cost of timer interrupt
each 1/100 second is nearly zero (1000 instances on S/390 VM is not common
case - it is not reasonable to degradate performance of timers because of
this).

Timers more precise than 100HZ aren't probably needed - as MIN_RTO is 0.2s
and MIN_DELACK is 0.04s, TCP would hardly benefit from them.

Mikulas



2001-04-10 14:29:08

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Mikulas Patocka wrote:

> BTW. Why we need to redesign timers at all? The cost of timer interrupt
> each 1/100 second is nearly zero (1000 instances on S/390 VM is not common
> case - it is not reasonable to degradate performance of timers because of
> this).
>
> Timers more precise than 100HZ aren't probably needed - as MIN_RTO is 0.2s
> and MIN_DELACK is 0.04s, TCP would hardly benefit from them.
>

well, I can think dozens of real time applications off the top of my head that
need beter than 1ms timing resolution (think sensor fusion) 1000 clock
interrupts/sec is wasteful when what you need is 1 very precisely timed
interrupt.

why do we redesign anything? to make it better. TCP is not the only thing in
the system.


if you are in love with the existing system, it shouldn't be hard to make it a
config option.

2001-04-10 14:23:26

by Andi Kleen

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, Apr 10, 2001 at 04:10:28PM +0200, Mikulas Patocka wrote:
> Timers more precise than 100HZ aren't probably needed - as MIN_RTO is 0.2s
> and MIN_DELACK is 0.04s, TCP would hardly benefit from them.

On networking bandwidth shaping and traffic control generally need higher precision timers.
There are also people who don't like the minimum 10ms delay for non-busy wait, it's
apparently a problem for some apps.

-Andi

2001-04-10 14:44:51

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: No 100 HZ timer !



>BTW. Why we need to redesign timers at all? The cost of timer interrupt
>each 1/100 second is nearly zero (1000 instances on S/390 VM is not common
>case - it is not reasonable to degradate performance of timers because of
>this).
The cost of the timer interrupts on a single image system is neglectable,
true. As I already pointed out in the original proposal we are looking
for a solution that will allow us to minimize the costs of the timer
interrupts when we run many images. For us this case is not unusual and
it is reasonable to degrade performance of a running system by a very
small amount to get rid of the HZ timer. This proposal was never meant
to be the perfect solution for every platform, that is why it is
configuratable with the CONFIG_NO_HZ_TIMER option.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]


2001-04-10 15:43:51

by Alan

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> Timers more precise than 100HZ aren't probably needed - as MIN_RTO is 0.2s
> and MIN_DELACK is 0.04s, TCP would hardly benefit from them.

There are a considerable number of people who really do need 1Khz resolution.
Midi is one of the example cases. That doesn't mean we have to go to a 1KHz
timer, we may want to combine a 100Hz timer with a smart switch to 1Khz

2001-04-10 17:15:58

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Mikulas Patocka wrote:
> Timers more precise than 100HZ aren't probably needed - as MIN_RTO is 0.2s
> and MIN_DELACK is 0.04s, TCP would hardly benefit from them.

Indeed, using precise timers for TCP would probably degrade performance
-- you should process a group of timer events together, when resolution
is not that important.

There are plenty of apps that need higher resolution. Software modem
comes to mind (guess why ;), though the device driver supplies the high
resolution timed interrupts in that case.

Games would like to be able to page flip at vertical refresh time --
<1ms accuracy please. Network traffic shaping benefits from better than
1ms timing. Video players want to display their frames preferably
without 10ms jitter.

Even that old classic game "snake" benefits from decent timing. I
worked on an X multiplayer snake implementation which was very
unpleasant and jerky at first. 1. Disable nagle for X connection :-)
Better but still jerky. 2. Write delay loop like this:

calculate next_event_time
select (0, 0, 0, next_event_time - 20ms)
while (gettimeofday() < next_event_time)
/* Busy loop for last 20ms. */

It's no coincidence that I've had to write another very similar event
loop recently. You can see, this sort of thing is a real waste of CPU.

-- Jamie

2001-04-10 17:29:09

by Alan

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> Games would like to be able to page flip at vertical refresh time --
> <1ms accuracy please. Network traffic shaping benefits from better than

This is an X issue. I was talking with Jim Gettys about what is needed to
get the relevant existing X extensions for this working

2001-04-10 17:35:49

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Alan Cox wrote:
> > Games would like to be able to page flip at vertical refresh time --
> > <1ms accuracy please. Network traffic shaping benefits from better than
>
> This is an X issue. I was talking with Jim Gettys about what is needed to
> get the relevant existing X extensions for this working

Last time I looked at XF86DGA (years ago), it seemed to have the right
hooks in place. Just a matter of server implementation. My
recollection is dusty though.

-- Jamie

2001-04-10 17:49:42

by Victor Yodaiken

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, Apr 10, 2001 at 04:43:36AM -0700, David Schleef wrote:
> However, on machines without a monotonically increasing counter,
> i.e., the TSC, you have to use 8254 timer 0 as both the timebase
> and the interval counter -- you end up slowly losing time because
> of the race condition between reading the timer and writing a
> new interval. RTAI's solution is to disable kd_mksound and
> use timer 2 as a poor man's TSC. If either of those is too big
> of a price, it may suffice to report that the timer granularity
> on 486's is 10 ms.

Just for the record, Michael Barabanov did this in RTLinux from before
kd_mksound was a function pointer in 1995. Michael had an optimization
attempt using channel 1 for a while, but on very slow machines this
was not sufficient and he went back to channel 2. Of course, the
fundamental problem is that board designers keep putting an 1920s
part in machines built in 2001.

Here's the comment from the RTLinux 0.5 patch -- all available on the archives
on rtlinux.com.

+/* The main procedure; resets the 8254 timer to generate an interrupt. The
+ * tricky part is to keep the global time while reprogramming it. We latch
+ * counters 0 and 2 atomically before and after reprogramming to figure it out.
+ */


--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2001-04-10 18:17:35

by Alan

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> > This is an X issue. I was talking with Jim Gettys about what is needed to
> > get the relevant existing X extensions for this working
>
> Last time I looked at XF86DGA (years ago), it seemed to have the right
> hooks in place. Just a matter of server implementation. My
> recollection is dusty though.

There is also a timing extension for synchronizing events to happenings. The
stopper is the kernel interface for the vblank handling since the irq must
be handled and cleared in kernel context before we return to X. We now think
we know how to handle that cleanly

Alan

2001-04-10 18:24:48

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Alan Cox wrote:
> > > This is an X issue. I was talking with Jim Gettys about what is needed to
> > > get the relevant existing X extensions for this working
> >
> > Last time I looked at XF86DGA (years ago), it seemed to have the right
> > hooks in place. Just a matter of server implementation. My
> > recollection is dusty though.
>
> There is also a timing extension for synchronizing events to happenings. The
> stopper is the kernel interface for the vblank handling since the irq must
> be handled and cleared in kernel context before we return to X. We now think
> we know how to handle that cleanly

Except for cards which don't generate an irq on vsync but do let you see
how the raster is proceeding. (I vaguely recall these). For which a
PLL and, wait for it.... high resolution timer is needed.

Perhaps that's a 1990s problem that doesn't need a 2000s solution though :-)

-- Jamie

2001-04-10 19:32:29

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Just for your information we have a project going that is trying to come
up with a good solution for all of this:

http://sourceforge.net/projects/high-res-timers

We have a mailing list there where we have discussed much of the same
stuff. The mailing list archives are available at sourceforge.

Lets separate this into findings and tentative conclusions :)

Findings:

a) The University of Kansas and others have done a lot of work here.

b) High resolution timer events can be had with or without changing HZ.

c) High resolution timer events can be had with or without eliminating
the 1/HZ tick.

d) The organization of the timer list should reflect the existence of
the 1/HZ tick or not. The current structure is not optimal for a "tick
less" implementation. Better would be strict expire order with indexes
to "interesting times".

e) The current organization of the timer list generates a hiccup every
2.56 seconds to handle "cascading". Hiccups are bad.

f) As noted, the account timers (task user/system times) would be much
more accurate with the tick less approach. The cost is added code in
both the system call and the schedule path.

Tentative conclusions:

Currently we feel that the tick less approach is not acceptable due to
(f). We felt that this added code would NOT be welcome AND would, in a
reasonably active system, have much higher overhead than any savings in
not having a tick. Also (d) implies a list organization that will, at
the very least, be harder to understand. (We have some thoughts here,
but abandoned the effort because of (f).) We are, of course, open to
discussion on this issue and all others related to the project
objectives.

We would reorganize the current timer list structure to eliminate the
cascade (e) and to add higher resolution entries. The higher resolution
entries would carry an addition word which would be the fraction of a
jiffie that needs to be added to the jiffie value for the timer. This
fraction would be in units defined by the platform to best suit the sub
jiffie interrupt generation code. Each of the timer lists would then be
ordered by time based on this sub jiffie value. In addition, in order
to eliminate the cascade, each timer list would carry all timers for
times that expire on the (jiffie mod (size of list)). Thus, with the
current 256 first order lists, all timers with the same (jiffies & 255)
would be in the same list, again in expire order. We also think that
the list size should be configurable to some power of two. Again we
welcome discussion of these issues.

George

Alan Cox wrote:

>> It's also all interrupts, not only syscalls, and also context switch if you
>> want to be accurate.

>We dont need to be that accurate. Our sample rate is currently so low the
>data is worthless anyway

2001-04-10 19:46:50

by Stephen D. Williams

[permalink] [raw]
Subject: Re: No 100 HZ timer !

When this is rewritten, I would strongly suggest that we find a way to
make 'gettimeofday' nearly free. Certain applications need to use this
frequently while still being portable. One solution when you do have
clock ticks is a read-only mapped Int. Another cheap solution is
library assembly that adds a cycle clock delta since last system call to
a 'gettimeofday' value set on every system call return.

sdw

Andi Kleen wrote:
>
> On Tue, Apr 10, 2001 at 01:12:14PM +0100, Alan Cox wrote:
> > Measure the number of clocks executing a timer interrupt. rdtsc is fast. Now
> > consider the fact that out of this you get KHz or better scheduling
> > resolution required for games and midi. I'd say it looks good. I agree
>
> And measure the number of cycles a gigahertz CPU can do between a 1ms timer.
> And then check how often the typical application executes something like
> gettimeofday.
>
...

sdw
--
[email protected] http://sdw.st
Stephen D. Williams
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax
Dec2000

2001-04-10 19:42:41

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Alan Cox wrote:
>
> > Games would like to be able to page flip at vertical refresh time --
> > <1ms accuracy please. Network traffic shaping benefits from better than
>
> This is an X issue. I was talking with Jim Gettys about what is needed to
> get the relevant existing X extensions for this working

I've already proposed my /dev/vbi device (currently works only for MGA
card)
- read returns when VBI occures - works quite well...
(currently in avifile CVS tree)

Anyway in good all days AmigaOS had interrupt service where devices
where sending timer request - they were queued and timer device was
serving
them in order - I don't see the reason why we should implement this
differently.
If there is no real reason to interrupt system more then 100Hz
periodicity
then this is ok - scheduler will simple send time request for
rescheduling in 10ms.

Why we should create 1KHz timers or so when this way seems to be much
more
elegant and will work even on XXGHz systems.


bye

[email protected]

2001-04-10 19:51:32

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Jamie Lokier wrote:
>
> Alan Cox wrote:
>
> Except for cards which don't generate an irq on vsync but do let you see
> how the raster is proceeding. (I vaguely recall these). For which a
> PLL and, wait for it.... high resolution timer is needed.
>
> Perhaps that's a 1990s problem that doesn't need a 2000s solution though :-)

I think it would be wrong if we could not add new very usable features
to the system just because some old hardware doesn't support it.
(I also believe this was the original idea why XFree has no user
interface
for such important thing like VBI screen switch - because not all device
could provide such information)
I think those who use such old system probably don't need any
synchronized
video or game output - simply because it will not work in such system
anyway :)
so we should not worry about this.

[email protected]

2001-04-10 19:57:13

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer !

george anzinger wrote:

> f) As noted, the account timers (task user/system times) would be much
> more accurate with the tick less approach. The cost is added code in
> both the system call and the schedule path.
>
> Tentative conclusions:
>
> Currently we feel that the tick less approach is not acceptable due to
> (f). We felt that this added code would NOT be welcome AND would, in a
> reasonably active system, have much higher overhead than any savings in
> not having a tick. Also (d) implies a list organization that will, at
> the very least, be harder to understand. (We have some thoughts here,
> but abandoned the effort because of (f).) We are, of course, open to
> discussion on this issue and all others related to the project
> objectives.

f does not imply tick-less is not acceptable, it implies that better process time
accounting is not acceptable.

list organization is not complex, it is a sorted absolute time list. I would
argue that this is a hell of a lot easier to understand that ticks + offsets.

still, better process time accounting should be a compile CONFIG option, not
ignored and ruled out because some one thinks that is is to expensive in the
general case.

the whole point of linux and CONFIG options is to get you the kernel with the
features you want, not what someone else wants.

there should be a whole range of config options associated with this issue:

CONFIG_JIFFIES == old jiffies implementation
CONFIG_TICKLESS == tickless
CONFIG_HYBRID == old jiffies plus a tickless high-res timer system on
the side but not assoc w/ process and global
timekeeping

CONFIG_USELESS_PROCESS_TIME_ACCOUNTING = old style, cheap time acctg
CONFIG_USEFUL_BUT_COSTS_TIME_ACCOUNTING = accurate but expensive time accounting

this way, users who want tickless and lousy time acctg can have it AND people who
want jiffies and good time acctg could have it.

these features are largely disjoint and easily seperable. it is also relatively
trivial to do this in such a way that drivers depending on the jiffie abstraction
can be supported without modification no matter what the configuration.

Mark Salisbury

2001-04-10 19:59:53

by Andi Kleen

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, Apr 10, 2001 at 02:45:15PM -0400, Stephen D. Williams wrote:
> When this is rewritten, I would strongly suggest that we find a way to
> make 'gettimeofday' nearly free. Certain applications need to use this
> frequently while still being portable. One solution when you do have
> clock ticks is a read-only mapped Int. Another cheap solution is
> library assembly that adds a cycle clock delta since last system call to
> a 'gettimeofday' value set on every system call return.

The upcoming x86-64 port supports vsyscalls for just that purpose.



-Andi

2001-04-10 22:11:50

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer !

mark salisbury wrote:
>
> george anzinger wrote:
>
> > f) As noted, the account timers (task user/system times) would be much
> > more accurate with the tick less approach. The cost is added code in
> > both the system call and the schedule path.
> >
> > Tentative conclusions:
> >
> > Currently we feel that the tick less approach is not acceptable due to
> > (f). We felt that this added code would NOT be welcome AND would, in a
> > reasonably active system, have much higher overhead than any savings in
> > not having a tick. Also (d) implies a list organization that will, at
> > the very least, be harder to understand. (We have some thoughts here,
> > but abandoned the effort because of (f).) We are, of course, open to
> > discussion on this issue and all others related to the project
> > objectives.
>
> f does not imply tick-less is not acceptable, it implies that better process time
> accounting is not acceptable.

My thinking is that a timer implementation that forced (f) would have
problems gaining acceptance (even with me :). I think a tick less
system does force this and this is why we have, at least for the moment,
abandoned it. In no way does this preclude (f) as it is compatible with
either ticks or tick less time keeping. On the other hand, the stated
project objectives do not include (f) unless, of course we do a tick
less time system.
>
> list organization is not complex, it is a sorted absolute time list. I would
> argue that this is a hell of a lot easier to understand that ticks + offsets.

The complexity comes in when you want to maintain indexes into the list
for quick insertion of new timers. To get the current insert
performance, for example, you would need pointers to (at least) each of
the next 256 centasecond boundaries in the list. But the list ages, so
these pointers need to be continually updated. The thought I had was to
update needed pointers (and only those needed) each time an insert was
done and a needed pointer was found to be missing or stale. Still it
adds complexity that the static structure used now doesn't have.
>
> still, better process time accounting should be a compile CONFIG option, not
> ignored and ruled out because some one thinks that is is to expensive in the
> general case.

As I said above, we are not ruling it out, but rather, we are not
requiring it by going tick less.
>
> the whole point of linux and CONFIG options is to get you the kernel with the
> features you want, not what someone else wants.
>
> there should be a whole range of config options associated with this issue:
>
> CONFIG_JIFFIES == old jiffies implementation
> CONFIG_TICKLESS == tickless
> CONFIG_HYBRID == old jiffies plus a tickless high-res timer system on
> the side but not assoc w/ process and global
> timekeeping
>
> CONFIG_USELESS_PROCESS_TIME_ACCOUNTING = old style, cheap time acctg
> CONFIG_USEFUL_BUT_COSTS_TIME_ACCOUNTING = accurate but expensive time accounting
>
> this way, users who want tickless and lousy time acctg can have it AND people who
> want jiffies and good time acctg could have it.

As I said, it is not clear how you could get
CONFIG_USELESS_PROCESS_TIME_ACCOUNTING unless you did a tick every
jiffie. What did you have in mind?
>
> these features are largely disjoint and easily seperable. it is also relatively
> trivial to do this in such a way that drivers depending on the jiffie abstraction
> can be supported without modification no matter what the configuration.
>
For the most part, I agree. I am not sure that it makes a lot of sense
to mix some of these options, however. I think it comes down to the
question of benefit vs cost. If keeping an old version around that is
not any faster or more efficient in any way would seem too costly to
me. We would like to provide a system that is better in every way and
thus eliminate the need to keep the old one around. We could leave it
in as a compile option so folks would have a fall back, I suppose.

An Issue no one has raised is that the tick less system would need to
start a timer each time it scheduled a task. This would lead to either
slow but very precise time slicing or about what we have today with more
schedule overhead.

George


> Mark Salisbury

2001-04-11 00:43:54

by Mark Salisbury

[permalink] [raw]
Subject: Re: No 100 HZ timer !


> mark salisbury wrote:
> >
> > george anzinger wrote:
> >
> > > f) As noted, the account timers (task user/system times) would be much
> > > more accurate with the tick less approach. The cost is added code in
> > > both the system call and the schedule path.
> > >
> > > Tentative conclusions:
> > >
> > > Currently we feel that the tick less approach is not acceptable due to
> > > (f). We felt that this added code would NOT be welcome AND would, in
a
> > > reasonably active system, have much higher overhead than any savings
in
> > > not having a tick. Also (d) implies a list organization that will, at
> > > the very least, be harder to understand. (We have some thoughts here,
> > > but abandoned the effort because of (f).) We are, of course, open to
> > > discussion on this issue and all others related to the project
> > > objectives.
> >
> > f does not imply tick-less is not acceptable, it implies that better
process time
> > accounting is not acceptable.
>
> My thinking is that a timer implementation that forced (f) would have
> problems gaining acceptance (even with me :). I think a tick less
> system does force this and this is why we have, at least for the moment,
> abandoned it. In no way does this preclude (f) as it is compatible with
> either ticks or tick less time keeping. On the other hand, the stated
> project objectives do not include (f) unless, of course we do a tick
> less time system.
> >
> > list organization is not complex, it is a sorted absolute time list. I
would
> > argue that this is a hell of a lot easier to understand that ticks +
offsets.
>
> The complexity comes in when you want to maintain indexes into the list
> for quick insertion of new timers. To get the current insert
> performance, for example, you would need pointers to (at least) each of
> the next 256 centasecond boundaries in the list. But the list ages, so
> these pointers need to be continually updated. The thought I had was to
> update needed pointers (and only those needed) each time an insert was
> done and a needed pointer was found to be missing or stale. Still it
> adds complexity that the static structure used now doesn't have.

actually, I think a head and tail pointer would be sufficient for most
cases. (most new timers are either going to be a new head of list or a new
tail, i.e. long duration timeouts that will never be serviced or short
duration timers that are going to go off "real soon now (tm)") the oddball
cases would be mostly coming from user-space, i.e. nanosleep which a longerr
list insertion disapears in the block/wakeup/context switch overhead

> >
> > still, better process time accounting should be a compile CONFIG option,
not
> > ignored and ruled out because some one thinks that is is to expensive in
the
> > general case.
>
> As I said above, we are not ruling it out, but rather, we are not
> requiring it by going tick less.
> As I said, it is not clear how you could get
> CONFIG_USELESS_PROCESS_TIME_ACCOUNTING unless you did a tick every
> jiffie. What did you have in mind?

time accounting can be limited to the quantum expiration and voluntary
yields in the tickless/useless case.

> For the most part, I agree. I am not sure that it makes a lot of sense
> to mix some of these options, however. I think it comes down to the
> question of benefit vs cost. If keeping an old version around that is
> not any faster or more efficient in any way would seem too costly to
> me. We would like to provide a system that is better in every way and
> thus eliminate the need to keep the old one around. We could leave it
> in as a compile option so folks would have a fall back, I suppose.

I agree that some combinations don't make much sense _TO_ME_ but that
doesn't mean they don't meet sombody's needs.

in my case (embedded, medium hard real time, massively parallel
multicomputer) the only choices that makes sense to my customers is
tickless/useless in deployment and tickless/useful in
development/profiling/optimization.

in other cases ticked/useless makes sense (dumb old slow chips)

I have no idea who would want ticked/useful or hybrid. i suggested hybrid
as an alternative in case linus might be reluctant to gutting and replacing
the sysclock.

>
> An Issue no one has raised is that the tick less system would need to
> start a timer each time it scheduled a task. This would lead to either
> slow but very precise time slicing or about what we have today with more
> schedule overhead.

no. you would re-use the timer with a new expiration time and a re-insert.

also re overhead. the cost of servicing 10 times as many interrupts as
necessary for system function must cost a fair chunk.



2001-04-11 02:39:20

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Mark Salisbury wrote:
>
> > mark salisbury wrote:
> > >
> > > george anzinger wrote:
> > >
> > > > f) As noted, the account timers (task user/system times) would be much
> > > > more accurate with the tick less approach. The cost is added code in
> > > > both the system call and the schedule path.
> > > >
> > > > Tentative conclusions:
> > > >
> > > > Currently we feel that the tick less approach is not acceptable due to
> > > > (f). We felt that this added code would NOT be welcome AND would, in
> a
> > > > reasonably active system, have much higher overhead than any savings
> in
> > > > not having a tick. Also (d) implies a list organization that will, at
> > > > the very least, be harder to understand. (We have some thoughts here,
> > > > but abandoned the effort because of (f).) We are, of course, open to
> > > > discussion on this issue and all others related to the project
> > > > objectives.
> > >
> > > f does not imply tick-less is not acceptable, it implies that better
> process time
> > > accounting is not acceptable.
> >
> > My thinking is that a timer implementation that forced (f) would have
> > problems gaining acceptance (even with me :). I think a tick less
> > system does force this and this is why we have, at least for the moment,
> > abandoned it. In no way does this preclude (f) as it is compatible with
> > either ticks or tick less time keeping. On the other hand, the stated
> > project objectives do not include (f) unless, of course we do a tick
> > less time system.
> > >
> > > list organization is not complex, it is a sorted absolute time list. I
> would
> > > argue that this is a hell of a lot easier to understand that ticks +
> offsets.
> >
> > The complexity comes in when you want to maintain indexes into the list
> > for quick insertion of new timers. To get the current insert
> > performance, for example, you would need pointers to (at least) each of
> > the next 256 centasecond boundaries in the list. But the list ages, so
> > these pointers need to be continually updated. The thought I had was to
> > update needed pointers (and only those needed) each time an insert was
> > done and a needed pointer was found to be missing or stale. Still it
> > adds complexity that the static structure used now doesn't have.
>
> actually, I think a head and tail pointer would be sufficient for most
> cases. (most new timers are either going to be a new head of list or a new
> tail, i.e. long duration timeouts that will never be serviced or short
> duration timers that are going to go off "real soon now (tm)") the oddball
> cases would be mostly coming from user-space, i.e. nanosleep which a longerr
> list insertion disapears in the block/wakeup/context switch overhead
>
> > >
> > > still, better process time accounting should be a compile CONFIG option,
> not
> > > ignored and ruled out because some one thinks that is is to expensive in
> the
> > > general case.
> >
> > As I said above, we are not ruling it out, but rather, we are not
> > requiring it by going tick less.
> > As I said, it is not clear how you could get
> > CONFIG_USELESS_PROCESS_TIME_ACCOUNTING unless you did a tick every
> > jiffie. What did you have in mind?
>
> time accounting can be limited to the quantum expiration and voluntary
> yields in the tickless/useless case.
>
> > For the most part, I agree. I am not sure that it makes a lot of sense
> > to mix some of these options, however. I think it comes down to the
> > question of benefit vs cost. If keeping an old version around that is
> > not any faster or more efficient in any way would seem too costly to
> > me. We would like to provide a system that is better in every way and
> > thus eliminate the need to keep the old one around. We could leave it
> > in as a compile option so folks would have a fall back, I suppose.
>
> I agree that some combinations don't make much sense _TO_ME_ but that
> doesn't mean they don't meet sombody's needs.
>
> in my case (embedded, medium hard real time, massively parallel
> multicomputer) the only choices that makes sense to my customers is
> tickless/useless in deployment and tickless/useful in
> development/profiling/optimization.

I suspect you might go for ticked if its overhead was less. The thing
that makes me think the overhead is high for tick less is the accounting
and time slice stuff. This has to be handled each system call and each
schedule call. These happen WAY more often than ticks... Contrary
wise, I would go for tick less if its overhead is the same or less under
a reasonable load. Of course, we would also want to check overload
sensitivity.
>
> in other cases ticked/useless makes sense (dumb old slow chips)
>
> I have no idea who would want ticked/useful or hybrid. i suggested hybrid
> as an alternative in case linus might be reluctant to gutting and replacing
> the sysclock.
>
> >
> > An Issue no one has raised is that the tick less system would need to
> > start a timer each time it scheduled a task. This would lead to either
> > slow but very precise time slicing or about what we have today with more
> > schedule overhead.
>
> no. you would re-use the timer with a new expiration time and a re-insert.

The issue is not the timer structure itself, but the insert and removal
time AND the time accounting that would have to go on around it. The
timer would need to be computed and started each time on the way out of
schedule and stopped and the residue saved on schedule entry. One might
want to roll this into the accounting time stuff...
>
> also re overhead. the cost of servicing 10 times as many interrupts as
> necessary for system function must cost a fair chunk.
>
On a half way loaded system? Do you think it is that high? I sort of
think that, once the system is loaded, there would be a useful timer
tick most of the time. Might be useful to do some measurements of this.

George

2001-04-11 05:49:43

by Karim Yaghmour

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Mark Salisbury wrote:
>
> It would probably be a good compile config option to allow fine or coarse
> process time accounting, that leaves the choice to the person setting up the
> system to make the choice based on their needs.
>

I suggested this a while ago during a discussion about performance
measurement. This would be fairly easy to implement using the patch
provided with the Linux Trace Toolkit since all entry points and
exit points are known (and it already is available in post-mortem
analysis). Implementing the measurement code within the kernel should
be fairly easy to implement and it would be provided as part of the
compile option. All in all, given the measurements I made, I'd place
the overhead at around 1% for the computations. (The overhead is very
likely to be negligeable when eventual fixes are taken into account.)

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2001-04-11 09:08:19

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: No 100 HZ timer !



>f) As noted, the account timers (task user/system times) would be much
>more accurate with the tick less approach. The cost is added code in
>both the system call and the schedule path.
>
>Tentative conclusions:
>
>Currently we feel that the tick less approach is not acceptable due to
>(f). We felt that this added code would NOT be welcome AND would, in a
>reasonably active system, have much higher overhead than any savings in
>not having a tick. Also (d) implies a list organization that will, at
>the very least, be harder to understand. (We have some thoughts here,
>but abandoned the effort because of (f).) We are, of course, open to
>discussion on this issue and all others related to the project
>objectives.
f) might be true on Intel based systems. At least for s/390 the situation
is a little bit different. Here is a extract from the s/390 part of the
timer patch:

.macro UPDATE_ENTER_TIME reload
la %r14,thread+_TSS_UTIME(%r9) # pointer to utime
tm SP_PSW+1(%r15),0x01 # interrupting from user ?
jno 0f # yes -> add to user time
la %r14,8(%r14) # no -> add to system time
0: lm %r0,%r1,0(%r14) # load user/system time
sl %r1,__LC_LAST_MARK+4 # subtract last time mark
bc 3,BASED(1f) # borrow ?
sl %r0,BASED(.Lc1)
1: sl %r0,__LC_LAST_MARK
stck __LC_LAST_MARK # make a new mark
al %r1,__LC_LAST_MARK+4 # add new mark -> added delta
bc 12,BASED(2f) # carry ?
al %r0,BASED(.Lc1)
2: al %r0,__LC_LAST_MARK
stm %r0,%r1,0(%r14) # store updated user/system time
clc __LC_LAST_MARK(8),__LC_JIFFY_TIMER # check if enough time
jl 3f # passed for a jiffy update
l %r1,BASED(.Ltime_warp)
basr %r14,%r1
.if \reload # reload != 0 for system call
lm %r2,%r6,SP_R2(%r15) # reload clobbered parameters
.endif
3:

.macro UPDATE_LEAVE_TIME
l %r1,BASED(.Ltq_timer) # test if tq_timer list is empty
x %r1,0(%r1) # tq_timer->next != tq_timer ?
jz 0f
l %r1,BASED(.Ltq_timer_active)
icm %r0,15,0(%r1) # timer event already added ?
jnz 0f
l %r1,BASED(.Ltq_pending)
basr %r14,%r1
0: lm %r0,%r1,thread+_TSS_STIME(%r9) # load system time
sl %r1,__LC_LAST_MARK+4 # subtract last time mark
bc 3,BASED(1f) # borrow ?
sl %r0,BASED(.Lc1)
1: sl %r0,__LC_LAST_MARK
stck __LC_LAST_MARK # make new mark
al %r1,__LC_LAST_MARK+4 # add new mark -> added delta
bc 12,BASED(2f) # carry ?
al %r0,BASED(.Lc1)
2: al %r0,__LC_LAST_MARK
stm %r0,%r1,thread+_TSS_STIME(%r9) # store system time
.endm

The two macros UPDATE_ENTER_TIME and UPDATE_LEAVE_TIMER are executed
on every system entry/exit. In the case that no special work has to
be done less then 31 instruction are executed in addition to the
normal system entry/exit code. Special work has to be done if more
time then 1/HZ has passed (call time_warp), or if tq_timer contains
an element (call tq_pending).
The accuracy of the timer events has not changed. It still is 1/HZ.
The only thing this patch does is to avoid unneeded interruptions.
I'd be happy if this could be combined with a new, more accurate
timing method.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]


2001-04-11 11:48:31

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Tue, 10 Apr 2001, Zdenek Kabelac wrote:

> I think it would be wrong if we could not add new very usable features
> to the system just because some old hardware doesn't support it.

s/old/crappy/ -- even old hardware can handle vsync IRQs fine. E.g. TVGA
8900C.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2001-04-11 16:11:29

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Mark Salisbury wrote:
> > The complexity comes in when you want to maintain indexes into the list
> > for quick insertion of new timers. To get the current insert
> > performance, for example, you would need pointers to (at least) each of
> > the next 256 centasecond boundaries in the list. But the list ages, so
> > these pointers need to be continually updated. The thought I had was to
> > update needed pointers (and only those needed) each time an insert was
> > done and a needed pointer was found to be missing or stale. Still it
> > adds complexity that the static structure used now doesn't have.
>
> actually, I think a head and tail pointer would be sufficient for most
> cases. (most new timers are either going to be a new head of list or a new
> tail, i.e. long duration timeouts that will never be serviced or short
> duration timers that are going to go off "real soon now (tm)") the oddball
> cases would be mostly coming from user-space, i.e. nanosleep which a longerr
> list insertion disapears in the block/wakeup/context switch overhead

A pointer-based priority queue is really not a very complex thing, and
there are ways to optimise them for typical cases like the above.

-- Jamie

2001-04-11 16:13:39

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Maciej W. Rozycki wrote:
> On Tue, 10 Apr 2001, Zdenek Kabelac wrote:
>
> > I think it would be wrong if we could not add new very usable features
> > to the system just because some old hardware doesn't support it.
>
> s/old/crappy/ -- even old hardware can handle vsync IRQs fine. E.g. TVGA
> 8900C.

Think of the original 64k and 256k VGA cards. I think some of those
didn't have an irq, but did have a way to read the progress of the
raster, which you could PLL to using timer interrupts. Some video games
still look fine at 320x200 :-)

-- Jamie

2001-04-11 17:04:41

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Jamie Locker wrote:
>
> Mark Salisbury wrote:
> > > The complexity comes in when you want to maintain indexes into the list
> > > for quick insertion of new timers. To get the current insert
> > > performance, for example, you would need pointers to (at least) each of
> > > the next 256 centasecond boundaries in the list. But the list ages, so
> > > these pointers need to be continually updated. The thought I had was to
> > > update needed pointers (and only those needed) each time an insert was
> > > done and a needed pointer was found to be missing or stale. Still it
> > > adds complexity that the static structure used now doesn't have.
> >
> > actually, I think a head and tail pointer would be sufficient for most
> > cases. (most new timers are either going to be a new head of list or a new
> > tail, i.e. long duration timeouts that will never be serviced or short
> > duration timers that are going to go off "real soon now (tm)") the oddball
> > cases would be mostly coming from user-space, i.e. nanosleep which a longerr
> > list insertion disapears in the block/wakeup/context switch overhead
>
> A pointer-based priority queue is really not a very complex thing, and
> there are ways to optimise them for typical cases like the above.
>
Please do enlighten me. The big problem in my mind is that the
pointers, pointing at points in time, are perishable.

George

2001-04-11 18:44:24

by Oliver Xymoron

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Mon, 9 Apr 2001, Alan Cox wrote:

> > > Its worth doing even on the ancient x86 boards with the PIT.
> >
> > Note that programming the PIT is sloooooooow and doing it on every timer
> > add_timer/del_timer would be a pain.
>
> You only have to do it occasionally.
>
> When you add a timer newer than the current one
> (arguably newer by at least 1/2*HZ sec)

That's only if we want to do no better than the current system. We'd want
a new variable called timer_margin or something, which would be dependent
on interrupt source and processor, and could be tuned up or down via
sysctl.

> When you finish running the timers at an interval and the new interval is
> significantly larger than the current one.

Make that larger or smaller. If we come out of a quiescent state (1 hz
interrupts, say) and start getting 10ms timers, we want to respond to them
right away.

> Remember each tick we poke the PIT anyway

We could also have a HZ_max tunable above which we would not try to
reprogram the interval. On older systems, this could be set at
100-200HZ...

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2001-04-11 18:57:34

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer !

george anzinger wrote:
> > A pointer-based priority queue is really not a very complex thing, and
> > there are ways to optimise them for typical cases like the above.
> >
> Please do enlighten me. The big problem in my mind is that the
> pointers, pointing at points in time, are perishable.

Um, I'm not sure what perishability has to do with anything. Timers at
the moment can be added, deleted and destroyed and it's no big deal.

Look in an algorithms book under "priority queue". They usually don't
say how to use a heap-ordered tree -- usually it's an array -- but you
can use a tree if you want. In such a tree, each timer has a link to
_two_ next timers and one previous timer. (The previous timer link is
only needed if you can delete timers before they expire, which is
required for Linux).

I'm not saying saying a heap-ordered tree is fastest. But it's ok, and
doesn't require any more storage than the `struct timer' objects
themselves.

It's possible to optimise this kind of data structure rather a lot,
depending on how you want to use it. Linux' current timer algorithm is
a pretty good example of a priority queue optimised for timer ticks,
where you don't mind doing a small amount of work per tick.

One notable complication with the kernel is that we don't want every
timer to run at its exactly allocated time, except for some special
timers. For example, if you have 100 TCP streams and their
retransmission times are scheduled for 1.0000s, 1.0001s, 1.0002s, etc.,
you'd much rather just process them all at once as is done at the moment
by the 100Hz timer. This is because cache locality is much more
important for TCP retransmits than transmit timing resolution.

There's literature online about this class of problems: look up "event
set" and "simulation" and "fast priority queue".

enjoy,
-- Jamie

2001-04-11 19:25:39

by John Alvord

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Wed, 11 Apr 2001 20:57:04 +0200, Jamie Lokier
<[email protected]> wrote:

>george anzinger wrote:
>> > A pointer-based priority queue is really not a very complex thing, and
>> > there are ways to optimise them for typical cases like the above.
>> >
>> Please do enlighten me. The big problem in my mind is that the
>> pointers, pointing at points in time, are perishable.
>
>Um, I'm not sure what perishability has to do with anything. Timers at
>the moment can be added, deleted and destroyed and it's no big deal.
>
>Look in an algorithms book under "priority queue". They usually don't
>say how to use a heap-ordered tree -- usually it's an array -- but you
>can use a tree if you want. In such a tree, each timer has a link to
>_two_ next timers and one previous timer. (The previous timer link is
>only needed if you can delete timers before they expire, which is
>required for Linux).
>
>I'm not saying saying a heap-ordered tree is fastest. But it's ok, and
>doesn't require any more storage than the `struct timer' objects
>themselves.
>
>It's possible to optimise this kind of data structure rather a lot,
>depending on how you want to use it. Linux' current timer algorithm is
>a pretty good example of a priority queue optimised for timer ticks,
>where you don't mind doing a small amount of work per tick.
>
>One notable complication with the kernel is that we don't want every
>timer to run at its exactly allocated time, except for some special
>timers. For example, if you have 100 TCP streams and their
>retransmission times are scheduled for 1.0000s, 1.0001s, 1.0002s, etc.,
>you'd much rather just process them all at once as is done at the moment
>by the 100Hz timer. This is because cache locality is much more
>important for TCP retransmits than transmit timing resolution.
>
>There's literature online about this class of problems: look up "event
>set" and "simulation" and "fast priority queue".

I bumped into a funny non-optimization a few years ago. A system with
a timer queue like the above had been "optimized" by keeping old timer
elements... ready for new tasks to link onto and activate. The
granularity was 1 millsecond. Over time, all timer values from 0 to
roughly 10 minutes had been used. That resulted in 60,000 permanent
storage fragments laying about... a significant fragmentation problem.
The end was a forced recycle every month or so.

john alvord

2001-04-12 00:19:10

by Mark Salisbury

[permalink] [raw]
Subject: Re: No 100 HZ timer !

> I suspect you might go for ticked if its overhead was less. The thing
> that makes me think the overhead is high for tick less is the accounting
> and time slice stuff. This has to be handled each system call and each
> schedule call. These happen WAY more often than ticks... Contrary
> wise, I would go for tick less if its overhead is the same or less under
> a reasonable load. Of course, we would also want to check overload
> sensitivity.

as I said, i think the process time accounting is disjoint from the
ticked/tickless issue and should therefore be considered seperately..

in my experience with the tickless method, after all pending timers have
been serviced, then to determine the interval until the next interrupt
should be generated all that is needed is one 64 bit subtraction and a
register load (about 5 - 8 cycles)

I think we should spend some serious time with some quick and dirty
prototypes of both methods, both to characterize all the issues raised and
to refine our ideas when it comes time to implement this.

I also think that we should give some thought to implementing both an
improved ticked system and a tickless system to be chosen as CONFIG options
so that the developer putting together a system can make a choice based on
their needs, and not our whims. I am more than willing to concede that
there is a class of customer to whom a ticked system is appropriate, but I
am quite sure that there is a class for whom the ticked model is completely
unacceptable.

> On a half way loaded system? Do you think it is that high? I sort of
> think that, once the system is loaded, there would be a useful timer
> tick most of the time. Might be useful to do some measurements of this.

any non-useful tick takes away cycles that could be better used performing
FFT's

there are many kinds of loads, some which generate large numbers of timed
events and some that generate none or few.
I think we want to be able to provide good solutions for both cases even.

we should do lots of measurements.

Mark


2001-04-12 05:26:30

by watermodem

[permalink] [raw]
Subject: Re: No 100 HZ timer !

Alan Cox wrote:
>
> > Timers more precise than 100HZ aren't probably needed - as MIN_RTO is 0.2s
> > and MIN_DELACK is 0.04s, TCP would hardly benefit from them.
>
> There are a considerable number of people who really do need 1Khz resolution.
> Midi is one of the example cases. That doesn't mean we have to go to a 1KHz
> timer, we may want to combine a 100Hz timer with a smart switch to 1Khz

As somebody who is now debating how to measure latencies in a
giga-bit ethernet environment with several components doing
L3 switching in much less than 10 micro-seconds ... (hardware)
I agree that some method is need to achieve higher resolutions.
(Sigh... I will likely need to buy something big and expensive)
{this is for a system to make use of L. Yarrow's little protocol}

2001-04-12 08:42:15

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer !

John Alvord wrote:
> I bumped into a funny non-optimization a few years ago. A system with
> a timer queue like the above had been "optimized" by keeping old timer
> elements... ready for new tasks to link onto and activate. The
> granularity was 1 millsecond. Over time, all timer values from 0 to
> roughly 10 minutes had been used. That resulted in 60,000 permanent
> storage fragments laying about... a significant fragmentation problem.
> The end was a forced recycle every month or so.

This is the sort of thing that Linux does with slab, dentry, inode
caches and so on. In theory the memory is reclaimed as required :-)

It's not a big issue with timers, as the timer elements are fixed size
structures that tend to be embedded in other structures. So the
lifetime of the timer element is the same as the lifetime of the object
associated with the timer.

-- Jamie

2001-04-12 08:45:55

by Jamie Lokier

[permalink] [raw]
Subject: Re: No 100 HZ timer !

watermodem wrote:
> As somebody who is now debating how to measure latencies in a
> giga-bit ethernet environment with several components doing
> L3 switching in much less than 10 micro-seconds ... (hardware)
> I agree that some method is need to achieve higher resolutions.
> (Sigh... I will likely need to buy something big and expensive)
> {this is for a system to make use of L. Yarrow's little protocol}

Use Alteon/Netgear cards, everyone else seems to be :-) We get
measurements on the order of 100ns, if we are just measuring network
latencies. (Data is not transferred over the PCI bus).

-- Jamie

2001-04-12 10:11:16

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: No 100 HZ timer !

On Wed, 11 Apr 2001, Jamie Lokier wrote:

> Think of the original 64k and 256k VGA cards. I think some of those
> didn't have an irq, but did have a way to read the progress of the
> raster, which you could PLL to using timer interrupts. Some video games
> still look fine at 320x200 :-)

The *original* VGA, i.e. the PS/2 one, did have an IRQ, IIRC (according
to docs -- I haven't ever seen one). Cheap clones might have lacked it,
though.

Then there is workstation (non-PC) hardware from early '90s which we run
on and which uses an IRQ-driven interface to graphics adapters -- not only
for vsync.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2001-04-12 13:06:35

by mbs

[permalink] [raw]
Subject: Re: No 100 HZ timer !


On Wed, 11 Apr 2001, Bret Indrelee wrote:
> Current generation PCs can easily handle 1000s of
> interrupts a second if you keep the overhead small.

the PC centric implementation of the ticked system is one of its major flaws.

there are architectures which the cost of a fixed interval is the same as a
variable interval, i.e. no reload register, so you must explicitely load a
value each interrupt anyway. and if you want accurate intervals, you must
perform a calculation each reload, and not just load a static value, because
you don't know how many cycles it took from the time the interrupt happened
till you get to the reload line because the int may have been masked or lower
pri than another interrupt.

also, why handle 1000's of interrupts if you only need to handle 10?

--
/*------------------------------------------------**
** Mark Salisbury | Mercury Computer Systems **
** [email protected] | System OS - Kernel Team **
**------------------------------------------------**
** I will be riding in the Multiple Sclerosis **
** Great Mass Getaway, a 150 mile bike ride from **
** Boston to Provincetown. Last year I raised **
** over $1200. This year I would like to beat **
** that. If you would like to contribute, **
** please contact me. **
**------------------------------------------------*/

2001-08-01 01:10:04

by George Anzinger

[permalink] [raw]
Subject: Re: No 100 HZ timer !

I have just posted a patch on sourceforge:
http://sourceforge.net/projects/high-res-timers

to the 2.4.7 kernel with both ticked and tick less options, switch able
at any time via a /proc interface. The system is instrumented with
Andrew Mortons time pegs with a couple of enhancements so you can easily
see your clock/ timer overhead (thanks Andrew).

Please take a look at this system and let me know if a tick less system
is worth further effort.

The testing I have done seems to indicate a lower overhead on a lightly
loaded system, about the same overhead with some load, and much more
overhead with a heavy load. To me this seems like the wrong thing to
do. We would like as nearly a flat overhead to load curve as we can get
and the ticked system seems to be much better in this regard. Still
there may be applications where this works.

comments? RESULTS?

George