LinuxLists.cc - HZ, preferably as small as possible

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

"Grover, Andrew" wrote:
>
> I'd like to see HZ closer to 100 than 1000, for CPU power reasons. Processor
> power states like C3 may take 100 microseconds+ to enter/leave - time when
> both the CPU isn't doing any work, but still drawing power as if it was. We
> pop out of C3 whenever there is an interrupt, so reducing timer interrupts
> is good from a power standpoint by amortizing the transition penalty over a
> longer period of power savings.
>
> But on the other hand, increasing HZ has perf/latency benefits, yes? Have
> these been quantified? I'd either like to see a HZ that has balanced
> power/performance, or could we perhaps detect we are on a system that cares
> about power (aka a laptop) and tweak its value at runtime?

HZ is used in a LOT of places. I suspect "tweaking" at run
time would be a bit difficult.

The high-res-timers patch give high resolution timers with
out changing HZ. Interrupts are scheculed as needed,
between the 1/HZ ticks, so a quite system will have few (if
any) interrupts between the ticks.

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-07-10 21:27:19

by Andrew Morton

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

2002-07-10 21:33:11

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Wed, Jul 10, 2002 at 02:28:03PM -0700, Andrew Morton wrote:
> > But on the other hand, increasing HZ has perf/latency benefits, yes? Have
> > these been quantified?
>
> Not that I'm aware of. And I'd regard any such claims with some
> scepticism.

The most obvious one is the reduced latency of select/poll timeouts.

-ben
--
"You will be reincarnated as a toad; and you will be much happier."

2002-07-10 21:37:43

by Andrew Morton

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Benjamin LaHaise wrote:
>
> On Wed, Jul 10, 2002 at 02:28:03PM -0700, Andrew Morton wrote:
> > > But on the other hand, increasing HZ has perf/latency benefits, yes? Have
> > > these been quantified?
> >
> > Not that I'm aware of. And I'd regard any such claims with some
> > scepticism.
>
> The most obvious one is the reduced latency of select/poll timeouts.

OK, I'll grant that. Why is this useful?

-

2002-07-10 21:40:08

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Wed, Jul 10, 2002 at 02:38:32PM -0700, Andrew Morton wrote:
> OK, I'll grant that. Why is this useful?

Think video playback, where you want to queue the frame to be played as
close to the correct 1/60s time as possible. With HZ=100, the code will
frequently wake up much too late.

-ben
--
"You will be reincarnated as a toad; and you will be much happier."

2002-07-10 21:58:29

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Hi,

On Wed, 10 Jul 2002, Andrew Morton wrote:
> That makes a ton of sense.
>
> > But on the other hand, increasing HZ has perf/latency benefits, yes? Have
> > these been quantified?
>
> Not that I'm aware of. And I'd regard any such claims with some
> scepticism.
>
> > I'd either like to see a HZ that has balanced
> > power/performance, or could we perhaps detect we are on a system that cares
> > about power (aka a laptop) and tweak its value at runtime?

Want a config option? Either int or bool (CONFIG_LOW_HZ). It's not too
much effort.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-10 22:09:18

by Cort Dougan

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Yes, please do make it a config option. 10x interrupt overhead makes me
worry. It lets users tailor the kernel to their expected load.

2002-07-10 22:38:48

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Hi,

On Wed, 10 Jul 2002, Thunder from the hill wrote:
> Want a config option? Either int or bool (CONFIG_LOW_HZ). It's not too
> much effort.

I guess I forgot the half of it...

What arches do we want?

Index: arch/i386/Config.help
===================================================================
RCS file: /var/cvs/thunder-2.5/arch/i386/Config.help,v
retrieving revision 1.4
diff -p -u -r1.4 Config.help
--- arch/i386/Config.help 7 Jul 2002 09:59:46 -0000 1.4
+++ arch/i386/Config.help 10 Jul 2002 22:40:17 -0000
@@ -991,3 +991,13 @@ CONFIG_X86_EARLY_PRINTK
to the console much earlier in the boot process than printk. This
is useful when debugging fatal problems early in the boot sequence
(e.g. within setup_arch). If unsure, say N.
+
+Low kernel scheduler rate
+CONFIG_SCHED_LOW_HZ
+ Enable this if you care about your CPU sleeping time. The current
+ interval for scheduling processes in the kernel has recently been
+ increased. The advantage is less latency for many things that depend
+ on the timer, the disadvantage is that your cpu will probably not
+ go to sleep in time (so CPU power management will possibly not work
+ at all)
+
Index: include/asm-i386/param.h
===================================================================
RCS file: /var/cvs/thunder-2.5/include/asm-i386/param.h,v
retrieving revision 1.2
diff -p -u -r1.2 param.h
--- include/asm-i386/param.h 6 Jul 2002 18:17:30 -0000 1.2
+++ include/asm-i386/param.h 10 Jul 2002 22:40:17 -0000
@@ -2,7 +2,11 @@
#define _ASMi386_PARAM_H

#ifdef __KERNEL__
-# define HZ 1000 /* Internal kernel timer frequency */
+# ifdef CONFIG_SCHED_LOW_HZ
+# define HZ 100 /* Internal kernel timer frequency */
+# else
+# define HZ 1000 /* Internal kernel timer frequency */
+# endif
# define USER_HZ 100 /* .. some user interfaces are in "ticks" */
# define CLOCKS_PER_SEC (USER_HZ) /* like times() */
#endif

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-10 22:44:55

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Hi,

On Wed, 10 Jul 2002, Thunder from the hill wrote:
> I guess I forgot the half of it...

I did. Here is the whole version:

Index: arch/i386/Config.help
===================================================================
RCS file: /var/cvs/thunder-2.5/arch/i386/Config.help,v
retrieving revision 1.4
diff -p -u -r1.4 Config.help
--- arch/i386/Config.help 7 Jul 2002 09:59:46 -0000 1.4
+++ arch/i386/Config.help 10 Jul 2002 22:40:17 -0000
@@ -991,3 +991,13 @@ CONFIG_X86_EARLY_PRINTK
to the console much earlier in the boot process than printk. This
is useful when debugging fatal problems early in the boot sequence
(e.g. within setup_arch). If unsure, say N.
+
+Low kernel scheduler rate
+CONFIG_SCHED_LOW_HZ
+ Enable this if you care about your CPU sleeping time. The current
+ interval for scheduling processes in the kernel has recently been
+ increased. The advantage is less latency for many things that depend
+ on the timer, the disadvantage is that your cpu will probably not
+ go to sleep in time (so CPU power management will possibly not work
+ at all)
+
Index: include/asm-i386/param.h
===================================================================
RCS file: /var/cvs/thunder-2.5/include/asm-i386/param.h,v
retrieving revision 1.2
diff -p -u -r1.2 param.h
--- include/asm-i386/param.h 6 Jul 2002 18:17:30 -0000 1.2
+++ include/asm-i386/param.h 10 Jul 2002 22:40:17 -0000
@@ -2,7 +2,11 @@
#define _ASMi386_PARAM_H

#ifdef __KERNEL__
-# define HZ 1000 /* Internal kernel timer frequency */
+# ifdef CONFIG_SCHED_LOW_HZ
+# define HZ 100 /* Internal kernel timer frequency */
+# else
+# define HZ 1000 /* Internal kernel timer frequency */
+# endif
# define USER_HZ 100 /* .. some user interfaces are in "ticks" */
# define CLOCKS_PER_SEC (USER_HZ) /* like times() */
#endif
Index: arch/i386/config.in
===================================================================
RCS file: /var/cvs/thunder-2.5/arch/i386/config.in,v
retrieving revision 1.8
diff -p -u -r1.8 config.in
--- arch/i386/config.in 7 Jul 2002 09:59:47 -0000 1.8
+++ arch/i386/config.in 10 Jul 2002 22:45:28 -0000
@@ -181,6 +181,7 @@ else
bool 'Multiquad NUMA system' CONFIG_MULTIQUAD
fi

+bool 'Low scheduler rates' CONFIG_SCHED_LOW_HZ
bool 'Machine Check Exception' CONFIG_X86_MCE
dep_bool 'Check for non-fatal errors on Athlon/Duron' CONFIG_X86_MCE_NONFATAL $CONFIG_X86_MCE
dep_bool 'check for P4 thermal throttling interrupt.' CONFIG_X86_MCE_P4THERMAL $CONFIG_X86_MCE $CONFIG_X86_UP_APIC

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-10 22:47:10

by Eli Carter

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Thunder from the hill wrote:
> Hi,
>
> On Wed, 10 Jul 2002, Thunder from the hill wrote:
>
>>Want a config option? Either int or bool (CONFIG_LOW_HZ). It's not too
>>much effort.
>
>
> I guess I forgot the half of it...
>
> What arches do we want?
>
> Index: arch/i386/Config.help
> ===================================================================
> RCS file: /var/cvs/thunder-2.5/arch/i386/Config.help,v
> retrieving revision 1.4
> diff -p -u -r1.4 Config.help
> --- arch/i386/Config.help 7 Jul 2002 09:59:46 -0000 1.4
> +++ arch/i386/Config.help 10 Jul 2002 22:40:17 -0000
> @@ -991,3 +991,13 @@ CONFIG_X86_EARLY_PRINTK
> to the console much earlier in the boot process than printk. This
> is useful when debugging fatal problems early in the boot sequence
> (e.g. within setup_arch). If unsure, say N.
> +
> +Low kernel scheduler rate
> +CONFIG_SCHED_LOW_HZ
> + Enable this if you care about your CPU sleeping time. The current
> + interval for scheduling processes in the kernel has recently been
> + increased. The advantage is less latency for many things that depend

Perhaps s/increased/shortened/ ?

> + on the timer, the disadvantage is that your cpu will probably not
> + go to sleep in time (so CPU power management will possibly not work
> + at all)
> +
> Index: include/asm-i386/param.h
[snip]

Eli
--------------------. "If it ain't broke now,
Eli Carter \ it will be soon." -- crypto-gram
eli.carter(a)inet.com `-------------------------------------------------

2002-07-10 23:02:32

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Hi,

On Wed, 10 Jul 2002, Eli Carter wrote:
> Perhaps s/increased/shortened/ ?

'Course. Sorry, I'm quite out of bounds since I heard that tonight some of
my friends possibly got lost in the storm in Germany. I wished them lots
of fun on their canoe tour at the Mueritz...

I think I'll get back there tonight, so don't expect many responses, TAs
are ugly. Of course I'll try to get on working meanwhile.

The patch w/ the shortened-update is now at

http://luckynet.dynu.com/~thunder/patches/CONFIG_SCHED_LOW_HZ.patch

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-10 23:06:08

by Dave Mielke

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

[quoted lines by Thunder from the hill on July 10, 2002, at 16:41]

>+ Enable this if you care about your CPU sleeping time. The current
>+ interval for scheduling processes in the kernel has recently been
>+ increased.

The word "recently" will very quickly become out-of-date. Why not just state
the way it is and why one might want to select the option?

--
Dave Mielke | 2213 Fox Crescent | I believe that the Bible is the
Phone: 1-613-726-0014 | Ottawa, Ontario | Word of God. Please contact me
EMail: [email protected] | Canada K2A 1H7 | if you're concerned about Hell.
http://familyradio.com

2002-07-10 23:11:18

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Hi,

On Wed, 10 Jul 2002, Dave Mielke wrote:
> [quoted lines by Thunder from the hill on July 10, 2002, at 16:41]
>
> >+ Enable this if you care about your CPU sleeping time. The current
> >+ interval for scheduling processes in the kernel has recently been
> >+ increased.
>
> The word "recently" will very quickly become out-of-date. Why not just state
> the way it is and why one might want to select the option?

I don't think this is a patch for long.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-10 23:47:45

by J.A. Magallon

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On 2002.07.11 Thunder from the hill wrote:
>Hi,
>
>On Wed, 10 Jul 2002, Andrew Morton wrote:
>> That makes a ton of sense.
>>
>> > But on the other hand, increasing HZ has perf/latency benefits, yes? Have
>> > these been quantified?
>>
>> Not that I'm aware of. And I'd regard any such claims with some
>> scepticism.
>>
>> > I'd either like to see a HZ that has balanced
>> > power/performance, or could we perhaps detect we are on a system that cares
>> > about power (aka a laptop) and tweak its value at runtime?
>
>Want a config option? Either int or bool (CONFIG_LOW_HZ). It's not too
>much effort.
>

How about a <boot> option ? linux hz=[low,high]

It is runtime, but just one time.

--
J.A. Magallon \ Software is like sex: It's better when it's free
mailto:[email protected] \ -- Linus Torvalds, FSF T-shirt
Linux werewolf 2.4.19-rc1-jam2, Mandrake Linux 8.3 (Cooker) for i586
gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.7mdk)

2002-07-11 00:26:57

by Lincoln Dale

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

At 02:28 PM 10/07/2002 -0700, Andrew Morton wrote:
> > But on the other hand, increasing HZ has perf/latency benefits, yes? Have
> > these been quantified?
>
>Not that I'm aware of. And I'd regard any such claims with some
>scepticism.

for one, i'm using a modified version of the network FIFO queue discipline
to inject "delay" and "drop", similar to what ippipe can do on FreeBSD.
given i'm using a kernel timer for this, HZ >= 1000 is essential for <1.5
millisecond accuracy.

perhaps we really need a high-speed timer mechanism for parts of the kernel
that require it (or a highly-accurate single-fire timer)?

cheers,

lincoln.

2002-07-11 02:11:43

by CaT

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Wed, Jul 10, 2002 at 05:42:51PM -0400, Benjamin LaHaise wrote:
> On Wed, Jul 10, 2002 at 02:38:32PM -0700, Andrew Morton wrote:
> > OK, I'll grant that. Why is this useful?
>
> Think video playback, where you want to queue the frame to be played as
> close to the correct 1/60s time as possible. With HZ=100, the code will

Or 1/50 (think PAL), no? (Of course HZ=100 would be sweet for that. ;)

--
GOVERNMENT ANNOUNCEMENT - The government announced today that it is
changing its mascot to a condom because it more clearly reflects the
government's political stance. A condom stands up to inflation, halts
production, destroys the next generation, protects a bunch of pricks
and finally, gives you a sense of security while you're being screwed!

2002-07-11 02:44:05

by Andrew Grover

[permalink] [raw]

Subject: RE: HZ, preferably as small as possible

> From: CaT [mailto:[email protected]]
> On Wed, Jul 10, 2002 at 05:42:51PM -0400, Benjamin LaHaise wrote:
> > On Wed, Jul 10, 2002 at 02:38:32PM -0700, Andrew Morton wrote:
> > > OK, I'll grant that. Why is this useful?
> >
> > Think video playback, where you want to queue the frame to
> be played as
> > close to the correct 1/60s time as possible. With HZ=100,
> the code will
>
> Or 1/50 (think PAL), no? (Of course HZ=100 would be sweet for that. ;)

I don't know if I should mention this, but...

Win2k's default timer tick is 10ms (i.e. 100HZ) but it will go as low as 1ms
(1000HZ) if people request timers with that level of granularity. On the
fly.

So, a changing tick *can* be done. If Linux does the same thing, seems like
everyone is happy. What are the obstacles to this for Linux? If code is
based on the assumption of a constant timer tick, I humbly assert that the
code is broken.

Regards -- Andy

2002-07-11 02:58:41

by Jeff Garzik

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Grover, Andrew wrote:
> So, a changing tick *can* be done. If Linux does the same thing, seems like
> everyone is happy. What are the obstacles to this for Linux? If code is
> based on the assumption of a constant timer tick, I humbly assert that the
> code is broken.

Unfortunately code in Linux has traditionally compiled in a constant HZ
all over the place, and jiffies instead of real time units are at the
heart of all Linux timer-related activities.

I don't see that making 'HZ' a variable is really an option, because
many drivers and scheduler-related code will be wildly inaccurate as
soon as HZ actually changes values.

So that leaves us with the option of changing all the code related to
waiting to be based on msecs and usecs. Which I would love to do, but
that's a lot of work, both code- and audit-wise.

Jeff

2002-07-11 05:57:00

by Hannu Savolainen

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Hi,

IMHO the easiest solution is just making HZ selectable (100 or 1000 or
maybe 1024) when configuring the kernel. Also there has to be a variable
that exports the configured HZ value to modules. In that way users can
select HZ depending on their needs.

There are users who don't use power management. Instead they need higher
HZ for various reasons. Kernels compiled with HZ=1000 have been used
successfully since year 0 without any major problems. Making HZ
configurable just makes life easier for such users.

OTOH the higher wakeup rate during low power states can be cured by
temporarily lowering the hw clock rate from 1000 to 100. The timer
interrupt handler just increases jiffies by 10 (instead of 1). All code
compiled with HZ=1000 still works but there may be latency problems during
low power states.

On Wed, 10 Jul 2002, george anzinger wrote:

> "Grover, Andrew" wrote:
> >
> > I'd like to see HZ closer to 100 than 1000, for CPU power reasons. Processor
> > power states like C3 may take 100 microseconds+ to enter/leave - time when
> > both the CPU isn't doing any work, but still drawing power as if it was. We
> > pop out of C3 whenever there is an interrupt, so reducing timer interrupts
> > is good from a power standpoint by amortizing the transition penalty over a
> > longer period of power savings.
> >
> > But on the other hand, increasing HZ has perf/latency benefits, yes? Have
> > these been quantified? I'd either like to see a HZ that has balanced
> > power/performance, or could we perhaps detect we are on a system that cares
> > about power (aka a laptop) and tweak its value at runtime?
>
> HZ is used in a LOT of places. I suspect "tweaking" at run
> time would be a bit difficult.
This is not a problem at all. Just define HZ as:

extern int system_hz;
#define HZ system_hz

After that all code will use variable HZ. Changing HZ on fly will be
dangerous. However HZ can be made a boot time (LILO) parameter.

> The high-res-timers patch give high resolution timers with
> out changing HZ. Interrupts are scheculed as needed,
> between the 1/HZ ticks, so a quite system will have few (if
> any) interrupts between the ticks.
>
> --
> George Anzinger [email protected]
> High-res-timers:
> http://sourceforge.net/projects/high-res-timers/
> Real time sched: http://sourceforge.net/projects/rtsched/
> Preemption patch:
> http://www.kernel.org/pub/linux/kernel/people/rml
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

Best regards,

Hannu
-----
Hannu Savolainen ([email protected])
http://www.opensound.com (Open Sound System (OSS))
http://www.compusonic.fi (Finnish OSS pages)

2002-07-11 07:07:37

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

"Grover, Andrew" wrote:
>
> > From: CaT [mailto:[email protected]]
> > On Wed, Jul 10, 2002 at 05:42:51PM -0400, Benjamin LaHaise wrote:
> > > On Wed, Jul 10, 2002 at 02:38:32PM -0700, Andrew Morton wrote:
> > > > OK, I'll grant that. Why is this useful?
> > >
> > > Think video playback, where you want to queue the frame to
> > be played as
> > > close to the correct 1/60s time as possible. With HZ=100,
> > the code will
> >
> > Or 1/50 (think PAL), no? (Of course HZ=100 would be sweet for that. ;)
>
> I don't know if I should mention this, but...
>
> Win2k's default timer tick is 10ms (i.e. 100HZ) but it will go as low as 1ms
> (1000HZ) if people request timers with that level of granularity. On the
> fly.

This is what the high-res-timers patch does. It always does
the 1/HZ tick, but if a timer is requested with finer
granularity (resolution) an interrupt is scheduled to take
care of it. Check it out. You will find it here:
http://sourceforge.net/projects/high-res-timers/
>
> So, a changing tick *can* be done. If Linux does the same thing, seems like
> everyone is happy. What are the obstacles to this for Linux? If code is
> based on the assumption of a constant timer tick, I humbly assert that the
> code is broken.
>
> Regards -- Andy
> -

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-07-11 07:13:04

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Hannu Savolainen wrote:
>
> Hi,
>
> IMHO the easiest solution is just making HZ selectable (100 or 1000 or
> maybe 1024) when configuring the kernel. Also there has to be a variable
> that exports the configured HZ value to modules. In that way users can
> select HZ depending on their needs.
>
> There are users who don't use power management. Instead they need higher
> HZ for various reasons. Kernels compiled with HZ=1000 have been used
> successfully since year 0 without any major problems. Making HZ
> configurable just makes life easier for such users.
>
> OTOH the higher wakeup rate during low power states can be cured by
> temporarily lowering the hw clock rate from 1000 to 100. The timer
> interrupt handler just increases jiffies by 10 (instead of 1). All code
> compiled with HZ=1000 still works but there may be latency problems during
> low power states.
>
> On Wed, 10 Jul 2002, george anzinger wrote:
>
> > "Grover, Andrew" wrote:
> > >
> > > I'd like to see HZ closer to 100 than 1000, for CPU power reasons. Processor
> > > power states like C3 may take 100 microseconds+ to enter/leave - time when
> > > both the CPU isn't doing any work, but still drawing power as if it was. We
> > > pop out of C3 whenever there is an interrupt, so reducing timer interrupts
> > > is good from a power standpoint by amortizing the transition penalty over a
> > > longer period of power savings.
> > >
> > > But on the other hand, increasing HZ has perf/latency benefits, yes? Have
> > > these been quantified? I'd either like to see a HZ that has balanced
> > > power/performance, or could we perhaps detect we are on a system that cares
> > > about power (aka a laptop) and tweak its value at runtime?
> >
> > HZ is used in a LOT of places. I suspect "tweaking" at run
> > time would be a bit difficult.
> This is not a problem at all. Just define HZ as:
>
> extern int system_hz;
> #define HZ system_hz
>
> After that all code will use variable HZ. Changing HZ on fly will be
> dangerous. However HZ can be made a boot time (LILO) parameter.

This is not really advisable. A good deal to of the timer
code depends on HZ being a constant so that calculations are
done at compile time. A lot of this code would be
measurably slower if these calculations were required at run
time. For example, often a divide is used with the
understanding that it will be done at compile time, not run
time.

-g

>
> > The high-res-timers patch give high resolution timers with
> > out changing HZ. Interrupts are scheculed as needed,
> > between the 1/HZ ticks, so a quite system will have few (if
> > any) interrupts between the ticks.
> >
> > --
> > George Anzinger [email protected]
> > High-res-timers:
> > http://sourceforge.net/projects/high-res-timers/
> > Real time sched: http://sourceforge.net/projects/rtsched/
> > Preemption patch:
> > http://www.kernel.org/pub/linux/kernel/people/rml
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
> Best regards,
>
> Hannu
> -----
> Hannu Savolainen ([email protected])
> http://www.opensound.com (Open Sound System (OSS))
> http://www.compusonic.fi (Finnish OSS pages)

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-07-11 11:33:20

by Kasper Dupont

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Lincoln Dale wrote:
>
> (or a highly-accurate single-fire timer)?

That would be my preference, at least on hardware where it can
be done efficient and accurate.

The x86 PIT can be programmed in one-shot mode, but the delay
cannot be programmed to be more than approximately 55msec. For
longer delays we'd have to get interrupted prematurely just
to reprogram the PIT for another delay. This is of course no
worse than an interrupt every 1 or 10 msec we actually don't
need.

Another problem is that a PIT in one shot mode cannot meassure
time accurately. Each interrupt will arrive slightly off the
wanted time. For the interrupt itself this is no big deal, but
for meassuring time they will accumulate, so you'd see a clock
drifting beyond anything acceptable.

The answer here is that we need something else for meassuring
time, I guess the TSC would be appropriate. If doing all clock
meassurements using the TSC the clock would no longer drift in
case of lost timer interrupts. The TSC frequency can be
meassured at boot time, and if done smart enough that variable
can be made into a knob that ntpd can control to adjust the
clock speed instead of a jumping clock once in a while. If we
are smart enough we can get walltime more accurate than it has
ever been seen before. :-)

The problems remaining know are:
1) Reprogramming the PIT is slow and inaccurate, we'd like
better hardware for producing timer interrupts. (I think I
read somewhere that an APIC could help us here.)
2) We will be meassuring time in a lot of different units,
which needs to be converted. The PIT using 1/1193180 sec,
the TSC using a varying unit, and finally the user/kernel
interface using secs, msecs, usecs, nsecs.
3) On SMP hardware we will be using different TSCs on
different CPUs. Having TSCs in sync might get more imporant
than on current kernels.
4) We are introducing new hardware requirements.

I'd like to see oneshot timer interrupts as a compile time
option on any architecture that is capable of doing it. But of
course it is not easy.

Have I missed something somewhere?

--
Kasper Dupont -- der bruger for meget tid p? usenet.
For sending spam use mailto:[email protected]

2002-07-11 12:04:51

by Alan

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

> I'd like to see oneshot timer interrupts as a compile time
> option on any architecture that is capable of doing it. But of
> course it is not easy.
>
> Have I missed something somewhere?

The APIC on modern systems has decent timers. There may also be ACPI timers
we can use on ACPI capable systems.

2002-07-11 12:19:23

by Alan

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

> Grover, Andrew wrote:
> > So, a changing tick *can* be done. If Linux does the same thing, seems like
> > everyone is happy. What are the obstacles to this for Linux? If code is
> > based on the assumption of a constant timer tick, I humbly assert that the
> > code is broken.
>
> I don't see that making 'HZ' a variable is really an option, because
> many drivers and scheduler-related code will be wildly inaccurate as
> soon as HZ actually changes values.

HZ never changes value. HZ is the top granularity we choose to operate at.

2002-07-11 12:51:50

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Hi,

On Thu, 11 Jul 2002, Hannu Savolainen wrote:
> This is not a problem at all. Just define HZ as:
>
> extern int system_hz;
> #define HZ system_hz
>
> After that all code will use variable HZ. Changing HZ on fly will be
> dangerous. However HZ can be made a boot time (LILO) parameter.

OK, that's probably a start. As the next step, I'd recommend that the
maintainers and their supporters try to replace the static HZ with
possibly-dynamic system_hz. The third step would be to have guys like Ingo
to tune system_hz to be really dynamic.

Cool idea, anyway.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-11 13:35:32

by Kasper Dupont

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Alan Cox wrote:
>
> > I'd like to see oneshot timer interrupts as a compile time
> > option on any architecture that is capable of doing it. But of
> > course it is not easy.
> >
> > Have I missed something somewhere?
>
> The APIC on modern systems has decent timers. There may also be ACPI timers
> we can use on ACPI capable systems.

In what units do they meassure time? It would be nice if
they were garanteed to match the TSC frequency or some
other of the units already being used.

--
Kasper Dupont -- der bruger for meget tid p? usenet.
For sending spam use mailto:[email protected]

2002-07-11 13:39:57

by Mark Mielke

[permalink] [raw]

Subject: Whoa... (was: Re: HZ, preferably as small as possible)

On Wed, Jul 10, 2002 at 04:09:21PM -0600, Cort Dougan wrote:
> Yes, please do make it a config option. 10x interrupt overhead makes me
> worry. It lets users tailor the kernel to their expected load.

All this talk is getting to me.

I thought we recently (1 month ago? 2 months ago?) concluded that
increases in interrupt frequency only affects performance by a very
small amount, but generates an increase in responsiveness. The only
real argument against that I have seen, is the 'power conservation'
argument. The idea was, that the scheduler itself did not execute
on most interrupts. The clock is updated, and that is about all.

I can invent a reason as to why throughput increases, from user space.
The hard drive sends data to the kernel, the kernel handles the
hardware interrupt, grabs the buffer, and returns control to the
active process/thread. It may be some time until the process/thread
that is *reading* the data gets scheduled. Any reduction in the
average time a process/thread will be scheduled to execute, results in
increased throughput.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-07-11 15:20:52

by Alan

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

> > The APIC on modern systems has decent timers. There may also be ACPI timers
> > we can use on ACPI capable systems.
>
> In what units do they meassure time? It would be nice if
> they were garanteed to match the TSC frequency or some
> other of the units already being used.

It really doesn't matter providing the resolution is decent. Conversion
between formats of time is a maths operation, and we can handle those 8)

2002-07-11 15:57:40

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

U?ytkownik Thunder from the hill napisa?:
> Hi,
>
> On Thu, 11 Jul 2002, Hannu Savolainen wrote:
>
>>This is not a problem at all. Just define HZ as:
>>
>>extern int system_hz;
>>#define HZ system_hz
>>
>>After that all code will use variable HZ. Changing HZ on fly will be
>>dangerous. However HZ can be made a boot time (LILO) parameter.
>
>
> OK, that's probably a start. As the next step, I'd recommend that the
> maintainers and their supporters try to replace the static HZ with
> possibly-dynamic system_hz. The third step would be to have guys like Ingo
> to tune system_hz to be really dynamic.
>
> Cool idea, anyway.

Just remember plase to map it to /proc/sys/kernel/xxx
So we could implement the following properly:

_SC_CLK_TCK CLK_TCK Ticks per second (clock_t)

(Taken from Solaris pecs.)

Unless of course we stick to the fact that HZ exposed
to user land remains an arch specific constant as in 2.5.25 which
I think is the more prefferable solution.

Pitty is RedHat beta does mess with this! The 2.5.25 solutoin from
Linus is far better.

2002-07-11 16:58:45

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

U?ytkownik Benjamin LaHaise napisa?:
> On Wed, Jul 10, 2002 at 02:28:03PM -0700, Andrew Morton wrote:
>
>>>But on the other hand, increasing HZ has perf/latency benefits, yes? Have
>>>these been quantified?
>>
>>Not that I'm aware of. And I'd regard any such claims with some
>>scepticism.
>
>
> The most obvious one is the reduced latency of select/poll timeouts.

Which you can actually see if running x11perf or simple ico.

2002-07-11 17:05:44

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

U?ytkownik Jeff Garzik napisa?:
> Grover, Andrew wrote:
>
>> So, a changing tick *can* be done. If Linux does the same thing, seems
>> like
>> everyone is happy. What are the obstacles to this for Linux? If code is
>> based on the assumption of a constant timer tick, I humbly assert that
>> the
>> code is broken.
>
>
> Unfortunately code in Linux has traditionally compiled in a constant HZ
> all over the place, and jiffies instead of real time units are at the
> heart of all Linux timer-related activities.
>
> I don't see that making 'HZ' a variable is really an option, because
> many drivers and scheduler-related code will be wildly inaccurate as
> soon as HZ actually changes values.
>
> So that leaves us with the option of changing all the code related to
> waiting to be based on msecs and usecs. Which I would love to do, but
> that's a lot of work, both code- and audit-wise.

vmstat.c:

hz = sysconf(_SC_CLK_TCK); /* get ticks/s from system */

And yes I know the libproc is *evil* in this area.
The rest should be an implementation detail of sysconf().
Changing this value during the runtime of vmstat is interresting story
anyway, but it should be up to the sysadmin to do this kind
of stuff only at runtlevel 1.
sysconf can be indeed imeplemented as a single global
file containing configuration data. But sysctl is another story
of course.

2002-07-11 18:49:05

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Kasper Dupont wrote:
>
> Lincoln Dale wrote:
> >
> > (or a highly-accurate single-fire timer)?
>
> That would be my preference, at least on hardware where it can
> be done efficient and accurate.
>
> The x86 PIT can be programmed in one-shot mode, but the delay
> cannot be programmed to be more than approximately 55msec. For
> longer delays we'd have to get interrupted prematurely just
> to reprogram the PIT for another delay. This is of course no
> worse than an interrupt every 1 or 10 msec we actually don't
> need.
>
> Another problem is that a PIT in one shot mode cannot meassure
> time accurately. Each interrupt will arrive slightly off the
> wanted time. For the interrupt itself this is no big deal, but
> for meassuring time they will accumulate, so you'd see a clock
> drifting beyond anything acceptable.
>
> The answer here is that we need something else for meassuring
> time, I guess the TSC would be appropriate. If doing all clock
> meassurements using the TSC the clock would no longer drift in
> case of lost timer interrupts. The TSC frequency can be
> meassured at boot time, and if done smart enough that variable
> can be made into a knob that ntpd can control to adjust the
> clock speed instead of a jumping clock once in a while. If we
> are smart enough we can get walltime more accurate than it has
> ever been seen before. :-)

The high-res-timers patch does most of this (all but the ntp
knob). It allows you to use either the TSC or the ACPI pm
timer to keep clock time. The former is fast, but some
systems are known to "mess" with the TSC as part of power
management. The pm timer, being I/O, takes more time to
read, but is not "messed" with.
>
> The problems remaining know are:
> 1) Reprogramming the PIT is slow and inaccurate, we'd like
> better hardware for producing timer interrupts. (I think I
> read somewhere that an APIC could help us here.)

Actually the "best" option would be something like the
decrementer in the PPC. It can be set to generate an
interrupt at just about any time. Another HW register
(64-bits) keeps track of (effectively) decrementer clocks
since boot and can be used as the clock source. The best
solution in the x86 platform, would be an additional
register that either counts down at TSC speed to an
interrupt OR compares to the TSC and interrupts on compare.
It should be a cpu register to avoid the latencies of
accessing an I/O register.

> 2) We will be meassuring time in a lot of different units,
> which needs to be converted. The PIT using 1/1193180 sec,
> the TSC using a varying unit, and finally the user/kernel
> interface using secs, msecs, usecs, nsecs.

Not really a big problem. The conversion constants are
computed once (or at ntp correction) and from then on all
one does is mpy and shift instructions to do the
conversion. (Again, see the HRT patch.)

> 3) On SMP hardware we will be using different TSCs on
> different CPUs. Having TSCs in sync might get more imporant
> than on current kernels.
> 4) We are introducing new hardware requirements.
>
> I'd like to see oneshot timer interrupts as a compile time
> option on any architecture that is capable of doing it. But of
> course it is not easy.

As I imply above, the one shot, if done as an I/O device, is
less than optimal. Better is the PPC decrementer.
>
> Have I missed something somewhere?
>

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-07-11 19:20:09

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Martin Dalecki writes:
> U\277ytkownik Jeff Garzik napisa\263:

>> I don't see that making 'HZ' a variable is really an option, because
>> many drivers and scheduler-related code will be wildly inaccurate as
>> soon as HZ actually changes values.

Definitely:
my_timeout = foo*HZ;

>> So that leaves us with the option of changing all the code related to
>> waiting to be based on msecs and usecs. Which I would love to do, but
>> that's a lot of work, both code- and audit-wise.
>
> vmstat.c:
>
> hz = sysconf(_SC_CLK_TCK); /* get ticks/s from system */

Oops! Sorry I missed that one. Not that it matters for
the 2.5.25 kernel and above, but that code really should
be using the Hertz value supplied by libproc.

> And yes I know the libproc is *evil* in this area.

Hell yes. It's going to remain evil until the 2.4 kernel
is a distant memory. Debian uses a 2.2 kernel in the
upcoming release, so it will be a good long time until
everyone is using a 2.6 kernel. When 2.8 comes out,
Debian will finally stop using 2.4 and I can get rid of
my evil hack.

Hey, I asked for a clean way to get HZ. I didn't even
get "send a patch"; I got BS about the 2.5.25 behavior
being standard, as if it had already been implemented.

> The rest should be an implementation detail of sysconf().

That's broken. It can't even correctly report the
number of processors you have.

2002-07-11 20:37:44

by Bill Davidsen

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Thu, 11 Jul 2002, Martin Dalecki wrote:

> vmstat.c:
>
> hz = sysconf(_SC_CLK_TCK); /* get ticks/s from system */
>
> And yes I know the libproc is *evil* in this area.
> The rest should be an implementation detail of sysconf().

Yes, any of the changes need to make the dynamic value available to
programs. Alas, too many programs grab the HZ value and compile it in, and
don't work right on a kernel with a modified rate. I don't know if the
CLK_TCK macro is dynamic or not, I sure hope so.

I'd like to see it set at boot time, and available in /proc/sys for easy
use by scripts. As noted by others, there are a lot of uses in the kernel
source which assume that arithmetic will happen at compile time, and even
if you ignore the overhead it would take a lot of rewriting to make it
dynamic. Setting it a boot time gets most of the gain and none of the
pain (boot time = pick a kernel, not a parameter).

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-07-11 21:04:13

[permalink] [raw]

Subject: Re: Whoa... (was: Re: HZ, preferably as small as possible)

On Thursday 11 July 2002 15:36, Mark Mielke wrote:
> On Wed, Jul 10, 2002 at 04:09:21PM -0600, Cort Dougan wrote:
> > Yes, please do make it a config option. 10x interrupt overhead makes me
> > worry. It lets users tailor the kernel to their expected load.
>
> All this talk is getting to me.
>
> I thought we recently (1 month ago? 2 months ago?) concluded that
> increases in interrupt frequency only affects performance by a very
> small amount, but generates an increase in responsiveness. The only
> real argument against that I have seen, is the 'power conservation'
> argument. The idea was, that the scheduler itself did not execute
> on most interrupts. The clock is updated, and that is about all.
>
> I can invent a reason as to why throughput increases, from user space.
> The hard drive sends data to the kernel, the kernel handles the
> hardware interrupt, grabs the buffer, and returns control to the
> active process/thread. It may be some time until the process/thread
> that is *reading* the data gets scheduled. Any reduction in the
> average time a process/thread will be scheduled to execute, results in
> increased throughput.

Yes, it's the same reason that -preempt leads, counterintuitively,
to better throughput under parallel loads. Contrary to popular
wisdom, lower latency and higher throughput are not always mutally
exclusive.

Anyway, 1 ms timer interrupt is still a snail's pace by the
standards of today's processors, it's silly to worry about it. If
somebody wants a cruder scheduling interval than the raw timer
interrupt, that's child's play, just step the interval down. The
only slightly challenging thing is do that without restricting
choice of rate for the raw timer and scheduler, respectively. Here,
a novel application of Bresenham's algorithm (the line drawing
algorithm) works nicely: at each raw interrupt, subtract the period
of the raw interrupt from an accumulator; if the result is less
than zero, add the period of the scheduler to the accumlator and
drop into the scheduler's part of the timer interrupt.

This Bresenham trick works for arbitrary collections of interrupt
rates, all with different periods. It has the property that,
over time, the total number of invocations at each rate remains
*exactly* correct, and so long as the raw interrupt runs at a
reasonably high rate, displacement isn't that bad either.

--
Daniel

2002-07-12 00:40:50

by Stevie-O

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

At <time> <date>, <user> [<email>] wrote:
> <stuff>

A lot of people are talking about how HZ needs to be a constant, etc.

I don't do much kernel hacking, so allow me to post a query that would (probably) better belong on #kernelnewbies if I wasn't so damn lazy ;) --

Why must HZ be the same as 'interrupts per second'?

--
Stevie-O

Real programmers link their executables by hand.

2002-07-12 00:47:39

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Hi,

On Thu, 11 Jul 2002, Stevie O wrote:
> Why must HZ be the same as 'interrupts per second'?

s/interrupts/scheduler calls/

But what exactly does this question mean to be? I don't fully understand.
We define HZ to have an interval for the calls of the scheduler. That's
why it is the number of scheduler calls per second, because that's what it
was invented to be.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-12 00:52:38

by Robert Love

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Thu, 2002-07-11 at 17:50, Thunder from the hill wrote:

> On Thu, 11 Jul 2002, Stevie O wrote:
> > Why must HZ be the same as 'interrupts per second'?
>
> s/interrupts/scheduler calls/

Uh, HZ is not scheduler calls per second.

Neither exactly is it interrupts per second, but _timer_ interrupts per
second. It is the frequency of the timer interrupt.

> But what exactly does this question mean to be? I don't fully understand.
> We define HZ to have an interval for the calls of the scheduler. That's
> why it is the number of scheduler calls per second, because that's what it
> was invented to be.

No no no...

Robert Love

2002-07-12 00:55:53

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Hi,

On 11 Jul 2002, Robert Love wrote:
> Uh, HZ is not scheduler calls per second.

Sorry, I must be sleeping...

It's disnerving to fix that stuff I'm fixing. I can't even concentrate on
reading any more. Sure, you're right.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-12 00:58:09

by Alan

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

> Uh, HZ is not scheduler calls per second.
>
> Neither exactly is it interrupts per second, but _timer_ interrupts per
> second. It is the frequency of the timer interrupt.

Its not exactly that either. Its 'rate at which jiffies is incremented'.
The distinction is not pedantic its rather critical when you go to a
variable timer tick...

Alan

2002-07-12 01:07:13

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Stevie O wrote:
>
> At <time> <date>, <user> [<email>] wrote:
> > <stuff>
>
> A lot of people are talking about how HZ needs to be a constant, etc.
>
> I don't do much kernel hacking, so allow me to post a query that would (probably) better belong on #kernelnewbies if I wasn't so damn lazy ;) --
>
> Why must HZ be the same as 'interrupts per second'?

Well, in truth it has nothing to do with interrupts. It is
just that that is the way most systems keep time. The REAL
definition of HZ is in its relationship to jiffies and
seconds.

I.e. jiffies * HZ = seconds, by definition.

Then we define interfaces that promise to return so many
jiffies from now and we keep execution time and time slice
times in jiffies. In order to keep these things true, it is
usual to set up some sort of timer to interrupt once each
jiffie. Now we can actually do this two ways. We can say
that the interrupt is a reminder to look at a "reliable
clock" and update the system time with what we find OR we
can use the interrupt to actually drive the system time.
The former is the more accurate way of doing things as it
eliminates interrupt latency. It also allows us to use a
more sloppy source of interrupts since they are just
reminders to check a clock and not actually driving the
clock. This, by the way, is what the high-res-timers patch
does. Doing things this way also allows one to reprogram
the timer interrupt hardware with out worrying too much
about loosing track of time. The HRT patch does this to
generate interrupts at sub jiffie intervals, but only when
required.

-g
>
> --
> Stevie-O
>
> Real programmers link their executables by hand.
>

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-07-12 01:24:13

by Roland Dreier

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

>>>>> "george" == george anzinger <[email protected]> writes:

george> Well, in truth it has nothing to do with interrupts. It
george> is just that that is the way most systems keep time. The
george> REAL definition of HZ is in its relationship to jiffies
george> and seconds.

george> I.e. jiffies * HZ = seconds, by definition.

I'm sure you know the truth, but this isn't quite right. Just to be
pedantic and make sure the correct definition is out there:

jiffies / HZ = seconds

For example if HZ is 100 then the jiffy counter is incremented 100
times each second.

Best,
Roland

2002-07-12 01:28:48

by Mark Hahn

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

> > > Why must HZ be the same as 'interrupts per second'?
> >
> > s/interrupts/scheduler calls/
>
> Uh, HZ is not scheduler calls per second.
>
> Neither exactly is it interrupts per second, but _timer_ interrupts per
> second. It is the frequency of the timer interrupt.

is there really code which uses HZ which is not merely fiddling with jiffies?
that is, HZ is merely "jiffies per second". there's no reason the timer
(if any!) couldn't run faster than HZ, even at different ratios depending on
power level.

afaikt, jiffies has survived because there's a need for a
moderately fast, strictly monotonically increasing clock.
that doesn't imply that the periodic timer needs to run at HZ
or even that such a clock exists (tickless).
just that the kernel promises to update jiffies at HZ,
even if that means HZ is 1M, and goes by jumps of 10K.

2002-07-12 01:40:59

by Stevie-O

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

At 06:09 PM 7/11/2002 -0700, george anzinger wrote:
>> Why must HZ be the same as 'interrupts per second'?
>
>Well, in truth it has nothing to do with interrupts. It is
>just that that is the way most systems keep time. The REAL
>definition of HZ is in its relationship to jiffies and
>seconds.
>
>I.e. jiffies * HZ = seconds, by definition.
>
>Then we define interfaces that promise to return so many
>jiffies from now and we keep execution time and time slice
>times in jiffies. In order to keep these things true, it is
>usual to set up some sort of timer to interrupt once each
>jiffie. Now we can actually do this two ways. We can say
>that the interrupt is a reminder to look at a "reliable
>clock" and update the system time with what we find OR we
>can use the interrupt to actually drive the system time.
>The former is the more accurate way of doing things as it
>eliminates interrupt latency. It also allows us to use a
>more sloppy source of interrupts since they are just
>reminders to check a clock and not actually driving the
>clock. This, by the way, is what the high-res-timers patch
>does. Doing things this way also allows one to reprogram
>the timer interrupt hardware with out worrying too much
>about loosing track of time. The HRT patch does this to
>generate interrupts at sub jiffie intervals, but only when
>required.

So why not do it this way:

1. Let HZ = 1000.

2. Program PIT (having programmed the PC speaker in DOS, I personally believe Intel forgot the 'A' at the end of the name) to fire every 10ms.

3. void pit_isr(void) { jiffies += 10; do_other_stuff(); }

--
Stevie-O

Real programmers use COPY CON PROGRAM.EXE

2002-07-12 02:58:58

by Bernd Eckenfels

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

In article <[email protected]> you wrote:
> Why must HZ be the same as 'interrupts per second'?

Well, it must not. But currently each timer interrupt the tick timestamp is
increased by one. So to find out how many seconds uptime you have (and other
things which are measured in timer ticks and passed to the userspace) you
need to know how many ticks have passed.

Actually there are a few things here, on the one hand, kernel should not
pass values in ticks to the userspace.

On the other hand having a changing HZ does not work for timespans measured
in those ticks, as long as those are not adjusted. One could think about
having a doze mode where only every 100th interruped is generated but it
increasedss the tick count by 100. Mst likely this will break a lot of
averaged measuring and stats counting, tough.

Greetings
Bernd

2002-07-12 11:59:45

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

U?ytkownik Bill Davidsen napisa?:
> On Thu, 11 Jul 2002, Martin Dalecki wrote:
>
>
>>vmstat.c:
>>
>>hz = sysconf(_SC_CLK_TCK); /* get ticks/s from system */
>>
>>And yes I know the libproc is *evil* in this area.
>>The rest should be an implementation detail of sysconf().
>
>
> Yes, any of the changes need to make the dynamic value available to
> programs. Alas, too many programs grab the HZ value and compile it in, and
> don't work right on a kernel with a modified rate. I don't know if the
> CLK_TCK macro is dynamic or not, I sure hope so.
>
> I'd like to see it set at boot time, and available in /proc/sys for easy
> use by scripts. As noted by others, there are a lot of uses in the kernel
> source which assume that arithmetic will happen at compile time, and even
> if you ignore the overhead it would take a lot of rewriting to make it
> dynamic. Setting it a boot time gets most of the gain and none of the
> pain (boot time = pick a kernel, not a parameter).
>

IMHO there where reasons why the standards are defining a function
to access this information from applications.

2002-07-12 18:27:51

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Roland Dreier wrote:
>
> >>>>> "george" == george anzinger <[email protected]> writes:
>
> george> Well, in truth it has nothing to do with interrupts. It
> george> is just that that is the way most systems keep time. The
> george> REAL definition of HZ is in its relationship to jiffies
> george> and seconds.
>
> george> I.e. jiffies * HZ = seconds, by definition.
>
> I'm sure you know the truth, but this isn't quite right. Just to be
> pedantic and make sure the correct definition is out there:
>
> jiffies / HZ = seconds
>
> For example if HZ is 100 then the jiffy counter is incremented 100
> times each second.
>
Of course you are right. Must have been a brain fart :)

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-07-15 05:07:18

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

In article <[email protected]>,
Grover, Andrew <[email protected]> wrote:
>
>But on the other hand, increasing HZ has perf/latency benefits, yes? Have
>these been quantified?

I've never had good reason to believe the latency/perf benefits myself,
but I was approached at OLS about problems with something as simple as
DVD playing, where a 100Hz timer means that the DVD player ends up
having to busy-loop on gettimeofday() because it cannot sanely sleep due
to the lack in sufficient sleeping granularity.

You apparently end up visibly missing frames - a frame is just 3 timer
ticks at 100 Hz, and considering that the kernel has to round up by one
due to POSIX requirements _and_ considering that you lose roughly one
for actually processing the frame itself, that doesn't sound _that_
outlandish.

> I'd either like to see a HZ that has balanced
>power/performance, or could we perhaps detect we are on a system that cares
>about power (aka a laptop) and tweak its value at runtime?

Runtime tweaking is not really an option with the current setup. There
are also divisions etc that really want it to be a compile-time constant
for efficiency.

As noted, even power/performance-wise a higher Hz can actually _help_.
Especially on laptops. Exactly because you actually sanely _can_ afford
to sleep, which you cannot with a 100Hz timer.

So you lose some, you win some, depending on your needs.

There is, of course, the option to do variable frequency (and make it
integer multiples of the exposed "constant HZ" so that kernel code
doesn't actually need to _care_ about the variability). There are
patches to play with things like that.

Linus

2002-07-15 05:16:18

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

In article <[email protected]>,
Bill Davidsen <[email protected]> wrote:
>On Thu, 11 Jul 2002, Martin Dalecki wrote:
>
>> vmstat.c:
>>
>> hz = sysconf(_SC_CLK_TCK); /* get ticks/s from system */
>>
>> And yes I know the libproc is *evil* in this area.
>> The rest should be an implementation detail of sysconf().
>
>Yes, any of the changes need to make the dynamic value available to
>programs.

No they don't.

Have people looked at the 2.5.x patches?

CLK_TCK is 100 on x86. As it has always been. User land should never
care about whatever random value the kernel happens to use for the
actual timer tick at that particular moment. Especially since the kernel
internal timer tick may well be variable some day.

The fact that libproc believes that HZ can change is _their_ problem.
I've told people over and over that user-level HZ is a constant (and, on
x86, that constant is 100), and that won't change.

So in current 2.5.x times() still counts at 100Hz, and /proc files that
export clock_t still show the same 100Hz rate.

The fact that the kernel internally counts at some different rate should
be _totally_ invisible to user programs (except they get better latency
for stuff like select() and other timeouts).

Linus

2002-07-15 06:53:26

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Linus Torvalds writes:

> The fact that libproc believes that HZ can change is _their_ problem.
> I've told people over and over that user-level HZ is a constant (and, on
> x86, that constant is 100), and that won't change.

Was HZ supposed to be 1024 or 1200 on alpha?
How about arm... 64, 128, or 1000?

Not even counting user-mode-linux at 20 HZ, there were
about _five_ archs in your official kernel source that
indirectly made HZ a config option.

> So in current 2.5.x times() still counts at 100Hz, and /proc files that
> export clock_t still show the same 100Hz rate.

Good. That works for the 2.5 kernel and above, assuming you
did something about alpha, arm, ia64, s390, and mips.

Unfortunately, the hack must remain for another 4 years or so.
Maybe that's not so bad though. I prefer it over this:

#ifdef __386__
#define HZ 100
#endif
#ifdef __IA64__
#define HZ 1024
#endif
#ifdef __ARM__
#define HZ 128 // if they settle on this
#endif
#ifdef __S390__
#define HZ 10
#endif
...

2002-07-15 08:21:24

by Russell King

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Mon, Jul 15, 2002 at 02:56:14AM -0400, Albert D. Cahalan wrote:
> Unfortunately, the hack must remain for another 4 years or so.
> Maybe that's not so bad though. I prefer it over this:
>
> #ifdef __386__
> #define HZ 100
> #endif
> #ifdef __IA64__
> #define HZ 1024
> #endif
> #ifdef __ARM__
> #define HZ 128 // if they settle on this

Ehh? It's been 100 on the majority of ARM. If it's different in libproc,
the libproc is broken. One (broken) machine type decided it would be a
good idea to change it to 1000. Since no one has paid any attention
to this machine for some time, it's support code will get dropped if
they don't fix it before 2.6.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-07-15 08:56:46

by Dave Mielke

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

[quoted lines by Linus Torvalds on July 15, 2002, at 05:15]

>The fact that the kernel internally counts at some different rate should
>be _totally_ invisible to user programs (except they get better latency
>for stuff like select() and other timeouts).

I believe your position to be right on. May I ask, however, about a quandry
which we have in BRLTTY? We generate short "tunes" via the PC speaker in order
to give a blind user audible clues regarding certain events. To do this, we
need rather precise control over how long each note is on. Due to the current
lack of granularity, we need to do some rather long busy loops. This has worked
out okay, but it'd of course be much better if we could rely on the kernel to
do it, especially on a busy system, if its granularity is good enough. My
quandry is that while I don't believe that user land should know what
granularity the kernel is using, I'd still like to know if we should busy loop
or let the kernel do it depending on whether or not the kernel's granularity is
good enough for our needs. It'd be nice to have a way, therefore, to query two
values at run time, i.e. the granularity that services like select can offer
and the maximum amount of time that nanosleep will do a very accurate short
wait, although I suppose that these abilities could be abused by some.

--
Dave Mielke | 2213 Fox Crescent | I believe that the Bible is the
Phone: 1-613-726-0014 | Ottawa, Ontario | Word of God. Please contact me
EMail: [email protected] | Canada K2A 1H7 | if you're concerned about Hell.
http://familyradio.com

2002-07-15 15:46:00

by David Mosberger

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

>>>>> On Mon, 15 Jul 2002 09:24:11 +0100, Russell King <[email protected]> said:

Russell> On Mon, Jul 15, 2002 at 02:56:14AM -0400, Albert D. Cahalan
Russell> wrote:
>> Unfortunately, the hack must remain for another 4 years or so.
>> Maybe that's not so bad though. I prefer it over this:
>>
>> #ifdef __386__ #define HZ 100 #endif #ifdef __IA64__ #define HZ
>> 1024 #endif #ifdef __ARM__ #define HZ 128 // if they settle on
>> this

Russell> Ehh? It's been 100 on the majority of ARM. If it's
Russell> different in libproc, the libproc is broken.

libproc should be using AT_CLKTCK (as provided via sysconf(_SC_CLK_TCK))
at any rate.

--david

2002-07-15 16:04:42

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Russell King writes:
> On Mon, Jul 15, 2002 at 02:56:14AM -0400, Albert D. Cahalan wrote:

>> Unfortunately, the hack must remain for another 4 years or so.
>> Maybe that's not so bad though. I prefer it over this:
>>
>> #ifdef __386__
>> #define HZ 100
>> #endif
>> #ifdef __IA64__
>> #define HZ 1024
>> #endif
>> #ifdef __ARM__
>> #define HZ 128 // if they settle on this
>
> Ehh? It's been 100 on the majority of ARM. If it's different in libproc,
> the libproc is broken.

It's not a different value in libproc. There's autodetection.
I can't just support "the majority of ARM", and people keep
giving me shit about HZ supposedly being a per-arch constant.
(not that there's a sane way to get a per-arch constant from
user code anyway)

> One (broken) machine type decided it would be a
> good idea to change it to 1000. Since no one has paid any attention
> to this machine for some time, it's support code will get dropped if
> they don't fix it before 2.6.

You have 64, 128, and 1000. See for yourself.

arch-cl7500/param.h #define HZ 100
arch-epxa10db/param.h #define HZ 100
arch-integrator/param.h #define HZ 100
arch-l7200/param.h #define HZ 128
arch-shark/param.h #define HZ 64
arch-tbox/param.h #define HZ 1000

I need to support all of that with one binary.
So I'm stuck with:

case 9 ... 11 : Hertz = 10; break; /* S/390 (sometimes) */
case 18 ... 22 : Hertz = 20; break; /* user-mode Linux */
case 30 ... 34 : Hertz = 32; break; /* ia64 emulator */
case 48 ... 52 : Hertz = 50; break;
case 58 ... 62 : Hertz = 60; break;
case 63 ... 65 : Hertz = 64; break; /* StrongARM /Shark */
case 95 ... 105 : Hertz = 100; break; /* normal Linux */
case 124 ... 132 : Hertz = 128; break; /* MIPS, ARM */
case 195 ... 204 : Hertz = 200; break; /* normal << 1 */
case 253 ... 260 : Hertz = 256; break;
case 393 ... 408 : Hertz = 400; break; /* normal << 2 */
case 790 ... 808 : Hertz = 800; break; /* normal << 3 */
case 990 ... 1010 : Hertz = 1000; break; /* ARM */
case 1015 ... 1035 : Hertz = 1024; break; /* Alpha, ia64 */
case 1180 ... 1220 : Hertz = 1200; break; /* Alpha */

2002-07-15 16:24:01

by Robert Love

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Sun, 2002-07-14 at 22:06, Linus Torvalds wrote:

> I've never had good reason to believe the latency/perf benefits myself,
> but I was approached at OLS about problems with something as simple as
> DVD playing, where a 100Hz timer means that the DVD player ends up
> having to busy-loop on gettimeofday() because it cannot sanely sleep due
> to the lack in sufficient sleeping granularity.

A cleaner solution to this issue is a higher resolution timer, e.g. the
high-res-timers project which has high resolution POSIX timers.

We could still bump HZ, of course...

Robert Love

2002-07-15 17:03:55

by Russell King

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Mon, Jul 15, 2002 at 12:07:18PM -0400, Albert D. Cahalan wrote:
> You have 64, 128, and 1000. See for yourself.
>
> arch-cl7500/param.h #define HZ 100
> arch-epxa10db/param.h #define HZ 100
> arch-integrator/param.h #define HZ 100
> arch-l7200/param.h #define HZ 128
> arch-shark/param.h #define HZ 64
> arch-tbox/param.h #define HZ 1000
>
> I need to support all of that with one binary.
> So I'm stuck with:

Lets look more closely:

#ifndef HZ
#define HZ 100
#endif
#if defined(__KERNEL__) && (HZ == 100)
#define hz_to_std(a) (a)
#endif

And:

$ grep hz_to_std arch-*/param.h
arch-l7200/param.h:#define hz_to_std(a) ((a * HZ)/100)
arch-shark/param.h:#define hz_to_std(a) ((a * HZ)/100)

As I said, tbox is broken, so ignore that.

And hz_to_std gets used (fs/proc/array.c):

hz_to_std(task->times.tms_utime),
hz_to_std(task->times.tms_stime),
hz_to_std(task->times.tms_cutime),
hz_to_std(task->times.tms_cstime),

So merely grepping for HZ doesn't actually tell you anything.

All /proc values are in 100Hz units on ARM.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-07-15 18:17:55

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

David Mosberger writes:

> libproc should be using AT_CLKTCK (as provided via sysconf(_SC_CLK_TCK))
> at any rate.

If that would work reliably, sure. The glibc hackers have had
some trouble with doing a correct implementation. I've heard
that recently the kernel has been supplying glibc with HZ via
the ELF note mechanism, but I've no way to tell a broken glibc
from a working one. Thus libproc does things the painful way.

Perhaps you could explain how to access ELF notes from
regular app code. That covers 2.4 kernels AFAIK, and so
the hacks could go away as soon as Debian retires the
2.2 kernel.

2002-07-15 18:29:00

by David Mosberger

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

>>>>> On Mon, 15 Jul 2002 14:20:31 -0400 (EDT), "Albert D. Cahalan" <[email protected]> said:

Albert> Perhaps you could explain how to access ELF notes from
Albert> regular app code. That covers 2.4 kernels AFAIK, and so the
Albert> hacks could go away as soon as Debian retires the 2.2
Albert> kernel.

The ELF auxiliary info table is stored at the top of the user level
stack (above argv and envp). &envp[num_envs] should get you there
(check on the alignment, though).

--david

2002-07-15 18:41:11

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Russell King writes:
> On Mon, Jul 15, 2002 at 12:07:18PM -0400, Albert D. Cahalan wrote:

>> You have 64, 128, and 1000. See for yourself.
>>
>> arch-cl7500/param.h #define HZ 100
>> arch-epxa10db/param.h #define HZ 100
>> arch-integrator/param.h #define HZ 100
>> arch-l7200/param.h #define HZ 128
>> arch-shark/param.h #define HZ 64
>> arch-tbox/param.h #define HZ 1000
>>
>> I need to support all of that with one binary.
>> So I'm stuck with:
>
> Lets look more closely:
>
> #ifndef HZ
> #define HZ 100
> #endif
> #if defined(__KERNEL__) && (HZ == 100)
> #define hz_to_std(a) (a)
> #endif
>
> And:
>
> $ grep hz_to_std arch-*/param.h
> arch-l7200/param.h:#define hz_to_std(a) ((a * HZ)/100)
> arch-shark/param.h:#define hz_to_std(a) ((a * HZ)/100)

Won't that overflow in 3 or 4 days?

> As I said, tbox is broken, so ignore that.

OK.

> And hz_to_std gets used (fs/proc/array.c):
>
> hz_to_std(task->times.tms_utime),
> hz_to_std(task->times.tms_stime),
> hz_to_std(task->times.tms_cutime),
> hz_to_std(task->times.tms_cstime),

Now look in the 2.4.xx kernel source.

> So merely grepping for HZ doesn't actually tell you anything.
>
> All /proc values are in 100Hz units on ARM.

Since kernel 2.5.25 it looks like. I must support
the 2.4.xx kernels at least, and 2.2.xx is still
pretty popular.

2002-07-15 18:50:43

by Russell King

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Mon, Jul 15, 2002 at 02:43:00PM -0400, Albert D. Cahalan wrote:
> Russell King writes:
> > $ grep hz_to_std arch-*/param.h
> > arch-l7200/param.h:#define hz_to_std(a) ((a * HZ)/100)
> > arch-shark/param.h:#define hz_to_std(a) ((a * HZ)/100)
>
> Won't that overflow in 3 or 4 days?

Probably. Someone else's problem though (who wrote those)

> > And hz_to_std gets used (fs/proc/array.c):
> >
> > hz_to_std(task->times.tms_utime),
> > hz_to_std(task->times.tms_stime),
> > hz_to_std(task->times.tms_cutime),
> > hz_to_std(task->times.tms_cstime),
>
> Now look in the 2.4.xx kernel source.

Firstly, you can't base any assumptions about ARM from what's in the
main kernels.

It's not in the Marcelo source, but in the -rmk patch, which you need
to have a working kernel on ARM for _any_ kernel what so ever (because
I haven't yet managed to get Linus to take some trivial bits needed,
neither have I had any response why he won't take them.)

Yes, ARM has always been broken in every kernel there ever has been
from Linus/Marcelo/Alan.

The situation is improving with BK, but it's less than optimal; the
generic changes can't go through BK, therefore I can't really have a
BK tree that builds for ARM (because then the merging of csets gets
horrible.)

This all said, it looks like libproc automatically detects whatever
the kernel uses, so this is all irrelevant in the end.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-07-15 18:51:51

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Mon, 15 Jul 2002, Albert D. Cahalan wrote:
>
> It's not a different value in libproc. There's autodetection.
> I can't just support "the majority of ARM", and people keep
> giving me shit about HZ supposedly being a per-arch constant.
> (not that there's a sane way to get a per-arch constant from
> user code anyway)

But that's just _wrong_.

There _is_ a sane way to get the per-arch constant, and there has been for
a long long time.

The kernel exports it with the AT_CLKTCK ELF auxiliary note to every ELF
binary ever loaded, and I think glibc in turn exports that value through
the regular sysconf(_SC_CLK_TCK) thing. (Yeah, I disagree with some of the
glibc sysconf implementation, but it sure should be there, and it's
documented).

If that doesn't work, then it's a glibc bug (well, in theory there could
be a kernel bug too, but since it's a one-liner in the kernel I really
doubt it).

Linus

2002-07-15 18:57:07

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On 15 Jul 2002, Robert Love wrote:
>
> A cleaner solution to this issue is a higher resolution timer, e.g. the
> high-res-timers project which has high resolution POSIX timers.

But that really doesn't solve the problem either.

You still need to have some limit on the timer resolution. Whether you
call that limit "HZ" or something else is irrelevant in the end. Just
calling them "high-resolution" doesn't make the problem go away, you still
have some resolution (*).

So once you set some magic limit on the fine-grained resolution (let's
call that "MAX_FINE_HZ"), you might as well realize that that really is
100% equivalent to just making HZ _be_ that value. Together with possibly
making the actual timer tick happen at a slower rate according to some
other heuristics (ie "the system doesn't need timers right now, let's just
not do them").

Linus

(*) Which is a lot less than the hw can generate, since you mustn't allow
users to bog down the system in timer interrupts by just using
"itimer(ITIMER_REAL, .. fine-resolution..)".

2002-07-15 19:47:54

by mbs

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Monday 15 July 2002 14:56, Linus Torvalds wrote:
> On 15 Jul 2002, Robert Love wrote:
> > A cleaner solution to this issue is a higher resolution timer, e.g. the
> > high-res-timers project which has high resolution POSIX timers.
>
> But that really doesn't solve the problem either.
>
> You still need to have some limit on the timer resolution. Whether you
> call that limit "HZ" or something else is irrelevant in the end. Just
> calling them "high-resolution" doesn't make the problem go away, you still
> have some resolution (*).
>
> So once you set some magic limit on the fine-grained resolution (let's
> call that "MAX_FINE_HZ"), you might as well realize that that really is
> 100% equivalent to just making HZ _be_ that value. Together with possibly
> making the actual timer tick happen at a slower rate according to some
> other heuristics (ie "the system doesn't need timers right now, let's just
> not do them").
>
> Linus
>
> (*) Which is a lot less than the hw can generate, since you mustn't allow
> users to bog down the system in timer interrupts by just using
> "itimer(ITIMER_REAL, .. fine-resolution..)".

actually, that is an interesting philosophical argument.

in an embedded system, it is sometimes more useful to not put artificial
constraints on the system and allow the clock and timer system to work in hw
increments, but document the hell out of it.

this is the "give 'em enough rope to hang themselves, but tell them the
precise length of the rope" model.

in an embedded system a "tickless" system is sometimes preferable to a ticked
system. there is often only one or a very small number of processes/threads
running and the extra overhead of 10 surplus clock ticks per process quantum
is a waste of cycles. (also when using a ppc or similar modern chip(flame
on;-), there is no need to keep a software wall clock, as the cpu has a 64bit
free running counter)

I had this discussion with george A. early in the posix timers project and I
argued/begged for a compile time config option giving the option of ticked
and tickless versions. George chose to go with a ticked system, because it
benchmarked better in a general purpose system, particlularly under high
loads, and he didn't have time to implement two systems. he made the right
choice for the general purpose kernel and for probably 80% of the embedded
market. (I'm in the other 20%)

--
/**************************************************
** Mark Salisbury || [email protected] **
**************************************************/

2002-07-15 19:59:54

by Victor Yodaiken

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Mon, Jul 15, 2002 at 03:52:37PM -0400, mbs wrote:
> On Monday 15 July 2002 14:56, Linus Torvalds wrote:
> > (*) Which is a lot less than the hw can generate, since you mustn't allow
> > users to bog down the system in timer interrupts by just using
> > "itimer(ITIMER_REAL, .. fine-resolution..)".
>
> actually, that is an interesting philosophical argument.
>
> in an embedded system, it is sometimes more useful to not put artificial

That's why we have RTLinux.

> in an embedded system a "tickless" system is sometimes preferable to a ticked
> system. there is often only one or a very small number of processes/threads
> running and the extra overhead of 10 surplus clock ticks per process quantum
> is a waste of cycles. (also when using a ppc or similar modern chip(flame
> on;-), there is no need to keep a software wall clock, as the cpu has a 64bit
> free running counter)

Right: but "one or a very small number of processes/threads" does not apply to
Linux.

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2002-07-15 20:13:12

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

Linus Torvalds writes:
> On Mon, 15 Jul 2002, Albert D. Cahalan wrote:

>> It's not a different value in libproc. There's autodetection.
>> I can't just support "the majority of ARM", and people keep
>> giving me shit about HZ supposedly being a per-arch constant.
>> (not that there's a sane way to get a per-arch constant from
>> user code anyway)
>
> But that's just _wrong_.

If you only support recent kernels and glibc, true.
Debian is about to release a distribution with the 2.2 kernel.

> There _is_ a sane way to get the per-arch constant, and there has been for
> a long long time.

Your "long long time" is very different, because you
always (?) run the very latest kernel.

> The kernel exports it with the AT_CLKTCK ELF auxiliary note to every ELF
> binary ever loaded, and I think glibc in turn exports that value through
> the regular sysconf(_SC_CLK_TCK) thing. (Yeah, I disagree with some of the
> glibc sysconf implementation, but it sure should be there, and it's
> documented).
>
> If that doesn't work, then it's a glibc bug (well, in theory there could
> be a kernel bug too, but since it's a one-liner in the kernel I really
> doubt it).

Yeah, NOW it should work fine. App code sees:

old glibc and old kernel --> guess
old glibc and new kernel --> guess
new glibc and old kernel --> guess
new glibc and new kernel --> useful data

(the guess is correct for unmodified x86)

Two problems with that:

1. must handle the "guess" case
2. can't tell a guess from useful data!

So I can't use the useful data for a few more years.
I can cut that time down to maybe 2 years if I write
code to dig up the ELF notes myself, assuming that
were introduced with the 2.4 kernel.

2002-07-16 10:30:26

by kaih

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

[email protected] (Albert D. Cahalan) wrote on 11.07.02 in <[email protected]>:

> Hell yes. It's going to remain evil until the 2.4 kernel
> is a distant memory. Debian uses a 2.2 kernel in the
> upcoming release, so it will be a good long time until
> everyone is using a 2.6 kernel. When 2.8 comes out,
> Debian will finally stop using 2.4 and I can get rid of
> my evil hack.

Currently, the upcoming version has 4 2.2.20 kernels, 7 2.4.16 kernels,
and 7 2.4.18 kernels.

MfG Kai

2002-07-16 11:38:49

by Vojtech Pavlik

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Mon, Jul 15, 2002 at 05:06:45AM +0000, Linus Torvalds wrote:
> In article <[email protected]>,
> Grover, Andrew <[email protected]> wrote:
> >
> >But on the other hand, increasing HZ has perf/latency benefits, yes? Have
> >these been quantified?
>
> I've never had good reason to believe the latency/perf benefits myself,
> but I was approached at OLS about problems with something as simple as
> DVD playing, where a 100Hz timer means that the DVD player ends up
> having to busy-loop on gettimeofday() because it cannot sanely sleep due
> to the lack in sufficient sleeping granularity.
>
> You apparently end up visibly missing frames - a frame is just 3 timer
> ticks at 100 Hz, and considering that the kernel has to round up by one
> due to POSIX requirements _and_ considering that you lose roughly one
> for actually processing the frame itself, that doesn't sound _that_
> outlandish.

Actually, this example is pretty much false I believe.

Since there is always the screen refresh rate going at say 85 Hz, you'll
be missing frames anyway.

The really correct solution would be to use the vertical blank
interrupt, which all recent cards provide, to wake the X process to tell
it that it should flip it's xvideo double-buffer (*), and to tell the
DVD player to supply another frame to it, which it would then preferably
DMA over AGP straight into the video card memory.

Now, if you wanted a real smooth video, you'd set your screen refresh
rate to 100 Hz in Europe and 120 Hz in US. (Without that it never can be
100% smooth anyway). And it also has the nice side effect of eliminating
the screen flicker caused by fluorescent lamp interference.

(*) If our interrupt-to-wake latency is too large for X to do the buffer
flip in the vblank, then we'll probably need some more kernel support
for that.

--
Vojtech Pavlik
SuSE Labs

2002-07-17 19:29:42

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Monday 15 July 2002 07:06, Linus Torvalds wrote:
> There is, of course, the option to do variable frequency (and make it
> integer multiples of the exposed "constant HZ" so that kernel code
> doesn't actually need to _care_ about the variability). There are
> patches to play with things like that.

We don't have to feel restricted to integer multiples. I'll paste in my
earlier post, for your convenience:

> ...If somebody wants a cruder scheduling interval than the raw timer
> interrupt, that's child's play, just step the interval down. ?The
> only slightly challenging thing is do that without restricting
> choice of rate for the raw timer and scheduler, respectively. ?Here,
> a novel application of Bresenham's algorithm (the line drawing
> algorithm) works nicely: at each raw interrupt, subtract the period
> of the raw interrupt from an accumulator; if the result is less
> than zero, add the period of the scheduler to the accumlator and
> drop into the scheduler's part of the timer interrupt.

[which just increments the timer variable I believe]

> This Bresenham trick works for arbitrary collections of interrupt
> rates, all with different periods. ?It has the property that,
> over time, the total number of invocations at each rate remains
> *exactly* correct, and so long as the raw interrupt runs at a
> reasonably high rate, displacement isn't that bad either.

This technique is scarcely less efficient than the cruder method.

--
Daniel

2002-07-17 20:28:09

by Richard B. Johnson

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Wed, 17 Jul 2002, Daniel Phillips wrote:

> On Monday 15 July 2002 07:06, Linus Torvalds wrote:
> > There is, of course, the option to do variable frequency (and make it
> > integer multiples of the exposed "constant HZ" so that kernel code
> > doesn't actually need to _care_ about the variability). There are
> > patches to play with things like that.
>
> We don't have to feel restricted to integer multiples. I'll paste in my
> earlier post, for your convenience:
>
> > ...If somebody wants a cruder scheduling interval than the raw timer
> > interrupt, that's child's play, just step the interval down. ?The
> > only slightly challenging thing is do that without restricting
> > choice of rate for the raw timer and scheduler, respectively. ?Here,
> > a novel application of Bresenham's algorithm (the line drawing
> > algorithm) works nicely: at each raw interrupt, subtract the period
> > of the raw interrupt from an accumulator; if the result is less
> > than zero, add the period of the scheduler to the accumlator and
> > drop into the scheduler's part of the timer interrupt.
>
> [which just increments the timer variable I believe]
>
> > This Bresenham trick works for arbitrary collections of interrupt
> > rates, all with different periods. ?It has the property that,
> > over time, the total number of invocations at each rate remains
> > *exactly* correct, and so long as the raw interrupt runs at a
> > reasonably high rate, displacement isn't that bad either.
>
> This technique is scarcely less efficient than the cruder method.

It is hardly novel and I can't imagine how Bresenham or whomever
could make such a claim to the obvious. Even the DOS writer(s) used
this technique to get one-second time intervals from the 18.206
ticks/per second. This is simply division by subtraction, but you
don't throw away the remainder. Therefore, in the limit, there is
no remainder. However, at any instant, the time can be off by as
much as the divisor -1. FYI, you make digital filters using this
same method, it's hardly novel.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

Windows-2000/Professional isn't.

2002-07-17 20:51:58

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Wed, 17 Jul 2002, Daniel Phillips wrote:
>
> We don't have to feel restricted to integer multiples. I'll paste in my
> earlier post, for your convenience:

Oh, I agree. I think the integer multiplies simplify the problem space,
and should be trivial for most timer hardware (ie most timer hardware is
likely just a counter, so making the countdown value be N times as big
just automatically gives you integer multiples).

The _important_ part I would prefer people to take away is that it is
easier to "slow down" the clock than it is to speed it up. Mainly because
the place that are likely to care about speeding it up are also very very
timing-critical. For example, there is no way in _hell_ that we're going
to reprogram the old-style PC/AT timer inside the "add_timer()" function.
It just is not viable.

In contrast, the places who are interested in slowing the timer down are
also the places likely to not be as timing-critical. The idle loop being
the perfect example (and also being right now the _only_ example where
somebody actually asked for a slower timer tick).

Also note that once you're willing to do this in the slow path, you can
also do real "fixups" to the results since you can afford to take a small
hit when you get back to "fast mode". For example, if we only do this on
PC's while we go into C3 anyway (where latencies to saving power are quite
noticeably anyway, so that the idle loop already has to do some latency
estimation before it decides to go into C3), then we can easily afford to
completely re-setting not just the timer counter, but doing fairly complex
things like re-adjusting the whole time-of-day clock.

See how it becomes a much simpler game (and you have more options) if you
take the "slow the timer down when idle" approach instead of taking the
"speed the timer up when you need to" approach?

Linus

2002-07-17 20:36:08

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Wednesday 17 July 2002 22:31, Richard B. Johnson wrote:
> On Wed, 17 Jul 2002, Daniel Phillips wrote:
>
> > On Monday 15 July 2002 07:06, Linus Torvalds wrote:
> > > There is, of course, the option to do variable frequency (and make it
> > > integer multiples of the exposed "constant HZ" so that kernel code
> > > doesn't actually need to _care_ about the variability). There are
> > > patches to play with things like that.
> >
> > We don't have to feel restricted to integer multiples. I'll paste in my
> > earlier post, for your convenience:
> >
> > > ...If somebody wants a cruder scheduling interval than the raw timer
> > > interrupt, that's child's play, just step the interval down. ?The
> > > only slightly challenging thing is do that without restricting
> > > choice of rate for the raw timer and scheduler, respectively. ?Here,
> > > a novel application of Bresenham's algorithm (the line drawing
> > > algorithm) works nicely: at each raw interrupt, subtract the period
> > > of the raw interrupt from an accumulator; if the result is less
> > > than zero, add the period of the scheduler to the accumlator and
> > > drop into the scheduler's part of the timer interrupt.
> >
> > [which just increments the timer variable I believe]
> >
> > > This Bresenham trick works for arbitrary collections of interrupt
> > > rates, all with different periods. ?It has the property that,
> > > over time, the total number of invocations at each rate remains
> > > *exactly* correct, and so long as the raw interrupt runs at a
> > > reasonably high rate, displacement isn't that bad either.
> >
> > This technique is scarcely less efficient than the cruder method.
>
> It is hardly novel and I can't imagine how Bresenham or whomever
> could make such a claim to the obvious. Even the DOS writer(s) used
> this technique to get one-second time intervals from the 18.206
> ticks/per second. This is simply division by subtraction, but you
> don't throw away the remainder. Therefore, in the limit, there is
> no remainder. However, at any instant, the time can be off by as
> much as the divisor -1. FYI, you make digital filters using this
> same method, it's hardly novel.

It's novel for Linux then, because it seems not to have occured to
anyone here. I'll take your agressive response as a vote in favor.

--
Daniel

2002-07-17 21:12:13

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Wednesday 17 July 2002 23:02, Linus Torvalds wrote:
> On Wed, 17 Jul 2002, Richard B. Johnson wrote:
> >
> > It is hardly novel and I can't imagine how Bresenham or whomever
> > could make such a claim to the obvious. Even the DOS writer(s) used
> > this technique to get one-second time intervals from the 18.206
> > ticks/per second.
>
> Ehh.. Look at _existing_ linux code to do exactly the same.
>
> See update_wall_time_one_tick() and second_overflow() (which does a lot
> more besides, but it does largely boil down to this "average fractions
> using basic integer math" thing.

I see lots of stuff in there all right, but I don't see anything that
implements the numerator/denominator error analysis technique I
described above. Maybe I just didn't look hard enough.

--
Daniel

2002-07-17 20:58:52

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Wed, 17 Jul 2002, Richard B. Johnson wrote:
>
> It is hardly novel and I can't imagine how Bresenham or whomever
> could make such a claim to the obvious. Even the DOS writer(s) used
> this technique to get one-second time intervals from the 18.206
> ticks/per second.

Ehh.. Look at _existing_ linux code to do exactly the same.

See update_wall_time_one_tick() and second_overflow() (which does a lot
more besides, but it does largely boil down to this "average fractions
using basic integer math" thing.

Linus

2002-07-17 20:58:18

by Richard B. Johnson

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Wed, 17 Jul 2002, Daniel Phillips wrote:

> On Wednesday 17 July 2002 22:31, Richard B. Johnson wrote:
> > On Wed, 17 Jul 2002, Daniel Phillips wrote:
> >
> > > On Monday 15 July 2002 07:06, Linus Torvalds wrote:
> > > > There is, of course, the option to do variable frequency (and make it
> > > > integer multiples of the exposed "constant HZ" so that kernel code
> > > > doesn't actually need to _care_ about the variability). There are
> > > > patches to play with things like that.
> > >
> > > We don't have to feel restricted to integer multiples. I'll paste in my
> > > earlier post, for your convenience:
> > >
> > > > ...If somebody wants a cruder scheduling interval than the raw timer
> > > > interrupt, that's child's play, just step the interval down. ?The
> > > > only slightly challenging thing is do that without restricting
> > > > choice of rate for the raw timer and scheduler, respectively. ?Here,
> > > > a novel application of Bresenham's algorithm (the line drawing
> > > > algorithm) works nicely: at each raw interrupt, subtract the period
> > > > of the raw interrupt from an accumulator; if the result is less
> > > > than zero, add the period of the scheduler to the accumlator and
> > > > drop into the scheduler's part of the timer interrupt.
> > >
> > > [which just increments the timer variable I believe]
> > >
> > > > This Bresenham trick works for arbitrary collections of interrupt
> > > > rates, all with different periods. ?It has the property that,
> > > > over time, the total number of invocations at each rate remains
> > > > *exactly* correct, and so long as the raw interrupt runs at a
> > > > reasonably high rate, displacement isn't that bad either.
> > >
> > > This technique is scarcely less efficient than the cruder method.
> >
> > It is hardly novel and I can't imagine how Bresenham or whomever
> > could make such a claim to the obvious. Even the DOS writer(s) used
> > this technique to get one-second time intervals from the 18.206
> > ticks/per second. This is simply division by subtraction, but you
> > don't throw away the remainder. Therefore, in the limit, there is
> > no remainder. However, at any instant, the time can be off by as
> > much as the divisor -1. FYI, you make digital filters using this
> > same method, it's hardly novel.
>
> It's novel for Linux then, because it seems not to have occured to
> anyone here. I'll take your agressive response as a vote in favor.
>

It's basically no overhead greater than the minimum if written in
assembly because the carry to less-than-zero is in the flags. In
'C' it requires a subtraction and then a test, but it's trivial code
and it provides for non-integral divisions with integers. I'm all
for it.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

Windows-2000/Professional isn't.

2002-07-18 10:37:19

by kaih

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

[email protected] (Richard B. Johnson) wrote on 17.07.02 in <[email protected]>:

> On Wed, 17 Jul 2002, Daniel Phillips wrote:
>
> > On Monday 15 July 2002 07:06, Linus Torvalds wrote:

[Are those attributions really right?]

> > > This Bresenham trick works for arbitrary collections of interrupt
> > > rates, all with different periods. It has the property that,
> > > over time, the total number of invocations at each rate remains
> > > *exactly* correct, and so long as the raw interrupt runs at a
> > > reasonably high rate, displacement isn't that bad either.
> >
> > This technique is scarcely less efficient than the cruder method.
>
> It is hardly novel and I can't imagine how Bresenham or whomever
> could make such a claim to the obvious. Even the DOS writer(s) used

Well, I mightpoint out the original (AFAIAA) paper is "J. E. Bresenham,
IBM Systems Journal 4, 25-30 (1965)".

It's a long time from 1965 to the creation of DOS.

MfG Kai

2002-07-18 12:53:03

by Richard B. Johnson

[permalink] [raw]

Subject: Re: HZ, preferably as small as possible

On Wed, 17 Jul 2002, Linus Torvalds wrote:

>
>
> On Wed, 17 Jul 2002, Richard B. Johnson wrote:
> >
> > It is hardly novel and I can't imagine how Bresenham or whomever
> > could make such a claim to the obvious. Even the DOS writer(s) used
> > this technique to get one-second time intervals from the 18.206
> > ticks/per second.
>
> Ehh.. Look at _existing_ linux code to do exactly the same.
>
> See update_wall_time_one_tick() and second_overflow() (which does a lot
> more besides, but it does largely boil down to this "average fractions
> using basic integer math" thing.
>
> Linus
>
Maybe you see something in the code I don't. In fact, the hardware
apprears to have been programmed to interrupt at the HZ rate
using the constant, CLOCK_TICK_RATE, defined in ../asm/timex.h.
Maybe the hardware can't be programmed to interrupt at HZ so the
real ticks are adjusted by 'average fractions' code, but it is
very unclear if this is being done.

Here is a 20 year-old source snippit of some synthetic division
code used to correct the DOS time by substituting part of INT 08.

The dividend, DIVIDEND, was cached in a variable called _pll,
(accessible from 'C' code as pll). This could be tuned to slowly
adjust the time to make it as exact as you wanted. I replaced
the actual time-keeping code with a CALL ONE_SEC to shorten
this example. It executes at 1 second intervals. The history
in the comments is interesting.

Old-fashion Intel DEST <--- SOURCE

DIVIDEND EQU 18206 ; 18.206
DIVISOR EQU 1000 ; 18206/1000 = 18.206

_pll DW DIVIDEND
_false DW 0
;
; This is the local timer tick.
;
; The timer tick used to interrupt at 18.206 ticks/second. When
; the AT got redone, this was changed to 18.158 because the
; clock to timer channel 0 got changed to 1.190 MHz and the
; requirement of using a 3.579545 MHz (color subcarrier) crystal
; was eliminated.
;
; Was: 3.579545 / 3 = 1.193181667 / 65536 = 18.206
; Now: 1.190 / 65536 = 18.158
; Also: 1.934 / 65536 = 18.210
;
; As usual, things are not well in AT-Land. The 'C' runtime library
; and MS-DOS still think that the proper divisor is 18.206. So we
; have to use that value or the time-stamps on the hour will be
; about 1/2 second in error. This means that the AT-Time is wrong
; by about 1/2 seconds per hour!
;
;
EVEN
TIMER PROC FAR
SAV_REG AX, DS ; Save registers used
MOV AX,DGROUP ; Local data area
MOV DS,AX ; Set segment
MOV AL,01001011B
; ||||||||_____ Select in-service register
; ||||||_______ No poll
; |||||________ Always 1
; ||||_________ Always 0
; |||__________ Standard mask
; |____________ Always 0
OUT INT_CTL,AL ; Select OCW3
IN AL,INT_CTL ; Get results
AND AL,00000001B ; Is interrupt pending?
JNZ OKAY ; Yes
INC WORD PTR _false ; Not good, record
OKAY: MOV AL,SPC_EOI ; Specific EOI
OUT INT_CTL,AL ; Reset controller
SUB WORD PTR [LCL_SECONDS],DIVISOR ; Subtract 1000 ticks
JNC NOSEC ; 1 second is not up yet
MOV AX,WORD PTR [_pll] ; Get loop variable
ADD WORD PTR [LCL_SECONDS],AX ; Synth div = 1000/18206
CALL ONE_SEC ; Dummy
;
NOSEC: RES_REG AX, DS ; Restore registers used
IRET ; Done
TIMER ENDP
;-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

Windows-2000/Professional isn't.

2002-07-18 13:21:26