2005-09-19 16:48:42

by Thomas Gleixner

[permalink] [raw]
Subject: [ANNOUNCE] ktimers subsystem

ktimers seperate the "timer API" from the "timeout API". ktimers are
used for:
- nanosleep
- posixtimers
- itimers

The following text explains the rationale behind ktimers. It contains

- a general analysis of the current Linux time(r) core system and
patches / projects related to it.

- a detailed description of necessary changes to the Linux time(r)
core system

- detailed explanation of the ktimer subsystem

- a short explanation of possible follow up patches to demonstrate the
further possibilities of the ktimers subsystem implementation


Why do we need ktimers ?
========================

Authors: Thomas Gleixner, Ingo Molnar

A lot of discussion took place about Linux timekeeping and timers
recently. The efforts to integrate the High Resolution Timer patches
into the -rt tree gave a deep insight into the big picture and initiated
the ktimers implementation. This document is an analysis of all related
issues, with the goal of inclusion of ktimers into mainline.


Linux time(rs) status, very short summary
-----------------------------------------

The current upstream timer implementation of Linux is based on a
periodic system tick (jiffy). This periodic tick initially had a period
of 10ms. During the 2.5 development series this was changed for some
architectures to 1ms and recently corrected to 4ms.

The upstream "time of day" (tod) timekeeping code builds on top of the
periodic ticks and takes architecture dependent sub-tick resolution
mechanisms into account to provide finer resolution. It also implements
synchronization with external time sources such as NTP.


Patches and projects related to timers and timekeeping, in history order
------------------------------------------------------------------------

- UTIME Usec Resolution Timers
- HRT High Resolution Timers
- VST Variable Scheduling Timeouts
- DTCK Dynamic Ticks
- NEWTOD New timeofday core including reworked NTP

- some related architecture specific code already integrated into the
kernel (mostly s390 related time interpolator code)
- a couple of others - rather odd attempts to change timer
resolution. Mostly single purpose patches.

All of those patches have one thing in common. They are restricted to a
few architectures and address only single problems of timers and
timekeeping.


UTIME: Microsecond resolution timers
-------------------------------------

The patch history goes back to Linux 2.0 and is maintained by Dr.
Douglas Niehaus at Kansas University as part of the KURT (Kansas
University Realtime) project.

Supported platforms: x86, (PPC, ARM partial)

Implementation:

Originally the usec resolution support was available for all users of
the timer core system, but during the course of development it turned
out to be a waste of resources and was restricted to realtime processes.
The implementation is restricted to nanosleep and itimers. Posix timer
support is planned. Initially it was implemented on top of the timer
wheel, but recently converted to a seperate list for high resolution
timers.

A related and quite interesting point of activity in this project is the
research on fine granular synchronization of machines in a network.


HRT: High Resolution Timers
---------------------------

George Anzinger forked the UTIME parts of KURT some years ago and
started the High Resolution Timer project. The usage is restricted to
posix timers with clock = CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR.

Supported platforms: x86, PPC, PPC, PPC64, SH, ARM partial

Implementation:

High resolution timers are kept in the timer wheel until they reach the
jiffy where they expire. On expiry they are moved to a seperate list and
arm the high resolution timer source. The timer function is handled in a
seperate softirq. The high resolution portion of the timers is managed
as a seperate field in the timer structure which holds the fractional
part. Initially implemented as subjiffies with a given resolution it
changed to a variable holding the time source cycles to reduce
conversion overhead. This has impact on the accuracy of cyclic
schedules and also has some limitations for high resolution time sources
with variable frequencies, e.g. TSC on x86.
Possible changes are work in progress.


VST: Variable Scheduling Timeouts
---------------------------------

A patch closely related to and depending on HRT, also maintained by
George Anzinger. It provides the suppression of timer ticks during idle
periods.

Supported platforms: x86, (One ARM platform supported, a related patch
snippet recently sneaked into the ARM core interrupt handling code)

Implementation:

Whenever the system goes into idle state the timer list(s) are scanned
to find the next timer and, if it is reasonably far away, VST turns off
the periodic 1/HZ timer interrupts and sets up a timer interrupt at the
expiry time of said timer. VST also provides a callback list which is
used to notify about idle enter / leave events. On the next interrupt,
be it the VST timer or some other interrupt, the periodic 1/HZ timer is
restarted and the elapsed time (ticks) is properly accounted for.


DTCK: Dynamic Ticks
-------------------

An implementation similar to VST, but not depending on other patches.
Maintained by Con Kolivas. It also provides the suppression of timer
ticks during idle periods.

Supported platforms: x86

Implementation:

Similar to VST, but completely jiffy bound. Very actively maintained in
recent months, with good progress. The core implementation is leaner
than VST. It contains some x86 specific bits which have to be seperated
out. It uses the already existing NO_IDLE_HZ code (s390) instead of
introducing new duplicated functionality. The configuration interface is
integrated into sysfs. A generic notification interface is not available
(yet?).


NEWTOD: New timeofday core including reworked NTP
-------------------------------------------------

John Stultz maintains a set of patches which are related, but have been
split for easier review and discussion.

- Reworked implementation of NTP synchronization
- Seperation of time of day timekeeping from the timer core

Supported platforms: x86

Implementation:

The time of day code is completely seperated from the periodic tick. The
code provides a runtime configurable time source selection, which is
intended to be of generic (architecture spanning) use. One of the
possible time sources is the periodic tick of course.

The code gets rid of one of the fundamental flaws of Linux time keeping:
the wrong order of deduction. The current upstream code derives almost
everything except jiffies from the wall clock time (xtime). This is
controversial to almost every time reference in the world. Usually time
references are built on a raw hardware clock which provides a more or
less accurate monotonic time source. This time source is corrected vs.
frequency skew. On top of resulting "constant frequency" monotonic time
source the human time conversions are implemented, e.g. timezones.
John's patch addresses this nicely and builds the correct order of clock
source -> frequency correction -> wall clock adjustment. This is one of
the essential preliminaries to implement nonintrusive, simple and
effective high resolution time support.

The lively discussion of the patch is not questioning the general idea.
The main point of criticizm is related to the enforced usage of 64-bit
variables and 64-bit arithmetic in hot execution paths. This is seen as
a penalty for 32-bit architectures and for low computing power CPUs
which are often used in embedded devices, but architectural simplicity
is a strong argument in favor of 64-bit arithmetics and we have not seen
a substantial proof of the overhead. (See also the detailed comparison
of timespec versus 64-bit nsec_t further below.)


Timer related observations
--------------------------

Ticks are a convenient mechanism for a lot of time triggered functions
like scheduler-ticks, timeouts etc. which require limited resolution and
precision.

For time of day timekeeping, which requires sub tick resolution
preferrable in human time units, ticks introduce a bunch of ugliness
especially when it comes to time synchronizing with high resolution time
sources. Another astonishing implementation detail of the current time
keeping is the fact that we get the monotonic clock (defined by POSIX as
a continous clock source which can not be set) by subtracting a variable
offset from the real time clock, which can be set by the user and
corrected by NTP or other mechanisms.

Another well-known drawback of the current tick based implementation is
the fact that ticks happen even on a completely idle system. This is an
undesired behaviour for battery powered devices. Resolving this
currently needs a lot of quirks to the upstream time(rs) system.

The current POSIX timer implementation is also quite complex due to its
implementation on top of the tick timers. We are forced to e.g. keep
track of armed CLOCK_REALTIME timers to readjust them when the clock has
been set. The POSIX timer API, as defined by Posix Specification 1003.1,
is inconsistent in the treatment of relative and absolute CLOCK_REALTIME
timers. Absolute timers are influenced by clock_set, relative timers are
not. This also applies to nanosleep. The specification of nanosleep
states on the other hand that the sleeping time must not be less than
the given timeout measured by CLOCK_REALTIME. Setting the clock while a
nanosleep is scheduled leads to an interesting situation:

Process 1 Process2
t1=get_timeofday()
nanosleep(20s)
set_timeofday(relative -20s)
t2=get_timeofday()
t2 - t1 =~ 0s

Implementing high resolution timers on top of a the current system also
requires a lot of quirks to keep the timer API usable for both high
resolution and tick based timers.

Such kinds of 'interaction artifacts' between the tick based data
structures and algorithms and the high-resolution data structures, even
if looked at without knowing anything about the time(r) subsystem,
already point in the direction of separating 'high resolution time(r)'
and 'low resolution timeout' APIs and subsystems.

As mentioned earlier, the switch to 1ms ticks during 2.5 development
series turned out to cause certain regressions and was recently
corrected to 4ms. One common type of regression was 'timer
soft-interrupt overrun', i.e. when processing related to a timer tick
did not finish within one jiffy, causing a domino effect on the timing
quality of the system.

What are the reasons? During the work on integrating high resolution
timers into the -rt tree we observed a lot of details related to this
problem. When changing the period of the timer tick (changing HZ) the
size of the primary timer wheel in the core code remains unchanged. This
results in the fact that the primary timer wheel [into which wheel the
secondary wheels are 'cascaded' periodically] becomes capable of
handling a smaller time span than before. The CONFIG_BASE_SMALL option
makes it even worse. Here is a table of the capacity limits of the
primary wheel:

HZ CONFIG_BASE_SMALL=n CONFIG_BASE_SMALL=y

100 2560 ms 640 ms
250 1024 ms 256 ms
1000 256 ms 64 ms

So one source of regression is the increased necessity to move
non-expired timers from the outer wheels to the primary wheel.

This alone does not explain all the regressions yet though. We did some
instrumentation and statistics on the timer code related to common use
cases, where the regressions showed up - machines with high networking
and/or disk I/O load.

This revealed a reasonable explanation for this behaviour. Both
networking and disk I/O arm a lot of timeout timers (the maximum number
of armed timers during the tests observed was ~400000). The majority of
those timeouts are in the range of 0.5 seconds. As frequently seen with
timeout timers, most of those timers never expire, but under high load
they get easily into a time line where they have to be moved from the
outer wheel to the primary timer wheel, when HZ=1000. Have a look at the
timer cascading code [cascade() in kernel/timer.c] to see the penalty...

Another source of regression is the fact that quite a lot of timer
functions execute long lasting codepaths. E.g. in the networking code
rt_secret_rebuild() does a loop over rt_hash_mask (1024 in my case),
over entries and over some subsequent variable sized loops inside each
step. On a 300MHZ PPC system this accumulated to a worst case total of
>5ms (!). The networking code contains more of those loops in timer
functions and the worst case szenario is when all of those loops happen
in the same jiffy and block the timely delivery of other timers. There
are other culprits, but those in the networking code are the most
obvious ones. This went almost unnoticed on HZ=100 systems, but on
HZ=1000 based machines those effects surfaced. The change to HZ=250 is
just hiding the problem rather than solving it.

Another weird effect of the changes to the time tick period is the fact
that a lot of places in the code are using HZ incorrectly. Even today,
more than a year after the switchover. Many of those usages still assume
HZ=100 or even have a completely wrong understanding of the mechanism
provided by the Linux kernel timer API (which is unrelated to the HZ
changes of course).

These observations together finally led to the complete seperation of
the high resolution timer data structures from the jiffy wheel, in the
HRT-RT integration work. (to further reduce latencies we also separated
softirq threads, but that is another topic.)


What is solved by the available patches ?
-----------------------------------------

As said before each of the patches addresses a particular part of the
time(r) related problems. Some of them are competing implementations.

We don't want to put down the efforts of the particular projects, but one
outstanding patch is John Stultz's work on the new time of day
subsystem. It really addresses one of the substantial linux time(r)
problems in a very generic and architecture-independent way, upon which
the other efforts can build cleanly.

The other patches mostly relate to tickless systems and high resolution
timers, and are providing great proof of concept implementations but
suffer from the bindings to particular architectures and the
restrictions that the current upstream Linux timer core code imposes
upon them.


What changes are required?
--------------------------

The conclusions of my recent work on Linux time(rs) related problems and
the analysis of related patches are:

1. The HZ/jiffy based usage of time in the kernel code has to be
converted to human time units.

2. A clean seperation of all related APIs and subsystems is necessary
even if they have interdependencies and shared functionality


| HZ/jiffy conversion

The conversion of users of HZ/jiffy based timing to human time units is
necessary to allow changes to the core timer subsystem without breaking
the users all over the kernel. Looking at the code most HZ/jiffie timers
are using more or less correct conversions from human time units anyway.
A positive side effect of such a cleanup is the necessary auditing for
correctness.


| API seperation

- time sources
- time synchronization
- time of day API
- timers API
- timeout API

- time sources:

The number and the resolution of available time sources varies a lot
over architectures and particular architecture specific platform
implementations. Some of them are only run time detectable. NEWTOD
provides a excellent code base for time source abstraction, but a couple
of details have to be discussed:

- resolution selection
- resolution and architecture dependend interface
- support for tick bound and tick less systems including a clean
integration into the interrupt handling code.
- usability of timesources for differrent purposes (timekeeping,
high/low res event scheduling)
- 32- vs. 64-bit arithmetic

- time synchronisation:

Time syncronization corrects the frequency skew of the time source. It
must provide a plugable interface for time synchronisation mechanisms to
allow the flexible implementation of time synchronization sources:

- None
- NTP
- GPS
- RTC
- ...

- time of day API:

The time of day API makes use of the eventually frequency corrected time
sources to implement the "human readable" interface. It is also
responsible for the translation of the monotonic time source - time
since system (re)booted - to the wall clock - real time - time. (Real
time in this context must not be confused with "real time" in the sense
of determinism.)


- timer API:

The timer API provides finegrained precision timers with relation to the
time of day subsystem. It provides the functionality of:

- precise interval scheduling
- precise timeouts

with or without high resolution timing support depending on system
configuration and system capabilities.


- timeout API:

The timeout API provides a coarse resolution interface for timeout
purposes. As pointed out before the majority of timers are related to
timeouts. What's the nature of timeouts ?

- Timeout timers are usually armed to cover an error condition

- Most of those timers never expire (the non timer related good
condition arrives before expiry)

- The demands on resolution are usually quite low. It does not make
any difference if an error condition is detected a few or even a
few hundred milliseconds earlier or later. The relevant point is
that the error is detected at all.

On a heavy loaded web server ~95%+ of all timers (almost all of them
armed by network or I/O kernel code) are removed before expiry. The
remaining timers which really expire are mostly timers requested from
application code. The major usage there is some periodic supervisor
code, which checks program status or other application relevant
information, and delays.


Conclusion
----------

Before inclusion of extensions to the current timer implementation e.g.
dynamic ticks, a API seperation and cleanup has to be done. Integrating
new functionality on top of the current code will just introduce a lot
of quirks and oddities which make the necessary cleanup and rework
harder.

John Stultz timeofday patches provide an excellent and solid base to
solve the first 3 of 5 points of the API seperation changes.

Ktimers add the timer API seperation with a clean way to integrate high
resolution time keeping.

The combination of both patches provides the grounds and leads the way
to the cleanup of the timeout API and the implementation of
dyntick/tickless support without introducing additional ugliness.


What is solved by ktimers ?
===========================

ktimers seperate the "timer API" from the "timeout API". ktimers are
used for:

- nanosleep
- posixtimers
- itimers

The implementation was done with following constraints in mind:

- Not bound to jiffies
- Multiple time sources
- Per CPU timer queues
- Simplification of absolute CLOCK_REALTIME posix timers
- High resolution timer aware
- Allows the timeout API to reschedule the next event
(for tickless systems)

Ktimers enqueue the timers into a time sorted list, which is implemented
with a rbtree, which is effiecient and already used in other performance
critical parts of the kernel. This is a bit slower than the timer wheel,
but due to the fact that the vast majority of timers is actually
expiring it has to be waged versus the cascading penalty.

The code supports multiple time sources. Currently implemented are
CLOCK_REALTIME and CLOCK_MONOTONIC. They provide seperate timer queues
and support functions.

The time ordered implementation and storage of the expiry time as the
time of the selected time source removes the hard work of
reprogramming all armed absolute CLOCK_REALTIME posix timers when the
clock was set

During the initial implementation phase the choice of a time storage
format had to be done. Dispite the previous discussion about 64-bit time
storage vs. timespec structures, the decision was made to use plain
64-bit variables. The rationale behind this is:

1. Simple calculations (add, sub, compare), which are the ones used in
the fastpaths are simpler and better to handle than struct
timespec.

2. The storage size is the same for 32-bit systems, but half the size
on 64-bit machines

3. The resulting binary code size is smaller due to the simpler fast
path operations. The comparison of the resulting binary code size
of a function which resembles parts of the hotpaths in the
enqueue and expiry code make this very clear. All compiled with
gcc-3.4 -O2.

AMD64 I386 ARM PPC32 M68K
nsec_t_ops e2 11c fc 1ac ce
timespec_ops 19c 144 1c0 280 156

Smaller binary code usually executes faster.

4. The kludge introduced by timespec arithmetics is horrible to
maintain. The simple and straight forward 64-bit calculation have
much less of surprises and potential error sources hidden.

5. The areas where the 64-bit nsec_t value has to be converted to
timespec / timeval are very restricted and can be optimized
further. Except for one odd POSIX timer related case (cyclic timer
with no signal delivery) the calculation is simple and straight
forward. The most discussed code in John Stultz timeofday patch is
the conversion in gettimeofday(). This can be easily solved by a
low overhead storage in both formats. The POSIX timespec / timeval
interface to userspace for the apparently often used gettimeofday
syscall must not be used as an red herring to clutter the complete
kernel timer and time keeping subsystems with this uneffective
representation of time.

ktimers are available in a patch series for easier review:

- ktimer_base.patch

The basic implementation of ktimers. The timer queues are called
inside the existing timer softirq. The time ordered queue
implementation allows to remove all the abs_list functionality from
posix-timers.c. Converted interfaces: itimers, nanosleep, posix
timers. There is no change to the current kernel time keeping system
required. The base patch utilizes existing interfaces.

The following add on patches are not provided for ad hoc inclusion as
they contain third party patches. The reason for providing this series
is to demonstrate the future use of ktimers and the simple
extensibility for the impelemtation of high resolution timers.
Especially John Stultz timeofday patch is a complete seperate issue
and just used due to the ability to provide high resolution timers in
a simple and non intrusive way.

The full patch series is available from
http://www.tglx.de/private/tglx/ktimers/patches-ktimer-tod-hrt.tar.bz2

- ktimer_hres.patch

Generic extension to the base patch to support high resolution
timers. The high resolution changes are the seperation of the
softirq, the high resolution interrupt function and the timer
reprogramming management.

- timeofday_b5 patch

Integration of John Stultz timeofday patches to have a clean
abstraction of time sources for a non intrusive implementation of
high resolution timers. This patch will be replaced by the reworked
version which is currently developed by John Stultz.

- timeofday_fixup patch

Fixup clashing inlines

- ktimer_tod.patch

extend the timeofday API and switch ktimers to use the timeofday API

- hres_i386_support.patch

Patch to support high resolution timers on i386 with local APIC
timer used for high resolution events. Proof of concept with the
restriction to local APIC as high resolution event source at the
moment. The main point is to prove that high resolution timers do
not require large and intrusive patches anymore due to the cleanup
of the time system.

Test coverage:

The complete patch is tested with the posix timer tests, which all
pass. It survives a couple of stress tests and shows no flaws when
integrated into the -rt tree. Of course this is brand new code, but
it is designed simple and robust from ground and got a thorough
review by a couple of people.

Some notes to the patch size(s):

- ktimer_base.patch:
17 files changed, 1328 insertions(+), 859 deletions(-)
code (-) 642 (+) 1058
comments (-) 217 (+) 270
The added code is mostly the base functionality of the ktimers
itself. The most cleanups happen in posix-timers.c, where all the
code related to the clock_was_set adjustment of absolute
CLOCK_REALTIME timers is removed. Converts itimers, nanosleep and
posixtimers to ktimer users

- ktimer_hres_patch
3 files changed, 330 insertions(+), 7 deletions(-)
code (-) 6 (+) 232
comments (-) 1 (+) 98
Add the generic infrastructure for high resolution timers. No non
POSIX clocks introduced, all ktimer users are automatically converted

- timeofday_b5.patch
73 files changed, 2926 insertions(+), 2675 deletions(-)
code (-) 2100 (+) 2146
comments (-) 575 (+) 780
A balanced patch providing a great improvement of functionality and
abstraction.

- hres_i386_support.patch
13 files changed, 635 insertions(+), 10 deletions(-)
code (-) 9 (+) 386
comments (-) 1 (+) 249
The largest addon is a header file containing scaled math operations, which
needs to be cleaned up.

- Total patch size
97 files changed, 5238 insertions(+), 3544 deletions(-)
code (-) 2751 (+) 3829
comments (-) 793 (+) 1409

Comparision numbers:
- hrt-common.patch
13 files changed, 1464 insertions(+), 108 deletions(-)
code (-) 91 (+) 879
comments (-) 17 (+) 585
Most code is added to the
Only posixtimers are supported. Add seperate non POSIX clocks
(CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR)

- i386-hrt.patch (reduced to apic code)
13 files changed, 1371 insertions(+), 63 deletions(-)
code (-) 51 (+) 910
comments (-) 12 (+) 461

- combined
26 files changed, 2835 insertions(+), 171 deletions(-)
code (-) 142 (+) 1789
comments (-) 29 (+) 1046

Summary:

The ktimer/timeofday/hrt combination adds ~1050 lines of source and
provides a clean API seperation and a lot of code/functionality
cleanup.

The high resolution timer patches add ~1750 lines of code for high
resolution time keeping without further functional improvements or
API cleanups.



2005-09-19 21:47:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Mon, 2005-09-19 at 23:04 +0200, [email protected] wrote:
> ktimers seperate the "timer API" from the "timeout API". ktimers are

Sorry for double posting.

mailer / operator madness

tglx


2005-09-19 22:04:38

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Mon, 19 Sep 2005 [email protected] wrote:

> sources. Another astonishing implementation detail of the current time
> keeping is the fact that we get the monotonic clock (defined by POSIX as
> a continous clock source which can not be set) by subtracting a variable
> offset from the real time clock, which can be set by the user and
> corrected by NTP or other mechanisms.

The benefit or drawback of that implementation depends which time is more
important: realtime or monotonic time. I think the most used time value is
realtime and not monotonic time. Having the real time value in xtime
saves one addition when retrieving realtime.

2005-09-19 22:17:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Mon, 2005-09-19 at 15:03 -0700, Christoph Lameter wrote:
> On Mon, 19 Sep 2005 [email protected] wrote:
>
> > sources. Another astonishing implementation detail of the current time
> > keeping is the fact that we get the monotonic clock (defined by POSIX as
> > a continous clock source which can not be set) by subtracting a variable
> > offset from the real time clock, which can be set by the user and
> > corrected by NTP or other mechanisms.
>
> The benefit or drawback of that implementation depends which time is more
> important: realtime or monotonic time. I think the most used time value is
> realtime and not monotonic time. Having the real time value in xtime
> saves one addition when retrieving realtime.

Thats only partially true.

Granted, the most used time in user space is clock_realtime
(gettimeofday() / clock_gettime(CLOCK_REALTIME).

But do we really want to discuss one add instruction ?

The most used time in kernel space is clock_monotonic.
Thats partially a result of the rather odd POSIX specs regarding
relative CLOCK_REALTIME timers.

Also the basic prerequisite for for high resolution timers is a fast and
simple access to clock_monotonic rather than to a backward corrected
clock_realtime representation.

Kernel code speed in hot pathes must have precedence over code executed
on behalf of userspace if its not completely out of bounds. One add/sub
is definitely not.

We should rather ask glibc people why gettimeofday() / clock_getttime()
is called inside the library code all over the place for non obvious
reasons.

tglx


2005-09-19 22:24:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Tue, 20 Sep 2005, Thomas Gleixner wrote:

> Also the basic prerequisite for for high resolution timers is a fast and
> simple access to clock_monotonic rather than to a backward corrected
> clock_realtime representation.

Yup that may be a reason to tolerate the add for realtime.

> We should rather ask glibc people why gettimeofday() / clock_getttime()
> is called inside the library code all over the place for non obvious
> reasons.

You can ask lots of application vendors the same question because its all
over lots of user space code. The fact is that gettimeofday() /
clock_gettime() efficiency is very critical to the performance of many
applications on Linux. That is why the addtion of one add instruction may
better be carefully considered. Many platforms can execute gettimeofday
without having to enter the kernel.

2005-09-19 22:39:27

by Chris Friesen

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Thomas Gleixner wrote:

> We should rather ask glibc people why gettimeofday() / clock_getttime()
> is called inside the library code all over the place for non obvious
> reasons.

From an app point of view, there are any number of reasons to check the
time frequently.

--debugging
--flight-recorder style logs
--if you've got timers in your application, you may want to check to
make sure that you didn't get woken up early (the linux behaviour of
returning unused time in select is not portable)
--the app might be tracking it's own behaviour, measuring how long code
paths take for its own accounting purposes
--emulators (vmware, UML, etc.) often want to check the time quite
frequently


Chris

2005-09-19 22:44:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Mon, 2005-09-19 at 15:24 -0700, Christoph Lameter wrote:

> > We should rather ask glibc people why gettimeofday() / clock_getttime()
> > is called inside the library code all over the place for non obvious
> > reasons.
>
> You can ask lots of application vendors the same question because its all
> over lots of user space code. The fact is that gettimeofday() /
> clock_gettime() efficiency is very critical to the performance of many
> applications on Linux. That is why the addtion of one add instruction may
> better be carefully considered.

Hmm. I don't understand the argument line completely.

1. The kernel has to provide ugly mechanisms because a lot of
applications implementations are doing the Wrong Thing ?

2. All gettimeofday implementations I have looked at do a lot of math
anyway, so its definitely more interesting to look at those oddities
rather than discussing a single add. John Stulz timeofday rework have a
clean solution for this - please do not argue about the div64 in his
original patches which he is reworking at the moment.

> Many platforms can execute gettimeofday
> without having to enter the kernel.

Which ones ? How is this achieved with respect to all the time adjust,
correction... code ?

tglx


2005-09-19 22:50:57

by john stultz

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Tue, 2005-09-20 at 00:44 +0200, Thomas Gleixner wrote:
> On Mon, 2005-09-19 at 15:24 -0700, Christoph Lameter wrote:
>
> > > We should rather ask glibc people why gettimeofday() / clock_getttime()
> > > is called inside the library code all over the place for non obvious
> > > reasons.
> >
> > You can ask lots of application vendors the same question because its all
> > over lots of user space code. The fact is that gettimeofday() /
> > clock_gettime() efficiency is very critical to the performance of many
> > applications on Linux. That is why the addtion of one add instruction may
> > better be carefully considered.
>
> Hmm. I don't understand the argument line completely.
>
> 1. The kernel has to provide ugly mechanisms because a lot of
> applications implementations are doing the Wrong Thing ?
>
> 2. All gettimeofday implementations I have looked at do a lot of math
> anyway, so its definitely more interesting to look at those oddities
> rather than discussing a single add. John Stulz timeofday rework have a
> clean solution for this - please do not argue about the div64 in his
> original patches which he is reworking at the moment.

The simplest solution is to keep wall-time maintained in a timespec as
well as the nsec_t based system_time/wall_time_offset combo. This avoids
the extra add in the hotpath, and only costs an extra add at interrupt
time.

I'll have an updated patch that includes some of Roman's suggestions
from earlier soon.

> > Many platforms can execute gettimeofday
> > without having to enter the kernel.
>
> Which ones ? How is this achieved with respect to all the time adjust,
> correction... code ?

Many arches have userspace gtod implementations (x86-64, ppc64, and ia64
as well). Although my timeofday code allows for this as well (I had it
working for x86-64 awhile back).

thanks
-john


2005-09-19 22:54:41

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Mon, 2005-09-19 at 16:39 -0600, Christopher Friesen wrote:
> Thomas Gleixner wrote:
>
> > We should rather ask glibc people why gettimeofday() / clock_getttime()
> > is called inside the library code all over the place for non obvious
> > reasons.
>
> From an app point of view, there are any number of reasons to check the
> time frequently.
>
> --debugging

Non standard case.

> --flight-recorder style logs

If you want to implement such stuff efficiently you rely on rdtscll() on
x86 or other monotonic easy accessible time souces and not on a
permanent call to gettimeofday.

> --if you've got timers in your application, you may want to check to
> make sure that you didn't get woken up early (the linux behaviour of
> returning unused time in select is not portable)

#ifdef is portable


Please beware me of red herrings. If application developers code with
respect to random OS worst case behaviour then they should not complain
that OS N is having an additional add instruction in one of the pathes.

tglx


2005-09-19 22:58:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Mon, 2005-09-19 at 15:50 -0700, john stultz wrote:
> > 2. All gettimeofday implementations I have looked at do a lot of math
> > anyway, so its definitely more interesting to look at those oddities
> > rather than discussing a single add. John Stulz timeofday rework have a
> > clean solution for this - please do not argue about the div64 in his
> > original patches which he is reworking at the moment.
>
> The simplest solution is to keep wall-time maintained in a timespec as
> well as the nsec_t based system_time/wall_time_offset combo. This avoids
> the extra add in the hotpath, and only costs an extra add at interrupt
> time.

<NITPICKING>
The crucial question is what's the hot path ? Depending on the
application type I want to avoid the add in the interrupt code. :)
</NITPICKING>


> Many arches have userspace gtod implementations (x86-64, ppc64, and ia64
> as well). Although my timeofday code allows for this as well (I had it
> working for x86-64 awhile back).

Was not aware of that. Thanks for clarification.

tglx


2005-09-19 23:05:28

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Tue, 20 Sep 2005, Thomas Gleixner wrote:

> Hmm. I don't understand the argument line completely.
>
> 1. The kernel has to provide ugly mechanisms because a lot of
> applications implementations are doing the Wrong Thing ?

Lets skip the "wrong thing"... Or are you saying that glibc and all the
apps are all wrong?

Applications call gettimeofday for a variety of reasons. One is because it
is widely available over different platformsn and application want to
schedule things, need timestamps etc etc.

> > Many platforms can execute gettimeofday
> > without having to enter the kernel.
>
> Which ones ? How is this achieved with respect to all the time adjust,
> correction... code ?

IA64 f.e. has a special instruction that allows access to kernel user
space without having to do a context switch.

2005-09-19 23:12:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Mon, 2005-09-19 at 16:04 -0700, Christoph Lameter wrote:
> On Tue, 20 Sep 2005, Thomas Gleixner wrote:
>
> > Hmm. I don't understand the argument line completely.
> >
> > 1. The kernel has to provide ugly mechanisms because a lot of
> > applications implementations are doing the Wrong Thing ?
>
> Lets skip the "wrong thing"... Or are you saying that glibc and all the
> apps are all wrong?
>
> Applications call gettimeofday for a variety of reasons. One is because it
> is widely available over different platformsn and application want to
> schedule things, need timestamps etc etc.

Accepted. But I still doubt that the number of calls to gettimeofday is
in anyway justified. The question I'm asking if it is really worth a
long and epic discussion about a single add instruction ?


> > > Many platforms can execute gettimeofday
> > > without having to enter the kernel.
> >
> > Which ones ? How is this achieved with respect to all the time adjust,
> > correction... code ?
>
> IA64 f.e. has a special instruction that allows access to kernel user
> space without having to do a context switch.

Ok, was not aware of that and John kindly clarified this already.

tglx


2005-09-20 00:44:29

by George Anzinger

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Christoph Lameter wrote:
> On Mon, 19 Sep 2005 [email protected] wrote:
>
>
>>sources. Another astonishing implementation detail of the current time
>>keeping is the fact that we get the monotonic clock (defined by POSIX as
>>a continuous clock source which can not be set) by subtracting a variable
>>offset from the real time clock, which can be set by the user and
>>corrected by NTP or other mechanisms.

Why is this astonishing? What it really indicates is the nature of
Linux where in we have just (with 2.6) introduced the concept of
monotonic time. As such, and with few users, it made a LOT of sense to
not upset too much code by making it the primary clock. In the end, the
difference between the two clocks is a constant offset and it is only an
add in one path or the other.

An argument from the other side is that ntp works with CLOCK_REALTIME
and so that is where and what it corrects.

Agreed, this can be turned around, however, one needs folks like John
Stultz who take the time to understand ntp as well as all the other
clock issues to turn things like this around. Still, we should consider
carefully IF we want to turn it around.

A far more astonishing thing, IMHO, is the cascade in the timers code...
>
>
> The benefit or drawback of that implementation depends which time is more
> important: realtime or monotonic time. I think the most used time value is
> realtime and not monotonic time. Having the real time value in xtime
> saves one addition when retrieving realtime.
> -
Both sides of this argument have merit. Much as we would like to, we
can not change user usage. AND, in the end, they are, and will continue
to make far more calls to get the time than the kernel does. So, in raw
cpu power (or time) consumed, the user get time will win over kernel
usage. Also, the time to do a gettimeofday is easily computed with the
most simple program...
--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-09-20 04:57:48

by Chris Friesen

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Thomas Gleixner wrote:
> On Mon, 2005-09-19 at 16:39 -0600, Christopher Friesen wrote:
>
>>Thomas Gleixner wrote:

>>>We should rather ask glibc people why gettimeofday() / clock_getttime()
>>>is called inside the library code all over the place for non obvious
>>>reasons.

>>--flight-recorder style logs

> If you want to implement such stuff efficiently you rely on rdtscll() on
> x86 or other monotonic easy accessible time souces and not on a
> permanent call to gettimeofday.

Not portable across architectures, and doesn't work across all smp/numa
environments. Also not easy to compare with other nodes on the network,
whereas with ntp-synch'd nodes you can use gettimeofday() for quite
accurate correlations.

> Please beware me of red herrings. If application developers code with
> respect to random OS worst case behaviour then they should not complain
> that OS N is having an additional add instruction in one of the pathes.

Actually I'm not complaining about additional add instructions. I was
just suggesting some reasons why apps might reasonably want to know the
time frequently.

Chris

2005-09-20 05:11:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Mon, 2005-09-19 at 22:57 -0600, Christopher Friesen wrote:
> >>--flight-recorder style logs
>
> > If you want to implement such stuff efficiently you rely on rdtscll() on
> > x86 or other monotonic easy accessible time souces and not on a
> > permanent call to gettimeofday.
>
> Not portable across architectures, and doesn't work across all smp/numa
> environments. Also not easy to compare with other nodes on the network,
> whereas with ntp-synch'd nodes you can use gettimeofday() for quite
> accurate correlations.

Sorry was a stupid argument. Withdrawn herby

> > Please beware me of red herrings. If application developers code with
> > respect to random OS worst case behaviour then they should not complain
> > that OS N is having an additional add instruction in one of the pathes.
>
> Actually I'm not complaining about additional add instructions. I was
> just suggesting some reasons why apps might reasonably want to know the
> time frequently.

ok

tglx


2005-09-20 07:17:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem


* Christoph Lameter <[email protected]> wrote:

> > We should rather ask glibc people why gettimeofday() / clock_getttime()
> > is called inside the library code all over the place for non obvious
> > reasons.
>
> You can ask lots of application vendors the same question because its
> all over lots of user space code. The fact is that gettimeofday() /
> clock_gettime() efficiency is very critical to the performance of many
> applications on Linux. That is why the addtion of one add instruction
> may better be carefully considered. Many platforms can execute
> gettimeofday without having to enter the kernel.

i think this line of argument got into a bit of a wrong direction: do we
seriously consider a single 'add' as an argument to _not_ go to a much
cleaner implementation? The answer is very simple: we dont. In the core
kernel we frequently skip other, much worthier micro-optimizations in
favor of cleanliness. The time subsystem has been limping along for
many, many years, and to bring new life into it we need John's and
Thomas's stuff. Simple as that. I'd give up much more than just a single
cycle add overhead for that ...

it's probably not even worth keeping the timespec 'cached' in parallel
to nsec_t - but in any case, speedups like that should be considered
totally separately - cleanliness, the main problem of the whole time
subsystem, comes first. _Once_ cleanliness has been achieved, we can
consider micro-optimizations anew, and judge them by how much they bring
and how they affect cleanliness.

[ Even when not considering cleanliness at all, the best opportunities
for optimizations are elsewhere. E.g. we could speed up
sys_gettimeofday() much more by skipping a number of hardware bug
workarounds that affect the quality of e.g. the TSC, and other timer
hardware details that are simpler on modern hardware. So if someone is
after sys_gettimeofday() performance, dont look for a single add (or
even a single division), go for the bigger picture first. E.g. the
vsyscall people went for the bigger picture on modern platforms and
sped sys_time up by doing it in userspace most of the time and thus
skipping hundreds of cycles of syscall entry overhead - not a cycle
like an add is. ]

Ingo

2005-09-20 08:37:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem


* Thomas Gleixner <[email protected]> wrote:

> > Applications call gettimeofday for a variety of reasons. One is because it
> > is widely available over different platformsn and application want to
> > schedule things, need timestamps etc etc.
>
> Accepted. But I still doubt that the number of calls to gettimeofday
> is in anyway justified. The question I'm asking if it is really worth
> a long and epic discussion about a single add instruction ?

it is absolutely and emphatically not worth it.

even in a hypothetical scenario [which this patchset is _not_ analogous
to] where a new, clean subsystem introduces significant overhead, but
the old subsystem is unclean, we frequently go with the new one -
because it's so much easier to speed up something that is clean, robust
and well-designed, than something that has been cobbled together!

Ingo

2005-09-21 19:51:06

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi,

On Mon, 19 Sep 2005 [email protected] wrote:

First a quick question:

> This revealed a reasonable explanation for this behaviour. Both
> networking and disk I/O arm a lot of timeout timers (the maximum number
> of armed timers during the tests observed was ~400000).

This triggers the obvious question: where are these timers coming from?
You don't think that having that much timers in first place is little
insane (especially if these are kernel timers)?

Ok, now to the long part:

> The conclusions of my recent work on Linux time(rs) related problems and
> the analysis of related patches are:
>
> 1. The HZ/jiffy based usage of time in the kernel code has to be
> converted to human time units.
>
> 2. A clean seperation of all related APIs and subsystems is necessary
> even if they have interdependencies and shared functionality

How you get to these conclusions is still rather unclear, I don't even
really know what the problem is from just reading the pretext. You talk
about previous efforts, you talk about users abusing the timer system, but
what is actually the problem of the timer system itself?

Later it becomes clear that you want high resolution timers, what doesn't
become clear is why it's such an essential feature that everyone has to
pay the price for it (this does not only include changes in runtime
behaviour, but also the proposed API changes).

What is seriously missing here is the big picture. First off how does it
currently look like? You rather shortly mention scheduler ticks and your
analysis basically says only that it's "a bunch of ugliness". There is no
mention why they are needed in first place and there is no real
explanation why they are such a big problem for high resolution timers.

Second, an API cleanup is all nice, but the more interesting part is still
what is behind this API and this part you pretty much leave in the dark.
Basically how does the new big picture look like and how do high
resolution timer fit into it? (You are more busy defending the 64bit math,
than actually explaining why and where it's needed in the first place.)

Sorry, if this sounds harsh, but your announcement is more a random
collection of information about timers than an explanation of why ktimers
are desirable. I'm not against high resolution timers per se, but this
doesn't explain why it has to be high resolution all the way. It also
doesn't explain how it will interact with Johns work, e.g. I'm only scared
if I see this in the ktimer_hres patch:

+extern int arch_hrtimer_init(int highres);
+extern int arch_hrtimer_reprogram(nsec_t expires);
+extern void arch_hrtimer_trigger_ints(void);


Ok, so what's missing? From a basic design overview I would expect some
information about types of time within the kernel and their relationship.
We basically have three types:
- scheduler time
- wallclock time
- process time

The scheduler time (aka jiffies) is not just used for timeouts, it's the
basic time unit to schedule cpu time. It's major requirement is simplicity
- a 32bit value can always be read without locking and calculations based
on it are simple.
I exclude posix clocks here as it can be used with both wallclock and
process time. The main difference between them is that the latter is user
programmable. Here we get to the core problem of timer ticks: the current
timer system is designed around a simple timer model, which is not
reprogrammable, so the timer resolution available to user space is
limited to the timer tick resolution.

Johns patches now introduce two major new concepts as a generic mechanism
(and not just hidden somewhere in arch code): 1) a timer source
abstraction, 2) making wallclock updates independent of the timer tick.
BTW here you completely miss the "main point of criticizm", the 64bit math
is a problem, but the main problem is that he completely changed the NTP
kernel model. I don't deny that the NTP code could use some updates
itself, but that's a completely separate problem. Regarding the timer
system it's only important how to synchronize NTP time with the kernel
wallclock time, as soon as you get that right, the whole 64bit math
problematic becomes irrelevant.

The existence of the timer source abstraction is a major requirement for
further improvements (in this regard it's already suspicious, that you put
major changes before Johns patch). The next major change would be to add
the possibility to reprogram a timer source, the scheduler can use this to
skip timer ticks and e.g. itimer can offer higher resolution timers. The
main point here is before we get to any API decisions, we need to develop
a model how a single time source can drive multiple users. Your split
between user timers and kernel timeouts leaves this question completely
open.

The next step (_after_ reprogrammable timer sources) would be increasing
the timer resolution. Here I'm not at all convinced, that we need to
change everything to nanosecond resolution, we can easily make this a
config option which either ties process time resolution to scheduler time
or makes it independent. The first would make process time a 32bit ms
value (basically current behaviour), the latter can make it to a 64bit ns
value. Anyone trying to introduce nsec_t in common code really needs to
come up with some better arguments why calculations in ns are necessary
unconditionally, instead of making the resolution configurable.

In summary please provide a larger picture for your changes, it's
especially important to desribe the relationship between the various
systems. The API definition is only the last step and is derived from
these relationships.

bye, Roman

2005-09-21 22:41:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi,

On Wed, 2005-09-21 at 21:50 +0200, Roman Zippel wrote:
> First a quick question:
>
> > This revealed a reasonable explanation for this behaviour. Both
> > networking and disk I/O arm a lot of timeout timers (the maximum number
> > of armed timers during the tests observed was ~400000).
>
> This triggers the obvious question: where are these timers coming from?
> You don't think that having that much timers in first place is little
> insane (especially if these are kernel timers)?

Quick answer: Networking and disk I/O. Insane load on a 4 way SMP
machine. Check yourself. :)

FYI. I'm not the only one who observed this.

> Ok, now to the long part:
>
> > The conclusions of my recent work on Linux time(rs) related problems and
> > the analysis of related patches are:
> >
> > 1. The HZ/jiffy based usage of time in the kernel code has to be
> > converted to human time units.
> >
> > 2. A clean seperation of all related APIs and subsystems is necessary
> > even if they have interdependencies and shared functionality
>
> How you get to these conclusions is still rather unclear, I don't even
> really know what the problem is from just reading the pretext. You talk
> about previous efforts, you talk about users abusing the timer system, but
> what is actually the problem of the timer system itself?

Actually there are several problems in the timer system:

1. HZ/jiffie boundness in the current implementation is problematic vs.
the conversion of human time units, which is the way programmers think
and the way time values are defined in data sheets, to HZ/jiffies. There
are dozens of places which get this completely wrong - either due to the
missunderstanding or dumbness of programmers or due to the fact that HZ
is not longer =100 and nobody cared to change that affected piece of
code. Using human time units in the API solves this in a clear way.
There is no problem to have conversion functions which convert those to
jiffies/HZ or whatever internal representation during compile or if
necessary run time, but relying on non constant units of time in an API
is utterly wrong and error prone.


2. The all in one solution for "timeouts" and "timers".

I think I made this point rather clear, but nevertheless:

Timeouts are coarse grained functions to catch error conditions. The
vast majority of those never expire. The implementation emphasis is fast
insertion/removal.

Timers are (possibly/desirably) fine grained functions to control
program flow in a time ordered way. The vast majority of those expire.
The implementation emphasis is time accuracy. Don't confuse accuracy
with high resolution!

Intermingling those types (1.) /(2.) leads to an obvious conflict in
interests and to restrictions of extensibility / flexibility especially
when (3.) applies.

3. There _are_ provable abuses of the current timer ("timeout") system.
Functions which run longer than the next timer tick have to be
considered insane. I just pointed this out for reference and I sent a
note to the subsystem maintainer(s) to make them aware of that before
writing this. If you read the writeup without being biased in front you
might notice that there are explicit examples of those insane abuses of
timers. If you consider those as legitimate we can stop the discussion
right now. I also pointed out that the HZ=1000->HZ=250 change just hides
this rather than solving the underlying problem.

> Later it becomes clear that you want high resolution timers, what doesn't
> become clear is why it's such an essential feature that everyone has to
> pay the price for it (this does not only include changes in runtime
> behaviour, but also the proposed API changes).

1. Yes, the final goal are high resolution timers. I never denied that.

2. The integration of high resolution timers is a long standing issue,
which was rejected mostly due to the intrusiveness of the proposed
patches.

3. ktimers itself are designed to be aware of a possible extension for
HRT, but they provide a benefit without high resolution timers and
nobody has to pay a price for them when they are not
configured/implemented. The removal of the abstime list reprogramming in
posix-timers.c is definitely a worthwhile cleanup and totaly unrelated
to HRT.

> What is seriously missing here is the big picture. First off how does it
> currently look like? Y

Point taken. The writeup was intended for people who are familiar with
the current situation.

> You rather shortly mention scheduler ticks and your
> analysis basically says only that it's "a bunch of ugliness".

Wrong. I did not say there is a "bunch of ugliness" in general. Even not
between the lines. Period.

The only places where "ugl*" is used in the whole writeup are:
"ticks introduce a bunch of ugliness especially when it comes to time
synchronizing with high resolution time
sources."

"The combination of both patches provides the grounds and leads the way
to the cleanup of the timeout API and the implementation of
dyntick/tickless support without introducing additional ugliness."

I'm sure that both assertions are true, especially in the context they
were made.

Roman, please keep this at a serious level. I dont have the intention to
participate on another "UFT-8, Reiser4, sizeof(*p)" lkml debate club.

> There is no mention why they are needed in first place and there is no real
> explanation why they are such a big problem for hich resolution timers.

Maybe above does shed more light on it ?

> Second, an API cleanup is all nice, but the more interesting part is still
> what is behind this API and this part you pretty much leave in the dark.

Whats in the dark ?

> Basically how does the new big picture look like and how do high
> resolution timer fit into it? (You are more busy defending the 64bit math,
> than actually explaining why and where it's needed in the first place.)

I also explained why I wanted to seperate "timeout" and "timers" APIs. I
explained why I choose rbtree and I explained why I used 64bit math and
at least why 64 bit math is not that evil as commonly seen.

There is also code and if you need more details, call my sales
departent....

> Sorry, if this sounds harsh, but your announcement is more a random
> collection of information about timers than an explanation of why ktimers
> are desirable.

First of all, this is volunteer work and I _did_ take the time to write
up a detailed explanation at all rather than throwing a random patch
with a 10 line bla into the arena. Do you expect that I write a PhD
thesis on that ?

Second this writeup was not targeted for John User. ....

> I'm not against high resolution timers per se, but this
> doesn't explain why it has to be high resolution all the way.

Where is high resolution all the way. Care to read the patch ? It's high
resolution aware and it does take out odd areas of code by design.

> It also doesn't explain how it will interact with Johns work,

"The following add on patches are not provided for ad hoc inclusion as
they contain third party patches. The reason for providing this series
is to demonstrate the future use of ktimers and the simple extensibility
for the impelemtation of high resolution timers. Especially John Stultz
timeofday patch is a complete seperate issue
and just used due to the ability to provide high resolution timers in a
simple and non intrusive way."

Isn't this clear enough ?

> e.g. I'm only scared
> if I see this in the ktimer_hres patch:
>
> +extern int arch_hrtimer_init(int highres);
> +extern int arch_hrtimer_reprogram(nsec_t expires);
> +extern void arch_hrtimer_trigger_ints(void);

Whats scary ? This is proof of concept. See above !

Have you a simpler solution and did you care to read the comment in
arch/i386/kernel/hrtimer.c ?

> Ok, so what's missing? From a basic design overview I would expect some
> information about types of time within the kernel and their relationship.
> We basically have three types:
> - scheduler time
> - wallclock time
> - process time

What about monotonic time ?

> The scheduler time (aka jiffies) is not just used for timeouts, it's the
> basic time unit to schedule cpu time. It's major requirement is simplicity
> - a 32bit value can always be read without locking and calculations based
> on it are simple.
> I exclude posix clocks here as it can be used with both wallclock and
> process time.

What about monotonic time ?

> The main difference between them is that the latter is user
> programmable.

wallclock is reprogrammable too and it introduces a bunch of horrible
functions in posix-timers.c. grep for abs_list. I explained why its
horrible already.

> Here we get to the core problem of timer ticks: the current
> timer system is designed around a simple timer model, which is not
> reprogrammable, so the timer resolution available to user space is
> limited to the timer tick resolution.
>
> Johns patches now introduce two major new concepts as a generic mechanism
> (and not just hidden somewhere in arch code): 1) a timer source
> abstraction, 2) making wallclock updates independent of the timer tick.

1. I'm well aware of the addressed problems in Johns patches.

2.I dont see any hidden arch code in the ktimers patch. Do you ?

fs/exec.c | 9
fs/proc/array.c | 6
include/asm-generic/div64.h | 18
include/linux/ktimer.h | 142 +++++++
include/linux/posix-timers.h | 87 ++--
include/linux/sched.h | 4
include/linux/time.h | 65 ++-
include/linux/timer.h | 2
init/main.c | 1
kernel/Makefile | 3
kernel/exit.c | 2
kernel/fork.c | 5
kernel/itimer.c | 83 +---
kernel/ktimers.c | 826 ++++++++++++++++++++++++++++++++++++++++++
kernel/posix-cpu-timers.c | 23 -
kernel/posix-timers.c | 832 ++++++++-----------------------------------
kernel/timer.c | 59 ---


May I politely remind you, that I provided the complete patch series
just to show the future use and clearly stated that it is just a proof
of concept implemetation on top of ktimers.

> BTW here you completely miss the "main point of criticizm", the 64bit math
> is a problem, but the main problem is that he completely changed the NTP
> kernel model. I don't deny that the NTP code could use some updates
> itself, but that's a completely separate problem. Regarding the timer
> system it's only important how to synchronize NTP time with the kernel
> wallclock time, as soon as you get that right, the whole 64bit math
> problematic becomes irrelevant.

Roman, what are you trying to achieve ? Finding a playground for
rabulistic discussions ?

> The existence of the timer source abstraction is a major requirement for
> further improvements (in this regard it's already suspicious, that you put
> major changes before Johns patch).

Whats suspicious on that ? Seperating the "timeout" API and the "timer"
API has nothing to do with Johns patches.

> The next major change would be to add the possibility to reprogram a
> timer source, the scheduler can use this to
> skip timer ticks and e.g. itimer can offer higher resolution timers. The
> main point here is before we get to any API decisions, we need to develop
> a model how a single time source can drive multiple users. Your split
> between user timers and kernel timeouts leaves this question completely
> open.

Did I claim, that ktimers solve this problem?

No.

The patches are related but address different aspects of the overall
problem without conflicting with each other. Quite the contrary: they
complement each other.

I clearly stated that the reprogramming of timer events, which are not
addressed by ktimers and I never claimed ktimers does, is a completely
different problem.

> The next step (_after_ reprogrammable timer sources) would be increasing
> the timer resolution.

Please let me correct you here. Adding reprogrammable timer events
before you have the core timer system ready is wrong by design.
Providing reprogrammable timer events is simple, but when the timer core
system which depends on timer events is not ready for that you implement
useless things and implement likely stuff which is not matching the
requirements of the generic system.

> Here I'm not at all convinced, that we need to
> change everything to nanosecond resolution, we can easily make this a
> config option which either ties process time resolution to scheduler time
> or makes it independent. The first would make process time a 32bit ms
> value (basically current behaviour), the latter can make it to a 64bit ns
> value. Anyone trying to introduce nsec_t in common code really needs to
> come up with some better arguments why calculations in ns are necessary
> unconditionally, instead of making the resolution configurable.

Please provide a whatever time unit based and configurable / flexible
solution for that instead of making unprovable claims !

> In summary please provide a larger picture for your changes, it's
> especially important to desribe the relationship between the various
> systems. The API definition is only the last step and is derived from
> these relationships.

Sorry. When you are not able to get the larger picture in your mind, I
doubt that you are the right person to discuss this topic. This kind of
argument is not working with me, especially not when repeated all over
the place.

Please provide an alternative solution (i.e. code to review) yourself.

tglx



2005-09-22 10:32:30

by Pavel Machek

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi!

> > Also the basic prerequisite for for high resolution timers is a fast and
> > simple access to clock_monotonic rather than to a backward corrected
> > clock_realtime representation.
>
> Yup that may be a reason to tolerate the add for realtime.
>
> > We should rather ask glibc people why gettimeofday() / clock_getttime()
> > is called inside the library code all over the place for non obvious
> > reasons.
>
> You can ask lots of application vendors the same question because its all
> over lots of user space code. The fact is that gettimeofday() /
> clock_gettime() efficiency is very critical to the performance of many
> applications on Linux. That is why the addtion of one add instruction may
> better be carefully considered. Many platforms can execute gettimeofday
> without having to enter the kernel.

Eh? One addition is going to be lost in noise compared to syscall overhead.
(For vsyscall, you may be closer to truth, but I doubt it. You could still gain
more than one addition by using some strange calling convention).


--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2005-09-22 12:59:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem


* Thomas Gleixner <[email protected]> wrote:

> > > This revealed a reasonable explanation for this behaviour. Both
> > > networking and disk I/O arm a lot of timeout timers (the maximum number
> > > of armed timers during the tests observed was ~400000).
> >
> > This triggers the obvious question: where are these timers coming from?
> > You don't think that having that much timers in first place is little
> > insane (especially if these are kernel timers)?
>
> Quick answer: Networking and disk I/O. Insane load on a 4 way SMP
> machine. Check yourself. :)

a busy network server can easily have millions of timers pending. I once
had to increase a server's 16 million tw timer sysctl limit ...

Ingo

2005-09-22 23:10:06

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi,

On Thu, 22 Sep 2005, Thomas Gleixner wrote:

> > > This revealed a reasonable explanation for this behaviour. Both
> > > networking and disk I/O arm a lot of timeout timers (the maximum number
> > > of armed timers during the tests observed was ~400000).
> >
> > This triggers the obvious question: where are these timers coming from?
> > You don't think that having that much timers in first place is little
> > insane (especially if these are kernel timers)?
>
> Quick answer: Networking and disk I/O. Insane load on a 4 way SMP
> machine. Check yourself. :)

This no answer at all, you only repeat what you already said above. :(
Care to share your knowledge?

First off, I can understand that you're rather upset with what I wrote,
unfortunately you got overly defensive, so could you please next time
not reply immediately and first sleep over it, an overly emotional reply
is funny to read but not exactly useful.
The main problem with your text is that you jump from one topic to the
next, making it impossible to create a coherent picture from it. Maybe
it's obvious for you, but I'd like you to fill in these holes to better
understand the design desicions.

> Actually there are several problems in the timer system:
>
> 1. HZ/jiffie boundness in the current implementation is problematic vs.
> the conversion of human time units, which is the way programmers think
> and the way time values are defined in data sheets, to HZ/jiffies.

This is an API problem. How is this related to core timer system (and more
specially to ktimers).

> 2. The all in one solution for "timeouts" and "timers".
>
> I think I made this point rather clear, but nevertheless:
>
> Timeouts are coarse grained functions to catch error conditions. The
> vast majority of those never expire. The implementation emphasis is fast
> insertion/removal.
>
> Timers are (possibly/desirably) fine grained functions to control
> program flow in a time ordered way. The vast majority of those expire.
> The implementation emphasis is time accuracy. Don't confuse accuracy
> with high resolution!
>
> Intermingling those types (1.) /(2.) leads to an obvious conflict in
> interests and to restrictions of extensibility / flexibility especially
> when (3.) applies.

You don't make it obvious at all. You jump from problems with _kernel_
timers, to the introduction of a new subsytem to manage _process_ timers.
Why don't you fix the problems with kernel timers separately?
Only because kernel timer require less precision, doesn't mean you can
make them imprecise. The timer accuracy is defined by the timer source and
I expect it to be the same for both kernel and process timers.

> 3. There _are_ provable abuses of the current timer ("timeout") system.
> Functions which run longer than the next timer tick have to be
> considered insane.

Fine. Again, how exactly does this problem with _kernel_ timer relate to
the introduction of ktimers?

> The only places where "ugl*" is used in the whole writeup are:
> "ticks introduce a bunch of ugliness especially when it comes to time
> synchronizing with high resolution time
> sources."
>
> "The combination of both patches provides the grounds and leads the way
> to the cleanup of the timeout API and the implementation of
> dyntick/tickless support without introducing additional ugliness."
>
> I'm sure that both assertions are true, especially in the context they
> were made.

It's nice that you're sure of it, but as long don't provide the means to
verify them, they are just assertions.
You never really expand on what this "bunch of ugliness" is, you talk
about timer abusers and API problems, but what is the problem with timer
ticks related to high resolution timers?
How are dyntick/tickless support and ktimers supposed to share the same
timer source? This never becomes clear from you document and neither from
your example code.

> > Basically how does the new big picture look like and how do high
> > resolution timer fit into it? (You are more busy defending the 64bit math,
> > than actually explaining why and where it's needed in the first place.)
>
> I also explained why I wanted to seperate "timeout" and "timers" APIs. I
> explained why I choose rbtree and I explained why I used 64bit math and
> at least why 64 bit math is not that evil as commonly seen.

I don't say that 64bit math is evil, I just question that it's required -
small, but important difference.
The main problem with your ktimer patch is that it's another all-in-one
patch, it simply changes too many aspects at once. If you want to
introduce a new API, you can do so by first introducing a small layer
which maps to the old layer. This makes it easier to see and prove any
potential improvement.

> > Sorry, if this sounds harsh, but your announcement is more a random
> > collection of information about timers than an explanation of why ktimers
> > are desirable.
>
> First of all, this is volunteer work and I _did_ take the time to write
> up a detailed explanation at all rather than throwing a random patch
> with a 10 line bla into the arena. Do you expect that I write a PhD
> thesis on that ?

No, but I hope you didn't just expect hoorray calls. I appreciate that you
try to explain this, but you should also expect criticism. You make it
sound I'm doing this just for fun and to annoy you, but I'm trying to keep
the quality level of Linux up and half finished ideas out of it. There is
a reason that the answer took a few days, I really thought very carefully
about this based on your document and patches. It's still possible I
missed something, but feel free to point this out (preferably in a
civilized manner).

> Second this writeup was not targeted for John User. ....

Well, I'm not sure whom you targeted, but it wasn't coherent enough for
the avarage LKML reader (at least for those wanting more than just cursory
information).

> > I'm not against high resolution timers per se, but this
> > doesn't explain why it has to be high resolution all the way.
>
> Where is high resolution all the way. Care to read the patch ? It's high
> resolution aware and it does take out odd areas of code by design.

It's not just high resolution aware, it makes all calculation in high
resolution _unconditionally_, which makes it high resolution all the way.

> > It also doesn't explain how it will interact with Johns work,
>
> "The following add on patches are not provided for ad hoc inclusion as
> they contain third party patches. The reason for providing this series
> is to demonstrate the future use of ktimers and the simple extensibility
> for the impelemtation of high resolution timers. Especially John Stultz
> timeofday patch is a complete seperate issue
> and just used due to the ability to provide high resolution timers in a
> simple and non intrusive way."
>
> Isn't this clear enough ?

No and I explained why I think that these are not separate issues at all.

> > Ok, so what's missing? From a basic design overview I would expect some
> > information about types of time within the kernel and their relationship.
> > We basically have three types:
> > - scheduler time
> > - wallclock time
> > - process time
>
> What about monotonic time ?

It's derived from wallclock time.

> > The main difference between them is that the latter is user
> > programmable.
>
> wallclock is reprogrammable too and it introduces a bunch of horrible
> functions in posix-timers.c. grep for abs_list. I explained why its
> horrible already.

I said _user_ programmable, wallclock time is usually NTP controlled.

> > Johns patches now introduce two major new concepts as a generic mechanism
> > (and not just hidden somewhere in arch code): 1) a timer source
> > abstraction, 2) making wallclock updates independent of the timer tick.
>
> 1. I'm well aware of the addressed problems in Johns patches.
>
> 2.I dont see any hidden arch code in the ktimers patch. Do you ?

That's not what I meant (and if you had taken the time to think about it,
instead of just being angry at me, I'm sure you would have noticed
yourself), this is e.g. about code in arch/i386/kernel/timers/ or
arch/ppc/kernel/time.c.

> > BTW here you completely miss the "main point of criticizm", the 64bit math
> > is a problem, but the main problem is that he completely changed the NTP
> > kernel model. I don't deny that the NTP code could use some updates
> > itself, but that's a completely separate problem. Regarding the timer
> > system it's only important how to synchronize NTP time with the kernel
> > wallclock time, as soon as you get that right, the whole 64bit math
> > problematic becomes irrelevant.
>
> Roman, what are you trying to achieve ? Finding a playground for
> rabulistic discussions ?

Ok, I'm at a loss here, what are you trying to tell me? Is the above in
any way incorrect?

> > The existence of the timer source abstraction is a major requirement for
> > further improvements (in this regard it's already suspicious, that you put
> > major changes before Johns patch).
>
> Whats suspicious on that ? Seperating the "timeout" API and the "timer"
> API has nothing to do with Johns patches.

Related changes should be done in a logical order, which I'm obviously
disagree about with you.

> > The next major change would be to add the possibility to reprogram a
> > timer source, the scheduler can use this to
> > skip timer ticks and e.g. itimer can offer higher resolution timers. The
> > main point here is before we get to any API decisions, we need to develop
> > a model how a single time source can drive multiple users. Your split
> > between user timers and kernel timeouts leaves this question completely
> > open.
>
> Did I claim, that ktimers solve this problem?
>
> No.
>
> The patches are related but address different aspects of the overall
> problem without conflicting with each other. Quite the contrary: they
> complement each other.
>
> I clearly stated that the reprogramming of timer events, which are not
> addressed by ktimers and I never claimed ktimers does, is a completely
> different problem.

No, it's part of the same problem, how are scheduler and your ktimers
supposed to share the same time source?

> > The next step (_after_ reprogrammable timer sources) would be increasing
> > the timer resolution.
>
> Please let me correct you here. Adding reprogrammable timer events
> before you have the core timer system ready is wrong by design.
> Providing reprogrammable timer events is simple, but when the timer core
> system which depends on timer events is not ready for that you implement
> useless things and implement likely stuff which is not matching the
> requirements of the generic system.

So what are these requirements? Please be more specific.

> > Here I'm not at all convinced, that we need to
> > change everything to nanosecond resolution, we can easily make this a
> > config option which either ties process time resolution to scheduler time
> > or makes it independent. The first would make process time a 32bit ms
> > value (basically current behaviour), the latter can make it to a 64bit ns
> > value. Anyone trying to introduce nsec_t in common code really needs to
> > come up with some better arguments why calculations in ns are necessary
> > unconditionally, instead of making the resolution configurable.
>
> Please provide a whatever time unit based and configurable / flexible
> solution for that instead of making unprovable claims !

What unprovable claims? What would change in the basic principles, if you
would do them with 32bit ms values instead of 64bit ns values? The basic
math should be the same and should demonstrate the basic principles
equally well and since the current timer code has only ms (at HZ=1000)
precision the behaviour should be the same as well.

> > In summary please provide a larger picture for your changes, it's
> > especially important to desribe the relationship between the various
> > systems. The API definition is only the last step and is derived from
> > these relationships.
>
> Sorry. When you are not able to get the larger picture in your mind, I
> doubt that you are the right person to discuss this topic. This kind of
> argument is not working with me, especially not when repeated all over
> the place.
>
> Please provide an alternative solution (i.e. code to review) yourself.

You seriously trying to tell me, that anyone doing reviews must provide
alternative solution first?
Please refrain from personal attacks, it should be obvious that we have
different ideas how to implement this, what I'm trying to do is to get you
to explain your picture better and I'm trying to explain my picture, so we
can understand each other better. We won't get very far as long as you are
just pissed at me for disagreeing with you.

bye, Roman

2005-09-22 23:31:39

by Chris Friesen

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Roman Zippel wrote:

> This no answer at all, you only repeat what you already said above. :(
> Care to share your knowledge?

Ingo already gave an example. "a busy network server can easily have
millions of timers pending. I once had to increase a server's 16 million
tw timer sysctl limit ..."

> I don't say that 64bit math is evil, I just question that it's required -
> small, but important difference.

<snip>

> It's not just high resolution aware, it makes all calculation in high
> resolution _unconditionally_, which makes it high resolution all the way.

<snip>

> What unprovable claims? What would change in the basic principles, if you
> would do them with 32bit ms values instead of 64bit ns values? The basic
> math should be the same and should demonstrate the basic principles
> equally well and since the current timer code has only ms (at HZ=1000)
> precision the behaviour should be the same as well.

I see two assumptions that lead to the API using nanoseconds:

1) it is desireable to have a human-time-unit timer API, so that people
can specify timeouts in easily-understood units
2) eventually we will use sub-ms resolution timers, so it makes sense to
just jump to nanoseconds as our base timing unit

Are these reasonable starting points, or is there disagreement on these?

Maybe it would make sense to have the API be in nanoseconds and
internally use 32bit ms for now, and only change to 64bit nanos when we
actually move to sub-ms resolution timers.

Chris

2005-09-23 00:26:00

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi,

On Thu, 22 Sep 2005, Christopher Friesen wrote:

> Roman Zippel wrote:
>
> > This no answer at all, you only repeat what you already said above. :(
> > Care to share your knowledge?
>
> Ingo already gave an example. "a busy network server can easily have millions
> of timers pending. I once had to increase a server's 16 million tw timer
> sysctl limit ..."

I hoped for a more concrete example (i.e. pointer to source), but this one
at least gave me enough hints where to look.
There are ways to avoid this huge number of added timers, but this
requires a better analysis of the problem.

> I see two assumptions that lead to the API using nanoseconds:
>
> 1) it is desireable to have a human-time-unit timer API, so that people can
> specify timeouts in easily-understood units
> 2) eventually we will use sub-ms resolution timers, so it makes sense to just
> jump to nanoseconds as our base timing unit
>
> Are these reasonable starting points, or is there disagreement on these?
>
> Maybe it would make sense to have the API be in nanoseconds and internally use
> 32bit ms for now, and only change to 64bit nanos when we actually move to
> sub-ms resolution timers.

Actually the decision to use ns has nothing to do with API issues.
<linux/jiffies.h> has already a lot of options to specify timeouts for
kernel timer. The official userspace API is mostly timespec/timeval.
The nsec_t type is an _internal_ type to manage time, so this makes it
possible to do something like this:

#ifdef CONFIG_HIRES_TIMER
typedef u64 ktime_t;
#else
typedef u32 ktime_t;
#endif

bye, Roman

2005-09-23 02:25:09

by john stultz

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Fri, 2005-09-23 at 01:09 +0200, Roman Zippel wrote:
> On Thu, 22 Sep 2005, Thomas Gleixner wrote:
> > > It also doesn't explain how it will interact with Johns work,
> >
> > "The following add on patches are not provided for ad hoc inclusion as
> > they contain third party patches. The reason for providing this series
> > is to demonstrate the future use of ktimers and the simple extensibility
> > for the impelemtation of high resolution timers. Especially John Stultz
> > timeofday patch is a complete seperate issue
> > and just used due to the ability to provide high resolution timers in a
> > simple and non intrusive way."
> >
> > Isn't this clear enough ?
>
> No and I explained why I think that these are not separate issues at all.

You had a long response, and some of the terminology is confusing, so
I'm not sure exactly how to help address your concern.

You did accurately summarized that in my patches I introduced two major
generic concepts:
1. The time source abstraction of a free-running counter to be used for
timekeeping. (Note: not "timer source" - timers generate interrupts,
time sources do not necessarilly)
2. Allowing the wall-clock (as well as a monotonic system clock) to run
independent of timer ticks, via the time source abstraction.

Your complaint was that in making these change the NTP model has been
changed, although I'm not sure if I agree, but hopefully the patches I
sent out today can continue that discussion. :)

It is my claim that these two concepts allow for correct and robust
timekeeping, and additionally give some flexibility so larger change
(such as dynamic ticks) can be made without worrying too much about
their effect on timekeeping. However, it does not provide anything
interface wise that is not currently exist, it just merely cleans some
of the interfaces up.

So, ignoring correctness in the face of lost ticks and other timekeeping
problems, my patch is not strictly necessary for what Thomas is doing.
It just so happens that my do_monotonic_clock() interface provided a
correct monotonic system time in nanoseconds and that meshed very well
with Thomas' work.


> > > Ok, so what's missing? From a basic design overview I would expect some
> > > information about types of time within the kernel and their relationship.
> > > We basically have three types:
> > > - scheduler time
> > > - wallclock time
> > > - process time


My clarity in writing is sometimes an issue, but let me take a shot at
it(Thomas, or anyone, feel free to correct me).

Currently we have two main domains of time in the kernel: xtime and
jiffies.

jiffies:
jiffies is a simple HZ frequency software maintained (interrupt based)
counter. Since it is fairly low-res, it easily fits in a single long
int(well, a reasonable portion of it does), and requires no locking to
atomically access, making it very easy and fast to use. It is used for
almost all in-kernel time accounting, from scheduler and processes
accounting to soft-timers. Since it is not exported to userspace, the
counter is less robust on some arches in the face of things like lost
ticks. It is not NTP corrected, has a limited range on some arches (in
it's single long int form) and discrepancies between the requested HZ
value and the actual tick frequency (ACTHZ)can cause additional
confusion when mapping jiffies to actual time.


xtime:
xtime provides nsec resolution software maintained NTP adjusted wall
clock which is exported to userspace. Along with wall_to_monotonic we
get a NTP adjusted monotonic clock. This is expected to be very robust
and accurate, however since it is so finely grained it requires 64 bits
(or a timepsec) to store it, which can cause performance concerns in the
cases where nsec resolution is unnecessary.


Now that the time domains are covered, how do we use them?

Soft-timers / Timeouts (aka: In kernel software maintained timers):
Soft-timers provide a internal kernel mechanism for running code at a
later specified time. Soft timers use jiffies for expiration, so they
are fast to use, but are limited to HZ resolution. As Thomas already
discussed (as well as LWN's article) they are very frequently used for
timeouts that are removed before they expire. Additionally, since they
are jiffies based, they have problems when mapping back and forth with
wall time, and do not robustly handle lost ticks (however due to their
common use, this is not normally an issue).


When userspace requests for action at a future time are made to the
kernel, they are made using some form of human time unit (flat usecs or
timespecs, whatever). Currently inside the kernel, we must convert these
requests to jiffies and use the soft-timer subsystem. This limits both
the range and low resolution of the request. Additionally, the
discrepancies between HZ, ACTHZ, NTP adjustments and lost ticks can
cause for additional inaccuracies in the conversion.


ktimers (From my understanding, again Thomas, correct me as needed):
ktimers provide a completely separate soft-timer list, which can use
either the wall-clock or the monotonic-clock as its domain for addition
and expiration. Since users may specify nanosecond resolution requests,
ktimers preserve the request in a nanosecond form. This eliminates any
discrepancies between jiffies and wall or monotonic time, and allows for
future sub-HZ latencies for expiration (in combination with a high-res
hardware-timer interrupt source). Additionally, in-kernel users who
desire high-precision wall/monotonic clock based timers could find
ktimers useful.


So the existing fast interface remains with the same jiffies time
domain. ktimers just add a secondary high-resolution interface that maps
to the wall/monotonic_clock domain.

> > > The existence of the timer source abstraction is a major requirement for
> > > further improvements (in this regard it's already suspicious, that you put
> > > major changes before Johns patch).
> >
> > Whats suspicious on that ? Seperating the "timeout" API and the "timer"
> > API has nothing to do with Johns patches.
>
> Related changes should be done in a logical order, which I'm obviously
> disagree about with you.

However, in this case the ktimer patch Thomas mailed out is really
independent from my change. I believe my change helps insure his
interfaces behave properly (you don't want your monotonic clock jumping
backwards occasionally!), but they do not affect his code's logic.


The fact that I make wall time update independent of timer ticks is
really just for simplicity and correctness. It is in no way a
requirement for wall/monotonic domain based timers or high-res timers
(my code does not provide any higher resolution interface then what is
already there).

Maybe if you were talking about the dynamic tick changes, would it make
sense to wait and do my changes first, and Thomas is not proposing that
at this moment.


> > > The next major change would be to add the possibility to reprogram a
> > > timer source, the scheduler can use this to
> > > skip timer ticks and e.g. itimer can offer higher resolution timers. The
> > > main point here is before we get to any API decisions, we need to develop
> > > a model how a single time source can drive multiple users. Your split
> > > between user timers and kernel timeouts leaves this question completely
> > > open.
> >
> > Did I claim, that ktimers solve this problem?
> >
> > No.
> >
> > The patches are related but address different aspects of the overall
> > problem without conflicting with each other. Quite the contrary: they
> > complement each other.
> >
> > I clearly stated that the reprogramming of timer events, which are not
> > addressed by ktimers and I never claimed ktimers does, is a completely
> > different problem.
>
> No, it's part of the same problem, how are scheduler and your ktimers
> supposed to share the same time source?

They don't share a time source.

ktimers are in the xtime domain, the scheduler is in the jiffies domain.


Sorry, this was really much longer then I wanted it to be. Hopefully it
wasn't too repetitious, and reasonably clear.

thanks
-john

2005-09-23 06:49:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Fri, 2005-09-23 at 02:25 +0200, Roman Zippel wrote:
> > Maybe it would make sense to have the API be in nanoseconds and internally use
> > 32bit ms for now, and only change to 64bit nanos when we actually move to
> > sub-ms resolution timers.
>
> Actually the decision to use ns has nothing to do with API issues.
> <linux/jiffies.h> has already a lot of options to specify timeouts for
> kernel timer. The official userspace API is mostly timespec/timeval.
> The nsec_t type is an _internal_ type to manage time, so this makes it
> possible to do something like this:
>
> #ifdef CONFIG_HIRES_TIMER
> typedef u64 ktime_t;
> #else
> typedef u32 ktime_t;
> #endif

Sure that's possible, but the 32bit storage format has its limitiations
and it is not possible to keep the code compatible for both use cases.

Posix timers - both CLOCK_REALTIME and CLOCK_MONOTONIC - can be
programmed in absolute time. In a 32bit representation with ms
resolution we can store ~49 days, so we can not fit the value which come
up from user space wihtout correction/conversion except we limit the use
cases to 49 days uptime and clock realtime < 49days since the epoch.

If we can not fit the given value into the internal representation, we
have to do exactly what the current implementation of clock realtime in
posix-timers.c has to do. Storing information about xtime / monotonic
offset, adding the timer to yet another list (abs_list) convert to
jiffies and in case the clock gets set, run through all the affected
timers in abs_list recalculate the expiry value and requeue them.

The idea of ktimers is to use the requested time given by a timespec in
human time without any corrections, so we actually can avoid the above.

Also doing time ordered insertion into a list introduces incompabilities
between 32/64 bit storage formats.

I carefully waged the necessary quirk load vs. the cleanliness,
simplicity and robustness of a pure 64 bit implementation. The resulting
payload for 32bit systems, which is in the range of 1-3 instructions per
fast path operation (add, sub, compare) is not worth the trouble IMO to
give up a clean, simple and robust design, which also allows high
resolution timers with no big change to the base implementation.


tglx


2005-09-23 08:27:04

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Fri, 2005-09-23 at 01:09 +0200, Roman Zippel wrote:
> > Quick answer: Networking and disk I/O. Insane load on a 4 way SMP
> > machine. Check yourself. :)
>
> This no answer at all, you only repeat what you already said above. :(
> Care to share your knowledge?

Each network connection, each disk I/O operation arms a timeout timer to
cover error conditions. Increasing the load on those increases the
number of armed timers. At the same time this increased load keeps the
timers longer active as it takes more time to detect that the "good"
condition arrived on time.

> You don't make it obvious at all. You jump from problems with _kernel_
> timers, to the introduction of a new subsytem to manage _process_ timers.
> Why don't you fix the problems with kernel timers separately?
> Only because kernel timer require less precision, doesn't mean you can
> make them imprecise.

I did not make them imprecise. ktimers changes excactly nothing there.

> The timer accuracy is defined by the timer source and
> I expect it to be the same for both kernel and process timers.

As John pointed out correctly, this is a seperation of time domains. Is
Johns explanation enough ?

> > > Basically how does the new big picture look like and how do high
> > > resolution timer fit into it? (You are more busy defending the 64bit math,
> > > than actually explaining why and where it's needed in the first place.)
> >
> > I also explained why I wanted to seperate "timeout" and "timers" APIs. I
> > explained why I choose rbtree and I explained why I used 64bit math and
> > at least why 64 bit math is not that evil as commonly seen.
>
> I don't say that 64bit math is evil, I just question that it's required -
> small, but important difference.
> The main problem with your ktimer patch is that it's another all-in-one
> patch, it simply changes too many aspects at once. If you want to
> introduce a new API, you can do so by first introducing a small layer
> which maps to the old layer. This makes it easier to see and prove any
> potential improvement.

I dont see ktimers as an all in one change everything patch.

It addresses exactly _one_ problem and nothing else. It seperates time
domains and their representation and handling.

I dont see how you want to map this new API to the old layer.

Lets look at the patch:

ktimers.c/h introduce the new API (no change to existing code)

The other files changed are the conversions of the users to this new
API. Mostly small changes except for posix-timers.c, where the abs_list
handling was removed as it was not longer necessary.

Providing a new API wihtout making use of it and showing the benefits is
rather pointless. I consider the simplification of posix-timers as a
valuable benefit.

> > > I'm not against high resolution timers per se, but this
> > > doesn't explain why it has to be high resolution all the way.
> >
> > Where is high resolution all the way. Care to read the patch ? It's high
> > resolution aware and it does take out odd areas of code by design.
>
> It's not just high resolution aware, it makes all calculation in high
> resolution _unconditionally_, which makes it high resolution all the way.

Please see the other mail explaining 32/64bit issues.

> > > It also doesn't explain how it will interact with Johns work,
> >
> > "The following add on patches are not provided for ad hoc inclusion as
> > they contain third party patches. The reason for providing this series
> > is to demonstrate the future use of ktimers and the simple extensibility
> > for the impelemtation of high resolution timers. Especially John Stultz
> > timeofday patch is a complete seperate issue
> > and just used due to the ability to provide high resolution timers in a
> > simple and non intrusive way."
> >
> > Isn't this clear enough ?
>
> No and I explained why I think that these are not separate issues at all.

These issues are seperate even if ktimers uses values provided by the
time of day subsystem.

ktimers need the current time in the representation of CLOCK_REALTIME
and CLOCK_MONOTIC.

ktimers doe not rely on Johns work. Ktimers can make use of Johns work
as it uses the values provided by the existing timeofday code now.

The usage is simpler if Johns work is in place. Nothing else.

> > > The main difference between them is that the latter is user
> > > programmable.
> >
> > wallclock is reprogrammable too and it introduces a bunch of horrible
> > functions in posix-timers.c. grep for abs_list. I explained why its
> > horrible already.
>
> I said _user_ programmable, wallclock time is usually NTP controlled.

I consider sys_adjtimex() and sys_settimeofday() as user interfaces.
Both affect wallclock and therefor affect timers related to wallclock.

> > 2.I dont see any hidden arch code in the ktimers patch. Do you ?
>
> That's not what I meant (and if you had taken the time to think about it,
> instead of just being angry at me, I'm sure you would have noticed
> yourself), this is e.g. about code in arch/i386/kernel/timers/ or
> arch/ppc/kernel/time.c.

I dont see what you want. The submitted ktimers patch does not contain a
single change to arch/xxx files.

If you refer to the proof of concept high resolution timer
implementation, then I really do not understand why you are insisting on
that. It is proof of concept to verify the usability of the ktimer base
implementation for high resolution timing - nothing more. I used Johns
patches for that proof of concept implementation as they made life
simpler.

So there is no point in discussing this.

I clearly said that from the beginning. I provided this addon patch
series to give interested developers a possibility to look at it and
compare and contrast it to the existing high resolution timer
implementations.

> > > The existence of the timer source abstraction is a major requirement for
> > > further improvements (in this regard it's already suspicious, that you put
> > > major changes before Johns patch).
> >
> > Whats suspicious on that ? Seperating the "timeout" API and the "timer"
> > API has nothing to do with Johns patches.
>
> Related changes should be done in a logical order, which I'm obviously
> disagree about with you.

Again. The patches are orthogonal. So where is the logical order ?

> > I clearly stated that the reprogramming of timer events, which are not
> > addressed by ktimers and I never claimed ktimers does, is a completely
> > different problem.
>
> No, it's part of the same problem, how are scheduler and your ktimers
> supposed to share the same time source?

<SNIP...>

Admittedly everything which is dealing with aspects of time is related,
but it can and must be seperated into different subsystems, which make
use of the provided interfaces.

1. Time tick

- constant frequency tick

Provides interface for:
reading the current tick count

2. Time of day

- handles frequency adjustments of the timesource
- keeps track of monotonic time
- provides the representation of wall clock time
- handles the adjustment of wall clock time

Provides interfaces for:
reading monotonic time
reading wallclock time
adjusting the frequency of the time source
setting wallclock time


Makes possibly use of the interface:
(Depends on the availability of time sources)
time tick:read tick count

3. Timeout API

- Time tick based timer handling
- Solely used for in kernel purposes

Provides interfaces for:
adding timers
modifying timers
deleting timers

Makes use of the interface:
time tick:read tick count

4. Timer API

- monotonic clock based timers
- realtime (wallclock) clock based timers
- Mainly intended for application timers

Provides interfaces for:
adding timers
modifying timers
deleting timers

Makes use of the interfaces:
timeofday: read monotonic time
timeofday: read wallclock time

So we have four seperate building blocks related to time, but clearly
seperated.

The current implementation in the kernel is providing only 1,2,3. The
timeofday API is somewhat intermingled with the tick code. The Timer API
is implemented with a bunch of workarounds by using the Timeout API.

ktimers provide the seperation of Timeout API and Timer API and therefor
the seperation of the time domains. ktimers do not need any changes to
1,2,3 but can benefit from what ever improvement is made in the
timeofday domain.

Johns patches address the clear seperation of time ticks and time of day
timekeeping and do not affect the timeout API nor ktimers. Once Johns
work is in place the ktimer code simply makes use of the new interfaces.

High resolution timers and dynsmic ticks need this seperations to make
them less intrusive. Of course we need an additional abstraction layer,
which handles the timer event sources.

tglx



2005-09-23 15:21:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Fri, Sep 23, 2005 at 01:09:46AM +0200, Roman Zippel wrote:
> On Thu, 22 Sep 2005, Thomas Gleixner wrote:

[ . . . ]

> > > The main difference between them is that the latter is user
> > > programmable.
> >
> > wallclock is reprogrammable too and it introduces a bunch of horrible
> > functions in posix-timers.c. grep for abs_list. I explained why its
> > horrible already.
>
> I said _user_ programmable, wallclock time is usually NTP controlled.

I believe Thomas is concerned about workloads that need a short-term
stable timebase. For example, a process-control application might need
to accurately measure a (say) 1500-millisecond time interval. Both
user-programmability and NTP adjustments to a given timebase could
destroy the needed measurement accuracy.

Such a workload does not need the long-term tie to wallclock time that
NTP provides, but it does need the accurate short-term timekeeping that
NTP cannot provide -- NTP sacrifices short-term accuracy in order to
adjust the clock as needed to gain long-term stability.

Thomas, John, please jump in if I am missing the point here.

Thanx, Paul

2005-09-24 02:43:28

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi,

On Fri, 23 Sep 2005, Thomas Gleixner wrote:

> Each network connection, each disk I/O operation arms a timeout timer to
> cover error conditions. Increasing the load on those increases the
> number of armed timers. At the same time this increased load keeps the
> timers longer active as it takes more time to detect that the "good"
> condition arrived on time.

You're still rather vague here...
Anyway, if the amount of active timer should be a problem, there are ways
to avoid them. For disk io it's rather simple as the timeout is mostly
constant:

start i/o:
req->expire = jiffies + timeout;
list_add_tail(req->list, &timeout_list);
if (!timer_pending(timer)) {
timer->expire = req->expire;
add_timer(timer);
}

timeout function:
req = list_head(&timeout_list);
if (time_after_eq(req->expire, jiffies)) {
// error...
} else {
timer->expire = req->expire;
add_timer(timer);
}

Network timer are a bit more difficult as the timeouts are more dynamic,
but one can at least delay arming the timer in most cases, by running a
timer every x ticks:

start timer:
if (timeout < x)
add_timer();
else
list_add_tail();

timer function:
list_for_each_entry()
add_timer();

Should the action be successfull before the timer runs, it only needs to
remove it from the private list.


> Admittedly everything which is dealing with aspects of time is related,
> but it can and must be seperated into different subsystems, which make
> use of the provided interfaces.
>
> 1. Time tick
>
> - constant frequency tick
>
> Provides interface for:
> reading the current tick count
>
> 2. Time of day
>
> - handles frequency adjustments of the timesource
> - keeps track of monotonic time
> - provides the representation of wall clock time
> - handles the adjustment of wall clock time
>
> Provides interfaces for:
> reading monotonic time
> reading wallclock time
> adjusting the frequency of the time source
> setting wallclock time
>
>
> Makes possibly use of the interface:
> (Depends on the availability of time sources)
> time tick:read tick count
>
> 3. Timeout API
>
> - Time tick based timer handling
> - Solely used for in kernel purposes
>
> Provides interfaces for:
> adding timers
> modifying timers
> deleting timers
>
> Makes use of the interface:
> time tick:read tick count
>
> 4. Timer API
>
> - monotonic clock based timers
> - realtime (wallclock) clock based timers
> - Mainly intended for application timers
>
> Provides interfaces for:
> adding timers
> modifying timers
> deleting timers
>
> Makes use of the interfaces:
> timeofday: read monotonic time
> timeofday: read wallclock time
>
> So we have four seperate building blocks related to time, but clearly
> seperated.

First, please don't say API if you talk about subsystems, it's not the
same. APIs are part of a subsystem and describe the relationships between
subsystems. I think that caused a major part of the confusion.


I don't completely agree with your picture above and while these
subsystems are mostly separate, they are not independent:

current dependencies:

1. time source ---> 2. wallclock ---> 4. process timer
\-> 3. kernel timer -/

We have currently a very simple time source, which just provides a
nonprogrammable timer interrupt. Process timer are limited in their
(programmable) resolution to kernel.

This maybe also explains better the 3 types of time I mentioned (or
domains if you prefer):
- wallclock time: NTP controlled, only readable and not programmable
- scheduler time: monotonic and unsynchronized, readable and programmable
in HZ resolution
- process time: derived from the first two.

Johns work now (hopefully) integrates wallclock functionality into the
time source:

time source ---> wallclock time
ntp library -/

This means the NTP code becomes a library which can be used by the time
source to provide a synchronized wallclock time.

The next step would be to make the time source programmable, so that we
can get rid of the kernel timer dependency from the process timer:

time source ---> kernel timer
\-> process timer

In this context the main functionality of your patch now finally becomes
understandable: making process timer independent of kernel timer. This
never became clear from your announcement, it talks about a lot of
unrelated problems, but it never gets to the actual problem.

bye, Roman

2005-09-24 03:16:05

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi,

On Fri, 23 Sep 2005, Thomas Gleixner wrote:

> The idea of ktimers is to use the requested time given by a timespec in
> human time without any corrections, so we actually can avoid the above.
>
> Also doing time ordered insertion into a list introduces incompabilities
> between 32/64 bit storage formats.

Except that the (time) range of the list would be limited I don't really
see a big difference.
Anyway, the biggest cost is the conversion from/to the 64bit ns value and
if its main use is sorting, you can use something like this:

typedef union {
u64 tv64;
struct {
#ifdef __BIG_ENDIAN
u32 sec, nsec;
#else
u32 nsec, sec;
#endif
} tv;
} ktimespec;

To compare two time values the tv64 value is sufficient.

bye, Roman

2005-09-24 03:39:08

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi,

On Fri, 23 Sep 2005, Paul E. McKenney wrote:

> I believe Thomas is concerned about workloads that need a short-term
> stable timebase. For example, a process-control application might need
> to accurately measure a (say) 1500-millisecond time interval. Both
> user-programmability and NTP adjustments to a given timebase could
> destroy the needed measurement accuracy.

NTP adjustments a quite small and not applied all at once, this means as
soon as the time is synchronized, we could switch CLOCK_MONOTONIC to it.

bye, Roman

2005-09-24 05:02:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem


* Roman Zippel <[email protected]> wrote:

> On Fri, 23 Sep 2005, Thomas Gleixner wrote:
>
> > Each network connection, each disk I/O operation arms a timeout timer to
> > cover error conditions. Increasing the load on those increases the
> > number of armed timers. At the same time this increased load keeps the
> > timers longer active as it takes more time to detect that the "good"
> > condition arrived on time.
>
> You're still rather vague here...

as i said before, millions of timers are easily possible, and i
personally saw in excess of 16 million active timers. I hope there was
nothing vague about that ;-)

Ingo

2005-09-24 05:16:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem


* Roman Zippel <[email protected]> wrote:

> On Fri, 23 Sep 2005, Thomas Gleixner wrote:
>
> > The idea of ktimers is to use the requested time given by a timespec in
> > human time without any corrections, so we actually can avoid the above.
> >
> > Also doing time ordered insertion into a list introduces incompabilities
> > between 32/64 bit storage formats.
>
> Except that the (time) range of the list would be limited I don't really
> see a big difference.
> Anyway, the biggest cost is the conversion from/to the 64bit ns value
> [...]

Where do you get that notion from? Have you personally measured the
performance and code size impact of it? If yes, would you mind to share
the resulting data with us?

Our data is that the use of 64-bit nsec_t significantly reduces the size
of a representative piece of code (object size in bytes):

AMD64 I386 ARM PPC32 M68K
nsec_t_ops 226 284 252 428 206
timespec_ops 412 324 448 640 342

i.e. a ~40% size reduction when going to nsec_t on m68k, in that
particular function. Even larger, ~45% code size reduction on a true
64-bit platform.

Ingo

2005-09-24 09:05:19

by James Bruce

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Roman Zippel wrote:
> On Fri, 23 Sep 2005, Thomas Gleixner wrote:
>>Each network connection, each disk I/O operation arms a timeout timer to
>>cover error conditions. Increasing the load on those increases the
>>number of armed timers. At the same time this increased load keeps the
>>timers longer active as it takes more time to detect that the "good"
>>condition arrived on time.
>
> You're still rather vague here...
> Anyway, if the amount of active timer should be a problem, there are ways
> to avoid them. <snip>

What does this have to do with the ktimers work itself? It's true that
other parts of the kernel shouldn't create more timers than necessary,
but the timer subsystem should be able to handle a lot of timers
regardless of that. To put it in perspective: A server doesn't run very
efficiently with a load of 1000, and that should be avoided by proper
application design. Yet we still test the scheduler on such workloads,
don't we? It's nice to know a subsystem doesn't fall over when its
stressed.

If you really feel timers are overused, please bring it up with the
maintainers of *those subsystems* which are overusing it. There's no
point in raising the issue with Thomas since he's not responsible for
how other people use/misuse an existing API.

Perhaps the real issue is that you feel we should police the kernel
usage of timers, instead of moving to a more scalable implementation.
This is one of those rare cases however where we can have cleaner, more
modular code, barely longer than before, which is also more scalable.
The only thing left to measure is the performance impact, but the
authors haven't gotten that far yet. Instead of jumping to conclusions
now, let's wait until we have some real numbers, shall we?

Jim Bruce

2005-09-24 10:35:46

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi,

On Sat, 24 Sep 2005, Ingo Molnar wrote:

> > Anyway, the biggest cost is the conversion from/to the 64bit ns value
> > [...]
>
> Where do you get that notion from? Have you personally measured the
> performance and code size impact of it? If yes, would you mind to share
> the resulting data with us?
>
> Our data is that the use of 64-bit nsec_t significantly reduces the size
> of a representative piece of code (object size in bytes):
>
> AMD64 I386 ARM PPC32 M68K
> nsec_t_ops 226 284 252 428 206
> timespec_ops 412 324 448 640 342
>
> i.e. a ~40% size reduction when going to nsec_t on m68k, in that
> particular function. Even larger, ~45% code size reduction on a true
> 64-bit platform.

Without any source these numbers are not verifiable. You don't even
mention here what that "representative piece of code" is...

Anyway, Thomas mentioned that this would be from the insert/remove code
and here you omitted the most important part of my mail:

typedef union {
u64 tv64;
struct {
#ifdef __BIG_ENDIAN
u32 sec, nsec;
#else
u32 nsec, sec;
#endif
} tv;
} ktimespec;

IOW this would allow to keep the time value in timespec format and use
your nsec_t_ops for sorting.

bye, Roman

2005-09-24 13:56:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Sat, 2005-09-24 at 12:35 +0200, Roman Zippel wrote:
> Hi,
>
> On Sat, 24 Sep 2005, Ingo Molnar wrote:
>
> > > Anyway, the biggest cost is the conversion from/to the 64bit ns value
> > > [...]
> >
> > Where do you get that notion from? Have you personally measured the
> > performance and code size impact of it? If yes, would you mind to share
> > the resulting data with us?
> >
> > Our data is that the use of 64-bit nsec_t significantly reduces the size
> > of a representative piece of code (object size in bytes):
> >
> > AMD64 I386 ARM PPC32 M68K
> > nsec_t_ops 226 284 252 428 206
> > timespec_ops 412 324 448 640 342
> >
> > i.e. a ~40% size reduction when going to nsec_t on m68k, in that
> > particular function. Even larger, ~45% code size reduction on a true
> > 64-bit platform.
>
> Without any source these numbers are not verifiable. You don't even
> mention here what that "representative piece of code" is...

struct base {
nsec_t now;
struct ktimer *timers[16];
struct ktimer *running;
};

void nsec_t_ops(struct base *base, struct ktimer *next,
nsec_t *tim, int mode)
{
int i;
nsec_t now = base->now;

for (i = 0; i < 16; i++) {

void (*fn)(void *);
void *data;
struct ktimer *timer = base->timers[i];

if (timer->expires > now)
break;
timer->expired = now;
fn = timer->function;
data = timer->data;
base->running = timer;
fn(data);
base->running = NULL;
}

switch(mode) {
case 0:
next->expires = *tim;
break;
case 1:
next->expires = now + *tim;
break;
case 2:
next->expires += *tim;
break;
case 3:
while (next->expires > now) {
next->expires += *tim;
}
break;
}
base->timers[0] = next;
}


versus:

#define NSEC_PER_SEC 1000000000
struct base {
struct timespec now;
struct ktimer *timers[16];
struct ktimer *running;
};

#define timespec_gt(a,b) \
(((a).tv_sec > (b).tv_sec) ? 1 : \
(((a).tv_sec < (b).tv_sec) ? 0 : \
((a).tv_nsec > (b).tv_nsec)))

#define timespec_addptr(a,b) \
(a)->tv_sec = ((a)->tv_sec + (b)->tv_sec); \
(a)->tv_nsec = ((a)->tv_nsec + (b)->tv_nsec); \
if ((a)->tv_nsec >= NSEC_PER_SEC){ \
(a)->tv_nsec -= NSEC_PER_SEC; \
(a)->tv_sec++; \
}

#define timespec_addppp(c,a,b) \
(c)->tv_sec = ((a)->tv_sec + (b)->tv_sec); \
(c)->tv_nsec = ((a)->tv_nsec + (b)->tv_nsec); \
if ((c)->tv_nsec >= NSEC_PER_SEC){ \
(c)->tv_nsec -= NSEC_PER_SEC; \
(c)->tv_sec++; \
}


void timespec_ops(struct base *base, struct ktimer *next,
struct timespec *tim, int mode)
{
int i;
struct timespec now = base->now;

for (i = 0; i < 16; i++) {

void (*fn)(void *);
void *data;
struct ktimer *timer = base->timers[i];

if (timespec_gt(timer->expires, now))
break;
timer->expired = now;
fn = timer->function;
data = timer->data;
base->running = timer;
fn(data);
base->running = NULL;
}

switch(mode) {
case 0:
next->expires = *tim;
break;
case 1:
timespec_addppp(&next->expires, &now, tim);
break;
case 2:
timespec_addptr(&next->expires, tim);
break;
case 3:
while (timespec_gt(now, next->expires)) {
timespec_addptr(&next->expires, tim);
}
break;
}
base->timers[0] = next;
}


> Anyway, Thomas mentioned that this would be from the insert/remove code
> and here you omitted the most important part of my mail:
>
> typedef union {
> u64 tv64;
> struct {
> #ifdef __BIG_ENDIAN
> u32 sec, nsec;
> #else
> u32 nsec, sec;
> #endif
> } tv;
> } ktimespec;
>
> IOW this would allow to keep the time value in timespec format and use
> your nsec_t_ops for sorting.

Yes, it works for comparisons.

But for any other operation this construct has the same problem than
struct timespec itself. You need at least an add function which is
always an add and a comparison / correction vs. nsec >= NSEC_PER_SEC.

The 64 bit nsec_t value can just be used as is without inventing a
wrapper macro for each operation.

The only point, where (k)timespec has an advantage is that the userspace
value must not be converted to nsec_t, but deducing therefor this is the
better overall solution is a fallacy.

nsec_t ktimespec

syscall:
32x32 mul
64bit add 2 x 32bit move

arm timer:
64 bit add 2 x 32 bit add
32 bit compare
32 bit sub
32 bit add

The 3 operation compensate for the 32x32
multiplication.

For interval timers you have the
32 bit compare
32 bit sub
32 bit add
additional overhead for each rearm.

The backward conversion from nsec_t to timespec is almost a non issue.
The vast majority of callers dont provide the second argument to
nanosleep(), setitimer(), set_timer() which makes the conversion
necessary and I think we optimize for the common use case.

Besides that the representation of time in nsec_t values is much
clearer.

I know that we have to deal with timespecs vs. userspace, but keeping
this representation for kernel internal usage reminds me on the BCD
calculations which were a similar 2^x vs 10^x oddity in the early days
of microprocessors. Of course they were obstinate and survived a
surprisingly long time.

tglx


2005-09-24 16:51:52

by Daniel Walker

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Sat, 2005-09-24 at 15:56 +0200, Thomas Gleixner wrote:
> On Sat, 2005-09-24 at 12:35 +0200, Roman Zippel wrote:
> > Hi,
> >
> > On Sat, 24 Sep 2005, Ingo Molnar wrote:
> >
> > > > Anyway, the biggest cost is the conversion from/to the 64bit ns value
> > > > [...]
> > >
> > > Where do you get that notion from? Have you personally measured the
> > > performance and code size impact of it? If yes, would you mind to share
> > > the resulting data with us?
> > >
> > > Our data is that the use of 64-bit nsec_t significantly reduces the size
> > > of a representative piece of code (object size in bytes):
> > >
> > > AMD64 I386 ARM PPC32 M68K
> > > nsec_t_ops 226 284 252 428 206
> > > timespec_ops 412 324 448 640 342
> > >
> > > i.e. a ~40% size reduction when going to nsec_t on m68k, in that
> > > particular function. Even larger, ~45% code size reduction on a true
> > > 64-bit platform.
> >
> > Without any source these numbers are not verifiable. You don't even
> > mention here what that "representative piece of code" is...

These numbers are misleading .. Doing a total code comparison shows that
a 2.6.14-rc2+ktimers kernel is slightly bigger than a vanilla 2.6.14-rc2
kernel (gcc 4.0, defconfig) .. So your argument that "small is faster"
must mean ktimers is slower, or at least not faster ..

Making a speed argument based on code size doesn't make much sense to
me, if it's actually faster then show that it's faster.

Daniel


2005-09-24 23:45:31

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi,

On Sat, 24 Sep 2005, Thomas Gleixner wrote:

> #define timespec_gt(a,b) \
> (((a).tv_sec > (b).tv_sec) ? 1 : \
> (((a).tv_sec < (b).tv_sec) ? 0 : \
> ((a).tv_nsec > (b).tv_nsec)))
>
> #define timespec_addptr(a,b) \
> (a)->tv_sec = ((a)->tv_sec + (b)->tv_sec); \
> (a)->tv_nsec = ((a)->tv_nsec + (b)->tv_nsec); \
> if ((a)->tv_nsec >= NSEC_PER_SEC){ \
> (a)->tv_nsec -= NSEC_PER_SEC; \
> (a)->tv_sec++; \
> }
>
> #define timespec_addppp(c,a,b) \
> (c)->tv_sec = ((a)->tv_sec + (b)->tv_sec); \
> (c)->tv_nsec = ((a)->tv_nsec + (b)->tv_nsec); \
> if ((c)->tv_nsec >= NSEC_PER_SEC){ \
> (c)->tv_nsec -= NSEC_PER_SEC; \
> (c)->tv_sec++; \
> }

Alternative for ktimespec:

#define timespec_gt(a,b) ((a).tv64 > (b).tv64)

#if BITS_PER_LONG == 64
#define timespec_addptr(a,b) \
(a).tv64 += (b).tv64; \
if ((a).tv.nsec >= NSEC_PER_SEC) { \
(a).tv64 += (u32)-NSEC_PER_SEC; \
}

#define timespec_addppp(c,a,b) \
(c).tv64 = (a).tv64 + (b).tv64; \
if ((c).tv.nsec >= NSEC_PER_SEC) { \
(c).tv64 += (u32)-NSEC_PER_SEC; \
}
#else
#define timespec_addptr(a,b) \
(a).tv.sec = ((a).tv.sec + (b).tv.sec); \
(a).tv.nsec = ((a).tv.nsec + (b).tv.nsec); \
if ((a).tv.nsec >= NSEC_PER_SEC) { \
(a).tv.nsec -= NSEC_PER_SEC; \
(a).tv.sec++; \
}

#define timespec_addppp(c,a,b) \
(c).tv.sec = ((a).tv.sec + (b).tv.sec); \
(c).tv.nsec = ((a).tv.nsec + (b).tv.nsec); \
if ((c).tv.nsec >= NSEC_PER_SEC) { \
(c).tv.nsec -= NSEC_PER_SEC; \
(c).tv.sec++; \
}
#endif

Adding the necessary conversion to the makes the difference even smaller.

> The only point, where (k)timespec has an advantage is that the userspace
> value must not be converted to nsec_t, but deducing therefor this is the
> better overall solution is a fallacy.

That's your opinion...

> nsec_t ktimespec
>
> syscall:
> 32x32 mul
> 64bit add 2 x 32bit move
>
> arm timer:
> 64 bit add 2 x 32 bit add
> 32 bit compare
> 32 bit sub
> 32 bit add
>
> The 3 operation compensate for the 32x32
> multiplication.

The multiply is not necessarly cheap, if the arch has no 32x32->64
instruction, gcc will generate a call to __muldi3().
Overall for the common case both variations don't differ much in speed
and size (for a single code path). For a few timers it likely doesn't
matter and for a lot of timers the tree insert likely dominates.

> The backward conversion from nsec_t to timespec is almost a non issue.
> The vast majority of callers dont provide the second argument to
> nanosleep(), setitimer(), set_timer() which makes the conversion
> necessary and I think we optimize for the common use case.

You know very well, that the conversion back to timespec is the killer in
your calculation. You graciously decide that the "vast majority" doesn't
want to read the timer, how did you get to that conclusion?

> Besides that the representation of time in nsec_t values is much
> clearer.

Well, that depends on the bigger picture, mainly how the timesource
manages the time. We want to optimize them for a fast get(ns)timeofday, so
we have already timespec based interfaces. Tick based sources will keep a
cached xtime timespec, so they either have to convert that to ns or
maintain another cached value just for your ktimers.
As long as you can't get rid of timespec completely (which is impossible),
there is a value in keeping it as much as possible as timespec.

bye, Roman

2005-09-25 15:50:38

by Sid Boyce

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

OT, but something that's been bugging me for quite a while.
I cut and paste the patch from the email to a file ktimers.patch.
"# patch -l -p1 <ktimer.patch" and it returns ---
(Patch is indented 1 space.)
patching file fs/exec.c
patch: **** malformed patch at line 16: }

If I prepend 2 tabs to the line, it complains about line 17, I do the
same to line 17 and on it moves to the next. from the manpage it reads
like the "-l" should take care of the tabs so it only compares the text.
Can anyone suggest how to apply the patch? Googling didn't help.
Regards
Sid.
--
Sid Boyce ... Hamradio License G3VBV, licensed Private Pilot
Retired IBM/Amdahl Mainframes and Sun/Fujitsu Servers Tech Support
Specialist
Microsoft Windows Free Zone - Linux used for all Computing Tasks

2005-09-25 18:13:56

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Sun, 25 Sep 2005, Sid Boyce wrote:

> OT, but something that's been bugging me for quite a while.
> I cut and paste the patch from the email to a file ktimers.patch.
> "# patch -l -p1 <ktimer.patch" and it returns ---
> (Patch is indented 1 space.)
> patching file fs/exec.c
> patch: **** malformed patch at line 16: }
>
> If I prepend 2 tabs to the line, it complains about line 17, I do the same to
> line 17 and on it moves to the next. from the manpage it reads like the "-l"
> should take care of the tabs so it only compares the text.
> Can anyone suggest how to apply the patch? Googling didn't help.

Save the entire email as a text file and apply it. Cut and paste usually
introduces white space damage.

2005-09-25 20:59:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Sun, 2005-09-25 at 01:45 +0200, Roman Zippel wrote:

> > The backward conversion from nsec_t to timespec is almost a non issue.
> > The vast majority of callers dont provide the second argument to
> > nanosleep(), setitimer(), set_timer() which makes the conversion
> > necessary and I think we optimize for the common use case.
>
> You know very well, that the conversion back to timespec is the killer in
> your calculation. You graciously decide that the "vast majority" doesn't
> want to read the timer, how did you get to that conclusion?

I graciously put instrumentation into _all_ the relevant syscalls on a
desktop and a server machine. The result is that less than 1% of the
calls provide the read back variable.

tglx


2005-09-25 21:01:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

On Sun, 2005-09-25 at 01:45 +0200, Roman Zippel wrote:

> The multiply is not necessarly cheap, if the arch has no 32x32->64
> instruction, gcc will generate a call to __muldi3().

Can you please point out which architectures do not have a 32x32->64
instruction ?

tglx


2005-09-26 00:02:59

by Sid Boyce

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Zwane Mwaikambo wrote:
> On Sun, 25 Sep 2005, Sid Boyce wrote:
>
>
>>OT, but something that's been bugging me for quite a while.
>>I cut and paste the patch from the email to a file ktimers.patch.
>>"# patch -l -p1 <ktimer.patch" and it returns ---
>> (Patch is indented 1 space.)
>>patching file fs/exec.c
>>patch: **** malformed patch at line 16: }
>>
>>If I prepend 2 tabs to the line, it complains about line 17, I do the same to
>>line 17 and on it moves to the next. from the manpage it reads like the "-l"
>>should take care of the tabs so it only compares the text.
>>Can anyone suggest how to apply the patch? Googling didn't help.
>
>
> Save the entire email as a text file and apply it. Cut and paste usually
> introduces white space damage.
>
>
>

Thanks.
regards
Sid.
--
Sid Boyce ... Hamradio License G3VBV, licensed Private Pilot
Retired IBM/Amdahl Mainframes and Sun/Fujitsu Servers Tech Support
Specialist
Microsoft Windows Free Zone - Linux used for all Computing Tasks

2005-09-27 16:48:43

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi,

On Sun, 25 Sep 2005, Thomas Gleixner wrote:

> On Sun, 2005-09-25 at 01:45 +0200, Roman Zippel wrote:
>
> > The multiply is not necessarly cheap, if the arch has no 32x32->64
> > instruction, gcc will generate a call to __muldi3().
>
> Can you please point out which architectures do not have a 32x32->64
> instruction ?

I have no complete overview. I know that Motorola actually removed that
instruction in the M68060 (it causes an emulation trap) and it's still not
back in newer ColdFire cpus. For arm it's an optional instruction in
earlier versions (v3). For ppc it's splitted into two instructions.

For the rest you might want to check <asm/div64.h>, if div64 has to be
emulated, there are good chances this instruction has to be emulated as
well (especially in smaller embedded archs).

bye, Roman

2005-09-27 16:54:56

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi,

On Sun, 25 Sep 2005, Thomas Gleixner wrote:

> > You know very well, that the conversion back to timespec is the killer in
> > your calculation. You graciously decide that the "vast majority" doesn't
> > want to read the timer, how did you get to that conclusion?
>
> I graciously put instrumentation into _all_ the relevant syscalls on a
> desktop and a server machine. The result is that less than 1% of the
> calls provide the read back variable.

That sill means it is used and if an application actually depends on it,
it would be penalized by your implementation.
These timers may open up new application (in kernel or user space), where
this conversion may be needed, so _only_ looking at the current numbers is
a bit misleading.

bye, Roman

2005-09-27 18:38:09

by Tim Bird

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Roman Zippel wrote:
> On Sun, 25 Sep 2005, Thomas Gleixner wrote:
>>Can you please point out which architectures do not have a 32x32->64
>>instruction ?

<snip>
> For the rest you might want to check <asm/div64.h>, if div64 has to be
> emulated, there are good chances this instruction has to be emulated as
> well (especially in smaller embedded archs).

Hmmm. In my experience, there are several embedded platforms
with a 32x32->64 instruction, which are lacking a div64 instruction.
I don't think checking for div64 is a very good metric here.

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

2005-09-27 19:03:53

by Tim Bird

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Roman Zippel wrote:
> On Sun, 25 Sep 2005, Thomas Gleixner wrote:
>> Roman Zippel wrote:
>>> You know very well, that the conversion back to timespec
>>> is the killer in your calculation. You graciously
>>> decide that the "vast majority" doesn't
>>> want to read the timer, how did you get to that
>>> conclusion?
>>
>> I graciously put instrumentation into _all_ the
>> relevant syscalls on a desktop and a server machine.
>> The result is that less than 1% of the
>> calls provide the read back variable.
>
> That still means it is used and if an application
> actually depends on it, it would be penalized by
> your implementation. These timers may open up new
> application (in kernel or user space), where
> this conversion may be needed, so _only_ looking
> at the current numbers is a bit misleading.

Oh good heavens! One can always point to real or
hypothetical cases where a change like this
will result in worse performance. Will you only
be satisfied if there is provably NO performance
degradation for ANY app on ANY platform? Even
if the code is easier to maintain, and allows
for improvements in functionality and equal or
better performance for the majority of apps.
and platforms?

We're talking about a tradeoff here, and I,
of all people, should be worried about the
possible impact on low-end embedded hardware.
However, having seen some of the problems with
the current timer system in the kernel, I'm in
favor of looking at some abstraction improvements.

Unless I missed something, ktimers has not been
recommended for mainlining yet. I suspect (without
having measured it myself yet) that the
core abstraction that it proposes (timers
vs. timeouts) is an important one for improving
the kernel timing system.

Personally, I'd like to see it go into
-mm or some other experimental tree, to give
it a proper shakedown. If some nasty corner
cases show up, then let them show up under testing
rather than via conjecture.
-- Tim

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

2005-09-27 20:37:07

by George Anzinger

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Tim Bird wrote:
> Roman Zippel wrote:
>
>>On Sun, 25 Sep 2005, Thomas Gleixner wrote:
>>
>>>Can you please point out which architectures do not have a 32x32->64
>>>instruction ?
>
>
> <snip>
>
>>For the rest you might want to check <asm/div64.h>, if div64 has to be
>>emulated, there are good chances this instruction has to be emulated as
>>well (especially in smaller embedded archs).
>
>
> Hmmm. In my experience, there are several embedded platforms
> with a 32x32->64 instruction, which are lacking a div64 instruction.
> I don't think checking for div64 is a very good metric here.

Also, even having a div64 instruction does not eliminate the asm/div64.h
as it checks for results that are >32-bits and does the right thing.
For example, letting the x86 do this divide with such a result, results
in a trap.

This is why we need to be very careful where we use the div_ll_l_rem()
which accesses just this instruction.

--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-09-28 16:37:11

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE] ktimers subsystem

Hi,

On Tue, 27 Sep 2005, Tim Bird wrote:

> > That still means it is used and if an application
> > actually depends on it, it would be penalized by
> > your implementation. These timers may open up new
> > application (in kernel or user space), where
> > this conversion may be needed, so _only_ looking
> > at the current numbers is a bit misleading.
>
> Oh good heavens! One can always point to real or
> hypothetical cases where a change like this
> will result in worse performance. Will you only
> be satisfied if there is provably NO performance
> degradation for ANY app on ANY platform?

I want to get the focus at the complete picture, as this is a rather
critical area and I will be satisfied, as soon as I can see all
consequences and possibilities have been considered.

> Even
> if the code is easier to maintain, and allows
> for improvements in functionality and equal or
> better performance for the majority of apps.
> and platforms?

If that's case, you're hopefully not afraid of a few questions? Why do I
have to take the code as is and just believe the claims about it?
I like improvements as everyone, but I also want to verify them and look
at the alternatives and I can't see anything wrong with it.

> Unless I missed something, ktimers has not been
> recommended for mainlining yet. I suspect (without
> having measured it myself yet) that the
> core abstraction that it proposes (timers
> vs. timeouts) is an important one for improving
> the kernel timing system.

I'm not saying that the idea is wrong, the general direction is fine, but
some course correction should be possible?

bye, Roman