2006-05-02 13:29:59

by Russell King

[permalink] [raw]
Subject: sched_clock() uses are broken

Hi,

Someone just asked a question about sched_clock() on the ARM list, asking
whether there was any way to find the range of values returned by this
function.

My initial thought was "why would you want that", but on considering the
problem a bit further, I could see why the question arises. In looking
deeper, I discovered a rather sad state of affairs with sched_clock().

The kernel assumes that the following will work:

unsigned long long t0, t1, time_passed_ns;

t0 = sched_clock();
/* do something */
t1 = sched_clock();

time_passed_ns = t1 - t0;

This is all very well if sched_clock() returns values filling the
entire range of an unsigned long long - two complement maths will
naturally return the time difference in the case where t1 < t0.

However, this is not the case. On x86 with TSC, it returns a 54 bit
number. This means that when t1 < t0, time_passed_ns becomes a very
large number which no longer represents the amount of time.

All uses in kernel/sched.c seem to be aflicted by this problem.

There are several solutions to this - the most obvious being that we
need a function which returns the nanosecond difference between two
sched_clock() return values, and this function needs to know how to
handle the case where sched_clock() has wrapped.

IOW:

t0 = sched_clock();
/* do something */
t1 = sched_clock();

time_passed = sched_clock_diff(t1, t0);

Comments?

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core


2006-05-02 16:44:04

by Andi Kleen

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

Russell King <[email protected]> writes:
>
> However, this is not the case. On x86 with TSC, it returns a 54 bit
> number. This means that when t1 < t0, time_passed_ns becomes a very
> large number which no longer represents the amount of time.

Good point. For a 1Ghz system this would happen every ~0.57 years.

The problem is there is AFAIK no non destructive[1] way to find out how
many bits the TSC has

Destructive would be to overwrite it with -1 and see how many stick.

> All uses in kernel/sched.c seem to be aflicted by this problem.
>
> There are several solutions to this - the most obvious being that we
> need a function which returns the nanosecond difference between two
> sched_clock() return values, and this function needs to know how to
> handle the case where sched_clock() has wrapped.

Ok it can be done with a simple test.

>
> IOW:
>
> t0 = sched_clock();
> /* do something */
> t1 = sched_clock();
>
> time_passed = sched_clock_diff(t1, t0);
>
> Comments?

Agreed it's a problem, but probably a small one. At worst you'll get
a small scheduling hickup every half year, which should be hardly
that big an issue.

Might chose to just ignore it with a big fat comment?

-Andi

2006-05-02 16:50:19

by Russell King

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Tue, May 02, 2006 at 06:43:45PM +0200, Andi Kleen wrote:
> Russell King <[email protected]> writes:
> >
> > However, this is not the case. On x86 with TSC, it returns a 54 bit
> > number. This means that when t1 < t0, time_passed_ns becomes a very
> > large number which no longer represents the amount of time.
>
> Good point. For a 1Ghz system this would happen every ~0.57 years.
>
> The problem is there is AFAIK no non destructive[1] way to find out how
> many bits the TSC has
>
> Destructive would be to overwrite it with -1 and see how many stick.
>
> > All uses in kernel/sched.c seem to be aflicted by this problem.
> >
> > There are several solutions to this - the most obvious being that we
> > need a function which returns the nanosecond difference between two
> > sched_clock() return values, and this function needs to know how to
> > handle the case where sched_clock() has wrapped.
>
> Ok it can be done with a simple test.
>
> >
> > IOW:
> >
> > t0 = sched_clock();
> > /* do something */
> > t1 = sched_clock();
> >
> > time_passed = sched_clock_diff(t1, t0);
> >
> > Comments?
>
> Agreed it's a problem, but probably a small one. At worst you'll get
> a small scheduling hickup every half year, which should be hardly
> that big an issue.
>
> Might chose to just ignore it with a big fat comment?

You're right assuming you have a 64-bit TSC, but ARM has at best a
32-bit cycle counter which rolls over about every 179 seconds - with
gives a range of values from sched_clock from 0 to 178956970625 or
0x29AAAAAA81.

That's rather more of a problem than having it happen every 208 days.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-05-02 16:54:38

by Chris Friesen

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

Andi Kleen wrote:

> Agreed it's a problem, but probably a small one. At worst you'll get
> a small scheduling hickup every half year, which should be hardly
> that big an issue.

Presumably this would be bad for soft-realtime embedded things. Set-top
boxes, etc.

Chris

2006-05-02 17:01:23

by Andi Kleen

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Tuesday 02 May 2006 18:50, Russell King wrote:

> You're right assuming you have a 64-bit TSC, but ARM has at best a
> 32-bit cycle counter which rolls over about every 179 seconds - with
> gives a range of values from sched_clock from 0 to 178956970625 or
> 0x29AAAAAA81.
>
> That's rather more of a problem than having it happen every 208 days.

Ok but you know it's always 32bit right? You can fix it up then
with your proposal of a sched_diff()

The problem would be fixing it up with a unknown number of bits.

-Andi

2006-05-02 17:01:22

by Andi Kleen

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Tuesday 02 May 2006 18:54, Christopher Friesen wrote:
> Andi Kleen wrote:
>
> > Agreed it's a problem, but probably a small one. At worst you'll get
> > a small scheduling hickup every half year, which should be hardly
> > that big an issue.
>
> Presumably this would be bad for soft-realtime embedded things. Set-top
> boxes, etc.

SCHED_RR/FIFO are not affected. AFAIK it's only used by the interactivity
detector in the normal scheduler. Worst case that happens is that a
process is not detected to be interactive when it should once, which
gives it only a small penalty. On the next time slice everything will be ok again.

-Andi

2006-05-02 17:15:50

by Nicolas Pitre

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Tue, 2 May 2006, Russell King wrote:

> On Tue, May 02, 2006 at 06:43:45PM +0200, Andi Kleen wrote:
> > Russell King <[email protected]> writes:
> > >
> > > However, this is not the case. On x86 with TSC, it returns a 54 bit
> > > number. This means that when t1 < t0, time_passed_ns becomes a very
> > > large number which no longer represents the amount of time.
> >
> > Good point. For a 1Ghz system this would happen every ~0.57 years.
> >
> > The problem is there is AFAIK no non destructive[1] way to find out how
> > many bits the TSC has
> >
> > Destructive would be to overwrite it with -1 and see how many stick.
> >
> > > All uses in kernel/sched.c seem to be aflicted by this problem.
> > >
> > > There are several solutions to this - the most obvious being that we
> > > need a function which returns the nanosecond difference between two
> > > sched_clock() return values, and this function needs to know how to
> > > handle the case where sched_clock() has wrapped.
> >
> > Ok it can be done with a simple test.

Better yet the sched_clock() implementation just needs to return a value
shifted left so the wrap around always happens on 64 bits and the
difference between two consecutive samples is always right.

> > >
> > > IOW:
> > >
> > > t0 = sched_clock();
> > > /* do something */
> > > t1 = sched_clock();
> > >
> > > time_passed = sched_clock_diff(t1, t0);
> > >
> > > Comments?
> >
> > Agreed it's a problem, but probably a small one. At worst you'll get
> > a small scheduling hickup every half year, which should be hardly
> > that big an issue.

... on x86 that is.

> > Might chose to just ignore it with a big fat comment?
>
> You're right assuming you have a 64-bit TSC, but ARM has at best a
> 32-bit cycle counter which rolls over about every 179 seconds - with
> gives a range of values from sched_clock from 0 to 178956970625 or
> 0x29AAAAAA81.
>
> That's rather more of a problem than having it happen every 208 days.

Yet that counter isn't necessarily nanosecond based. So rescaling the
returned value to nanosecs requires expensive divisions which could be
done only once within sched_clock_diff() instead of twice as often in
each sched_clock() calls.


Nicolas

2006-05-02 17:18:26

by Nicolas Pitre

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Tue, 2 May 2006, Andi Kleen wrote:

> On Tuesday 02 May 2006 18:50, Russell King wrote:
>
> > You're right assuming you have a 64-bit TSC, but ARM has at best a
> > 32-bit cycle counter which rolls over about every 179 seconds - with
> > gives a range of values from sched_clock from 0 to 178956970625 or
> > 0x29AAAAAA81.
> >
> > That's rather more of a problem than having it happen every 208 days.
>
> Ok but you know it's always 32bit right? You can fix it up then
> with your proposal of a sched_diff()
>
> The problem would be fixing it up with a unknown number of bits.

Just shift it left so you know you always have the most significant bits
valid. The sched_diff() would take care of scaling it back to nanosecs.


Nicolas

2006-05-02 18:56:04

by Russell King

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Tue, May 02, 2006 at 01:18:25PM -0400, Nicolas Pitre wrote:
> On Tue, 2 May 2006, Andi Kleen wrote:
>
> > On Tuesday 02 May 2006 18:50, Russell King wrote:
> >
> > > You're right assuming you have a 64-bit TSC, but ARM has at best a
> > > 32-bit cycle counter which rolls over about every 179 seconds - with
> > > gives a range of values from sched_clock from 0 to 178956970625 or
> > > 0x29AAAAAA81.
> > >
> > > That's rather more of a problem than having it happen every 208 days.
> >
> > Ok but you know it's always 32bit right? You can fix it up then
> > with your proposal of a sched_diff()
> >
> > The problem would be fixing it up with a unknown number of bits.
>
> Just shift it left so you know you always have the most significant bits
> valid. The sched_diff() would take care of scaling it back to nanosecs.

sched_clock is currently defined to return nanoseconds so this isn't
a possibility.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-05-02 19:05:26

by Nicolas Pitre

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Tue, 2 May 2006, Russell King wrote:

> On Tue, May 02, 2006 at 01:18:25PM -0400, Nicolas Pitre wrote:
> > On Tue, 2 May 2006, Andi Kleen wrote:
> >
> > > On Tuesday 02 May 2006 18:50, Russell King wrote:
> > >
> > > > You're right assuming you have a 64-bit TSC, but ARM has at best a
> > > > 32-bit cycle counter which rolls over about every 179 seconds - with
> > > > gives a range of values from sched_clock from 0 to 178956970625 or
> > > > 0x29AAAAAA81.
> > > >
> > > > That's rather more of a problem than having it happen every 208 days.
> > >
> > > Ok but you know it's always 32bit right? You can fix it up then
> > > with your proposal of a sched_diff()
> > >
> > > The problem would be fixing it up with a unknown number of bits.
> >
> > Just shift it left so you know you always have the most significant bits
> > valid. The sched_diff() would take care of scaling it back to nanosecs.
>
> sched_clock is currently defined to return nanoseconds so this isn't
> a possibility.

If we're discussing the addition of a sched_clock_diff(), why whouldn't
shed_clock() return anything it wants in that context? It could be
redefined to have a return value meaningful only to shed_clock_diff()?


Nicolas

2006-05-02 19:08:21

by Russell King

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Tue, May 02, 2006 at 03:05:22PM -0400, Nicolas Pitre wrote:
> On Tue, 2 May 2006, Russell King wrote:
> > On Tue, May 02, 2006 at 01:18:25PM -0400, Nicolas Pitre wrote:
> > > On Tue, 2 May 2006, Andi Kleen wrote:
> > >
> > > > On Tuesday 02 May 2006 18:50, Russell King wrote:
> > > >
> > > > > You're right assuming you have a 64-bit TSC, but ARM has at best a
> > > > > 32-bit cycle counter which rolls over about every 179 seconds - with
> > > > > gives a range of values from sched_clock from 0 to 178956970625 or
> > > > > 0x29AAAAAA81.
> > > > >
> > > > > That's rather more of a problem than having it happen every 208 days.
> > > >
> > > > Ok but you know it's always 32bit right? You can fix it up then
> > > > with your proposal of a sched_diff()
> > > >
> > > > The problem would be fixing it up with a unknown number of bits.
> > >
> > > Just shift it left so you know you always have the most significant bits
> > > valid. The sched_diff() would take care of scaling it back to nanosecs.
> >
> > sched_clock is currently defined to return nanoseconds so this isn't
> > a possibility.
>
> If we're discussing the addition of a sched_clock_diff(), why whouldn't
> shed_clock() return anything it wants in that context? It could be
> redefined to have a return value meaningful only to shed_clock_diff()?

If we're talking about doing that, we need to replace sched_clock()
to ensure that we all users are aware that it has changed.

I did think about that for my original fix proposal, but stepped back
because that's a bigger change - and is something for post-2.6.17.
The smallest fix (suitable for -rc kernels) is as I detailed.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-05-02 19:23:04

by Nicolas Pitre

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Tue, 2 May 2006, Russell King wrote:

> On Tue, May 02, 2006 at 03:05:22PM -0400, Nicolas Pitre wrote:
> > If we're discussing the addition of a sched_clock_diff(), why whouldn't
> > shed_clock() return anything it wants in that context? It could be
> > redefined to have a return value meaningful only to shed_clock_diff()?
>
> If we're talking about doing that, we need to replace sched_clock()
> to ensure that we all users are aware that it has changed.
>
> I did think about that for my original fix proposal, but stepped back
> because that's a bigger change - and is something for post-2.6.17.
> The smallest fix (suitable for -rc kernels) is as I detailed.

Oh agreed.


Nicolas

2006-05-02 21:35:55

by Russell King

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Tue, May 02, 2006 at 03:23:01PM -0400, Nicolas Pitre wrote:
> On Tue, 2 May 2006, Russell King wrote:
>
> > On Tue, May 02, 2006 at 03:05:22PM -0400, Nicolas Pitre wrote:
> > > If we're discussing the addition of a sched_clock_diff(), why whouldn't
> > > shed_clock() return anything it wants in that context? It could be
> > > redefined to have a return value meaningful only to shed_clock_diff()?
> >
> > If we're talking about doing that, we need to replace sched_clock()
> > to ensure that we all users are aware that it has changed.
> >
> > I did think about that for my original fix proposal, but stepped back
> > because that's a bigger change - and is something for post-2.6.17.
> > The smallest fix (suitable for -rc kernels) is as I detailed.
>
> Oh agreed.

(Added Ingo - don't know if you saw my original message.)

Actually, I'm not entirely convinced that the smallest fix is going to
be the best solution - it looks too complex.

Note the code change in update_cpu_clock and sched_current_time - arguably
both are buggy as they stand today when the values wrap.

Comments anyone?

diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -988,6 +988,11 @@ static inline int set_cpus_allowed(task_
extern unsigned long long sched_clock(void);
extern unsigned long long current_sched_time(const task_t *current_task);

+static inline unsigned long long sched_clock_diff(unsigned long long t1, unsigned long long t0)
+{
+ return t1 - t0;
+}
+
/* sched_exec is called by processes performing an exec */
#ifdef CONFIG_SMP
extern void sched_exec(void);
diff --git a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -731,7 +731,7 @@ static inline void __activate_idle_task(
static int recalc_task_prio(task_t *p, unsigned long long now)
{
/* Caller must always ensure 'now >= p->timestamp' */
- unsigned long long __sleep_time = now - p->timestamp;
+ unsigned long long __sleep_time = sched_clock_diff(now, p->timestamp);
unsigned long sleep_time;

if (batch_task(p))
@@ -807,7 +807,7 @@ static void activate_task(task_t *p, run
if (!local) {
/* Compensate for drifting sched_clock */
runqueue_t *this_rq = this_rq();
- now = (now - this_rq->timestamp_last_tick)
+ now = sched_clock_diff(now, this_rq->timestamp_last_tick)
+ rq->timestamp_last_tick;
}
#endif
@@ -2512,8 +2512,10 @@ EXPORT_PER_CPU_SYMBOL(kstat);
static inline void update_cpu_clock(task_t *p, runqueue_t *rq,
unsigned long long now)
{
- unsigned long long last = max(p->timestamp, rq->timestamp_last_tick);
- p->sched_time += now - last;
+ unsigned long long d1, d2;
+ d1 = sched_clock_diff(now, p->timestamp);
+ d2 = sched_clock_diff(now, rq->timestamp_last_tick);
+ p->sched_time += min(d1, d2);
}

/*
@@ -2522,11 +2524,13 @@ static inline void update_cpu_clock(task
*/
unsigned long long current_sched_time(const task_t *tsk)
{
- unsigned long long ns;
+ unsigned long long now, d1, d2, ns;
unsigned long flags;
local_irq_save(flags);
- ns = max(tsk->timestamp, task_rq(tsk)->timestamp_last_tick);
- ns = tsk->sched_time + (sched_clock() - ns);
+ now = sched_clock();
+ d1 = sched_clock_diff(now, tsk->timestamp);
+ d2 = sched_clock_diff(now, task_rq(tsk)->timestamp_last_tick);
+ ns = tsk->sched_time + min(d1, d2);
local_irq_restore(flags);
return ns;
}
@@ -2925,7 +2929,7 @@ asmlinkage void __sched schedule(void)
runqueue_t *rq;
prio_array_t *array;
struct list_head *queue;
- unsigned long long now;
+ unsigned long long now, sleep;
unsigned long run_time;
int cpu, idx, new_prio;

@@ -2960,11 +2964,10 @@ need_resched_nonpreemptible:

schedstat_inc(rq, sched_cnt);
now = sched_clock();
- if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
- run_time = now - prev->timestamp;
- if (unlikely((long long)(now - prev->timestamp) < 0))
- run_time = 0;
- } else
+ sleep = sched_clock_diff(now, prev->timestamp);
+ if (likely(sleep < NS_MAX_SLEEP_AVG))
+ run_time = sleep;
+ else
run_time = NS_MAX_SLEEP_AVG;

/*
@@ -3039,8 +3042,8 @@ go_idle:
next = list_entry(queue->next, task_t, run_list);

if (!rt_task(next) && interactive_sleep(next->sleep_type)) {
- unsigned long long delta = now - next->timestamp;
- if (unlikely((long long)(now - next->timestamp) < 0))
+ unsigned long long delta = sched_clock_diff(now, next->timestamp);
+ if (unlikely((long long)delta < 0))
delta = 0;

if (next->sleep_type == SLEEP_INTERACTIVE)


--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-05-02 23:59:24

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

Russell King wrote:

> There are several solutions to this - the most obvious being that we
> need a function which returns the nanosecond difference between two
> sched_clock() return values, and this function needs to know how to
> handle the case where sched_clock() has wrapped.
>
> IOW:
>
> t0 = sched_clock();
> /* do something */
> t1 = sched_clock();
>
> time_passed = sched_clock_diff(t1, t0);
>
> Comments?
>

There is another problem John pointed out to me: sched_clock (at least
on i386) can do tsc frequency scaling on the raw tsc value. I'm not
sure if this is still a problem (I'm not aware that it has been fixed),
however it would mean that between two sched_clock()s, the values
returned can be basically completely arbitrary.

What is needed is something like:
t0 = get_cycles_unsynchronized();
t1 = get_cycles_unsynchronized();
ns = cycles_to_ns(t1, t0);

Where unsynchronized means not synchronized between CPUs.

This would still cause the `ns' value to be skewed if a frequency change
occured between t0 and t1, however at least it should be within some
realistic range (something like ns +/- ns * max freq / min freq).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-03 00:53:48

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

Andi Kleen wrote:
> On Tuesday 02 May 2006 18:54, Christopher Friesen wrote:
>
>>Andi Kleen wrote:
>>
>>
>>>Agreed it's a problem, but probably a small one. At worst you'll get
>>>a small scheduling hickup every half year, which should be hardly
>>>that big an issue.
>>
>>Presumably this would be bad for soft-realtime embedded things. Set-top
>>boxes, etc.
>
>
> SCHED_RR/FIFO are not affected. AFAIK it's only used by the interactivity
> detector in the normal scheduler. Worst case that happens is that a
> process is not detected to be interactive when it should once, which
> gives it only a small penalty. On the next time slice everything will be ok again.

Other problem is that some people didn't RTFM and have started trying to
use it for precise accounting :(

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-03 07:08:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Wed, 2006-05-03 at 03:07 +1000, Nick Piggin wrote:
> Other problem is that some people didn't RTFM and have started trying to
> use it for precise accounting :(

Are you talking about me perchance? I don't really care about precision
_that_ much, though I certainly do want to tighten timeslice accounting.

Given that most people are going to end up using the pm_timer anyway, I
don't see the point of even having a sched_clock(). If it's jiffy
resolution, it's useless. If it's wildly inaccurate (as it is in the
SMP case, monotonicity issues aside) it's more than useless.

-Mike

2006-05-03 07:40:50

by Andi Kleen

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Wednesday 03 May 2006 09:09, Mike Galbraith wrote:

> Given that most people are going to end up using the pm_timer anyway, I
> don't see the point of even having a sched_clock(). If it's jiffy
> resolution, it's useless. If it's wildly inaccurate (as it is in the
> SMP case, monotonicity issues aside) it's more than useless.

For sched_clock TSC is always used and it's fine - sched_clock
doesn't require the guarantees that make TSC often useless otherwise

e.g. the scheduler is designed to only use it on one CPU
and can also tolerate variations scaling with frequency.

-Andi

2006-05-03 09:10:36

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Wed, 2006-05-03 at 09:40 +0200, Andi Kleen wrote:
> On Wednesday 03 May 2006 09:09, Mike Galbraith wrote:
>
> > Given that most people are going to end up using the pm_timer anyway, I
> > don't see the point of even having a sched_clock(). If it's jiffy
> > resolution, it's useless. If it's wildly inaccurate (as it is in the
> > SMP case, monotonicity issues aside) it's more than useless.
>
> For sched_clock TSC is always used and it's fine - sched_clock
> doesn't require the guarantees that make TSC often useless otherwise

Regrettable, that's not true.

no command line

now: 4294742814000000 X:6906->timestamp: 4294742813000000
now: 4294743815000000 konqueror:7409->timestamp: 4294743814000000
now: 4294744816000000 kicker:7352->timestamp: 4294744815000000
now: 4294745817000000 konsole:7363->timestamp: 4294745815000000
now: 4294746818000000 konqueror:7409->timestamp: 4294746817000000
now: 4294747819000000 kmix:7388->timestamp: 4294747818000000
now: 4294748820000000 kmix:7388->timestamp: 4294748818000000
now: 4294749821000000 konsole:7363->timestamp: 4294749820000000

command line clock=tsc

now: 124079605551 gconfd-2:7372->timestamp: 124079563934
now: 125079899394 swapper:0->timestamp: 125077929715
now: 126080194639 swapper:0->timestamp: 126077228724
now: 127080489088 swapper:0->timestamp: 127077510347
now: 128080784525 swapper:0->timestamp: 128080615408
now: 129081079685 swapper:0->timestamp: 129080104338
now: 130081375553 evolution:7440->timestamp: 130080376731

If the (expletive deleted) pm timer is enabled in your .config (it is
enabled by default and shall forever remain enabled for all who don't
know the incantation to make pm timer option appear in acpi menu) it is
used unless specifically overridden, which defeats the original purpose
for ever having a high resolution sched_clock().

-Mike

2006-05-03 09:16:35

by Andi Kleen

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Wednesday 03 May 2006 11:11, Mike Galbraith wrote:
> On Wed, 2006-05-03 at 09:40 +0200, Andi Kleen wrote:
> > On Wednesday 03 May 2006 09:09, Mike Galbraith wrote:
> >
> > > Given that most people are going to end up using the pm_timer anyway, I
> > > don't see the point of even having a sched_clock(). If it's jiffy
> > > resolution, it's useless. If it's wildly inaccurate (as it is in the
> > > SMP case, monotonicity issues aside) it's more than useless.
> >
> > For sched_clock TSC is always used and it's fine - sched_clock
> > doesn't require the guarantees that make TSC often useless otherwise
>
> Regrettable, that's not true.

Hmm, maybe I'm thinking too much x86-64. At least on x86-64 it's true.

I don't see a big reason to not do this on i386 either, except
on systems that truly don't have a TSC (386/486)

Ok i suppose if you don't want cruft you can always go to 64bit @)

-Andi

2006-05-03 09:31:04

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Wed, 2006-05-03 at 11:16 +0200, Andi Kleen wrote:
> On Wednesday 03 May 2006 11:11, Mike Galbraith wrote:
> > On Wed, 2006-05-03 at 09:40 +0200, Andi Kleen wrote:
> > > On Wednesday 03 May 2006 09:09, Mike Galbraith wrote:
> > >
> > > > Given that most people are going to end up using the pm_timer anyway, I
> > > > don't see the point of even having a sched_clock(). If it's jiffy
> > > > resolution, it's useless. If it's wildly inaccurate (as it is in the
> > > > SMP case, monotonicity issues aside) it's more than useless.
> > >
> > > For sched_clock TSC is always used and it's fine - sched_clock
> > > doesn't require the guarantees that make TSC often useless otherwise
> >
> > Regrettable, that's not true.
>
> Hmm, maybe I'm thinking too much x86-64. At least on x86-64 it's true.
>
> I don't see a big reason to not do this on i386 either, except
> on systems that truly don't have a TSC (386/486)

It should be this way on any system that has a half way functional high
resolution source. Without it, the starvation scenario which
sched_clock() was invented to solve returns. Making that the default
wasn't (um um um) the most brilliant selection among available options.

> Ok i suppose if you don't want cruft you can always go to 64bit @)

Unemployed guys can't buy new toys without wives getting all grumpy ;-)

-Mike

2006-05-04 03:52:17

by George Anzinger

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

Nicolas Pitre wrote:
> On Tue, 2 May 2006, Russell King wrote:
>
>
>>On Tue, May 02, 2006 at 06:43:45PM +0200, Andi Kleen wrote:
>>
>>>Russell King <[email protected]> writes:
>>>
>>>>However, this is not the case. On x86 with TSC, it returns a 54 bit
>>>>number. This means that when t1 < t0, time_passed_ns becomes a very
>>>>large number which no longer represents the amount of time.
>>>
>>>Good point. For a 1Ghz system this would happen every ~0.57 years.
>>>
>>>The problem is there is AFAIK no non destructive[1] way to find out how
>>>many bits the TSC has
>>>
>>>Destructive would be to overwrite it with -1 and see how many stick.
>>>
>>>
>>>>All uses in kernel/sched.c seem to be aflicted by this problem.
>>>>
>>>>There are several solutions to this - the most obvious being that we
>>>>need a function which returns the nanosecond difference between two
>>>>sched_clock() return values, and this function needs to know how to
>>>>handle the case where sched_clock() has wrapped.
>>>
>>>Ok it can be done with a simple test.
>
>
> Better yet the sched_clock() implementation just needs to return a value
> shifted left so the wrap around always happens on 64 bits and the
> difference between two consecutive samples is always right.
>
>
>>>>IOW:
>>>>
>>>> t0 = sched_clock();
>>>> /* do something */
>>>> t1 = sched_clock();
>>>>
>>>> time_passed = sched_clock_diff(t1, t0);
>>>>
>>>>Comments?
>>>
>>>Agreed it's a problem, but probably a small one. At worst you'll get
>>>a small scheduling hickup every half year, which should be hardly
>>>that big an issue.
>
>
> ... on x86 that is.
>
>
>>>Might chose to just ignore it with a big fat comment?
>>
>>You're right assuming you have a 64-bit TSC, but ARM has at best a
>>32-bit cycle counter which rolls over about every 179 seconds - with
>>gives a range of values from sched_clock from 0 to 178956970625 or
>>0x29AAAAAA81.
>>
>>That's rather more of a problem than having it happen every 208 days.
>
>
> Yet that counter isn't necessarily nanosecond based. So rescaling the
> returned value to nanosecs requires expensive divisions which could be
> done only once within sched_clock_diff() instead of twice as often in
> each sched_clock() calls.

Oh phooey!! Scaling can be done with a mpy and a shift. See the new clock
code where the TSC (or what ever) is scaled to ns.


--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2006-05-04 14:18:36

by Nicolas Pitre

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Wed, 3 May 2006, George Anzinger wrote:

> Nicolas Pitre wrote:
> > Yet that counter isn't necessarily nanosecond based. So rescaling the
> > returned value to nanosecs requires expensive divisions which could be done
> > only once within sched_clock_diff() instead of twice as often in each
> > sched_clock() calls.
>
> Oh phooey!! Scaling can be done with a mpy and a shift. See the new clock
> code where the TSC (or what ever) is scaled to ns.

I know.

And if you want to preserve more than 32 bits of precision you need 4
mpy_and_add insns with a shift on ARM at least.

But the point remains that it is more efficient to do it once rather
than twice or more.


Nicolas

2006-05-04 20:02:23

by Florian Schmidt

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Tue, 2 May 2006 14:29:53 +0100
Russell King <[email protected]> wrote:

> However, this is not the case. On x86 with TSC, it returns a 54 bit
> number. This means that when t1 < t0, time_passed_ns becomes a very
> large number which no longer represents the amount of time.

Hi,

is this specific to the sched_clock() or does the rdtsc on these systems
only use 54 bit?

Flo

--
Palimm Palimm!
http://tapas.affenbande.org

2006-05-07 12:33:46

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

Mike Galbraith wrote:
> On Wed, 2006-05-03 at 03:07 +1000, Nick Piggin wrote:
>
>>Other problem is that some people didn't RTFM and have started trying to
>>use it for precise accounting :(
>
>
> Are you talking about me perchance? I don't really care about precision
> _that_ much, though I certainly do want to tighten timeslice accounting.

No, sched_clock is fine to be used in CPU scheduling choices, which are
heuristic anyway (although strictly speaking, even using it for timeslicing
within a single CPU could cause slight unfairness).

I'm talking about the update_cpu_clock() / task_struct->sched_time stuff.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-07 12:43:18

by Russell King

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Sun, May 07, 2006 at 10:33:41PM +1000, Nick Piggin wrote:
> Mike Galbraith wrote:
> >On Wed, 2006-05-03 at 03:07 +1000, Nick Piggin wrote:
> >
> >>Other problem is that some people didn't RTFM and have started trying to
> >>use it for precise accounting :(
> >
> >
> >Are you talking about me perchance? I don't really care about precision
> >_that_ much, though I certainly do want to tighten timeslice accounting.
>
> No, sched_clock is fine to be used in CPU scheduling choices, which are
> heuristic anyway (although strictly speaking, even using it for timeslicing
> within a single CPU could cause slight unfairness).

Except maybe if it rolls over every 178 seconds, which is my original
point. Maybe someone could comment on my initial patch sent 5 days
ago?

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-05-07 12:56:21

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

Russell King wrote:
> On Sun, May 07, 2006 at 10:33:41PM +1000, Nick Piggin wrote:
>

>>No, sched_clock is fine to be used in CPU scheduling choices, which are
>>heuristic anyway (although strictly speaking, even using it for timeslicing
>>within a single CPU could cause slight unfairness).
>
>
> Except maybe if it rolls over every 178 seconds, which is my original
> point. Maybe someone could comment on my initial patch sent 5 days
> ago?

Well yes that's true. I meant the "sched_clock interface as defined". Now
there are obviously issues (including the one you raised) that makes the
sched_clock interface unreasonable to implement.

I stand by my first reply to your comment WRT the API. Seems like most of
the rest of the debate was unrelated or concerning implementation details.
kernel/sched.c patches implementing the new API would get an ack from me.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-07 13:00:34

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

Nick Piggin wrote:

> I stand by my first reply to your comment WRT the API.

Actually, on rereading, it seems like I was a bit confused about
your proposal. I don't think you specified anyway the units
returned by your new sched_clock(). So it is identical to my
"corrected" interface :\

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-07 13:18:35

by Russell King

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Sun, May 07, 2006 at 11:00:29PM +1000, Nick Piggin wrote:
> Nick Piggin wrote:
>
> >I stand by my first reply to your comment WRT the API.
>
> Actually, on rereading, it seems like I was a bit confused about
> your proposal. I don't think you specified anyway the units
> returned by your new sched_clock(). So it is identical to my
> "corrected" interface :\

Okay, so that presumably means we have to either stick with what we
currently have, or go the whole hog and re-implement the sched_clock()
support?

IOW, my patch on 2nd May isn't of any use as it currently stands?

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-05-07 13:35:52

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

Russell King wrote:
> On Sun, May 07, 2006 at 11:00:29PM +1000, Nick Piggin wrote:
>
>>Nick Piggin wrote:
>>
>>
>>>I stand by my first reply to your comment WRT the API.
>>
>>Actually, on rereading, it seems like I was a bit confused about
>>your proposal. I don't think you specified anyway the units
>>returned by your new sched_clock(). So it is identical to my
>>"corrected" interface :\
>
>
> Okay, so that presumably means we have to either stick with what we
> currently have, or go the whole hog and re-implement the sched_clock()
> support?
>
> IOW, my patch on 2nd May isn't of any use as it currently stands?

IMO it would probably be best to try to re implement it in one go.
It shouldn't have spread too far out of kernel/sched.c, and the arch
code should mostly be implementable in terms of their sched_clock().
Mundane but not difficult.

Making arch code actually try to do the right thing may require a
bit more thinking, to handle both the variable time counter issue
and your time counter wrap problem. That wouldn't be your problem
though, outside arch/arm/

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-07 13:55:50

by Russell King

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Sun, May 07, 2006 at 11:30:15PM +1000, Nick Piggin wrote:
> Russell King wrote:
> >On Sun, May 07, 2006 at 11:00:29PM +1000, Nick Piggin wrote:
> >
> >>Nick Piggin wrote:
> >>
> >>
> >>>I stand by my first reply to your comment WRT the API.
> >>
> >>Actually, on rereading, it seems like I was a bit confused about
> >>your proposal. I don't think you specified anyway the units
> >>returned by your new sched_clock(). So it is identical to my
> >>"corrected" interface :\
> >
> >
> >Okay, so that presumably means we have to either stick with what we
> >currently have, or go the whole hog and re-implement the sched_clock()
> >support?
> >
> >IOW, my patch on 2nd May isn't of any use as it currently stands?
>
> IMO it would probably be best to try to re implement it in one go.
> It shouldn't have spread too far out of kernel/sched.c, and the arch
> code should mostly be implementable in terms of their sched_clock().
> Mundane but not difficult.
>
> Making arch code actually try to do the right thing may require a
> bit more thinking, to handle both the variable time counter issue
> and your time counter wrap problem. That wouldn't be your problem
> though, outside arch/arm/

Also, any comments on update_cpu_clock() and current_sched_time() both
appearing to be buggy, or am I barking up the wrong tree with those?

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-05-07 14:18:12

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

Russell King wrote:

> Also, any comments on update_cpu_clock() and current_sched_time() both
> appearing to be buggy, or am I barking up the wrong tree with those?

Can't remember off the top of my head who put those in, sorry.

Aside from the fact that they appear to be fundamentally buggy anyway
because we don't require exact ns intervals from sched_clock() at the
best of times, I think your wrapping fix for them looks correct.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-07 16:03:54

by Andi Kleen

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Sunday 07 May 2006 16:04, Nick Piggin wrote:
> Russell King wrote:
>
> > Also, any comments on update_cpu_clock() and current_sched_time() both
> > appearing to be buggy, or am I barking up the wrong tree with those?
>
> Can't remember off the top of my head who put those in, sorry.
>
> Aside from the fact that they appear to be fundamentally buggy anyway
> because we don't require exact ns intervals from sched_clock() at the
> best of times, I think your wrapping fix for them looks correct.

I must say I always hated them because adding code to the context
switch for such an obscure POSIX bu^w"feature" is just a bad idea.
Maybe it would be best to remove it again instead of adding the
scaling needed to make it actually work

(since we got along with such a broken implementation for so long I
would guess that nobody uses these timers anyways)

-Andi

2006-05-07 16:54:09

by Russell King

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Sun, May 07, 2006 at 11:30:15PM +1000, Nick Piggin wrote:
> Russell King wrote:
> >On Sun, May 07, 2006 at 11:00:29PM +1000, Nick Piggin wrote:
> >
> >>Nick Piggin wrote:
> >>
> >>
> >>>I stand by my first reply to your comment WRT the API.
> >>
> >>Actually, on rereading, it seems like I was a bit confused about
> >>your proposal. I don't think you specified anyway the units
> >>returned by your new sched_clock(). So it is identical to my
> >>"corrected" interface :\
> >
> >
> >Okay, so that presumably means we have to either stick with what we
> >currently have, or go the whole hog and re-implement the sched_clock()
> >support?
> >
> >IOW, my patch on 2nd May isn't of any use as it currently stands?
>
> IMO it would probably be best to try to re implement it in one go.
> It shouldn't have spread too far out of kernel/sched.c, and the arch
> code should mostly be implementable in terms of their sched_clock().
> Mundane but not difficult.

Having looked at this several times over the last couple of days, I've
come to the conclusion that I'm not the right person to fix this problem.
I've tried several methods of converting the code, but every time I
remain unconvinced that the changes are provably correct as far as not
missing something, so I end up throwing the changes away and starting
again.

Yes, I admit defeat.

Maybe someone who cares about this stuff[1] (or who sees the problem)
should look into it.

[1] - eg, the original poster on linux-arm-kernel who indirectly pointed
out the sched_clock() issue.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-05-07 17:32:04

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Sun, 2006-05-07 at 22:33 +1000, Nick Piggin wrote:
> Mike Galbraith wrote:
> > On Wed, 2006-05-03 at 03:07 +1000, Nick Piggin wrote:
> >
> >>Other problem is that some people didn't RTFM and have started trying to
> >>use it for precise accounting :(
> >
> >
> > Are you talking about me perchance? I don't really care about precision
> > _that_ much, though I certainly do want to tighten timeslice accounting.
>
> No, sched_clock is fine to be used in CPU scheduling choices, which are
> heuristic anyway (although strictly speaking, even using it for timeslicing
> within a single CPU could cause slight unfairness).
>
> I'm talking about the update_cpu_clock() / task_struct->sched_time stuff.

Oh. I kinda sorta agree. If the continuous nanosecond thing happens,
it'll make things _much_ easier. I actually have exactly one testcase
that actually requires nanoseconds as things stand, and that one I've
worked around in the past. The rest of my desire to increase accuracy
stems from instrumenting timeslice enforcement. Statistical at best,
and truly sad at worst.

-Mike

2006-05-07 17:37:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Sun, 2006-05-07 at 13:43 +0100, Russell King wrote:
> On Sun, May 07, 2006 at 10:33:41PM +1000, Nick Piggin wrote:
> > Mike Galbraith wrote:
> > >On Wed, 2006-05-03 at 03:07 +1000, Nick Piggin wrote:
> > >
> > >>Other problem is that some people didn't RTFM and have started trying to
> > >>use it for precise accounting :(
> > >
> > >
> > >Are you talking about me perchance? I don't really care about precision
> > >_that_ much, though I certainly do want to tighten timeslice accounting.
> >
> > No, sched_clock is fine to be used in CPU scheduling choices, which are
> > heuristic anyway (although strictly speaking, even using it for timeslicing
> > within a single CPU could cause slight unfairness).
>
> Except maybe if it rolls over every 178 seconds, which is my original
> point. Maybe someone could comment on my initial patch sent 5 days
> ago?

Simply ignore the wrap... unless you have a scenario where the wrap
event itself is significant event.

-Mike

2006-05-07 17:52:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Sun, 2006-05-07 at 17:53 +0100, Russell King wrote:
> On Sun, May 07, 2006 at 11:30:15PM +1000, Nick Piggin wrote:
> > Russell King wrote:
> > >On Sun, May 07, 2006 at 11:00:29PM +1000, Nick Piggin wrote:
> > >
> > >>Nick Piggin wrote:
> > >>
> > >>
> > >>>I stand by my first reply to your comment WRT the API.
> > >>
> > >>Actually, on rereading, it seems like I was a bit confused about
> > >>your proposal. I don't think you specified anyway the units
> > >>returned by your new sched_clock(). So it is identical to my
> > >>"corrected" interface :\
> > >
> > >
> > >Okay, so that presumably means we have to either stick with what we
> > >currently have, or go the whole hog and re-implement the sched_clock()
> > >support?
> > >
> > >IOW, my patch on 2nd May isn't of any use as it currently stands?
> >
> > IMO it would probably be best to try to re implement it in one go.
> > It shouldn't have spread too far out of kernel/sched.c, and the arch
> > code should mostly be implementable in terms of their sched_clock().
> > Mundane but not difficult.
>
> Having looked at this several times over the last couple of days, I've
> come to the conclusion that I'm not the right person to fix this problem.
> I've tried several methods of converting the code, but every time I
> remain unconvinced that the changes are provably correct as far as not
> missing something, so I end up throwing the changes away and starting
> again.
>
> Yes, I admit defeat.

Ah, you feel my pain. I look forward to the continuous nanosecond clock
that the time guys are currently talking about, and will hopefully _not_
decide is not needed.

-Mike

2006-05-08 04:15:05

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Sun, 2006-05-07 at 19:32 +0200, Mike Galbraith wrote:
> On Sun, 2006-05-07 at 22:33 +1000, Nick Piggin wrote:
> > Mike Galbraith wrote:
> > > On Wed, 2006-05-03 at 03:07 +1000, Nick Piggin wrote:
> > >
> > >>Other problem is that some people didn't RTFM and have started trying to
> > >>use it for precise accounting :(
> > >
> > >
> > > Are you talking about me perchance? I don't really care about precision
> > > _that_ much, though I certainly do want to tighten timeslice accounting.
> >
> > No, sched_clock is fine to be used in CPU scheduling choices, which are
> > heuristic anyway (although strictly speaking, even using it for timeslicing
> > within a single CPU could cause slight unfairness).
> >
> > I'm talking about the update_cpu_clock() / task_struct->sched_time stuff.
>
> Oh. I kinda sorta agree.

Here's part of the reason why. The below is from update_cpu_clock()
when now <= last. I print now, last_tick, timestamp, and a total
timewarp count. This is a UP kernel, CPU is 3GHz P4 running flat out,
no clockmod (though it is compiled in).

now: 28781601231 tick: 28781610060 stamp: 28781484480 total: 111
now: 39653939737 tick: 39653953110 stamp: 39652958475 total: 112
now: 48382789863 tick: 48382802652 stamp: 48382788437 total: 113
now: 64846190533 tick: 64846203307 stamp: 64846185356 total: 114
now: 87418287278 tick: 87418298913 stamp: 87417324842 total: 115
now: 88679398808 tick: 88679409618 stamp: 88679395865 total: 116
now: 97397256650 tick: 97397270613 stamp: 97397254208 total: 117
now: 108467457132 tick: 108467470627 stamp: 108467410356 total: 118
now: 217186858841 tick: 217186868844 stamp: 217186858049 total: 119
now: 267909122394 tick: 267909132306 stamp: 267909120976 total: 120
now: 432481173047 tick: 432481183477 stamp: 432481169893 total: 121
now: 435389124209 tick: 435389134985 stamp: 435389107999 total: 122
now: 437341748522 tick: 437341757801 stamp: 437341714053 total: 124
now: 472829745453 tick: 472829755435 stamp: 472829743083 total: 125

Most happen at boot, but they continue to happen occasionally. Even
tossing these in the bit bucket, I've given up on timeslice accuracy,
because the numbers just don't add up right. This and the pm timer
thing make sched_clock() fairly annoying.

-Mike

2006-05-08 04:37:44

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Mon, 2006-05-08 at 06:15 +0200, Mike Galbraith wrote:
> On Sun, 2006-05-07 at 19:32 +0200, Mike Galbraith wrote:
> > On Sun, 2006-05-07 at 22:33 +1000, Nick Piggin wrote:
> > > Mike Galbraith wrote:
> > > > On Wed, 2006-05-03 at 03:07 +1000, Nick Piggin wrote:
> > > >
> > > >>Other problem is that some people didn't RTFM and have started trying to
> > > >>use it for precise accounting :(
> > > >
> > > >
> > > > Are you talking about me perchance? I don't really care about precision
> > > > _that_ much, though I certainly do want to tighten timeslice accounting.
> > >
> > > No, sched_clock is fine to be used in CPU scheduling choices, which are
> > > heuristic anyway (although strictly speaking, even using it for timeslicing
> > > within a single CPU could cause slight unfairness).
> > >
> > > I'm talking about the update_cpu_clock() / task_struct->sched_time stuff.
> >
> > Oh. I kinda sorta agree.
>

Sorry for yet another reply, but running the old starvation testcase
that caused sched_clock() to be born in the first place tickled my
funny-bone. With that running and hitting 300k context switches...

now: 2100508962835 tick: 2100508972067 stamp: 2100508961220 total: 2906
now: 2101531243883 tick: 2101531251877 stamp: 2101531238543 total: 2924
now: 2102695422392 tick: 2102695431699 stamp: 2102695418265 total: 2940

Accounting? Not :)

-Mike

2006-05-08 04:46:18

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

Mike Galbraith wrote:

>Sorry for yet another reply, but running the old starvation testcase
>that caused sched_clock() to be born in the first place tickled my
>funny-bone. With that running and hitting 300k context switches...
>
>now: 2100508962835 tick: 2100508972067 stamp: 2100508961220 total: 2906
>now: 2101531243883 tick: 2101531251877 stamp: 2101531238543 total: 2924
>now: 2102695422392 tick: 2102695431699 stamp: 2102695418265 total: 2940
>
>Accounting? Not :)
>

Yeah I agree with Andi that this accounting stuff is probably done
for some POSIX conformance that doesn't matter. Actually it is worse
than that because if anyone _really_ did need it, then they'll get a
horrible surprise when their system mysteriously fails in production.

It should either get ripped out, or perhaps converted to use jiffies
until a sane high resolution, low overhead scheme is developed (if
ever). And that would exclude something that does this accounting in
fastpaths for the 99.99% of processes that never use it.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-08 05:24:59

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

On Mon, 2006-05-08 at 14:46 +1000, Nick Piggin wrote:
> It should either get ripped out, or perhaps converted to use jiffies
> until a sane high resolution, low overhead scheme is developed (if
> ever). And that would exclude something that does this accounting in
> fastpaths for the 99.99% of processes that never use it.

The accounting is really light compared to the interactivity part. That
doesn't need to be in the fast path, and in my tree it isn't.

FWIW (0), yy tree is missing every last shred of the interactivity code,
and not missing it one bit.

[root@Homer]:> diffstat xx
sched.c | 480
+++++++++++++++++++++++++++++-----------------------------------
1 files changed, 219 insertions(+), 261 deletions(-)

And that's with full throttling, and absolute starvation proofing.

Ho hum. Back to work on my never-going-anywhere-but-fun tree :)

later,

-Mike (shutting the hell up now;)

2006-05-08 05:30:24

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_clock() uses are broken

Mike Galbraith wrote:

>On Mon, 2006-05-08 at 14:46 +1000, Nick Piggin wrote:
>
>>It should either get ripped out, or perhaps converted to use jiffies
>>until a sane high resolution, low overhead scheme is developed (if
>>ever). And that would exclude something that does this accounting in
>>fastpaths for the 99.99% of processes that never use it.
>>
>
>The accounting is really light compared to the interactivity part. That
>doesn't need to be in the fast path, and in my tree it isn't.
>

And it is worse than useless if it is wrong.

>
>FWIW (0), yy tree is missing every last shred of the interactivity code,
>and not missing it one bit.
>

Join the club ;)

>
>[root@Homer]:> diffstat xx
> sched.c | 480
>+++++++++++++++++++++++++++++-----------------------------------
> 1 files changed, 219 insertions(+), 261 deletions(-)
>
>And that's with full throttling, and absolute starvation proofing.
>
>Ho hum. Back to work on my never-going-anywhere-but-fun tree :)
>

Join the club ;)

Nah, the interactivity stuff isn't too bad (as a function of bug reports
to lkml). One day when there is nothing else wrong with the kernel, it will
probably end up getting reworked. No need to stir it up now though, really.

--

Send instant messages to your online friends http://au.messenger.yahoo.com