LinuxLists.cc - [PATCH 3/5] lockstat: core infrastructure

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Tue, 2007-05-29 at 14:52 +0200, Peter Zijlstra wrote:
> + now = sched_clock();
> + waittime = now - hlock->waittime_stamp;
> +

It looks like your using sched_clock() through out .. It's a little
troubling considering the constraints on the function .. Most
architecture implement a jiffies sched_clock() w/ 1 millisecond or worse
resolution.. I'd imagine a millisecond hold time is pretty rare, even a
millisecond wait time might be fairly rare too .. There's also no
guarantee that sched_clock timestamps off two different cpu's can be
compared (or at least that's my understanding) ..

Daniel

2007-05-30 03:45:04

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Tue, 29 May 2007 14:52:51 +0200 Peter Zijlstra <[email protected]> wrote:

> Introduce the core lock statistics code.
>

I must say that an aggregate addition of 27 ifdefs is a bit sad. And there
is some easy stuff here.

> +#ifdef CONFIG_PROVE_LOCKING
> +int prove_locking = 1;
> +module_param(prove_locking, int, 0644);
> +#endif

#else
#define prove_locking 0
#endif

> +
> +#ifdef CONFIG_LOCK_STAT
> +int lock_stat = 1;
> +module_param(lock_stat, int, 0644);
> +#endif

ditto.

>
> ...
>
> +struct lock_class_stats lock_stats(struct lock_class *class)
> +{
> + struct lock_class_stats stats;
> + int cpu, i;
> +
> + memset(&stats, 0, sizeof(struct lock_class_stats));
> + for_each_possible_cpu(cpu) {
> + struct lock_class_stats *pcs =
> + &per_cpu(lock_stats, cpu)[class - lock_classes];
> +
> + for (i = 0; i < ARRAY_SIZE(stats.contention_point); i++)
> + stats.contention_point[i] += pcs->contention_point[i];
> +
> + lock_time_add(&pcs->read_waittime, &stats.read_waittime);
> + lock_time_add(&pcs->write_waittime, &stats.write_waittime);
> +
> + lock_time_add(&pcs->read_holdtime, &stats.read_holdtime);
> + lock_time_add(&pcs->write_holdtime, &stats.write_holdtime);
> + }
> +
> + return stats;
> +}

hm, that isn't trying to be very efficient.

> @@ -2035,6 +2131,11 @@ static int __lock_acquire(struct lockdep
> int chain_head = 0;
> u64 chain_key;
>
> +#ifdef CONFIG_PROVE_LOCKING
> + if (!prove_locking)
> + check = 1;
> +#endif

Removable

> +#ifdef CONFIG_LOCK_STAT
> +static void lock_release_holdtime(struct held_lock *hlock)
> +{
> + struct lock_class_stats *stats;
> + unsigned long long holdtime;
> +
> + if (!lock_stat)
> + return;
> +
> + holdtime = sched_clock() - hlock->holdtime_stamp;
> +
> + stats = get_lock_stats(hlock->class);
> +
> + if (hlock->read)
> + lock_time_inc(&stats->read_holdtime, holdtime);
> + else
> + lock_time_inc(&stats->write_holdtime, holdtime);
> +
> + put_lock_stats(stats);
> +}
> +#else
> +static void lock_release_holdtime(struct held_lock *hlock)

inline

> +{
> +}
> +#endif
> +
> ...
>
> @@ -2456,6 +2712,14 @@ void lock_acquire(struct lockdep_map *lo
> {
> unsigned long flags;
>
> +#ifdef CONFIG_LOCK_STAT
> + if (unlikely(!lock_stat))
> +#endif

removable

> +#ifdef CONFIG_PROVE_LOCKING
> + if (unlikely(!prove_locking))
> +#endif

removable

> @@ -2475,6 +2739,14 @@ void lock_release(struct lockdep_map *lo
> {
> unsigned long flags;
>
> +#ifdef CONFIG_LOCK_STAT
> + if (unlikely(!lock_stat))
> +#endif

removable

> +#ifdef CONFIG_PROVE_LOCKING
> + if (unlikely(!prove_locking))
> +#endif
> + return;
> +
> if (unlikely(current->lockdep_recursion))
> return;
>
>
> ...
>
> +#ifdef CONFIG_LOCK_STAT
> +
> +extern void lock_contended(struct lockdep_map *lock, unsigned long ip);
> +extern void lock_acquired(struct lockdep_map *lock);
> +
> +#define LOCK_CONTENDED(_lock, try, lock) \
> +do { \
> + if (!try(_lock)) { \
> + lock_contended(&(_lock)->dep_map, _RET_IP_); \
> + lock(_lock); \
> + lock_acquired(&(_lock)->dep_map); \
> + } \
> +} while (0)
> +
> +#else /* CONFIG_LOCK_STAT */
> +
> +#define lock_contended(l, i) do { } while (0)
> +#define lock_acquired(l) do { } while (0)

inlines are better.

> +#define LOCK_CONTENDED(_lock, try, lock) \
> + lock(_lock)
> +
> +#endif /* CONFIG_LOCK_STAT */
> +
> },
>
> ...
>
> +#ifdef CONFIG_PROVE_LOCKING
> + {
> + .ctl_name = KERN_PROVE_LOCKING,
> + .procname = "prove_locking",
> + .data = &prove_locking,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec,
> + },
> +#endif
> +#ifdef CONFIG_LOCK_STAT
> + {
> + .ctl_name = KERN_LOCK_STAT,
> + .procname = "lock_stat",
> + .data = &lock_stat,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec,
> + },
> +#endif

Please use CTL_UNNUNBERED for new sysctls.

> { .ctl_name = 0 }
> };
> Index: linux-2.6-git/include/linux/sysctl.h
> ===================================================================
> --- linux-2.6-git.orig/include/linux/sysctl.h
> +++ linux-2.6-git/include/linux/sysctl.h
> @@ -166,6 +166,8 @@ enum
> KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
> KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
> KERN_POWEROFF_CMD=77, /* string: poweroff command line */
> + KERN_PROVE_LOCKING=78, /* int: enable lock dependancy checking */
> + KERN_LOCK_STAT=79, /* int: enable lock statistics */
> };

And lose these.

So I'm inclined to ask if you can redo these pathces with a view to reducing
the ifdef density with a bit of restructuring.

We could do that as a followon patch I guess. Nicer not to though.

2007-05-30 13:19:37

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Tue, 2007-05-29 at 13:28 -0700, Daniel Walker wrote:
> On Tue, 2007-05-29 at 14:52 +0200, Peter Zijlstra wrote:
> > + now = sched_clock();
> > + waittime = now - hlock->waittime_stamp;
> > +
>
> It looks like your using sched_clock() through out .. It's a little
> troubling considering the constraints on the function .. Most
> architecture implement a jiffies sched_clock() w/ 1 millisecond or worse
> resolution.. I'd imagine a millisecond hold time is pretty rare, even a
> millisecond wait time might be fairly rare too .. There's also no
> guarantee that sched_clock timestamps off two different cpu's can be
> compared (or at least that's my understanding) ..

All valid points, however.. calling anything more expensive 2-3 times
per lock acquisition is going to be _very_ painful.

Also, IMHO the contention count vs the acquisition count is the most
interesting number, the times are just a nice bonus (if and when they
work).

2007-05-30 13:25:28

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

* Daniel Walker <[email protected]> wrote:

> [...] Most architecture implement a jiffies sched_clock() w/ 1
> millisecond or worse resolution.. [...]

weird that you count importance by 'number of architectures', while 98%
of the installed server base is x86 or x86_64, where sched_clock() is
pretty precise ;-)

Ingo

2007-05-30 13:41:48

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

>
> > [...] Most architecture implement a jiffies sched_clock() w/ 1
> > millisecond or worse resolution.. [...]
>
> weird that you count importance by 'number of architectures', while 98%
> of the installed server base is x86 or x86_64, where sched_clock() is
> pretty precise ;-)
>

I can understand Daniel's POV (from working at TimeSys once upon a time).
He works for Monta Vista which does a lot with embedded systems running on
something other than x86. So from Daniel's POV, those "number of
architectures" is of importance. While, from those that do server work
where x86 is now becoming the dominant platform, our focus is on those.

As long as the work doesn't "break" an arch. We can argue that sched_clock
is "good enough". If someone wants better accounting of locks on some
other arch, they can simply change sched_clock to be more precise.

-- Steve

2007-05-30 13:49:41

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

* Steven Rostedt <[email protected]> wrote:

> [...] We can argue that sched_clock is "good enough". If someone
> wants better accounting of locks on some other arch, they can simply
> change sched_clock to be more precise.

exactly. Imprecise sched_clock() if there's a better fast clock source
available is a bug in the architecture code. If the only available
clocksource is 1 msec resolution then there's no solution and nothing to
talk about - lock statistics will be 1msec granular just as much as
scheduling.

Ingo

2007-05-30 15:23:30

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Wed, 2007-05-30 at 15:24 +0200, Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
> > [...] Most architecture implement a jiffies sched_clock() w/ 1
> > millisecond or worse resolution.. [...]
>
> weird that you count importance by 'number of architectures', while 98%
> of the installed server base is x86 or x86_64, where sched_clock() is
> pretty precise ;-)

I work with many other architectures, so it's important to me. That's
aside anyway, just consider there are several Linux architectures. When
we write code we should consider those other architectures .. Non-x86
Linux is used pretty frequently. Dare I say that x86 and x86_64
installations combined don't out weigh all the other architecture
installations, that's only a guess but I wouldn't be surprised if that's
true.

Daniel

2007-05-30 17:09:27

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Wed, 2007-05-30 at 15:49 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > [...] We can argue that sched_clock is "good enough". If someone
> > wants better accounting of locks on some other arch, they can simply
> > change sched_clock to be more precise.
>
> exactly. Imprecise sched_clock() if there's a better fast clock source
> available is a bug in the architecture code. If the only available
> clocksource is 1 msec resolution then there's no solution and nothing to
> talk about - lock statistics will be 1msec granular just as much as
> scheduling.

I don't agree .. sched_clock() is obsoleted by timekeepings clocksource
structure.. sched_clock() was a quick way to get lowlevel time stamps
just for the scheduler. The timekeeping clocksource structure is a more
complete solution.

>From the architecture perspective there are two low level clock hooks to
implement one is sched_clock() , and at least one clocksource structure.
Both do essentially the same thing. With timekeepings clocksource
structure actually being easier to implement cause the math is built in.

It's clear to me that architectures will implement clocksource
structures .. However, they will not implement sched_clock() because
there is no benefit associated with it. As you have said the scheduler
works fine with a jiffies resolution clock, even a large number of x86
machines use a jiffies sched_clock() ..

So I don't think it's a bug if sched_clock() is lowres, and it shouldn't
be a bug in the future..

Daniel

2007-05-30 17:16:52

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Wed, 2007-05-30 at 10:06 -0700, Daniel Walker wrote:
> On Wed, 2007-05-30 at 15:49 +0200, Ingo Molnar wrote:
> > * Steven Rostedt <[email protected]> wrote:
> >
> > > [...] We can argue that sched_clock is "good enough". If someone
> > > wants better accounting of locks on some other arch, they can simply
> > > change sched_clock to be more precise.
> >
> > exactly. Imprecise sched_clock() if there's a better fast clock source
> > available is a bug in the architecture code. If the only available
> > clocksource is 1 msec resolution then there's no solution and nothing to
> > talk about - lock statistics will be 1msec granular just as much as
> > scheduling.
>
>
> I don't agree .. sched_clock() is obsoleted by timekeepings clocksource
> structure.. sched_clock() was a quick way to get lowlevel time stamps
> just for the scheduler. The timekeeping clocksource structure is a more
> complete solution.
>
> >From the architecture perspective there are two low level clock hooks to
> implement one is sched_clock() , and at least one clocksource structure.
> Both do essentially the same thing. With timekeepings clocksource
> structure actually being easier to implement cause the math is built in.

I think you are mistaken here; the two are similar but not identical.

I see sched_clock() as fast first, accurate second. Whereas the
clocksource thing is accurate first, fast second.

There is room for both of them.

2007-05-30 17:28:11

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:

> > >From the architecture perspective there are two low level clock hooks to
> > implement one is sched_clock() , and at least one clocksource structure.
> > Both do essentially the same thing. With timekeepings clocksource
> > structure actually being easier to implement cause the math is built in.
>
> I think you are mistaken here; the two are similar but not identical.
>
> I see sched_clock() as fast first, accurate second. Whereas the
> clocksource thing is accurate first, fast second.

This is true .. However, if there is a speed different it's small.
In the past I've replace sched_clock() with a clocksource, and there was
no noticeable speed different .. Just recently I replaced x86's
sched_clock() math with the clocksource math with no noticable
difference .. At least not from my benchmarks ..

> There is room for both of them.

There is room, but we don't need sched_clock() .. Certainly we shouldn't
force architectures to implement sched_clock() by calling it a "bug" if
it's lowres.

Daniel

2007-06-01 13:13:23

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

* Daniel Walker <[email protected]> wrote:

> On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:

> > I think you are mistaken here; the two are similar but not
> > identical.
> >
> > I see sched_clock() as fast first, accurate second. Whereas the
> > clocksource thing is accurate first, fast second.
>
> This is true .. However, if there is a speed different it's small.

Ugh. Have you ever compared pmtimer (or even hpet) against TSC based
sched_clock()? What you write is so wrong that it's not even funny. You
keep repeating this nonsense despite having been told multiple times
that you are dead wrong.

Ingo

2007-06-01 13:29:47

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

Daniel Walker <[email protected]> writes:

> On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:
>
> > > >From the architecture perspective there are two low level clock hooks to
> > > implement one is sched_clock() , and at least one clocksource structure.
> > > Both do essentially the same thing. With timekeepings clocksource
> > > structure actually being easier to implement cause the math is built in.
> >
> > I think you are mistaken here; the two are similar but not identical.
> >
> > I see sched_clock() as fast first, accurate second. Whereas the
> > clocksource thing is accurate first, fast second.
>
> This is true .. However, if there is a speed different it's small.

pmtimer (factor 1000+) or HPET (factor 10-100+ depending on CPU)
accesses are much slower than TSC or jiffies read. Talking to the
southbridge is slow.

-Andi

2007-06-01 15:29:47

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 15:12 +0200, Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
> > On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:
>
> > > I think you are mistaken here; the two are similar but not
> > > identical.
> > >
> > > I see sched_clock() as fast first, accurate second. Whereas the
> > > clocksource thing is accurate first, fast second.
> >
> > This is true .. However, if there is a speed different it's small.
>
> Ugh. Have you ever compared pmtimer (or even hpet) against TSC based
> sched_clock()? What you write is so wrong that it's not even funny. You
> keep repeating this nonsense despite having been told multiple times
> that you are dead wrong.

Yes I have, and your right there is a difference, and a big
difference .. Above I was referring only to the TSC clocksource, since
that's an apples to apples comparison .. I would never compare the TSC
to the acpi_pm, that's no contest ..

The acpi_pm as sched_clock() with hackbench was about %25 slower, the
pit was 10x slower approximately. (I did this months ago.)

Daniel

2007-06-01 15:53:42

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 08:26 -0700, Daniel Walker wrote:
> On Fri, 2007-06-01 at 15:12 +0200, Ingo Molnar wrote:
> > * Daniel Walker <[email protected]> wrote:
> >
> > > On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:
> >
> > > > I think you are mistaken here; the two are similar but not
> > > > identical.
> > > >
> > > > I see sched_clock() as fast first, accurate second. Whereas the
> > > > clocksource thing is accurate first, fast second.
> > >
> > > This is true .. However, if there is a speed different it's small.
> >
> > Ugh. Have you ever compared pmtimer (or even hpet) against TSC based
> > sched_clock()? What you write is so wrong that it's not even funny. You
> > keep repeating this nonsense despite having been told multiple times
> > that you are dead wrong.
>
> Yes I have, and your right there is a difference, and a big
> difference .. Above I was referring only to the TSC clocksource, since
> that's an apples to apples comparison .. I would never compare the TSC
> to the acpi_pm, that's no contest ..
>
> The acpi_pm as sched_clock() with hackbench was about %25 slower, the
> pit was 10x slower approximately. (I did this months ago.)

The whole issue is that you don't have any control over what clocksource
you'll end up with. If it so happens that pmtimer gets selected your
whole box will crawl if its used liberaly, like the patch under
discussion does.

So, having two interfaces, one fast and one accurate is the right answer
IMHO.

And I agree, that if the arch has a fast clock but doesn't use it for
sched_clock() that would be a shortcoming of that arch.

2007-06-01 16:14:04

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 17:52 +0200, Peter Zijlstra wrote:
> On Fri, 2007-06-01 at 08:26 -0700, Daniel Walker wrote:
> > On Fri, 2007-06-01 at 15:12 +0200, Ingo Molnar wrote:
> > > * Daniel Walker <[email protected]> wrote:
> > >
> > > > On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:
> > >
> > > > > I think you are mistaken here; the two are similar but not
> > > > > identical.
> > > > >
> > > > > I see sched_clock() as fast first, accurate second. Whereas the
> > > > > clocksource thing is accurate first, fast second.
> > > >
> > > > This is true .. However, if there is a speed different it's small.
> > >
> > > Ugh. Have you ever compared pmtimer (or even hpet) against TSC based
> > > sched_clock()? What you write is so wrong that it's not even funny. You
> > > keep repeating this nonsense despite having been told multiple times
> > > that you are dead wrong.
> >
> > Yes I have, and your right there is a difference, and a big
> > difference .. Above I was referring only to the TSC clocksource, since
> > that's an apples to apples comparison .. I would never compare the TSC
> > to the acpi_pm, that's no contest ..
> >
> > The acpi_pm as sched_clock() with hackbench was about %25 slower, the
> > pit was 10x slower approximately. (I did this months ago.)
>
> The whole issue is that you don't have any control over what clocksource
> you'll end up with. If it so happens that pmtimer gets selected your
> whole box will crawl if its used liberaly, like the patch under
> discussion does.

You can have control over it, which I think the whole point of this
discussion ..

> So, having two interfaces, one fast and one accurate is the right answer
> IMHO.

In the case of lockstat you have two cases fast and functional, and
non-functional .. Right now your patch has no slow and functional state.

The non-functional state is even the majority from my perspective.

> And I agree, that if the arch has a fast clock but doesn't use it for
> sched_clock() that would be a shortcoming of that arch.

As I said before there is no reason why and architectures should be
forced to implement sched_clock() .. Is there some specific reason why
you think it should be mandatory?

Daniel

2007-06-01 18:20:26

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

* Daniel Walker <[email protected]> wrote:

> > > > I see sched_clock() as fast first, accurate second. Whereas the
> > > > clocksource thing is accurate first, fast second.
> > >
> > > This is true .. However, if there is a speed different it's small.
> >
> > Ugh. Have you ever compared pmtimer (or even hpet) against TSC based
> > sched_clock()? What you write is so wrong that it's not even funny.
> > You keep repeating this nonsense despite having been told multiple
> > times that you are dead wrong.
>
> Yes I have, and your right there is a difference, and a big difference
> .. Above I was referring only to the TSC clocksource, since that's an
> apples to apples comparison .. I would never compare the TSC to the
> acpi_pm, that's no contest ..

You still dont get it i think: in real life we end up using the TSC in
sched_clock() _much more often_ than we end up using the TSC for
clocksource! So your flawed suggestion does not fix anything, it in fact
introduces a really bad regression: instead of using the TSC (or
jiffies) we'd end up using the pmtimer or hpet for every lock operation
when lockstat is enabled, bringing the box to a screeching halt in
essence.

so what you suggest has a far worse effect on the _majority_ of systems
that are even interested in running lockstat, than the case you
mentioned that some seldom-used arch which is lazy about sched_clock()
falls back to jiffies granularity. It's not a big deal: the stats will
have the same granularity. (the op counts in lockstat will still be
quite useful)

sched_clock() is a 'fast but occasionally inaccurate clock', while the
GTOD clocksource is an accurate clock (but very often slow).

Ingo

2007-06-01 18:32:21

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

* Daniel Walker <[email protected]> wrote:

> > So, having two interfaces, one fast and one accurate is the right
> > answer IMHO.
>
> In the case of lockstat you have two cases fast and functional, and
> non-functional .. Right now your patch has no slow and functional
> state.

let me explain it to you:

1) there is absolutely no problem here to begin with. If a rare
architecture is lazy enough to not bother implementing a finegrained
sched_clock() then it certainly does not care about the granularity of
lockstat fields either. If it does, it can improve scheduling and get
more finegrained lockstat by implementing a proper sched_clock()
function - all for the same price! ;-)

2) the 'solution' you suggested for this non-problem is _far worse_ than
the granularity non-problem, on the _majority_ of server systems today!
Think about it! Your suggestion would make lockstat _totally unusable_.
Not "slow and functional" like you claim but "dead-slow and unusable".

in light of all this it is puzzling to me how you can still call Peter's
code "non-functional" with a straight face. I have just tried lockstat
with jiffies granular sched_clock() and it was still fully functional.
So if you want to report some bug then please do it in a proper form.

> As I said before there is no reason why and architectures should be
> forced to implement sched_clock() .. Is there some specific reason why
> you think it should be mandatory?

Easy: it's not mandatory, but it's certainly "nice" even today, even
without lockstat. It will get you:

- better scheduling
- better printk timestamps
- higher-quality blktrace timestamps

With lockstat, append "more finegrained lockstat output" to that list of
benefits too. That's why every sane server architecture has a
sched_clock() implementation - go check the kernel source. Now i wouldnt
mind to clean the API up and call it get_stat_clock() or whatever - but
that was not your suggestion at all - your suggestion was flawed: to
implement sched_clock() via the GTOD clocksource.

Ingo

2007-06-01 18:44:44

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 09:11 -0700, Daniel Walker wrote:
> On Fri, 2007-06-01 at 17:52 +0200, Peter Zijlstra wrote:

> > The whole issue is that you don't have any control over what clocksource
> > you'll end up with. If it so happens that pmtimer gets selected your
> > whole box will crawl if its used liberaly, like the patch under
> > discussion does.
>
> You can have control over it, which I think the whole point of this
> discussion ..

No you don't, clocksource will gladly discard the TSC when its not found
stable enough (the majority of the systems today). While it would be
good enough for sched_clock().

2007-06-01 18:51:53

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2007-06-01 at 09:11 -0700, Daniel Walker wrote:
> > On Fri, 2007-06-01 at 17:52 +0200, Peter Zijlstra wrote:
>
> > > The whole issue is that you don't have any control over what clocksource
> > > you'll end up with. If it so happens that pmtimer gets selected your
> > > whole box will crawl if its used liberaly, like the patch under
> > > discussion does.
> >
> > You can have control over it, which I think the whole point of this
> > discussion ..
>
> No you don't, clocksource will gladly discard the TSC when its not
> found stable enough (the majority of the systems today). While it
> would be good enough for sched_clock().

yeah, precisely. [ There is another thing as well: most embedded
architectures do not even implement LOCKDEP_SUPPORT today, so it wouldnt
be possible to enable lockstat on them anyway. So this whole topic is
ridiculous to begin with. How about fixing some real, non-imaginery bugs
instead? ;-) ]

Ingo

2007-06-01 19:26:30

by Matt Mackall

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, Jun 01, 2007 at 08:30:53PM +0200, Ingo Molnar wrote:
> - better scheduling
> - better printk timestamps
> - higher-quality blktrace timestamps
- more entropy in /dev/random

--
Mathematics is the supreme nostalgia of our time.

2007-06-01 19:33:35

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 20:19 +0200, Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
> > > > > I see sched_clock() as fast first, accurate second. Whereas the
> > > > > clocksource thing is accurate first, fast second.
> > > >
> > > > This is true .. However, if there is a speed different it's small.
> > >
> > > Ugh. Have you ever compared pmtimer (or even hpet) against TSC based
> > > sched_clock()? What you write is so wrong that it's not even funny.
> > > You keep repeating this nonsense despite having been told multiple
> > > times that you are dead wrong.
> >
> > Yes I have, and your right there is a difference, and a big difference
> > .. Above I was referring only to the TSC clocksource, since that's an
> > apples to apples comparison .. I would never compare the TSC to the
> > acpi_pm, that's no contest ..
>
> You still dont get it i think: in real life we end up using the TSC in
> sched_clock() _much more often_ than we end up using the TSC for
> clocksource! So your flawed suggestion does not fix anything, it in fact
> introduces a really bad regression: instead of using the TSC (or
> jiffies) we'd end up using the pmtimer or hpet for every lock operation
> when lockstat is enabled, bringing the box to a screeching halt in
> essence.

My position isn't that we should use the high level clocksource
interface as it is now without changes .. That's never been my position
since I've been working with it.. The high level interface will need to
evolve.

I'm saying we should use the clocksource structure as the main hook into
the low level architecture code.

> so what you suggest has a far worse effect on the _majority_ of systems
> that are even interested in running lockstat, than the case you
> mentioned that some seldom-used arch which is lazy about sched_clock()
> falls back to jiffies granularity. It's not a big deal: the stats will
> have the same granularity. (the op counts in lockstat will still be
> quite useful)

My suggestion is only as good as the implementation .. Your making some
fairly sweeping assumption about how lockstat _would_ use the
clocksources in it's final form ..

So clearly lockstat has a contraint which is that it can't use slow
clocks..

> sched_clock() is a 'fast but occasionally inaccurate clock', while the
> GTOD clocksource is an accurate clock (but very often slow).

I think we're just taking different perspectives .. The tsc clocksource
is just as fast as the tsc sched_clock() , you can interchange the two
without ill effects .. That's one perspective ..

You can use a yet to be written API that uses the GTOD and a
clocksource, and allows slow clocksources to be used in place of fast
ones with really bad effects, that's another perspective ..

Daniel

2007-06-01 19:33:47

[permalink] [raw]

Subject: Re: [PATCH 3/5] lockstat: core infrastructure

On Fri, 2007-06-01 at 20:30 +0200, Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
> > > So, having two interfaces, one fast and one accurate is the right
> > > answer IMHO.
> >
> > In the case of lockstat you have two cases fast and functional, and
> > non-functional .. Right now your patch has no slow and functional
> > state.
>
> let me explain it to you:
>
> 1) there is absolutely no problem here to begin with. If a rare
> architecture is lazy enough to not bother implementing a finegrained
> sched_clock() then it certainly does not care about the granularity of
> lockstat fields either. If it does, it can improve scheduling and get
> more finegrained lockstat by implementing a proper sched_clock()
> function - all for the same price! ;-)

There is a problem, which we are discussing ... sched_clock() can be
lowres in lots of different situations, and lockstat fails to account
for that .. That in turn makes it's timing non-functional.

> 2) the 'solution' you suggested for this non-problem is _far worse_ than
> the granularity non-problem, on the _majority_ of server systems today!
> Think about it! Your suggestion would make lockstat _totally unusable_.
> Not "slow and functional" like you claim but "dead-slow and unusable".

I'm not sure how to respond to this.. You taking a big ball of
assumptions, and molding it into what ever you want ..

> in light of all this it is puzzling to me how you can still call Peter's
> code "non-functional" with a straight face. I have just tried lockstat
> with jiffies granular sched_clock() and it was still fully functional.
> So if you want to report some bug then please do it in a proper form.

Clearly you can't have sane microsecond level timestamps with a clock
that doesn't support microsecond resolution.. This is even something
Peter acknowledged in his first email to me.

> > As I said before there is no reason why and architectures should be
> > forced to implement sched_clock() .. Is there some specific reason why
> > you think it should be mandatory?
>
> Easy: it's not mandatory, but it's certainly "nice" even today, even
> without lockstat. It will get you:
>
> - better scheduling
> - better printk timestamps
> - higher-quality blktrace timestamps
>
> With lockstat, append "more finegrained lockstat output" to that list of
> benefits too. That's why every sane server architecture has a
> sched_clock() implementation - go check the kernel source. Now i wouldnt
> mind to clean the API up and call it get_stat_clock() or whatever - but
> that was not your suggestion at all - your suggestion was flawed: to
> implement sched_clock() via the GTOD clocksource.

At this point it's not clear to me you know what my suggestion was ..
Your saying you want a better API for sched_clock(), and yes I agree
with that 100% sched_clock() needs a better API .. The paragraph above
it looks like your on the verge of agreeing with me ..

You think my words are puzzling, try it from this end..

Daniel

2007-06-01 19:33:58