LinuxLists.cc - [RFC patch 08/18] cnt32_to_63 should use smp

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 07 Nov 2008 00:23:44 -0500 Mathieu Desnoyers <[email protected]> wrote:

> #define cnt32_to_63(cnt_lo) \
> ({ \
> - static volatile u32 __m_cnt_hi; \
> + static u32 __m_cnt_hi; \
> union cnt32_to_63 __x; \
> __x.hi = __m_cnt_hi; \
> + smp_rmb(); /* read __m_cnt_hi before mmio cnt_lo */ \
> __x.lo = (cnt_lo); \
> if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \

Oh dear. We have a macro which secretly maintains
per-instantiation-site global state? And doesn't even implement locking
to protect that state?

I mean, the darned thing is called from sched_clock(), which can be
concurrently called on separate CPUs and which can be called from
interrupt context (with an arbitrary nesting level!) while it was running
in process context.

Who let that thing into Linux?

Look:

/*
* Caller must provide locking to protect *caller_state
*/
u32 cnt32_to_63(u32 *caller_state, u32 cnt_lo);

But even that looks pretty crappy.

2008-11-07 08:12:37

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Thu, 6 Nov 2008, Andrew Morton wrote:

> On Fri, 07 Nov 2008 00:23:44 -0500 Mathieu Desnoyers <[email protected]> wrote:
>
> > #define cnt32_to_63(cnt_lo) \
> > ({ \
> > - static volatile u32 __m_cnt_hi; \
> > + static u32 __m_cnt_hi; \
> > union cnt32_to_63 __x; \
> > __x.hi = __m_cnt_hi; \
> > + smp_rmb(); /* read __m_cnt_hi before mmio cnt_lo */ \
> > __x.lo = (cnt_lo); \
> > if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> > __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
>
> Oh dear. We have a macro which secretly maintains
> per-instantiation-site global state? And doesn't even implement locking
> to protect that state?

Please do me a favor and look for those very unfrequent posts I've sent
to lkml lately. I've explained it all at least 3 times so far, to Peter
Zijlstra, to David Howells, to Mathieu Desnoyers, and now to you.

> I mean, the darned thing is called from sched_clock(), which can be
> concurrently called on separate CPUs and which can be called from
> interrupt context (with an arbitrary nesting level!) while it was running
> in process context.

Yes! And this is so on *purpose*. Please take some time to read the
comment that goes along with it, and if you're still not convinced then
look for those explanation emails I've already posted.

> /*
> * Caller must provide locking to protect *caller_state
> */

NO! This is meant to be LOCK FREE!

Nicolas

2008-11-07 08:41:24

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 07 Nov 2008 03:12:18 -0500 (EST) Nicolas Pitre <[email protected]> wrote:

> On Thu, 6 Nov 2008, Andrew Morton wrote:
>
> > On Fri, 07 Nov 2008 00:23:44 -0500 Mathieu Desnoyers <[email protected]> wrote:
> >
> > > #define cnt32_to_63(cnt_lo) \
> > > ({ \
> > > - static volatile u32 __m_cnt_hi; \
> > > + static u32 __m_cnt_hi; \
> > > union cnt32_to_63 __x; \
> > > __x.hi = __m_cnt_hi; \
> > > + smp_rmb(); /* read __m_cnt_hi before mmio cnt_lo */ \
> > > __x.lo = (cnt_lo); \
> > > if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> > > __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> >
> > Oh dear. We have a macro which secretly maintains
> > per-instantiation-site global state? And doesn't even implement locking
> > to protect that state?
>
> Please do me a favor and look for those very unfrequent posts I've sent
> to lkml lately.

No. Reading the kernel code (and, at a pinch, the changelogs) should
suffice. If it does not suffice, the kernel code is in error.

> I've explained it all at least 3 times so far, to Peter
> Zijlstra, to David Howells, to Mathieu Desnoyers, and now to you.

If four heads have exploded (thus far) over one piece of code, perhaps
the blame doesn't lie with those heads.

> > I mean, the darned thing is called from sched_clock(), which can be
> > concurrently called on separate CPUs and which can be called from
> > interrupt context (with an arbitrary nesting level!) while it was running
> > in process context.
>
> Yes! And this is so on *purpose*. Please take some time to read the
> comment that goes along with it,

OK.

> and if you're still not convinced then
> look for those explanation emails I've already posted.

No.

> > /*
> > * Caller must provide locking to protect *caller_state
> > */
>
> NO! This is meant to be LOCK FREE!

We have a macro which must only have a single usage in any particular
kernel build (and nothing to detect a violation of that).

We have a macro which secretly hides internal state, on a per-expansion-site
basis, no less.

It apparently tries to avoid races via ordering tricks, as long
as it is called with sufficient frequency. But nothing guarantees
that it _is_ called sufficiently frequently?

There is absolutely no reason why the first two of these quite bad things
needed to be done. In fact there is no reason why it needed to be
implemented as a macro at all.

As I said in the text which you deleted and ignored, this would be
better if it was implemented as a C function which requires that the
caller explicitly pass in a reference to the state storage.

2008-11-07 10:56:29

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Mathieu Desnoyers <[email protected]> wrote:

> Assume the time source is a global clock which insures that time will never
> *ever* go backward. Use a smp_rmb() to make sure the cnt_lo value is read before
> __m_cnt_hi.

If you have an smp_rmb(), then don't you need an smp_wmb()/smp_mb() to match
it to make it work? And is your assumption valid that smp_rmb() will affect
memory vs the I/O access to read the clock? There's no requirement that
cnt_lo will have been read from an MMIO location at all.

David

2008-11-07 11:00:18

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Andrew Morton <[email protected]> wrote:

> I mean, the darned thing is called from sched_clock(), which can be
> concurrently called on separate CPUs and which can be called from
> interrupt context (with an arbitrary nesting level!) while it was running
> in process context.
>
> Who let that thing into Linux?

Having crawled all over it, and argued with Nicolas and Panasonic about it, I
think it's safe in sched_clock(), provided sched_clock() never gets preempted -
which appears to be the case.

David

2008-11-07 11:04:22

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Nicolas Pitre <[email protected]> wrote:

> > I mean, the darned thing is called from sched_clock(), which can be
> > concurrently called on separate CPUs and which can be called from
> > interrupt context (with an arbitrary nesting level!) while it was running
> > in process context.
>
> Yes! And this is so on *purpose*. Please take some time to read the
> comment that goes along with it, and if you're still not convinced then
> look for those explanation emails I've already posted.

I agree with Nicolas on this. It's abominably clever, but I think he's right.

The one place I remain unconvinced is over the issue of preemption of a process
that is in the middle of cnt32_to_63(), where if the preempted process is
asleep for long enough, I think it can wind time backwards when it resumes, but
that's not a problem for the one place I want to use it (sched_clock()) because
that is (almost) always called with preemption disabled in one way or another.

The one place it isn't is a debugging case that I'm not too worried about.

> > /*
> > * Caller must provide locking to protect *caller_state
> > */
>
> NO! This is meant to be LOCK FREE!

Absolutely.

David

2008-11-07 11:21:18

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Andrew Morton <[email protected]> wrote:

> We have a macro which must only have a single usage in any particular
> kernel build (and nothing to detect a violation of that).

That's not true. It's a macro containing a _static_ local variable, therefore
the macro may be used multiple times, and each time it's used the compiler
will allocate a new piece of storage.

> It apparently tries to avoid races via ordering tricks, as long
> as it is called with sufficient frequency. But nothing guarantees
> that it _is_ called sufficiently frequently?

The comment attached to it clearly states this restriction. Therefore the
caller must guarantee it. That is something Mathieu's code and my code must
deal with, not Nicolas's.

> There is absolutely no reason why the first two of these quite bad things
> needed to be done. In fact there is no reason why it needed to be
> implemented as a macro at all.

There's a very good reason to implement it as either a macro or an inline
function: it's faster. Moving the whole thing out of line would impose an
additional function call overhead - with a 64-bit return value on 32-bit
platforms. For my case - sched_clock() - I'm willing to burn a bit of extra
space to get the extra speed.

> As I said in the text which you deleted and ignored, this would be
> better if it was implemented as a C function which requires that the
> caller explicitly pass in a reference to the state storage.

I'd be quite happy if it was:

static inline u64 cnt32_to_63(u32 cnt_lo, u32 *__m_cnt_hi)
{
union cnt32_to_63 __x;
__x.hi = *__m_cnt_hi;
__x.lo = cnt_lo;
if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
*__m_cnt_hi =
__x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
return __x.val;
}

I imagine this would compile pretty much the same as the macro. I think it
would make it more obvious about the independence of the storage.

Alternatively, perhaps Nicolas just needs to mention this in the comment more
clearly.

David

2008-11-07 15:01:22

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008, David Howells wrote:

> Andrew Morton <[email protected]> wrote:
>
> > As I said in the text which you deleted and ignored, this would be
> > better if it was implemented as a C function which requires that the
> > caller explicitly pass in a reference to the state storage.

The whole purpose of that thing is to be utterly fast and lightweight.
Having an out of line C call would trash the major advantage of this
code.

> I'd be quite happy if it was:
>
> static inline u64 cnt32_to_63(u32 cnt_lo, u32 *__m_cnt_hi)
> {
> union cnt32_to_63 __x;
> __x.hi = *__m_cnt_hi;
> __x.lo = cnt_lo;
> if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
> *__m_cnt_hi =
> __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
> return __x.val;
> }
>
> I imagine this would compile pretty much the same as the macro.

Depends. As everybody has noticed now, the read ordering is important,
and if gcc decides to not inline this for whatever reason then the
ordering is lost. This is why this was a macro to start with.

> I think it
> would make it more obvious about the independence of the storage.

I don't think having the associated storage be outside the macro make
any sense either. There is simply no valid reason for having it shared
between multiple invokations of the macro, as well as making its
interface more complex for no gain.

> Alternatively, perhaps Nicolas just needs to mention this in the comment more
> clearly.

I wrote that code so to me it is cristal clear already. Any suggestions
as to how this could be improved?

Nicolas

2008-11-07 15:51:53

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 07 Nov 2008 10:01:01 -0500 (EST) Nicolas Pitre <[email protected]> wrote:

> On Fri, 7 Nov 2008, David Howells wrote:
>
> > Andrew Morton <[email protected]> wrote:
> >
> > > As I said in the text which you deleted and ignored, this would be
> > > better if it was implemented as a C function which requires that the
> > > caller explicitly pass in a reference to the state storage.
>
> The whole purpose of that thing is to be utterly fast and lightweight.

Well I'm glad it wasn't designed to demonstrate tastefulness.

btw, do you know how damned irritating and frustrating it is for a code
reviewer to have his comments deliberately ignored and deleted in
replies?

> Having an out of line C call would trash the major advantage of this
> code.

Not really.

> > I'd be quite happy if it was:
> >
> > static inline u64 cnt32_to_63(u32 cnt_lo, u32 *__m_cnt_hi)
> > {
> > union cnt32_to_63 __x;
> > __x.hi = *__m_cnt_hi;
> > __x.lo = cnt_lo;
> > if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
> > *__m_cnt_hi =
> > __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
> > return __x.val;
> > }
> >
> > I imagine this would compile pretty much the same as the macro.
>
> Depends. As everybody has noticed now, the read ordering is important,
> and if gcc decides to not inline this

If gcc did that then it would need to generate static instances of
inlined functions within individual compilation units. It would be a
disaster for the kernel. For a start, functions which are "inlined" in kernel
modules wouldn't be able to access their static storage and modprobing
them would fail.

> for whatever reason then the
> ordering is lost.

Uninlining won't affect any ordering I can see.

> This is why this was a macro to start with.
>
> > I think it
> > would make it more obvious about the independence of the storage.
>
> I don't think having the associated storage be outside the macro make
> any sense either. There is simply no valid reason for having it shared
> between multiple invokations of the macro, as well as making its
> interface more complex for no gain.

oh god.

> > Alternatively, perhaps Nicolas just needs to mention this in the comment more
> > clearly.
>
> I wrote that code so to me it is cristal clear already. Any suggestions
> as to how this could be improved?
>

Does mn10300's get_cycles() really count backwards? The first two
callsites I looked at (crypto/tcrypt.c and fs/ext4/mballoc.c) assume
that it is an upcounter.

2008-11-07 16:09:25

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Nicolas Pitre <[email protected]> wrote:

> The whole purpose of that thing is to be utterly fast and lightweight.
> Having an out of line C call would trash the major advantage of this
> code.

No argument there.

> > I imagine this would compile pretty much the same as the macro.

Having said that, I realise it's wrong. The macro potentially takes a h/w
read operation (cnt_lo) and does it at a place of its choosing - which the
compiler may not be permitted to move if cnt_lo resolves to a bit of volatile
inline asm with memory constraints. Converting it to an inline function
forces cnt_lo to be resolved first.

> Depends. As everybody has noticed now, the read ordering is important,
> and if gcc decides to not inline this for whatever reason then the
> ordering is lost. This is why this was a macro to start with.

Good point. I wonder if you should've put a compiler barrier in there to at
least make the point.

> I don't think having the associated storage be outside the macro make any
> sense either.

It can have a comment attached to it to say what it represents. On the other
hand, it'd probably need further comments attaching to it to fend off people
who start thinking they can make use of this variable in other ways...

> > Alternatively, perhaps Nicolas just needs to mention this in the comment
> > more clearly.
>
> I wrote that code so to me it is cristal clear already. Any suggestions
> as to how this could be improved?

It ought to be, but clearly it isn't. Sometimes the obvious is all too easy to
overlook. I'll think about it, but perhaps something like:

* This macro uses a static internal variable to retain the upper counter.
* This has two consequences: firstly, it may be used in multiple ways by
* different callers for different things without interference; and secondly,
* each caller will get its own, independent counter, and so an out of line
* wrapper must be used if multiple callers want to use the same pair of
* counters.

It's a bit heavy-handed, but...

David

2008-11-07 16:22:40

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Andrew Morton <[email protected]> wrote:

> If gcc did that then it would need to generate static instances of
> inlined functions within individual compilation units. It would be a
> disaster for the kernel. For a start, functions which are "inlined" in kernel
> modules wouldn't be able to access their static storage and modprobing
> them would fail.

Do you expect a static inline function that lives in a header file and that
has a static variable in it to share that static variable over all instances
of that function in a program? Or do you expect the static variable to be
limited at the file level? Or just at the invocation level?

> Does mn10300's get_cycles() really count backwards?

Yes, because the value is generated by a pair of cascaded 16-bit hardware
down-counters.

> The first two callsites I looked at (crypto/tcrypt.c and fs/ext4/mballoc.c)
> assume that it is an upcounter.

Hmmm... I didn't occur to me that get_cycles() was available for use outside
of arch code. Possibly it wasn't so used when I first came up with the code.

I should probably make it count the other way.

David

2008-11-07 16:31:04

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 07 Nov 2008 16:21:55 +0000 David Howells <[email protected]> wrote:

> Andrew Morton <[email protected]> wrote:
>
> > If gcc did that then it would need to generate static instances of
> > inlined functions within individual compilation units. It would be a
> > disaster for the kernel. For a start, functions which are "inlined" in kernel
> > modules wouldn't be able to access their static storage and modprobing
> > them would fail.
>
> Do you expect a static inline function that lives in a header file and that
> has a static variable in it to share that static variable over all instances
> of that function in a program? Or do you expect the static variable to be
> limited at the file level? Or just at the invocation level?

I'd expect it to behave in the same way as it would if the function was
implemented out-of-line.

But it occurs to me that the modrobe-doesnt-work thing would happen if
the function _is_ inlined anyway, so we won't be doing that.

Whatever. Killing this many puppies because gcc may do something so
bizarrely wrong isn't justifiable.

2008-11-07 16:48:14

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008, Andrew Morton wrote:

> On Fri, 07 Nov 2008 10:01:01 -0500 (EST) Nicolas Pitre <[email protected]> wrote:
>
> > On Fri, 7 Nov 2008, David Howells wrote:
> >
> > > Andrew Morton <[email protected]> wrote:
> > >
> > > > As I said in the text which you deleted and ignored, this would be
> > > > better if it was implemented as a C function which requires that the
> > > > caller explicitly pass in a reference to the state storage.
> >
> > The whole purpose of that thing is to be utterly fast and lightweight.
>
> Well I'm glad it wasn't designed to demonstrate tastefulness.

Fast tricks aren't always meant to be beautiful. That,s why we have
abstraction layers.

> btw, do you know how damned irritating and frustrating it is for a code
> reviewer to have his comments deliberately ignored and deleted in
> replies?

Do you know how irritating and frustrating it is when reviewers don't
care reading the damn comments along with the code? Maybe it wasn't the
case, but you gave the impression of jumping to conclusion without even
bothering about the associated explanation in the code until I directed
you at it.

> > Having an out of line C call would trash the major advantage of this
> > code.
>
> Not really.

Your opinion.

> > > I imagine this would compile pretty much the same as the macro.
> >
> > Depends. As everybody has noticed now, the read ordering is important,
> > and if gcc decides to not inline this
>
> If gcc did that then it would need to generate static instances of
> inlined functions within individual compilation units. It would be a
> disaster for the kernel. For a start, functions which are "inlined" in kernel
> modules wouldn't be able to access their static storage and modprobing
> them would fail.

That doesn't mean that access ordering is preserved within the function,
unless the interface is pointer based, but that won't work if the
counter is accessed through some special instruction rather than a
memory location.

> > for whatever reason then the
> > ordering is lost.
>
> Uninlining won't affect any ordering I can see.

See above.

> > This is why this was a macro to start with.
> >
> > > I think it
> > > would make it more obvious about the independence of the storage.
> >
> > I don't think having the associated storage be outside the macro make
> > any sense either. There is simply no valid reason for having it shared
> > between multiple invokations of the macro, as well as making its
> > interface more complex for no gain.
>
> oh god.

Thank you. ;-)

> > > Alternatively, perhaps Nicolas just needs to mention this in the comment more
> > > clearly.
> >
> > I wrote that code so to me it is cristal clear already. Any suggestions
> > as to how this could be improved?
> >
>
> Does mn10300's get_cycles() really count backwards? The first two
> callsites I looked at (crypto/tcrypt.c and fs/ext4/mballoc.c) assume
> that it is an upcounter.

I know nothing about mn10300.

Nicolas

2008-11-07 16:48:30

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* David Howells ([email protected]) wrote:
> Andrew Morton <[email protected]> wrote:
>
> > We have a macro which must only have a single usage in any particular
> > kernel build (and nothing to detect a violation of that).
>
> That's not true. It's a macro containing a _static_ local variable, therefore
> the macro may be used multiple times, and each time it's used the compiler
> will allocate a new piece of storage.
>
> > It apparently tries to avoid races via ordering tricks, as long
> > as it is called with sufficient frequency. But nothing guarantees
> > that it _is_ called sufficiently frequently?
>
> The comment attached to it clearly states this restriction. Therefore the
> caller must guarantee it. That is something Mathieu's code and my code must
> deal with, not Nicolas's.
>
> > There is absolutely no reason why the first two of these quite bad things
> > needed to be done. In fact there is no reason why it needed to be
> > implemented as a macro at all.
>
> There's a very good reason to implement it as either a macro or an inline
> function: it's faster. Moving the whole thing out of line would impose an
> additional function call overhead - with a 64-bit return value on 32-bit
> platforms. For my case - sched_clock() - I'm willing to burn a bit of extra
> space to get the extra speed.
>
> > As I said in the text which you deleted and ignored, this would be
> > better if it was implemented as a C function which requires that the
> > caller explicitly pass in a reference to the state storage.
>
> I'd be quite happy if it was:
>
> static inline u64 cnt32_to_63(u32 cnt_lo, u32 *__m_cnt_hi)
> {
> union cnt32_to_63 __x;
> __x.hi = *__m_cnt_hi;
> __x.lo = cnt_lo;
> if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
> *__m_cnt_hi =
> __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
> return __x.val;
> }
>

Almost there. At least, with this kind of implementation, we would not
have to resort to various tricks to make sure a single code path is
called at a certain frequency. We would simply have to make sure the
__m_cnt_hi value is updated at a certain frequency. Thinking about
"data" rather than "code" makes much more sense.

The only missing thing here is the correct ordering. The problem is, as
I presented in more depth in my previous discussion with Nicolas, that
the __m_cnt_hi value has to be read before cnt_lo. First off, using this
macro with get_cycles() is simply buggy, because the macro expects
_perfect_ order of timestamps, no skew whatsoever, or otherwise time
could jump. This macro is therefore good only for mmio reads. One should
use per-cpu variables to keep the state of get_cycles() reads (as I did
in my other patch).

The following approach should work :

static inline u64 cnt32_to_63(u32 io_addr, u32 *__m_cnt_hi)
{
union cnt32_to_63 __x;
__x.hi = *__m_cnt_hi; /* memory read for high bits internal state */
smp_rmb(); /*
* read high bits before low bits insures time
* does not go backward.
*/
__x.lo = readl(cnt_lo); /* mmio read */
if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
*__m_cnt_hi =
__x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
return __x.val;
}

But any get_cycles() user of cnt32_to_63() should be shot down. The
bright side is : there is no way get_cycles() can be used with this
new code. :)

e.g. of incorrect users for arm (unless they are UP only, but that seems
like a weird design argument) :

mach-sa1100/include/mach/SA-1100.h:#define OSCR __REG(0x90000010)
/* OS timer Counter Reg. */
mach-sa1100/generic.c: unsigned long long v = cnt32_to_63(OSCR);
mach-pxa/include/mach/pxa-regs.h:#define OSCR __REG(0x40A00010) /* OS
Timer Counter Register */
mach-pxa/time.c: unsigned long long v = cnt32_to_63(OSCR);

Correct user :
mach-versatile/core.c: unsigned long long v =
cnt32_to_63(readl(VERSATILE_REFCOUNTER));

The new call would look like :

/* Hi 32-bits of versatile refcounter state, kept for cnt32_to_64. */
static u32 versatile_refcounter_hi;

unsigned long long v = cnt32_to_64(VERSATILE_REFCOUNTER, refcounter_hi);

Mathieu

> I imagine this would compile pretty much the same as the macro. I think it
> would make it more obvious about the independence of the storage.
>
> Alternatively, perhaps Nicolas just needs to mention this in the comment more
> clearly.
>
> David

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 16:56:25

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* David Howells ([email protected]) wrote:
> Nicolas Pitre <[email protected]> wrote:
>
> > > I mean, the darned thing is called from sched_clock(), which can be
> > > concurrently called on separate CPUs and which can be called from
> > > interrupt context (with an arbitrary nesting level!) while it was running
> > > in process context.
> >
> > Yes! And this is so on *purpose*. Please take some time to read the
> > comment that goes along with it, and if you're still not convinced then
> > look for those explanation emails I've already posted.
>
> I agree with Nicolas on this. It's abominably clever, but I think he's right.
>
> The one place I remain unconvinced is over the issue of preemption of a process
> that is in the middle of cnt32_to_63(), where if the preempted process is
> asleep for long enough, I think it can wind time backwards when it resumes, but
> that's not a problem for the one place I want to use it (sched_clock()) because
> that is (almost) always called with preemption disabled in one way or another.
>
> The one place it isn't is a debugging case that I'm not too worried about.
>

I am also concerned about the non-preemption off case.

Then I think the function should document that it must be called with
preempt disabled.

Mathieu

> > > /*
> > > * Caller must provide locking to protect *caller_state
> > > */
> >
> > NO! This is meant to be LOCK FREE!
>
> Absolutely.
>
> David

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 16:57:00

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Nicolas Pitre <[email protected]> wrote:

> > Well I'm glad it wasn't designed to demonstrate tastefulness.
>
> Fast tricks aren't always meant to be beautiful. That,s why we have
> abstraction layers.

And comments. The comment attached to cnt32_to_63() is well written and
informative.

> I know nothing about mn10300.

That's aimed my way.

David

2008-11-07 17:05:52

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Mathieu Desnoyers <[email protected]> wrote:

> First off, using this macro with get_cycles() is simply buggy, because the
> macro expects _perfect_ order of timestamps, no skew whatsoever, or
> otherwise time could jump.

Erm... Why can't I pass it get_cycles()? Are you saying that sched_clock()
in MN10300 is wrong for it's use of get_cycles() with cnt32_to_63()?

> __x.lo = readl(cnt_lo); /* mmio read */

readl() might insert an extra barrier instruction. Not only that, io_addr
must be unsigned long.

David

2008-11-07 17:11:20

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Andrew Morton <[email protected]> wrote:

> I'd expect it to behave in the same way as it would if the function was
> implemented out-of-line.
>
> But it occurs to me that the modrobe-doesnt-work thing would happen if
> the function _is_ inlined anyway, so we won't be doing that.
>
> Whatever. Killing this many puppies because gcc may do something so
> bizarrely wrong isn't justifiable.

With gcc, you get one instance of the static variable from inside a static
(inline or outofline) function per .o file that invokes it, and these do not
merge even though they're common symbols. I asked around and the opinion
seems to be that this is correct C. I suppose it's the equivalent of cutting
and pasting a function between several files - why should the compiler assume
it's the same function in each?

David

2008-11-07 17:14:20

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* David Howells ([email protected]) wrote:
> Mathieu Desnoyers <[email protected]> wrote:
>
> > Assume the time source is a global clock which insures that time will never
> > *ever* go backward. Use a smp_rmb() to make sure the cnt_lo value is read before
> > __m_cnt_hi.
>
> If you have an smp_rmb(), then don't you need an smp_wmb()/smp_mb() to match
> it to make it work? And is your assumption valid that smp_rmb() will affect
> memory vs the I/O access to read the clock? There's no requirement that
> cnt_lo will have been read from an MMIO location at all.
>
> David

I want to make sure

__m_cnt_hi
is read before
mmio cnt_lo read

for the detailed reasons explained in my previous discussion with
Nicolas here :
http://lkml.org/lkml/2008/10/21/1

I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
be required so it works also on UP systems safely wrt interrupts).

The write side is between the hardware counter, which is assumed to
increment monotonically between each read, and the value __m_cnt_hi
updated by the CPU. I don't see where we could put a wmb() there.

Without barrier, the smp race looks as follow :

CPU A B
read hw cnt low (0xFFFFFFFA)
read __m_cnt_hi (0x80000000)
read hw cnt low (0x00000001)
(wrap detected :
(s32)(0x80000000 ^ 0x1) < 0)
write __m_cnt_hi = 0x00000001
read __m_cnt_hi (0x00000001)
return 0x0000000100000001
(wrap detected :
(s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
write __m_cnt_hi = 0x80000001
return 0x80000001FFFFFFFA
(time jumps)

And UP interrupt race :

Thread context Interrupts
read hw cnt low (0xFFFFFFFA)
read __m_cnt_hi (0x80000000)
read hw cnt low (0x00000001)
(wrap detected :
(s32)(0x80000000 ^ 0x1) < 0)
write __m_cnt_hi = 0x00000001
read __m_cnt_hi (0x00000001)
return 0x0000000100000001
(wrap detected :
(s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
write __m_cnt_hi = 0x80000001
return 0x80000001FFFFFFFA
(time jumps)

New code to fix it here with full rmb() :

static inline u64 cnt32_to_63(u32 io_addr, u32 *__m_cnt_hi)
{
union cnt32_to_63 __x;
__x.hi = *__m_cnt_hi; /* memory read for high bits internal state */
rmb(); /*
* read high bits before low bits insures time
* does not go backward. Sync across
* CPUs and for interrupts.
*/
__x.lo = readl(cnt_lo); /* mmio read */
if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
*__m_cnt_hi =
__x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
return __x.val;
}

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 17:17:19

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* David Howells ([email protected]) wrote:
> Mathieu Desnoyers <[email protected]> wrote:
>
> > First off, using this macro with get_cycles() is simply buggy, because the
> > macro expects _perfect_ order of timestamps, no skew whatsoever, or
> > otherwise time could jump.
>
> Erm... Why can't I pass it get_cycles()? Are you saying that sched_clock()
> in MN10300 is wrong for it's use of get_cycles() with cnt32_to_63()?
>

Yes. Do you think the synchronization of the cycles counters is
_perfect_ across CPUs so that there is no possible way whatsoever that
two cycle counter values appear to go backward between CPUs ? (also
taking in account delays in __m_cnt_hi write-back...)

As I showed in my previous example, if you are unlucky enough to hit the
spot where the cycle counters go backward at the time warp edge, time
will jump of 2^32, so about 4.29s at 1GHz.

> > __x.lo = readl(cnt_lo); /* mmio read */
>
> readl() might insert an extra barrier instruction. Not only that, io_addr
> must be unsigned long.

If we expect the only correct use-case to be with readl(), I don't see
the problem with added synchronization.

>

Ah, right, then the parameters should be updated accordingly.

static inline u64 cnt32_to_63(unsigned long io_addr, u32 *__m_cnt_hi)
{
union cnt32_to_63 __x;
__x.hi = *__m_cnt_hi; /* memory read for high bits internal state */
rmb(); /*
* read high bits before low bits insures time
* does not go backward. Sync across
* CPUs and for interrupts.
*/
__x.lo = readl(io_addr); /* mmio read */
if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
*__m_cnt_hi =
__x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
return __x.val;
}

Mathieu

> David

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 17:22:39

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 07 Nov 2008 11:47:47 -0500 (EST) Nicolas Pitre <[email protected]> wrote:

> > btw, do you know how damned irritating and frustrating it is for a code
> > reviewer to have his comments deliberately ignored and deleted in
> > replies?
>
> Do you know how irritating and frustrating it is when reviewers don't
> care reading the damn comments along with the code?

As you still seek to ignore it, I shall repeat my earlier question.
Please do not delete it again.

It apparently tries to avoid races via ordering tricks, as long
as it is called with sufficient frequency. But nothing guarantees
that it _is_ called sufficiently frequently?

Things like tickless kernels and SCHED_RR can surely cause
sched_clock() to not be called for arbitrary periods.

Userspace cli() will definitely do this, but it is expected to break
stuff and is not as legitiate a thing to do.

I'm just giving up on the tastefulness issue.

2008-11-07 17:28:49

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 07 Nov 2008 17:10:00 +0000 David Howells <[email protected]> wrote:

>
> Andrew Morton <[email protected]> wrote:
>
> > I'd expect it to behave in the same way as it would if the function was
> > implemented out-of-line.
> >
> > But it occurs to me that the modrobe-doesnt-work thing would happen if
> > the function _is_ inlined anyway, so we won't be doing that.
> >
> > Whatever. Killing this many puppies because gcc may do something so
> > bizarrely wrong isn't justifiable.
>
> With gcc, you get one instance of the static variable from inside a static
> (inline or outofline) function per .o file that invokes it, and these do not
> merge even though they're common symbols. I asked around and the opinion
> seems to be that this is correct C. I suppose it's the equivalent of cutting
> and pasting a function between several files - why should the compiler assume
> it's the same function in each?
>

OK, thanks, I guess that makes sense. For static inline. I wonder if
`extern inline' or plain old `inline' should change it.

It's one of those things I hope I never need to know about, but perhaps
we do somewhere have static storage in an inline. Wouldn't surprise
me, and I bet that if we do, it's a bug.

2008-11-07 17:33:42

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
>
> __m_cnt_hi
> is read before
> mmio cnt_lo read
>
> for the detailed reasons explained in my previous discussion with
> Nicolas here :
> http://lkml.org/lkml/2008/10/21/1
>
> I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> be required so it works also on UP systems safely wrt interrupts).

smp_rmb turns into a compiler barrier on UP and should prevent the below
description.

-- Steve

>
> The write side is between the hardware counter, which is assumed to
> increment monotonically between each read, and the value __m_cnt_hi
> updated by the CPU. I don't see where we could put a wmb() there.
>
> Without barrier, the smp race looks as follow :
>
>
> CPU A B
> read hw cnt low (0xFFFFFFFA)
> read __m_cnt_hi (0x80000000)
> read hw cnt low (0x00000001)
> (wrap detected :
> (s32)(0x80000000 ^ 0x1) < 0)
> write __m_cnt_hi = 0x00000001
> read __m_cnt_hi (0x00000001)
> return 0x0000000100000001
> (wrap detected :
> (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> write __m_cnt_hi = 0x80000001
> return 0x80000001FFFFFFFA
> (time jumps)
>
> And UP interrupt race :
>
> Thread context Interrupts
> read hw cnt low (0xFFFFFFFA)
> read __m_cnt_hi (0x80000000)
> read hw cnt low (0x00000001)
> (wrap detected :
> (s32)(0x80000000 ^ 0x1) < 0)
> write __m_cnt_hi = 0x00000001
> read __m_cnt_hi (0x00000001)
> return 0x0000000100000001
> (wrap detected :
> (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> write __m_cnt_hi = 0x80000001
> return 0x80000001FFFFFFFA
> (time jumps)
>

2008-11-07 18:00:54

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* Andrew Morton ([email protected]) wrote:
> On Fri, 07 Nov 2008 17:10:00 +0000 David Howells <[email protected]> wrote:
>
> >
> > Andrew Morton <[email protected]> wrote:
> >
> > > I'd expect it to behave in the same way as it would if the function was
> > > implemented out-of-line.
> > >
> > > But it occurs to me that the modrobe-doesnt-work thing would happen if
> > > the function _is_ inlined anyway, so we won't be doing that.
> > >
> > > Whatever. Killing this many puppies because gcc may do something so
> > > bizarrely wrong isn't justifiable.
> >
> > With gcc, you get one instance of the static variable from inside a static
> > (inline or outofline) function per .o file that invokes it, and these do not
> > merge even though they're common symbols. I asked around and the opinion
> > seems to be that this is correct C. I suppose it's the equivalent of cutting
> > and pasting a function between several files - why should the compiler assume
> > it's the same function in each?
> >
>
> OK, thanks, I guess that makes sense. For static inline. I wonder if
> `extern inline' or plain old `inline' should change it.
>
> It's one of those things I hope I never need to know about, but perhaps
> we do somewhere have static storage in an inline. Wouldn't surprise
> me, and I bet that if we do, it's a bug.

Tracepoints actually use that. It could be changed so they use :

DECLARE_TRACE() (in include/trace/group.h)
DEFINE_TRACE() (in the appropriate kernel c file)
trace_somename(); (in the code)

instead. That would actually make more sense and remove the need for
multiple declarations when the same tracepoint name is used in many
spots (this is a problem kmemtrace has, it generates a lot of tracepoint
declarations).

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 18:23:28

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008 13:00:41 -0500
Mathieu Desnoyers <[email protected]> wrote:

> * Andrew Morton ([email protected]) wrote:
> > On Fri, 07 Nov 2008 17:10:00 +0000 David Howells <[email protected]> wrote:
> >
> > >
> > > Andrew Morton <[email protected]> wrote:
> > >
> > > > I'd expect it to behave in the same way as it would if the function was
> > > > implemented out-of-line.
> > > >
> > > > But it occurs to me that the modrobe-doesnt-work thing would happen if
> > > > the function _is_ inlined anyway, so we won't be doing that.
> > > >
> > > > Whatever. Killing this many puppies because gcc may do something so
> > > > bizarrely wrong isn't justifiable.
> > >
> > > With gcc, you get one instance of the static variable from inside a static
> > > (inline or outofline) function per .o file that invokes it, and these do not
> > > merge even though they're common symbols. I asked around and the opinion
> > > seems to be that this is correct C. I suppose it's the equivalent of cutting
> > > and pasting a function between several files - why should the compiler assume
> > > it's the same function in each?
> > >
> >
> > OK, thanks, I guess that makes sense. For static inline. I wonder if
> > `extern inline' or plain old `inline' should change it.
> >
> > It's one of those things I hope I never need to know about, but perhaps
> > we do somewhere have static storage in an inline. Wouldn't surprise
> > me, and I bet that if we do, it's a bug.
>
> Tracepoints actually use that.

Referring to include/linux/tracepoint.h:DEFINE_TRACE()?

It does look a bit fragile. Does every .c file which included
include/trace/block.h get a copy of __tracepoint_block_rq_issue,
whether or not it used that tracepoint? Hopefully not.

> It could be changed so they use :
>
> DECLARE_TRACE() (in include/trace/group.h)
> DEFINE_TRACE() (in the appropriate kernel c file)
> trace_somename(); (in the code)
>
> instead. That would actually make more sense and remove the need for
> multiple declarations when the same tracepoint name is used in many
> spots (this is a problem kmemtrace has, it generates a lot of tracepoint
> declarations).

I'm unsure of the requirements here. Do you _want_ each call to
trace_block_rq_issue() to share some in-memory state? If so then yes,
there's a problem with calls to trace_block_rq_issue() from within
separate compilation units.

otoh, if all calls to trace_block_rq_issue() are supposed to have
independent state (which seems to be the case) then that could be
addressed by making trace_block_rq_issue() a macro which defines
static storage, as cnt32_to_63() shouldn't have done ;)

2008-11-07 18:30:38

by Harvey Harrison

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 2008-11-07 at 10:21 -0800, Andrew Morton wrote:
> On Fri, 7 Nov 2008 13:00:41 -0500
> Mathieu Desnoyers <[email protected]> wrote:
> > > It's one of those things I hope I never need to know about, but perhaps
> > > we do somewhere have static storage in an inline. Wouldn't surprise
> > > me, and I bet that if we do, it's a bug.
> >
> > Tracepoints actually use that.
>
> Referring to include/linux/tracepoint.h:DEFINE_TRACE()?
>
> It does look a bit fragile. Does every .c file which included
> include/trace/block.h get a copy of __tracepoint_block_rq_issue,
> whether or not it used that tracepoint? Hopefully not.
>
> > It could be changed so they use :
> >
> > DECLARE_TRACE() (in include/trace/group.h)
> > DEFINE_TRACE() (in the appropriate kernel c file)
> > trace_somename(); (in the code)
> >
> > instead. That would actually make more sense and remove the need for
> > multiple declarations when the same tracepoint name is used in many
> > spots (this is a problem kmemtrace has, it generates a lot of tracepoint
> > declarations).

Could this scheme also help with the thousands of sparse warnings that
kmemtrace produces because of the current arrangement, all of the form:

include/linux/kmemtrace.h:33:2: warning: Initializer entry defined twice
include/linux/kmemtrace.h:33:2: also defined here

As you could have unique names for the tracepoints now, rather than the
'unique' static storage? Or am I off-base here?

Harvey

2008-11-07 18:34:01

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* Andrew Morton ([email protected]) wrote:
> On Fri, 7 Nov 2008 13:00:41 -0500
> Mathieu Desnoyers <[email protected]> wrote:
>
> > * Andrew Morton ([email protected]) wrote:
> > > On Fri, 07 Nov 2008 17:10:00 +0000 David Howells <[email protected]> wrote:
> > >
> > > >
> > > > Andrew Morton <[email protected]> wrote:
> > > >
> > > > > I'd expect it to behave in the same way as it would if the function was
> > > > > implemented out-of-line.
> > > > >
> > > > > But it occurs to me that the modrobe-doesnt-work thing would happen if
> > > > > the function _is_ inlined anyway, so we won't be doing that.
> > > > >
> > > > > Whatever. Killing this many puppies because gcc may do something so
> > > > > bizarrely wrong isn't justifiable.
> > > >
> > > > With gcc, you get one instance of the static variable from inside a static
> > > > (inline or outofline) function per .o file that invokes it, and these do not
> > > > merge even though they're common symbols. I asked around and the opinion
> > > > seems to be that this is correct C. I suppose it's the equivalent of cutting
> > > > and pasting a function between several files - why should the compiler assume
> > > > it's the same function in each?
> > > >
> > >
> > > OK, thanks, I guess that makes sense. For static inline. I wonder if
> > > `extern inline' or plain old `inline' should change it.
> > >
> > > It's one of those things I hope I never need to know about, but perhaps
> > > we do somewhere have static storage in an inline. Wouldn't surprise
> > > me, and I bet that if we do, it's a bug.
> >
> > Tracepoints actually use that.
>
> Referring to include/linux/tracepoint.h:DEFINE_TRACE()?
>
> It does look a bit fragile. Does every .c file which included
> include/trace/block.h get a copy of __tracepoint_block_rq_issue,
> whether or not it used that tracepoint? Hopefully not.
>

No, __tracepoint_block_rq_issue is only instanciated if the static
inline function is used. One instance per use.

> > It could be changed so they use :
> >
> > DECLARE_TRACE() (in include/trace/group.h)
> > DEFINE_TRACE() (in the appropriate kernel c file)
> > trace_somename(); (in the code)
> >
> > instead. That would actually make more sense and remove the need for
> > multiple declarations when the same tracepoint name is used in many
> > spots (this is a problem kmemtrace has, it generates a lot of tracepoint
> > declarations).
>
> I'm unsure of the requirements here. Do you _want_ each call to
> trace_block_rq_issue() to share some in-memory state? If so then yes,
> there's a problem with calls to trace_block_rq_issue() from within
> separate compilation units.
>
> otoh, if all calls to trace_block_rq_issue() are supposed to have
> independent state (which seems to be the case) then that could be
> addressed by making trace_block_rq_issue() a macro which defines
> static storage, as cnt32_to_63() shouldn't have done ;)
>

They could share the same data, given it *has* to be the same. I'll
try to fix this.

Mathieu

>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 18:38:00

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008, Andrew Morton wrote:
>
> Referring to include/linux/tracepoint.h:DEFINE_TRACE()?
>
> It does look a bit fragile. Does every .c file which included
> include/trace/block.h get a copy of __tracepoint_block_rq_issue,
> whether or not it used that tracepoint? Hopefully not.

Look at "ratelimit()" too. Broken, broken. Of course, I don't think it's
actually _used_ anywhere, so..

Linus

2008-11-07 18:47:38

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008 10:36:27 -0800 (PST)
Linus Torvalds <[email protected]> wrote:

>
>
> On Fri, 7 Nov 2008, Andrew Morton wrote:
> >
> > Referring to include/linux/tracepoint.h:DEFINE_TRACE()?
> >
> > It does look a bit fragile. Does every .c file which included
> > include/trace/block.h get a copy of __tracepoint_block_rq_issue,
> > whether or not it used that tracepoint? Hopefully not.
>
> Look at "ratelimit()" too. Broken, broken.

Yup. Easy enough to fix, but...

> Of course, I don't think it's
> actually _used_ anywhere, so..
>

removing it altogether would be best, I think. It's a bit of an odd
thing.

I'll see what this

--- a/include/linux/ratelimit.h~a
+++ a/include/linux/ratelimit.h
@@ -17,11 +17,4 @@ struct ratelimit_state {
struct ratelimit_state name = {interval, burst,}

extern int __ratelimit(struct ratelimit_state *rs);
-
-static inline int ratelimit(void)
-{
- static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
- return __ratelimit(&rs);
-}
#endif

breaks.

2008-11-07 18:48:12

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* Harvey Harrison ([email protected]) wrote:
> On Fri, 2008-11-07 at 10:21 -0800, Andrew Morton wrote:
> > On Fri, 7 Nov 2008 13:00:41 -0500
> > Mathieu Desnoyers <[email protected]> wrote:
> > > > It's one of those things I hope I never need to know about, but perhaps
> > > > we do somewhere have static storage in an inline. Wouldn't surprise
> > > > me, and I bet that if we do, it's a bug.
> > >
> > > Tracepoints actually use that.
> >
> > Referring to include/linux/tracepoint.h:DEFINE_TRACE()?
> >
> > It does look a bit fragile. Does every .c file which included
> > include/trace/block.h get a copy of __tracepoint_block_rq_issue,
> > whether or not it used that tracepoint? Hopefully not.
> >
> > > It could be changed so they use :
> > >
> > > DECLARE_TRACE() (in include/trace/group.h)
> > > DEFINE_TRACE() (in the appropriate kernel c file)
> > > trace_somename(); (in the code)
> > >
> > > instead. That would actually make more sense and remove the need for
> > > multiple declarations when the same tracepoint name is used in many
> > > spots (this is a problem kmemtrace has, it generates a lot of tracepoint
> > > declarations).
>
> Could this scheme also help with the thousands of sparse warnings that
> kmemtrace produces because of the current arrangement, all of the form:
>
> include/linux/kmemtrace.h:33:2: warning: Initializer entry defined twice
> include/linux/kmemtrace.h:33:2: also defined here
>
> As you could have unique names for the tracepoints now, rather than the
> 'unique' static storage? Or am I off-base here?
>

Exactly.

Mathieu

> Harvey
>
>
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 19:18:47

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* Steven Rostedt ([email protected]) wrote:
>
> On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> >
> > __m_cnt_hi
> > is read before
> > mmio cnt_lo read
> >
> > for the detailed reasons explained in my previous discussion with
> > Nicolas here :
> > http://lkml.org/lkml/2008/10/21/1
> >
> > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > be required so it works also on UP systems safely wrt interrupts).
>
> smp_rmb turns into a compiler barrier on UP and should prevent the below
> description.
>

Ah, right, preserving program order on UP should be enough. smp_rmb()
then.

Thanks,

Mathieu

> -- Steve
>
> >
> > The write side is between the hardware counter, which is assumed to
> > increment monotonically between each read, and the value __m_cnt_hi
> > updated by the CPU. I don't see where we could put a wmb() there.
> >
> > Without barrier, the smp race looks as follow :
> >
> >
> > CPU A B
> > read hw cnt low (0xFFFFFFFA)
> > read __m_cnt_hi (0x80000000)
> > read hw cnt low (0x00000001)
> > (wrap detected :
> > (s32)(0x80000000 ^ 0x1) < 0)
> > write __m_cnt_hi = 0x00000001
> > read __m_cnt_hi (0x00000001)
> > return 0x0000000100000001
> > (wrap detected :
> > (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> > write __m_cnt_hi = 0x80000001
> > return 0x80000001FFFFFFFA
> > (time jumps)
> >
> > And UP interrupt race :
> >
> > Thread context Interrupts
> > read hw cnt low (0xFFFFFFFA)
> > read __m_cnt_hi (0x80000000)
> > read hw cnt low (0x00000001)
> > (wrap detected :
> > (s32)(0x80000000 ^ 0x1) < 0)
> > write __m_cnt_hi = 0x00000001
> > read __m_cnt_hi (0x00000001)
> > return 0x0000000100000001
> > (wrap detected :
> > (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> > write __m_cnt_hi = 0x80000001
> > return 0x80000001FFFFFFFA
> > (time jumps)
> >

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 19:33:14

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 2008-11-07 at 14:18 -0500, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
> >
> > On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> > >
> > > __m_cnt_hi
> > > is read before
> > > mmio cnt_lo read
> > >
> > > for the detailed reasons explained in my previous discussion with
> > > Nicolas here :
> > > http://lkml.org/lkml/2008/10/21/1
> > >
> > > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > > be required so it works also on UP systems safely wrt interrupts).
> >
> > smp_rmb turns into a compiler barrier on UP and should prevent the below
> > description.
> >
>
> Ah, right, preserving program order on UP should be enough. smp_rmb()
> then.

I'm not quite sure I'm following here. Is this a global hardware clock
you're reading from multiple cpus, if so, are you sure smp_rmb() will
indeed be enough to sync the read?

(In which case the smp_wmb() is provided by the hardware increasing the
clock?)

If these are per-cpu clocks then even in the smp case we'd be good with
a plain barrier() because you'd only ever want to read your own cpu's
clock (and have a separate __m_cnt_hi per cpu).

Or am I totally missing out on something?

2008-11-07 20:04:14

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008, Andrew Morton wrote:

> On Fri, 07 Nov 2008 11:47:47 -0500 (EST) Nicolas Pitre <[email protected]> wrote:
>
> > > btw, do you know how damned irritating and frustrating it is for a code
> > > reviewer to have his comments deliberately ignored and deleted in
> > > replies?
> >
> > Do you know how irritating and frustrating it is when reviewers don't
> > care reading the damn comments along with the code?
>
> As you still seek to ignore it, I shall repeat my earlier question.
> Please do not delete it again.
>
> It apparently tries to avoid races via ordering tricks, as long
> as it is called with sufficient frequency. But nothing guarantees
> that it _is_ called sufficiently frequently?
>
> Things like tickless kernels and SCHED_RR can surely cause
> sched_clock() to not be called for arbitrary periods.

On the machines this was initially written for, the critical period is
in the order of minutes. And if you're afraid you might lack enough
scheduling activities for that long, you simply have to keep the
algorithm "warm" with a simple kernel timer which only purpose is to
ensure it is called often enough.

> Userspace cli() will definitely do this, but it is expected to break
> stuff and is not as legitiate a thing to do.

Why do you bring it on then?

> I'm just giving up on the tastefulness issue.

Taste is a pretty subjective matter.

Nicolas

2008-11-07 20:07:57

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* Peter Zijlstra ([email protected]) wrote:
> On Fri, 2008-11-07 at 14:18 -0500, Mathieu Desnoyers wrote:
> > * Steven Rostedt ([email protected]) wrote:
> > >
> > > On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> > > >
> > > > __m_cnt_hi
> > > > is read before
> > > > mmio cnt_lo read
> > > >
> > > > for the detailed reasons explained in my previous discussion with
> > > > Nicolas here :
> > > > http://lkml.org/lkml/2008/10/21/1
> > > >
> > > > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > > > be required so it works also on UP systems safely wrt interrupts).
> > >
> > > smp_rmb turns into a compiler barrier on UP and should prevent the below
> > > description.
> > >
> >
> > Ah, right, preserving program order on UP should be enough. smp_rmb()
> > then.
>
>
> I'm not quite sure I'm following here. Is this a global hardware clock
> you're reading from multiple cpus, if so, are you sure smp_rmb() will
> indeed be enough to sync the read?
>
> (In which case the smp_wmb() is provided by the hardware increasing the
> clock?)
>
> If these are per-cpu clocks then even in the smp case we'd be good with
> a plain barrier() because you'd only ever want to read your own cpu's
> clock (and have a separate __m_cnt_hi per cpu).
>
> Or am I totally missing out on something?
>

This is the global hardware clock scenario.

We have to order an uncached mmio read wrt a cached variable read/write.
The uncached mmio read vs smp_rmb() barrier (e.g. lfence instruction)
should be insured by program order because the read will skip the cache
and go directly to the bus. Luckily we only do a mmio read and no mmio
write, so mmiowb() is not required.

You might be right in that it could require more barriers.

Given adequate program order, we can assume the the mmio read will
happen "on the spot", but that the cached read may be delayed.

What we want is :

readl(io_addr)
read __m_cnt_hi
write __m_cnt_hi

With the two reads in the correct order. If we consider two consecutive
executions on the same CPU :

readl(io_addr)
read __m_cnt_hi
write __m_cnt_hi

readl(io_addr)
read __m_cnt_hi
write __m_cnt_hi

We might have to order the read/write pair wrt the following readl, such
as :

smp_rmb(); /* Waits for every cached memory reads to complete */
readl(io_addr);
barrier(); /* Make sure the compiler leaves mmio read before cached read */
read __m_cnt_hi
write __m_cnt_hi

smp_rmb(); /* Waits for every cached memory reads to complete */
readl(io_addr)
barrier(); /* Make sure the compiler leaves mmio read before cached read */
read __m_cnt_hi
write __m_cnt_hi

Would that make more sense ?

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 20:08:26

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
>
> I want to make sure
>
> __m_cnt_hi
> is read before
> mmio cnt_lo read

Hmm, let me make sure I understand why there is no wmb.

Paul, can you verify this?

Mathieu, you do the following:

read a
smp_rmb
reab b
if (test b)
write a

So the idea is that you must read b to test it. And since we must read a
before reading b we can see that we write a before either?

The question remains, can the write happen before either of the reads?

But since the read b is reading the hw clock, perhaps that just implies a
wmb on the hardware side?

-- Steve

2008-11-07 20:12:53

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, Nov 07, 2008 at 11:47:58AM -0500, Mathieu Desnoyers wrote:
> But any get_cycles() user of cnt32_to_63() should be shot down. The
> bright side is : there is no way get_cycles() can be used with this
> new code. :)
>
> e.g. of incorrect users for arm (unless they are UP only, but that seems
> like a weird design argument) :
>
> mach-sa1100/include/mach/SA-1100.h:#define OSCR __REG(0x90000010)
> /* OS timer Counter Reg. */
> mach-sa1100/generic.c: unsigned long long v = cnt32_to_63(OSCR);
> mach-pxa/include/mach/pxa-regs.h:#define OSCR __REG(0x40A00010) /* OS
> Timer Counter Register */
> mach-pxa/time.c: unsigned long long v = cnt32_to_63(OSCR);

It's strange for you to make that assertion when PXA was the exact
platform that Nicolas created this code for - and that's a platform
where preempt has been widely used.

The two you mention are both ARMv5 or older architectures, and the
first real SMP ARM architecture is ARMv6. So architecturally they
are UP only.

So, tell me why you say "unless they are UP only, but that seems like
a weird design argument"? If the platforms can only ever be UP only,
what's wrong with UP only code being used with them? (Not that I'm
saying anything there about cnt32_to_63.)

I'd like to see you modify the silicon of a PXA or SA11x0 SoC to add
more than one processor to the chip - maybe you could use evostick to
glue two dies together and a microscope to aid bonding wires between
the two? (Of course, you'd need to design something to ensure cache
coherence as well, and arbitrate the internal bus between the two
dies.) ;)

Personally, I think that's highly unlikely.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2008-11-07 20:18:38

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:

> * David Howells ([email protected]) wrote:
> > Nicolas Pitre <[email protected]> wrote:
> >
> > > > I mean, the darned thing is called from sched_clock(), which can be
> > > > concurrently called on separate CPUs and which can be called from
> > > > interrupt context (with an arbitrary nesting level!) while it was running
> > > > in process context.
> > >
> > > Yes! And this is so on *purpose*. Please take some time to read the
> > > comment that goes along with it, and if you're still not convinced then
> > > look for those explanation emails I've already posted.
> >
> > I agree with Nicolas on this. It's abominably clever, but I think he's right.
> >
> > The one place I remain unconvinced is over the issue of preemption of a process
> > that is in the middle of cnt32_to_63(), where if the preempted process is
> > asleep for long enough, I think it can wind time backwards when it resumes, but
> > that's not a problem for the one place I want to use it (sched_clock()) because
> > that is (almost) always called with preemption disabled in one way or another.
> >
> > The one place it isn't is a debugging case that I'm not too worried about.
> >
>
> I am also concerned about the non-preemption off case.
>
> Then I think the function should document that it must be called with
> preempt disabled.

I explained several times already why I disagree. Preemption is not a
problem unless you're preempted away for long enough, or IOW if your
counter is too fast.

And no, ^Z on a process doesn't create preemption. This is a signal that
gets acted upon far away from the middle of cnt32_to_63().

Nicolas

2008-11-07 20:37:36

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:

> I want to make sure
>
> __m_cnt_hi
> is read before
> mmio cnt_lo read
>
> for the detailed reasons explained in my previous discussion with
> Nicolas here :
> http://lkml.org/lkml/2008/10/21/1
>
> I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> be required so it works also on UP systems safely wrt interrupts).
>
> The write side is between the hardware counter, which is assumed to
> increment monotonically between each read, and the value __m_cnt_hi
> updated by the CPU. I don't see where we could put a wmb() there.
>
> Without barrier, the smp race looks as follow :
>
>
> CPU A B
> read hw cnt low (0xFFFFFFFA)
> read __m_cnt_hi (0x80000000)
> read hw cnt low (0x00000001)
> (wrap detected :
> (s32)(0x80000000 ^ 0x1) < 0)
> write __m_cnt_hi = 0x00000001
> read __m_cnt_hi (0x00000001)
> return 0x0000000100000001
> (wrap detected :
> (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> write __m_cnt_hi = 0x80000001
> return 0x80000001FFFFFFFA
> (time jumps)

Could you have hardware doing such things? You would get a non cached
and more expensive read on CPU B which is not in program order with the
read that should have happened before, and before that second out of
order read could be performed, you'd have a full sequence in program
order performed on CPU A.

Nicolas

2008-11-07 20:45:59

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* Mathieu Desnoyers ([email protected]) wrote:
> * Peter Zijlstra ([email protected]) wrote:
> > On Fri, 2008-11-07 at 14:18 -0500, Mathieu Desnoyers wrote:
> > > * Steven Rostedt ([email protected]) wrote:
> > > >
> > > > On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> > > > >
> > > > > __m_cnt_hi
> > > > > is read before
> > > > > mmio cnt_lo read
> > > > >
> > > > > for the detailed reasons explained in my previous discussion with
> > > > > Nicolas here :
> > > > > http://lkml.org/lkml/2008/10/21/1
> > > > >
> > > > > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > > > > be required so it works also on UP systems safely wrt interrupts).
> > > >
> > > > smp_rmb turns into a compiler barrier on UP and should prevent the below
> > > > description.
> > > >
> > >
> > > Ah, right, preserving program order on UP should be enough. smp_rmb()
> > > then.
> >
> >
> > I'm not quite sure I'm following here. Is this a global hardware clock
> > you're reading from multiple cpus, if so, are you sure smp_rmb() will
> > indeed be enough to sync the read?
> >
> > (In which case the smp_wmb() is provided by the hardware increasing the
> > clock?)
> >
> > If these are per-cpu clocks then even in the smp case we'd be good with
> > a plain barrier() because you'd only ever want to read your own cpu's
> > clock (and have a separate __m_cnt_hi per cpu).
> >
> > Or am I totally missing out on something?
> >
>
> This is the global hardware clock scenario.
>
> We have to order an uncached mmio read wrt a cached variable read/write.
> The uncached mmio read vs smp_rmb() barrier (e.g. lfence instruction)
> should be insured by program order because the read will skip the cache
> and go directly to the bus. Luckily we only do a mmio read and no mmio
> write, so mmiowb() is not required.
>
> You might be right in that it could require more barriers.
>
> Given adequate program order, we can assume the the mmio read will
> happen "on the spot", but that the cached read may be delayed.
>
> What we want is :
>
> readl(io_addr)
> read __m_cnt_hi
> write __m_cnt_hi
>
> With the two reads in the correct order. If we consider two consecutive
> executions on the same CPU :
>
> readl(io_addr)
> read __m_cnt_hi
> write __m_cnt_hi
>
> readl(io_addr)
> read __m_cnt_hi
> write __m_cnt_hi
>
> We might have to order the read/write pair wrt the following readl, such
> as :
>
> smp_rmb(); /* Waits for every cached memory reads to complete */
> readl(io_addr);
> barrier(); /* Make sure the compiler leaves mmio read before cached read */
> read __m_cnt_hi
> write __m_cnt_hi
>
> smp_rmb(); /* Waits for every cached memory reads to complete */
> readl(io_addr)
> barrier(); /* Make sure the compiler leaves mmio read before cached read */
> read __m_cnt_hi
> write __m_cnt_hi
>
> Would that make more sense ?
>

Oh, actually, I got things reversed in this email : the readl(io_addr)
must be done _after_ the __m_cnt_hi read.

Therefore, two consecutive executions would look like :

barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
previous mmio read. */
read __m_cnt_hi
smp_rmb(); /* Waits for every cached memory reads to complete */
readl(io_addr);
write __m_cnt_hi

barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
previous mmio read. */
read __m_cnt_hi
smp_rmb(); /* Waits for every cached memory reads to complete */
readl(io_addr);
write __m_cnt_hi

Mathieu

> Mathieu
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 20:54:26

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, Nov 07, 2008 at 03:45:46PM -0500, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers ([email protected]) wrote:
> > * Peter Zijlstra ([email protected]) wrote:
> > > On Fri, 2008-11-07 at 14:18 -0500, Mathieu Desnoyers wrote:
> > > > * Steven Rostedt ([email protected]) wrote:
> > > > >
> > > > > On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> > > > > >
> > > > > > __m_cnt_hi
> > > > > > is read before
> > > > > > mmio cnt_lo read
> > > > > >
> > > > > > for the detailed reasons explained in my previous discussion with
> > > > > > Nicolas here :
> > > > > > http://lkml.org/lkml/2008/10/21/1
> > > > > >
> > > > > > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > > > > > be required so it works also on UP systems safely wrt interrupts).
> > > > >
> > > > > smp_rmb turns into a compiler barrier on UP and should prevent the below
> > > > > description.
> > > > >
> > > >
> > > > Ah, right, preserving program order on UP should be enough. smp_rmb()
> > > > then.
> > >
> > >
> > > I'm not quite sure I'm following here. Is this a global hardware clock
> > > you're reading from multiple cpus, if so, are you sure smp_rmb() will
> > > indeed be enough to sync the read?
> > >
> > > (In which case the smp_wmb() is provided by the hardware increasing the
> > > clock?)
> > >
> > > If these are per-cpu clocks then even in the smp case we'd be good with
> > > a plain barrier() because you'd only ever want to read your own cpu's
> > > clock (and have a separate __m_cnt_hi per cpu).
> > >
> > > Or am I totally missing out on something?
> > >
> >
> > This is the global hardware clock scenario.
> >
> > We have to order an uncached mmio read wrt a cached variable read/write.
> > The uncached mmio read vs smp_rmb() barrier (e.g. lfence instruction)
> > should be insured by program order because the read will skip the cache
> > and go directly to the bus. Luckily we only do a mmio read and no mmio
> > write, so mmiowb() is not required.
> >
> > You might be right in that it could require more barriers.
> >
> > Given adequate program order, we can assume the the mmio read will
> > happen "on the spot", but that the cached read may be delayed.
> >
> > What we want is :
> >
> > readl(io_addr)
> > read __m_cnt_hi
> > write __m_cnt_hi
> >
> > With the two reads in the correct order. If we consider two consecutive
> > executions on the same CPU :
> >
> > readl(io_addr)
> > read __m_cnt_hi
> > write __m_cnt_hi
> >
> > readl(io_addr)
> > read __m_cnt_hi
> > write __m_cnt_hi
> >
> > We might have to order the read/write pair wrt the following readl, such
> > as :
> >
> > smp_rmb(); /* Waits for every cached memory reads to complete */
> > readl(io_addr);
> > barrier(); /* Make sure the compiler leaves mmio read before cached read */
> > read __m_cnt_hi
> > write __m_cnt_hi
> >
> > smp_rmb(); /* Waits for every cached memory reads to complete */
> > readl(io_addr)
> > barrier(); /* Make sure the compiler leaves mmio read before cached read */
> > read __m_cnt_hi
> > write __m_cnt_hi
> >
> > Would that make more sense ?
> >
>
> Oh, actually, I got things reversed in this email : the readl(io_addr)
> must be done _after_ the __m_cnt_hi read.
>
> Therefore, two consecutive executions would look like :
>
> barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
> previous mmio read. */
> read __m_cnt_hi
> smp_rmb(); /* Waits for every cached memory reads to complete */

If these are MMIO reads, then you need rmb() rather than smp_rmb(),
at least on architectures that can reorder writes (Power, Itanium,
and I believe also ARM, ...).

Thanx, Paul

> readl(io_addr);
> write __m_cnt_hi
>
>
> barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
> previous mmio read. */
> read __m_cnt_hi
> smp_rmb(); /* Waits for every cached memory reads to complete */
> readl(io_addr);
> write __m_cnt_hi
>
> Mathieu
>
> > Mathieu
> >
> > --
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 20:55:42

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* Nicolas Pitre ([email protected]) wrote:
> On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
>
> > I want to make sure
> >
> > __m_cnt_hi
> > is read before
> > mmio cnt_lo read
> >
> > for the detailed reasons explained in my previous discussion with
> > Nicolas here :
> > http://lkml.org/lkml/2008/10/21/1
> >
> > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > be required so it works also on UP systems safely wrt interrupts).
> >
> > The write side is between the hardware counter, which is assumed to
> > increment monotonically between each read, and the value __m_cnt_hi
> > updated by the CPU. I don't see where we could put a wmb() there.
> >
> > Without barrier, the smp race looks as follow :
> >
> >
> > CPU A B
> > read hw cnt low (0xFFFFFFFA)
> > read __m_cnt_hi (0x80000000)
> > read hw cnt low (0x00000001)
> > (wrap detected :
> > (s32)(0x80000000 ^ 0x1) < 0)
> > write __m_cnt_hi = 0x00000001
> > read __m_cnt_hi (0x00000001)
> > return 0x0000000100000001
> > (wrap detected :
> > (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> > write __m_cnt_hi = 0x80000001
> > return 0x80000001FFFFFFFA
> > (time jumps)
>
> Could you have hardware doing such things? You would get a non cached
> and more expensive read on CPU B which is not in program order with the
> read that should have happened before, and before that second out of
> order read could be performed, you'd have a full sequence in program
> order performed on CPU A.
>

Hrm, yes ? Well, it's the whole point in barriers/cache coherency
mechanisms, out-of-order reads... etc.

First off, read hw cnt low _is_ an uncached memory read (this is the
mmio read). __m_cnt_hi is a cached read, and therefore can be delayed if
the cache-line is busy. And we have no control on how much time can pass
between the two reads given the CPU may stall waiting for a cache-line.

So the scenario above happens if CPU A have __m_cnt_hi in its cacheline,
but for come reason CPU B have to defer the cacheline read of __m_cnt_hi
due to heavy cacheline traffic and decides to proceed to mmio read
before the cacheline has been brought to the CPU because "hey, there is
no data dependency between those two reads !".

See Documentation/memory-barriers.txt.

Mathieu

>
> Nicolas

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 20:55:57

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, Nov 07, 2008 at 03:08:12PM -0500, Steven Rostedt wrote:
>
> On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> >
> > I want to make sure
> >
> > __m_cnt_hi
> > is read before
> > mmio cnt_lo read
>
> Hmm, let me make sure I understand why there is no wmb.
>
> Paul, can you verify this?
>
> Mathieu, you do the following:
>
> read a
> smp_rmb
> reab b
> if (test b)
> write a
>
> So the idea is that you must read b to test it. And since we must read a
> before reading b we can see that we write a before either?
>
> The question remains, can the write happen before either of the reads?
>
> But since the read b is reading the hw clock, perhaps that just implies a
> wmb on the hardware side?

The hardware must do an equivalent of a wmb(), but this might well be
done in logic or firmware on the device itself.

Thanx, Paul

2008-11-07 21:05:43

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008, Paul E. McKenney wrote:
> > > Would that make more sense ?
> > >
> >
> > Oh, actually, I got things reversed in this email : the readl(io_addr)
> > must be done _after_ the __m_cnt_hi read.
> >
> > Therefore, two consecutive executions would look like :
> >
> > barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
> > previous mmio read. */
> > read __m_cnt_hi
> > smp_rmb(); /* Waits for every cached memory reads to complete */
>
> If these are MMIO reads, then you need rmb() rather than smp_rmb(),
> at least on architectures that can reorder writes (Power, Itanium,
> and I believe also ARM, ...).

The read is from a clock source. The only writes that are happening is
by the clock itself.

On a UP system, is a rmb still needed? That is, can you have two reads on
the same CPU from the clock source that will produce a backwards clock?
That to me sounds like the clock interface is broken.

-- Steve

2008-11-07 21:16:43

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* Paul E. McKenney ([email protected]) wrote:
> On Fri, Nov 07, 2008 at 03:45:46PM -0500, Mathieu Desnoyers wrote:
> > * Mathieu Desnoyers ([email protected]) wrote:
> > > * Peter Zijlstra ([email protected]) wrote:
> > > > On Fri, 2008-11-07 at 14:18 -0500, Mathieu Desnoyers wrote:
> > > > > * Steven Rostedt ([email protected]) wrote:
> > > > > >
> > > > > > On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> > > > > > >
> > > > > > > __m_cnt_hi
> > > > > > > is read before
> > > > > > > mmio cnt_lo read
> > > > > > >
> > > > > > > for the detailed reasons explained in my previous discussion with
> > > > > > > Nicolas here :
> > > > > > > http://lkml.org/lkml/2008/10/21/1
> > > > > > >
> > > > > > > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > > > > > > be required so it works also on UP systems safely wrt interrupts).
> > > > > >
> > > > > > smp_rmb turns into a compiler barrier on UP and should prevent the below
> > > > > > description.
> > > > > >
> > > > >
> > > > > Ah, right, preserving program order on UP should be enough. smp_rmb()
> > > > > then.
> > > >
> > > >
> > > > I'm not quite sure I'm following here. Is this a global hardware clock
> > > > you're reading from multiple cpus, if so, are you sure smp_rmb() will
> > > > indeed be enough to sync the read?
> > > >
> > > > (In which case the smp_wmb() is provided by the hardware increasing the
> > > > clock?)
> > > >
> > > > If these are per-cpu clocks then even in the smp case we'd be good with
> > > > a plain barrier() because you'd only ever want to read your own cpu's
> > > > clock (and have a separate __m_cnt_hi per cpu).
> > > >
> > > > Or am I totally missing out on something?
> > > >
> > >
> > > This is the global hardware clock scenario.
> > >
> > > We have to order an uncached mmio read wrt a cached variable read/write.
> > > The uncached mmio read vs smp_rmb() barrier (e.g. lfence instruction)
> > > should be insured by program order because the read will skip the cache
> > > and go directly to the bus. Luckily we only do a mmio read and no mmio
> > > write, so mmiowb() is not required.
> > >
> > > You might be right in that it could require more barriers.
> > >
> > > Given adequate program order, we can assume the the mmio read will
> > > happen "on the spot", but that the cached read may be delayed.
> > >
> > > What we want is :
> > >
> > > readl(io_addr)
> > > read __m_cnt_hi
> > > write __m_cnt_hi
> > >
> > > With the two reads in the correct order. If we consider two consecutive
> > > executions on the same CPU :
> > >
> > > readl(io_addr)
> > > read __m_cnt_hi
> > > write __m_cnt_hi
> > >
> > > readl(io_addr)
> > > read __m_cnt_hi
> > > write __m_cnt_hi
> > >
> > > We might have to order the read/write pair wrt the following readl, such
> > > as :
> > >
> > > smp_rmb(); /* Waits for every cached memory reads to complete */
> > > readl(io_addr);
> > > barrier(); /* Make sure the compiler leaves mmio read before cached read */
> > > read __m_cnt_hi
> > > write __m_cnt_hi
> > >
> > > smp_rmb(); /* Waits for every cached memory reads to complete */
> > > readl(io_addr)
> > > barrier(); /* Make sure the compiler leaves mmio read before cached read */
> > > read __m_cnt_hi
> > > write __m_cnt_hi
> > >
> > > Would that make more sense ?
> > >
> >
> > Oh, actually, I got things reversed in this email : the readl(io_addr)
> > must be done _after_ the __m_cnt_hi read.
> >
> > Therefore, two consecutive executions would look like :
> >
> > barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
> > previous mmio read. */
> > read __m_cnt_hi
> > smp_rmb(); /* Waits for every cached memory reads to complete */
>
> If these are MMIO reads, then you need rmb() rather than smp_rmb(),
> at least on architectures that can reorder writes (Power, Itanium,
> and I believe also ARM, ...).
>
> Thanx, Paul
>

I just dug into the barrier() question at the beginning of the code. I
think it's not necessary after all, because the worse a compiler could
do is probably the following :

Read nr | code

1 read a
1 rmb()
2 read a <------ ugh. Compiler could decide to prefetch the a value
and only update it if the test is true :(
1 read b
1 if (test b) {
1 write a
2 read a
}

2 rmb()
2 read b
2 if (test b)
2 write a

But it would not mix the order of a/b reads. So I think just the rmb()
would be enough.

Mathieu

> > readl(io_addr);
> > write __m_cnt_hi
> >
> >
> > barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
> > previous mmio read. */
> > read __m_cnt_hi
> > smp_rmb(); /* Waits for every cached memory reads to complete */
> > readl(io_addr);
> > write __m_cnt_hi
> >
> > Mathieu
> >
> > > Mathieu
> > >
> > > --
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> >
> > --
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 21:23:19

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:

> First off, read hw cnt low _is_ an uncached memory read (this is the
> mmio read). __m_cnt_hi is a cached read, and therefore can be delayed if
> the cache-line is busy. And we have no control on how much time can pass
> between the two reads given the CPU may stall waiting for a cache-line.
>
> So the scenario above happens if CPU A have __m_cnt_hi in its cacheline,
> but for come reason CPU B have to defer the cacheline read of __m_cnt_hi
> due to heavy cacheline traffic and decides to proceed to mmio read
> before the cacheline has been brought to the CPU because "hey, there is
> no data dependency between those two reads !".

OK that makes sense.

Nicolas

2008-11-07 21:32:38

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* Steven Rostedt ([email protected]) wrote:
>
> On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> >
> > I want to make sure
> >
> > __m_cnt_hi
> > is read before
> > mmio cnt_lo read
>
> Hmm, let me make sure I understand why there is no wmb.
>
> Paul, can you verify this?
>
> Mathieu, you do the following:
>
> read a
> smp_rmb
> reab b
> if (test b)
> write a
>
> So the idea is that you must read b to test it. And since we must read a
> before reading b we can see that we write a before either?
>
> The question remains, can the write happen before either of the reads?
>

write a cannot happen before read a (same variable).
write a must happen after read b because it depends on the b value. It
makes sure the the side-effect of "write a" is seen by other CPUs
*after* we have read the b value.

> But since the read b is reading the hw clock, perhaps that just implies a
> wmb on the hardware side?
>

It makes sense. The hardware clock has no cache coherency problem.. so
it could be seen as doing wmb() after each data update.

Mathieu

> -- Steve

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 21:41:22

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* Russell King ([email protected]) wrote:
> On Fri, Nov 07, 2008 at 11:47:58AM -0500, Mathieu Desnoyers wrote:
> > But any get_cycles() user of cnt32_to_63() should be shot down. The
> > bright side is : there is no way get_cycles() can be used with this
> > new code. :)
> >
> > e.g. of incorrect users for arm (unless they are UP only, but that seems
> > like a weird design argument) :
> >
> > mach-sa1100/include/mach/SA-1100.h:#define OSCR __REG(0x90000010)
> > /* OS timer Counter Reg. */
> > mach-sa1100/generic.c: unsigned long long v = cnt32_to_63(OSCR);
> > mach-pxa/include/mach/pxa-regs.h:#define OSCR __REG(0x40A00010) /* OS
> > Timer Counter Register */
> > mach-pxa/time.c: unsigned long long v = cnt32_to_63(OSCR);
>
> It's strange for you to make that assertion when PXA was the exact
> platform that Nicolas created this code for - and that's a platform
> where preempt has been widely used.
>
> The two you mention are both ARMv5 or older architectures, and the
> first real SMP ARM architecture is ARMv6. So architecturally they
> are UP only.
>

Ok. And hopefully they do not execute instructions speculatively ?
Because then a instruction sync would be required between the __m_hi_cnt
read and get_cycles.

If you design such stuff with portability in mind, you'd use per-cpu
variables, which ends up being a single variable in the single-cpu
special-case.

> So, tell me why you say "unless they are UP only, but that seems like
> a weird design argument"? If the platforms can only ever be UP only,
> what's wrong with UP only code being used with them? (Not that I'm
> saying anything there about cnt32_to_63.)

That's fine, as long as the code does not end up in include/linux and
stays in arch/arm/up-only-subarch/.

When one try to create architecture agnostic code (which is what is
likely to be palatable to arch agnostic headers), designing with UP in
mind does not make much sense.

>
> I'd like to see you modify the silicon of a PXA or SA11x0 SoC to add
> more than one processor to the chip - maybe you could use evostick to
> glue two dies together and a microscope to aid bonding wires between
> the two? (Of course, you'd need to design something to ensure cache
> coherence as well, and arbitrate the internal bus between the two
> dies.) ;)
>
> Personally, I think that's highly unlikely.
>

Very unlikely indeed. ;)

Mathieu

> --
> Russell King
> Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
> maintainer of:

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 22:21:19

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, Nov 07, 2008 at 04:36:10PM -0500, Mathieu Desnoyers wrote:
> * Russell King ([email protected]) wrote:
> > On Fri, Nov 07, 2008 at 11:47:58AM -0500, Mathieu Desnoyers wrote:
> > > But any get_cycles() user of cnt32_to_63() should be shot down. The
> > > bright side is : there is no way get_cycles() can be used with this
> > > new code. :)
> > >
> > > e.g. of incorrect users for arm (unless they are UP only, but that seems
> > > like a weird design argument) :
> > >
> > > mach-sa1100/include/mach/SA-1100.h:#define OSCR __REG(0x90000010)
> > > /* OS timer Counter Reg. */
> > > mach-sa1100/generic.c: unsigned long long v = cnt32_to_63(OSCR);
> > > mach-pxa/include/mach/pxa-regs.h:#define OSCR __REG(0x40A00010) /* OS
> > > Timer Counter Register */
> > > mach-pxa/time.c: unsigned long long v = cnt32_to_63(OSCR);
> >
> > It's strange for you to make that assertion when PXA was the exact
> > platform that Nicolas created this code for - and that's a platform
> > where preempt has been widely used.
> >
> > The two you mention are both ARMv5 or older architectures, and the
> > first real SMP ARM architecture is ARMv6. So architecturally they
> > are UP only.
>
> Ok. And hopefully they do not execute instructions speculatively ?

Again, that's ARMv6 and later.

> Because then a instruction sync would be required between the __m_hi_cnt
> read and get_cycles.

What get_cycles? This is the ARM implementation of get_cycles():

static inline cycles_t get_cycles (void)
{
return 0;
}

Maybe you're using a name for one thing which means something else to
other people? Please don't use confusing vocabulary.

> If you design such stuff with portability in mind, you'd use per-cpu
> variables, which ends up being a single variable in the single-cpu
> special-case.

Explain how and why sched_clock(), which is a global time source, should
use per-cpu variables.

> > So, tell me why you say "unless they are UP only, but that seems like
> > a weird design argument"? If the platforms can only ever be UP only,
> > what's wrong with UP only code being used with them? (Not that I'm
> > saying anything there about cnt32_to_63.)
>
> That's fine, as long as the code does not end up in include/linux and
> stays in arch/arm/up-only-subarch/.

Well, that's where it was - private to ARM. Then David Howells came
along and unilaterally - and without reference to anyone as far as I
can see - moved it to include/linux.

Neither Nicolas, nor me had any idea that it was going to move into
include/linux - the first we knew of it was when pulling the change
from Linus' tree.

Look, if people in the kernel community can't or won't communicate
with others (either through malice, purpose or accident), you can
expect this kind of crap to happen.

> When one try to create architecture agnostic code (which is what is
> likely to be palatable to arch agnostic headers), designing with UP in
> mind does not make much sense.

It wasn't architecture agnostic code. It was ARM specific. We have
a "version control system" which stores "comments" for changes to the
kernel tree. Please use it to find out the true story. I'll save
you the trouble, here's the commits with full comments:

$ git log include/linux/cnt32_to_63.h
commit b4f151ff899362fec952c45d166252c9912c041f
Author: David Howells <[email protected]>
Date: Wed Sep 24 17:48:26 2008 +0100

MN10300: Move asm-arm/cnt32_to_63.h to include/linux/

Move asm-arm/cnt32_to_63.h to include/linux/ so that MN10300 can make
use of it too.

Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

$ git log -- arch/arm/include/asm/cnt32_to_63.h include/asm-arm/cnt32_to_63.h
commit bc173c5789e1fc6065fd378edc815914b40ee86b
Author: David Howells <[email protected]>
Date: Fri Sep 26 16:22:58 2008 +0100

ARM: Delete ARM's own cnt32_to_63.h

Delete ARM's own cnt32_to_63.h as the copy in include/linux/ should now be
used instead.

Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

commit 4baa9922430662431231ac637adedddbb0cfb2d7
Author: Russell King <[email protected]>
Date: Sat Aug 2 10:55:55 2008 +0100

[ARM] move include/asm-arm to arch/arm/include/asm

Move platform independent header files to arch/arm/include/asm, leaving
those in asm/arch* and asm/plat* alone.

Signed-off-by: Russell King <[email protected]>

commit 838ccbc35eae5b44d47724e5f694dbec4a26d269
Author: Nicolas Pitre <[email protected]>
Date: Mon Dec 4 20:19:31 2006 +0100

[ARM] 3978/1: macro to provide a 63-bit value from a 32-bit hardware counter

This is done in a completely lockless fashion. Bits 0 to 31 of the count
are provided by the hardware while bits 32 to 62 are stored in memory.
The top bit in memory is used to synchronize with the hardware count
half-period. When the top bit of both counters (hardware and in memory)
differ then the memory is updated with a new value, incrementing it when
the hardware counter wraps around. Because a word store in memory is
atomic then the incremented value will always be in synch with the top
bit indicating to any potential concurrent reader if the value in memory
is up to date or not wrt the needed increment. And any race in updating
the value in memory is harmless as the same value would be stored more
than once.

Signed-off-by: Nicolas Pitre <[email protected]>
Signed-off-by: Russell King <[email protected]>

So, stop slinging mud onto Nicolas and me over this. The resulting
mess is clearly not our creation.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2008-11-07 22:41:39

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* Russell King ([email protected]) wrote:
> On Fri, Nov 07, 2008 at 04:36:10PM -0500, Mathieu Desnoyers wrote:
> > * Russell King ([email protected]) wrote:
> > > On Fri, Nov 07, 2008 at 11:47:58AM -0500, Mathieu Desnoyers wrote:
> > > > But any get_cycles() user of cnt32_to_63() should be shot down. The
> > > > bright side is : there is no way get_cycles() can be used with this
> > > > new code. :)
> > > >
> > > > e.g. of incorrect users for arm (unless they are UP only, but that seems
> > > > like a weird design argument) :
> > > >
> > > > mach-sa1100/include/mach/SA-1100.h:#define OSCR __REG(0x90000010)
> > > > /* OS timer Counter Reg. */
> > > > mach-sa1100/generic.c: unsigned long long v = cnt32_to_63(OSCR);
> > > > mach-pxa/include/mach/pxa-regs.h:#define OSCR __REG(0x40A00010) /* OS
> > > > Timer Counter Register */
> > > > mach-pxa/time.c: unsigned long long v = cnt32_to_63(OSCR);
> > >
> > > It's strange for you to make that assertion when PXA was the exact
> > > platform that Nicolas created this code for - and that's a platform
> > > where preempt has been widely used.
> > >
> > > The two you mention are both ARMv5 or older architectures, and the
> > > first real SMP ARM architecture is ARMv6. So architecturally they
> > > are UP only.
> >
> > Ok. And hopefully they do not execute instructions speculatively ?
>
> Again, that's ARMv6 and later.
>
> > Because then a instruction sync would be required between the __m_hi_cnt
> > read and get_cycles.
>
> What get_cycles? This is the ARM implementation of get_cycles():
>
> static inline cycles_t get_cycles (void)
> {
> return 0;
> }
>
> Maybe you're using a name for one thing which means something else to
> other people? Please don't use confusing vocabulary.
>

get_cycles() is expected to be a cpu register read which reads a cycle
counter. As far as I can tell,

#define OSCR __REG(0x90000010) /* OS timer Counter Reg. */

seems to fit this definition pretty closely. Or maybe not ?

I am personnally not trying to find someone to blame. Just trying to
figure out how to fix the existing code or, at the very least, to make
sure nobody will ask me to use a piece of code not suitable for
tracing clock source purposes.

Mathieu

> > If you design such stuff with portability in mind, you'd use per-cpu
> > variables, which ends up being a single variable in the single-cpu
> > special-case.
>
> Explain how and why sched_clock(), which is a global time source, should
> use per-cpu variables.
>
> > > So, tell me why you say "unless they are UP only, but that seems like
> > > a weird design argument"? If the platforms can only ever be UP only,
> > > what's wrong with UP only code being used with them? (Not that I'm
> > > saying anything there about cnt32_to_63.)
> >
> > That's fine, as long as the code does not end up in include/linux and
> > stays in arch/arm/up-only-subarch/.
>
> Well, that's where it was - private to ARM. Then David Howells came
> along and unilaterally - and without reference to anyone as far as I
> can see - moved it to include/linux.
>
> Neither Nicolas, nor me had any idea that it was going to move into
> include/linux - the first we knew of it was when pulling the change
> from Linus' tree.
>
> Look, if people in the kernel community can't or won't communicate
> with others (either through malice, purpose or accident), you can
> expect this kind of crap to happen.
>
> > When one try to create architecture agnostic code (which is what is
> > likely to be palatable to arch agnostic headers), designing with UP in
> > mind does not make much sense.
>
> It wasn't architecture agnostic code. It was ARM specific. We have
> a "version control system" which stores "comments" for changes to the
> kernel tree. Please use it to find out the true story. I'll save
> you the trouble, here's the commits with full comments:
>
> $ git log include/linux/cnt32_to_63.h
> commit b4f151ff899362fec952c45d166252c9912c041f
> Author: David Howells <[email protected]>
> Date: Wed Sep 24 17:48:26 2008 +0100
>
> MN10300: Move asm-arm/cnt32_to_63.h to include/linux/
>
> Move asm-arm/cnt32_to_63.h to include/linux/ so that MN10300 can make
> use of it too.
>
> Signed-off-by: David Howells <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
>
> $ git log -- arch/arm/include/asm/cnt32_to_63.h include/asm-arm/cnt32_to_63.h
> commit bc173c5789e1fc6065fd378edc815914b40ee86b
> Author: David Howells <[email protected]>
> Date: Fri Sep 26 16:22:58 2008 +0100
>
> ARM: Delete ARM's own cnt32_to_63.h
>
> Delete ARM's own cnt32_to_63.h as the copy in include/linux/ should now be
> used instead.
>
> Signed-off-by: David Howells <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
>
> commit 4baa9922430662431231ac637adedddbb0cfb2d7
> Author: Russell King <[email protected]>
> Date: Sat Aug 2 10:55:55 2008 +0100
>
> [ARM] move include/asm-arm to arch/arm/include/asm
>
> Move platform independent header files to arch/arm/include/asm, leaving
> those in asm/arch* and asm/plat* alone.
>
> Signed-off-by: Russell King <[email protected]>
>
> commit 838ccbc35eae5b44d47724e5f694dbec4a26d269
> Author: Nicolas Pitre <[email protected]>
> Date: Mon Dec 4 20:19:31 2006 +0100
>
> [ARM] 3978/1: macro to provide a 63-bit value from a 32-bit hardware counter
>
> This is done in a completely lockless fashion. Bits 0 to 31 of the count
> are provided by the hardware while bits 32 to 62 are stored in memory.
> The top bit in memory is used to synchronize with the hardware count
> half-period. When the top bit of both counters (hardware and in memory)
> differ then the memory is updated with a new value, incrementing it when
> the hardware counter wraps around. Because a word store in memory is
> atomic then the incremented value will always be in synch with the top
> bit indicating to any potential concurrent reader if the value in memory
> is up to date or not wrt the needed increment. And any race in updating
> the value in memory is harmless as the same value would be stored more
> than once.
>
> Signed-off-by: Nicolas Pitre <[email protected]>
> Signed-off-by: Russell King <[email protected]>
>
> So, stop slinging mud onto Nicolas and me over this. The resulting
> mess is clearly not our creation.
>
> --
> Russell King
> Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
> maintainer of:

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-07 23:37:20

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Mathieu Desnoyers <[email protected]> wrote:

> Yes. Do you think the synchronization of the cycles counters is
> _perfect_ across CPUs so that there is no possible way whatsoever that
> two cycle counter values appear to go backward between CPUs ? (also
> taking in account delays in __m_cnt_hi write-back...)

Given there's currently only one CPU allowed, yes, I think it's perfect:-)

It's something to re-evaluate should Panasonic decide to do SMP.

> If we expect the only correct use-case to be with readl(), I don't see
> the problem with added synchronization.

It might be expensive if you don't actually want to call readl(). But that's
on a par with using funky instructions to read the TSC, I guess.

David

2008-11-07 23:43:41

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Russell King <[email protected]> wrote:

> Well, that's where it was - private to ARM. Then David Howells came
> along and unilaterally - and without reference to anyone as far as I
> can see - moved it to include/linux.
>
> Neither Nicolas, nor me had any idea that it was going to move into
> include/linux - the first we knew of it was when pulling the change
> from Linus' tree.
>
> Look, if people in the kernel community can't or won't communicate
> with others (either through malice, purpose or accident), you can
> expect this kind of crap to happen.

Excuse me, Russell, but I sent Nicolas an email prior to doing so asking him
if he had any objections:

To: Nicolas Pitre <[email protected]>
cc: [email protected]
Subject: Moving asm-arm/cnt32_to_63.h to include/linux/
Date: Thu, 31 Jul 2008 16:04:04 +0100

Hi Nicolas,

Mind if I move include/asm-arm/cnt32_to_63.h to include/linux/?

I need to use it for MN10300.

David

He didn't respond. Not only that, but I copied Nicolas on the patch to make
the move and the patch to make MN10300 use it when I submitted it to Linus on
the 24th September, so it's not like he didn't have plenty of time. He
certainly saw that because he joined in the discussion of the second patch.
Furthermore, he could've asked Linus to refuse the patch, or to revert it if
it had already gone in.

I suppose I should've cc'd the ARM list too... but why should it adversely
affect ARM?

David

2008-11-07 23:52:00

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Steven Rostedt <[email protected]> wrote:

> > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > be required so it works also on UP systems safely wrt interrupts).
>
> smp_rmb turns into a compiler barrier on UP and should prevent the below
> description.

Note that that does not guarantee that the two reads will be done in the order
you want. The compiler barrier _only_ affects the compiler. It does not stop
the CPU from doing the reads in any order it wants. You need something
stronger than smp_rmb() if you need the reads to be so ordered.

David

2008-11-07 23:56:53

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Nicolas Pitre <[email protected]> wrote:

> I explained several times already why I disagree. Preemption is not a
> problem unless you're preempted away for long enough, or IOW if your
> counter is too fast.

That's my point. Say a nice -19 process gets preempted... what guarantees
that it will resume within, the time it takes the counter to wrap? Even if
the preempting process goes back to sleep, in that time a bunch of other
processes could have woken up and could starve it for a long period of time.

David

2008-11-08 00:17:55

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, Nov 07, 2008 at 11:41:55PM +0000, David Howells wrote:
> Russell King <[email protected]> wrote:
>
> > Well, that's where it was - private to ARM. Then David Howells came
> > along and unilaterally - and without reference to anyone as far as I
> > can see - moved it to include/linux.
> >
> > Neither Nicolas, nor me had any idea that it was going to move into
> > include/linux - the first we knew of it was when pulling the change
> > from Linus' tree.
> >
> > Look, if people in the kernel community can't or won't communicate
> > with others (either through malice, purpose or accident), you can
> > expect this kind of crap to happen.
>
> Excuse me, Russell, but I sent Nicolas an email prior to doing so asking him
> if he had any objections:
>
> To: Nicolas Pitre <[email protected]>
> cc: [email protected]
> Subject: Moving asm-arm/cnt32_to_63.h to include/linux/
> Date: Thu, 31 Jul 2008 16:04:04 +0100
>
> Hi Nicolas,
>
> Mind if I move include/asm-arm/cnt32_to_63.h to include/linux/?
>
> I need to use it for MN10300.
>
> David
>
> He didn't respond. Not only that, but I copied Nicolas on the patch to make
> the move and the patch to make MN10300 use it when I submitted it to Linus on
> the 24th September, so it's not like he didn't have plenty of time. He
> certainly saw that because he joined in the discussion of the second patch.
> Furthermore, he could've asked Linus to refuse the patch, or to revert it if
> it had already gone in.
>
> I suppose I should've cc'd the ARM list too... but why should it adversely
> affect ARM?

I take back the "Neither Nicolas" bit but the rest of my comment stands
and remains valid.

In light of akpm's demands to know how this got into the kernel, I decided
I'd put the story forward, especially as people in this thread are confused
about what it was designed for, and making random unfounded claiming that
its existing ARM uses are buggy when they aren't.

It sounds to me as if the right answer is for it to move back to being an
ARM private thing with a MN10300 private copy, rather than it pretending
to be something everyone can use.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2008-11-08 00:34:48

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, Nov 07, 2008 at 04:04:48PM -0500, Steven Rostedt wrote:
>
> On Fri, 7 Nov 2008, Paul E. McKenney wrote:
> > > > Would that make more sense ?
> > > >
> > >
> > > Oh, actually, I got things reversed in this email : the readl(io_addr)
> > > must be done _after_ the __m_cnt_hi read.
> > >
> > > Therefore, two consecutive executions would look like :
> > >
> > > barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
> > > previous mmio read. */
> > > read __m_cnt_hi
> > > smp_rmb(); /* Waits for every cached memory reads to complete */
> >
> > If these are MMIO reads, then you need rmb() rather than smp_rmb(),
> > at least on architectures that can reorder writes (Power, Itanium,
> > and I believe also ARM, ...).
>
> The read is from a clock source. The only writes that are happening is
> by the clock itself.
>
> On a UP system, is a rmb still needed? That is, can you have two reads on
> the same CPU from the clock source that will produce a backwards clock?
> That to me sounds like the clock interface is broken.

I do not believe that all CPUs are guaranteed to execute a sequence
of MMIO reads in order.

Thanx, Paul

2008-11-08 00:46:46

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Russell King <[email protected]> wrote:

> It sounds to me as if the right answer is for it to move back to being an
> ARM private thing with a MN10300 private copy, rather than it pretending
> to be something everyone can use.

The only problem I have with that is that there are then two independent
copies of it:-(

David

2008-11-08 00:55:39

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Fri, 7 Nov 2008, David Howells wrote:

> Steven Rostedt <[email protected]> wrote:
>
> > > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > > be required so it works also on UP systems safely wrt interrupts).
> >
> > smp_rmb turns into a compiler barrier on UP and should prevent the below
> > description.
>
> Note that that does not guarantee that the two reads will be done in the order
> you want. The compiler barrier _only_ affects the compiler. It does not stop
> the CPU from doing the reads in any order it wants. You need something
> stronger than smp_rmb() if you need the reads to be so ordered.

For reading hardware devices that can indeed be correct. But for normal
memory access on a uniprocessor, if the CPU were to reorder the reads that
would effect the actual algorithm then that CPU is broken.

read a
<--- interrupt - should see read a here before read b is done.
read b

Now the fact that one of the reads is a hardware clock, then this
statement might not be too strong. But the fact that it is a clock, and
not some memory mapped device register, I still think smp_rmb is
sufficient.

-- Steve

2008-11-08 15:24:22

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Sat, 8 Nov 2008, Russell King wrote:

> On Fri, Nov 07, 2008 at 11:41:55PM +0000, David Howells wrote:
> > Russell King <[email protected]> wrote:
> >
> > > Well, that's where it was - private to ARM. Then David Howells came
> > > along and unilaterally - and without reference to anyone as far as I
> > > can see - moved it to include/linux.
> > >
> > > Neither Nicolas, nor me had any idea that it was going to move into
> > > include/linux - the first we knew of it was when pulling the change
> > > from Linus' tree.
> > >
> > > Look, if people in the kernel community can't or won't communicate
> > > with others (either through malice, purpose or accident), you can
> > > expect this kind of crap to happen.
> >
> > Excuse me, Russell, but I sent Nicolas an email prior to doing so asking him
> > if he had any objections:
> >
> > To: Nicolas Pitre <[email protected]>
> > cc: [email protected]
> > Subject: Moving asm-arm/cnt32_to_63.h to include/linux/
> > Date: Thu, 31 Jul 2008 16:04:04 +0100
> >
> > Hi Nicolas,
> >
> > Mind if I move include/asm-arm/cnt32_to_63.h to include/linux/?
> >
> > I need to use it for MN10300.
> >
> > David
> >
> > He didn't respond. Not only that, but I copied Nicolas on the patch to make
> > the move and the patch to make MN10300 use it when I submitted it to Linus on
> > the 24th September, so it's not like he didn't have plenty of time. He
> > certainly saw that because he joined in the discussion of the second patch.
> > Furthermore, he could've asked Linus to refuse the patch, or to revert it if
> > it had already gone in.

I was OK with the patch moving that code and I think I told you so as
well. But...

> > I suppose I should've cc'd the ARM list too... but why should it adversely
> > affect ARM?
>
> I take back the "Neither Nicolas" bit but the rest of my comment stands
> and remains valid.
>
> In light of akpm's demands to know how this got into the kernel, I decided
> I'd put the story forward, especially as people in this thread are confused
> about what it was designed for, and making random unfounded claiming that
> its existing ARM uses are buggy when they aren't.

... I must agree with Russell that this is apparently creating more
confusion with people than anything else.

> It sounds to me as if the right answer is for it to move back to being an
> ARM private thing with a MN10300 private copy, rather than it pretending
> to be something everyone can use.

I think this is OK if not everyone can use this. The main purpose for
this code was to provide much increased accuracy for shed_clock() on
processors with only a 32-bit hardware counter.

Given that sched_clock() is already used in contexts where preemption is
disabled, I don't mind the addition of a precision to the associated
comment mentioning that it must be called at least once per
half period of the base counter ***and*** not be preempted
away for longer than the half period of the counter minus the longest
period between two calls. The comment already mention a kernel timer
which can be used to control the longest period between two calls.
Implicit disabling of preemption is _NOT_ the goal of this code.

I also don't mind having a real barrier for this code to be useful on
other platforms. On the platform this was written for, any kind of
barrier is defined as a compiler barrier which is perfectly fine and
achieve the same effect as the current usage of volatile.

I also don't mind making the high part of the counter always be a per
CPU variable. Again this won't change anything on the target this was
intended for and this would make this code useful for more usages, and
possibly help making the needed barrier on SMP more lightweight. The
usage requirement therefore becomes per CPU even if the base counter is
global. There are per CPU timers with add_timer_on() so this can be
ensured pretty easily.

And if after all this the code doesn't suit your needs then just don't
use it. Its documentation should be clear enough so if people start
using it in contexts where it isn't appropriate then it's not the code's
fault.

Nicolas

2008-11-08 23:21:21

[permalink] [raw]

Subject: [PATCH] clarify usage expectations for cnt32_to_63()

Currently, all existing users of cnt32_to_63() are fine since the CPU
architectures where it is used don't do read access reordering, and user
mode preemption is disabled already. It is nevertheless a good idea to
better elaborate usage requirements wrt preemption, and use an explicit
memory barrier for proper results on CPUs that may perform instruction
reordering.

Signed-off-by: Nicolas Pitre <[email protected]>
---

On Sat, 8 Nov 2008, Nicolas Pitre wrote:

> I think this is OK if not everyone can use this. The main purpose for
> this code was to provide much increased accuracy for shed_clock() on
> processors with only a 32-bit hardware counter.
>
> Given that sched_clock() is already used in contexts where preemption is
> disabled, I don't mind the addition of a precision to the associated
> comment mentioning that it must be called at least once per
> half period of the base counter ***and*** not be preempted
> away for longer than the half period of the counter minus the longest
> period between two calls. The comment already mention a kernel timer
> which can be used to control the longest period between two calls.
> Implicit disabling of preemption is _NOT_ the goal of this code.
>
> I also don't mind having a real barrier for this code to be useful on
> other platforms. On the platform this was written for, any kind of
> barrier is defined as a compiler barrier which is perfectly fine and
> achieve the same effect as the current usage of volatile.

So here it is.

I used a rmb() so this is also safe for mixed usages in and out of
interrupt context. On the architecture I care about this is turned into
a simple compiler barrier and therefore doesn't make a difference, while
smp_rmb() is a noop which isn't right.

I won't make a per_cpu_cnt32_to_63() version myself as I have no need
for that. But if someone wants to use this witha per CPU counter which
is not coherent between CPUs then be my guest. The usage requirements
would be the same but on each used CPU instead of globally.

diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
index 8c0f950..584289d 100644
--- a/include/linux/cnt32_to_63.h
+++ b/include/linux/cnt32_to_63.h
@@ -53,11 +53,19 @@ union cnt32_to_63 {
* needed increment. And any race in updating the value in memory is harmless
* as the same value would simply be stored more than once.
*
- * The only restriction for the algorithm to work properly is that this
- * code must be executed at least once per each half period of the 32-bit
- * counter to properly update the state bit in memory. This is usually not a
- * problem in practice, but if it is then a kernel timer could be scheduled
- * to manage for this code to be executed often enough.
+ * The restrictions for the algorithm to work properly are:
+ *
+ * 1) this code must be called at least once per each half period of the
+ * 32-bit counter;
+ *
+ * 2) this code must not be preempted for a duration longer than the
+ * 32-bit counter half period minus the longest period between two
+ * calls to this code.
+ *
+ * Those requirements ensure proper update to the state bit in memory.
+ * This is usually not a problem in practice, but if it is then a kernel
+ * timer should be scheduled to manage for this code to be executed often
+ * enough.
*
* Note that the top bit (bit 63) in the returned value should be considered
* as garbage. It is not cleared here because callers are likely to use a
@@ -68,9 +76,10 @@ union cnt32_to_63 {
*/
#define cnt32_to_63(cnt_lo) \
({ \
- static volatile u32 __m_cnt_hi; \
+ static u32 __m_cnt_hi; \
union cnt32_to_63 __x; \
__x.hi = __m_cnt_hi; \
+ rmb(); \
__x.lo = (cnt_lo); \
if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
__m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \

2008-11-09 02:26:05

[permalink] [raw]

Subject: Re: [PATCH] clarify usage expectations for cnt32_to_63()

* Nicolas Pitre ([email protected]) wrote:
> Currently, all existing users of cnt32_to_63() are fine since the CPU
> architectures where it is used don't do read access reordering, and user
> mode preemption is disabled already. It is nevertheless a good idea to
> better elaborate usage requirements wrt preemption, and use an explicit
> memory barrier for proper results on CPUs that may perform instruction
> reordering.
>
> Signed-off-by: Nicolas Pitre <[email protected]>
> ---
>
> On Sat, 8 Nov 2008, Nicolas Pitre wrote:
>
> > I think this is OK if not everyone can use this. The main purpose for
> > this code was to provide much increased accuracy for shed_clock() on
> > processors with only a 32-bit hardware counter.
> >
> > Given that sched_clock() is already used in contexts where preemption is
> > disabled, I don't mind the addition of a precision to the associated
> > comment mentioning that it must be called at least once per
> > half period of the base counter ***and*** not be preempted
> > away for longer than the half period of the counter minus the longest
> > period between two calls. The comment already mention a kernel timer
> > which can be used to control the longest period between two calls.
> > Implicit disabling of preemption is _NOT_ the goal of this code.
> >
> > I also don't mind having a real barrier for this code to be useful on
> > other platforms. On the platform this was written for, any kind of
> > barrier is defined as a compiler barrier which is perfectly fine and
> > achieve the same effect as the current usage of volatile.
>
> So here it is.
>
> I used a rmb() so this is also safe for mixed usages in and out of
> interrupt context. On the architecture I care about this is turned into
> a simple compiler barrier and therefore doesn't make a difference, while
> smp_rmb() is a noop which isn't right.
>

Hum ? smp_rmb() is turned into a compiler barrier on !SMP architectures.
Turning it into a NOP would be broken. Actually, ARM defines it as a
barrier().

I *think* that smp_rmb() would be enough, supposing the access to memory
is done in program order wrt local interrupts in UP. This is basically
Steven's question, which has not received any clear answer yet. I'd like
to know what others think about it.

Mathieu

> I won't make a per_cpu_cnt32_to_63() version myself as I have no need
> for that. But if someone wants to use this witha per CPU counter which
> is not coherent between CPUs then be my guest. The usage requirements
> would be the same but on each used CPU instead of globally.
>
> diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
> index 8c0f950..584289d 100644
> --- a/include/linux/cnt32_to_63.h
> +++ b/include/linux/cnt32_to_63.h
> @@ -53,11 +53,19 @@ union cnt32_to_63 {
> * needed increment. And any race in updating the value in memory is harmless
> * as the same value would simply be stored more than once.
> *
> - * The only restriction for the algorithm to work properly is that this
> - * code must be executed at least once per each half period of the 32-bit
> - * counter to properly update the state bit in memory. This is usually not a
> - * problem in practice, but if it is then a kernel timer could be scheduled
> - * to manage for this code to be executed often enough.
> + * The restrictions for the algorithm to work properly are:
> + *
> + * 1) this code must be called at least once per each half period of the
> + * 32-bit counter;
> + *
> + * 2) this code must not be preempted for a duration longer than the
> + * 32-bit counter half period minus the longest period between two
> + * calls to this code.
> + *
> + * Those requirements ensure proper update to the state bit in memory.
> + * This is usually not a problem in practice, but if it is then a kernel
> + * timer should be scheduled to manage for this code to be executed often
> + * enough.
> *
> * Note that the top bit (bit 63) in the returned value should be considered
> * as garbage. It is not cleared here because callers are likely to use a
> @@ -68,9 +76,10 @@ union cnt32_to_63 {
> */
> #define cnt32_to_63(cnt_lo) \
> ({ \
> - static volatile u32 __m_cnt_hi; \
> + static u32 __m_cnt_hi; \
> union cnt32_to_63 __x; \
> __x.hi = __m_cnt_hi; \
> + rmb(); \
> __x.lo = (cnt_lo); \
> if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-09 02:54:23

[permalink] [raw]

Subject: Re: [PATCH] clarify usage expectations for cnt32_to_63()

On Sat, 8 Nov 2008, Mathieu Desnoyers wrote:

> > I used a rmb() so this is also safe for mixed usages in and out of
> > interrupt context. On the architecture I care about this is turned into
> > a simple compiler barrier and therefore doesn't make a difference, while
> > smp_rmb() is a noop which isn't right.
> >
>
> Hum ? smp_rmb() is turned into a compiler barrier on !SMP architectures.
> Turning it into a NOP would be broken. Actually, ARM defines it as a
> barrier().

Oh, right. I got confused somehow with read_barrier_depends().

> I *think* that smp_rmb() would be enough, supposing the access to memory
> is done in program order wrt local interrupts in UP. This is basically
> Steven's question, which has not received any clear answer yet. I'd like
> to know what others think about it.

In the mean time a pure rmb() is the safest thing to do now. Once we
can convince ourselves that out-of-order reads are always rolled back
upon the arrival of an interrupt then this could be relaxed.

Nicolas

2008-11-09 05:06:33

[permalink] [raw]

Subject: Re: [PATCH] clarify usage expectations for cnt32_to_63()

On Sat, 8 Nov 2008, Nicolas Pitre wrote:

> On Sat, 8 Nov 2008, Mathieu Desnoyers wrote:
>
> > I *think* that smp_rmb() would be enough, supposing the access to memory
> > is done in program order wrt local interrupts in UP. This is basically
> > Steven's question, which has not received any clear answer yet. I'd like
> > to know what others think about it.
>
> In the mean time a pure rmb() is the safest thing to do now. Once we
> can convince ourselves that out-of-order reads are always rolled back
> upon the arrival of an interrupt then this could be relaxed.

After thinking about it some more, then a smp_rmb() must be correct.

On UP, that would be completely insane if an exception didn't resume
the whole sequence since the CPU cannot presume anything when returning
from it.

If the instruction flows says:

READ A
READ B

And speculative execution makes for B to be read before A, and you get
an interrupt after B was read but before A was read, then the program
counter may only point at READ A upon returning from the exception and B
will be read again. Doing otherwise would require the CPU to remember
any reordering that it might have performed upon every exceptions which
is completely insane.

So smp_rmb() it shall be.

Nicolas

2008-11-09 05:28:19

[permalink] [raw]

Subject: [PATCH v2] clarify usage expectations for cnt32_to_63()

Currently, all existing users of cnt32_to_63() are fine since the CPU
architectures where it is used don't do read access reordering, and user
mode preemption is disabled already. It is nevertheless a good idea to
better elaborate usage requirements wrt preemption, and use an explicit
memory barrier on SMP to avoid different CPUs accessing the counter
value in the wrong order. On UP a simple compiler barrier is
sufficient.

Signed-off-by: Nicolas Pitre <[email protected]>
---

On Sun, 9 Nov 2008, Nicolas Pitre wrote:

> On Sat, 8 Nov 2008, Nicolas Pitre wrote:
>
> > On Sat, 8 Nov 2008, Mathieu Desnoyers wrote:
> >
> > > I *think* that smp_rmb() would be enough, supposing the access to memory
> > > is done in program order wrt local interrupts in UP. This is basically
> > > Steven's question, which has not received any clear answer yet. I'd like
> > > to know what others think about it.
> >
> > In the mean time a pure rmb() is the safest thing to do now. Once we
> > can convince ourselves that out-of-order reads are always rolled back
> > upon the arrival of an interrupt then this could be relaxed.
>
> After thinking about it some more, a smp_rmb() must be correct.
>
> On UP, that would be completely insane if an exception didn't resume
> the whole sequence since the CPU cannot presume anything when returning
> from it.
>
> If the instruction flows says:
>
> READ A
> READ B
>
> And speculative execution makes for B to be read before A, and you get
> an interrupt after B was read but before A was read, then the program
> counter may only point at READ A upon returning from the exception and B
> will be read again. Doing otherwise would require the CPU to remember
> any reordering that it might have performed upon every exceptions which
> is completely insane.
>
> So smp_rmb() it shall be.

diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
index 8c0f950..7605fdd 100644
--- a/include/linux/cnt32_to_63.h
+++ b/include/linux/cnt32_to_63.h
@@ -16,6 +16,7 @@
#include <linux/compiler.h>
#include <linux/types.h>
#include <asm/byteorder.h>
+#include <asm/system.h>

/* this is used only to give gcc a clue about good code generation */
union cnt32_to_63 {
@@ -53,11 +54,19 @@ union cnt32_to_63 {
* needed increment. And any race in updating the value in memory is harmless
* as the same value would simply be stored more than once.
*
- * The only restriction for the algorithm to work properly is that this
- * code must be executed at least once per each half period of the 32-bit
- * counter to properly update the state bit in memory. This is usually not a
- * problem in practice, but if it is then a kernel timer could be scheduled
- * to manage for this code to be executed often enough.
+ * The restrictions for the algorithm to work properly are:
+ *
+ * 1) this code must be called at least once per each half period of the
+ * 32-bit counter;
+ *
+ * 2) this code must not be preempted for a duration longer than the
+ * 32-bit counter half period minus the longest period between two
+ * calls to this code.
+ *
+ * Those requirements ensure proper update to the state bit in memory.
+ * This is usually not a problem in practice, but if it is then a kernel
+ * timer should be scheduled to manage for this code to be executed often
+ * enough.
*
* Note that the top bit (bit 63) in the returned value should be considered
* as garbage. It is not cleared here because callers are likely to use a
@@ -68,9 +77,10 @@ union cnt32_to_63 {
*/
#define cnt32_to_63(cnt_lo) \
({ \
- static volatile u32 __m_cnt_hi; \
+ static u32 __m_cnt_hi; \
union cnt32_to_63 __x; \
__x.hi = __m_cnt_hi; \
+ smp_rmb(); \
__x.lo = (cnt_lo); \
if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
__m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \

2008-11-09 06:49:16

[permalink] [raw]

Subject: Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

2008-11-09 11:53:06

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

Steven Rostedt <[email protected]> wrote:

> > Note that that does not guarantee that the two reads will be done in the
> > order you want. The compiler barrier _only_ affects the compiler. It
> > does not stop the CPU from doing the reads in any order it wants. You
> > need something stronger than smp_rmb() if you need the reads to be so
> > ordered.
>
> For reading hardware devices that can indeed be correct. But for normal
> memory access on a uniprocessor, if the CPU were to reorder the reads that
> would effect the actual algorithm then that CPU is broken.
>
> read a
> <--- interrupt - should see read a here before read b is done.
> read b

Life isn't that simple. Go and read the section labelled "The things cpus get
up to" in Documentation/memory-barriers.txt.

The two reads we're talking about are independent of each other. Independent
reads and writes can be reordered and merged at will by the CPU, subject to
restrictions imposed by barriers, cacheability attributes, MMIO attributes and
suchlike.

You can get read b happening before read a, but in such a case both
instructions will be in the CPU's execution pipeline. When an interrupt
occurs, the CPU will presumably finish clearing what's in its pipeline before
going and servicing the interrupt handler.

If a CPU is strictly ordered with respect to reads, do you actually need read
barriers?

The fact that a pair of reads might be part of an algorithm that is critically
dependent on the ordering of those reads isn't something the CPU cares about.
It doesn't know there's an algorithm there.

> Now the fact that one of the reads is a hardware clock, then this
> statement might not be too strong. But the fact that it is a clock, and
> not some memory mapped device register, I still think smp_rmb is
> sufficient.

To quote again from memory-barriers.txt, section "CPU memory barriers":

Mandatory barriers should not be used to control SMP effects, since
mandatory barriers unnecessarily impose overhead on UP systems. They
may, however, be used to control MMIO effects on accesses through
relaxed memory I/O windows. These are required even on non-SMP
systems as they affect the order in which memory operations appear to
a device by prohibiting both the compiler and the CPU from reordering
them.

Section "Accessing devices":

(2) If the accessor functions are used to refer to an I/O memory window with
relaxed memory access properties, then _mandatory_ memory barriers are
required to enforce ordering.

David

2008-11-09 13:34:42

[permalink] [raw]

Subject: Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

On Sun, 9 Nov 2008, Mathieu Desnoyers wrote:

> * Nicolas Pitre ([email protected]) wrote:
> > Currently, all existing users of cnt32_to_63() are fine since the CPU
> > architectures where it is used don't do read access reordering, and user
> > mode preemption is disabled already. It is nevertheless a good idea to
> > better elaborate usage requirements wrt preemption, and use an explicit
> > memory barrier on SMP to avoid different CPUs accessing the counter
> > value in the wrong order. On UP a simple compiler barrier is
> > sufficient.
> >
> > Signed-off-by: Nicolas Pitre <[email protected]>
> > ---
> >
> ...
> > @@ -68,9 +77,10 @@ union cnt32_to_63 {
> > */
> > #define cnt32_to_63(cnt_lo) \
> > ({ \
> > - static volatile u32 __m_cnt_hi; \
> > + static u32 __m_cnt_hi; \
>
> It's important to get the smp_rmb() here, which this patch provides, so
> consider this patch to be acked-by me. The added documentation is needed
> too.

Thanks.

> But I also think that declaring the static u32 __m_cnt_hi here is
> counter-intuitive for developers who wish to use it.

I'm rather not convinced of that. And this is a much bigger change
affecting all callers so I'd defer such change even if I was convinced
of it.

> I'd recommend letting the declaration be done outside of cnt32_to_63 so
> the same variable can be passed as parameter to more than one execution
> site.

Do you really have such instances where multiple call sites are needed?
That sounds even more confusing to me than the current model. Better
encapsulate the usage of this macro within some function which has a
stronger meaning, such as sched_clock(), and call _that_ from multiple
sites instead.

Nicolas

2008-11-09 13:44:34

[permalink] [raw]

Subject: Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

On Sun, Nov 09, 2008 at 08:34:23AM -0500, Nicolas Pitre wrote:
> Do you really have such instances where multiple call sites are needed?
> That sounds even more confusing to me than the current model. Better
> encapsulate the usage of this macro within some function which has a
> stronger meaning, such as sched_clock(), and call _that_ from multiple
> sites instead.

What if sched_clock() is inline and uses cnt32_to_63()? I think that's
where the problem lies. Better add a comment that it shouldn't be used
inside another inline function.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2008-11-09 14:31:52

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

On Sun, 9 Nov 2008, David Howells wrote:

> Steven Rostedt <[email protected]> wrote:
>
> > > Note that that does not guarantee that the two reads will be done in the
> > > order you want. The compiler barrier _only_ affects the compiler. It
> > > does not stop the CPU from doing the reads in any order it wants. You
> > > need something stronger than smp_rmb() if you need the reads to be so
> > > ordered.
> >
> > For reading hardware devices that can indeed be correct. But for normal
> > memory access on a uniprocessor, if the CPU were to reorder the reads that
> > would effect the actual algorithm then that CPU is broken.

Please read what I said above again.

"For reading hardware devices that can indeed be correct."

There I agree that accessing devices will require a rmb.

"But for normal memory access on a uniprocessor, if the CPU were to
reorder the reads that would effect the actual algorithm then that CPU is
broken."

Here I'm talking about accessing normal RAM. If the CPU decides to read b
before reading a then that will break the code.

> >
> > read a
> > <--- interrupt - should see read a here before read b is done.
> > read b
>
> Life isn't that simple. Go and read the section labelled "The things cpus get
> up to" in Documentation/memory-barriers.txt.

I've read it. Several times ;-)

>
> The two reads we're talking about are independent of each other. Independent
> reads and writes can be reordered and merged at will by the CPU, subject to
> restrictions imposed by barriers, cacheability attributes, MMIO attributes and
> suchlike.
>
> You can get read b happening before read a, but in such a case both
> instructions will be in the CPU's execution pipeline. When an interrupt
> occurs, the CPU will presumably finish clearing what's in its pipeline before
> going and servicing the interrupt handler.

This above sounds like you just answered my question, and a smp_rmb is
enough. If an interrupt occurs, then the read a and read b will be
completed. Really does not matter in which order, as long as the interrupt
itself does not see the read b before the read a.

>
> If a CPU is strictly ordered with respect to reads, do you actually need read
> barriers?
>
> The fact that a pair of reads might be part of an algorithm that is critically
> dependent on the ordering of those reads isn't something the CPU cares about.
> It doesn't know there's an algorithm there.
>
> > Now the fact that one of the reads is a hardware clock, then this
> > statement might not be too strong. But the fact that it is a clock, and
> > not some memory mapped device register, I still think smp_rmb is
> > sufficient.
>
> To quote again from memory-barriers.txt, section "CPU memory barriers":
>
> Mandatory barriers should not be used to control SMP effects, since
> mandatory barriers unnecessarily impose overhead on UP systems. They
> may, however, be used to control MMIO effects on accesses through
> relaxed memory I/O windows. These are required even on non-SMP
> systems as they affect the order in which memory operations appear to
> a device by prohibiting both the compiler and the CPU from reordering
> them.
>
> Section "Accessing devices":
>
> (2) If the accessor functions are used to refer to an I/O memory window with
> relaxed memory access properties, then _mandatory_ memory barriers are
> required to enforce ordering.

My confidence on reading a clock is not as strong that a smp_rmb is
enough. And it may not be. I'll have to think about this a bit more.
Again, the question arrises with:

read a (memory)
<---- interrupt
read b (clock)

Will the b be seen before the interrupt occurred, and before the a is
read? That is what will break the algorithm on UP. If we can not
guarantee this statement, then a rmb is needed.

-- Steve

2008-11-09 16:23:05

[permalink] [raw]

Subject: Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

* Nicolas Pitre ([email protected]) wrote:
> On Sun, 9 Nov 2008, Mathieu Desnoyers wrote:
>
> > * Nicolas Pitre ([email protected]) wrote:
> > > Currently, all existing users of cnt32_to_63() are fine since the CPU
> > > architectures where it is used don't do read access reordering, and user
> > > mode preemption is disabled already. It is nevertheless a good idea to
> > > better elaborate usage requirements wrt preemption, and use an explicit
> > > memory barrier on SMP to avoid different CPUs accessing the counter
> > > value in the wrong order. On UP a simple compiler barrier is
> > > sufficient.
> > >
> > > Signed-off-by: Nicolas Pitre <[email protected]>
> > > ---
> > >
> > ...
> > > @@ -68,9 +77,10 @@ union cnt32_to_63 {
> > > */
> > > #define cnt32_to_63(cnt_lo) \
> > > ({ \
> > > - static volatile u32 __m_cnt_hi; \
> > > + static u32 __m_cnt_hi; \
> >
> > It's important to get the smp_rmb() here, which this patch provides, so
> > consider this patch to be acked-by me. The added documentation is needed
> > too.
>
> Thanks.
>
> > But I also think that declaring the static u32 __m_cnt_hi here is
> > counter-intuitive for developers who wish to use it.
>
> I'm rather not convinced of that. And this is a much bigger change
> affecting all callers so I'd defer such change even if I was convinced
> of it.
>
> > I'd recommend letting the declaration be done outside of cnt32_to_63 so
> > the same variable can be passed as parameter to more than one execution
> > site.
>
> Do you really have such instances where multiple call sites are needed?
> That sounds even more confusing to me than the current model. Better
> encapsulate the usage of this macro within some function which has a
> stronger meaning, such as sched_clock(), and call _that_ from multiple
> sites instead.
>
>
> Nicolas

I see a few reasons for it :

- If we want to inline the whole read function so we don't pay the extra
runtime cost of a function call, this would become required.
- If we want to create a per cpu timer which updates the value
periodically without calling the function. We may want to add some
WARN_ON or some sanity tests in this periodic update that would not be
part of the standard read code. If we don't have access to this
variable outside of the macro, this becomes impossible.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-09 16:24:54

[permalink] [raw]

Subject: Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()

* David Howells ([email protected]) wrote:
> Steven Rostedt <[email protected]> wrote:
>
> > > Note that that does not guarantee that the two reads will be done in the
> > > order you want. The compiler barrier _only_ affects the compiler. It
> > > does not stop the CPU from doing the reads in any order it wants. You
> > > need something stronger than smp_rmb() if you need the reads to be so
> > > ordered.
> >
> > For reading hardware devices that can indeed be correct. But for normal
> > memory access on a uniprocessor, if the CPU were to reorder the reads that
> > would effect the actual algorithm then that CPU is broken.
> >
> > read a
> > <--- interrupt - should see read a here before read b is done.
> > read b
>
> Life isn't that simple. Go and read the section labelled "The things cpus get
> up to" in Documentation/memory-barriers.txt.
>
> The two reads we're talking about are independent of each other. Independent
> reads and writes can be reordered and merged at will by the CPU, subject to
> restrictions imposed by barriers, cacheability attributes, MMIO attributes and
> suchlike.
>
> You can get read b happening before read a, but in such a case both
> instructions will be in the CPU's execution pipeline. When an interrupt
> occurs, the CPU will presumably finish clearing what's in its pipeline before
> going and servicing the interrupt handler.
>
> If a CPU is strictly ordered with respect to reads, do you actually need read
> barriers?
>
> The fact that a pair of reads might be part of an algorithm that is critically
> dependent on the ordering of those reads isn't something the CPU cares about.
> It doesn't know there's an algorithm there.
>
> > Now the fact that one of the reads is a hardware clock, then this
> > statement might not be too strong. But the fact that it is a clock, and
> > not some memory mapped device register, I still think smp_rmb is
> > sufficient.
>
> To quote again from memory-barriers.txt, section "CPU memory barriers":
>
> Mandatory barriers should not be used to control SMP effects, since
> mandatory barriers unnecessarily impose overhead on UP systems. They
> may, however, be used to control MMIO effects on accesses through
> relaxed memory I/O windows. These are required even on non-SMP
> systems

<emphasis>
> as they affect the order in which memory operations appear to a device
</emphasis>

In this particular case, we don't care about the order of memory
operations as seen by the device, given we only read the mmio time
source atomically. So considering what you said above about the fact
that the CPU will flush all the pending operations in the pipeline
before proceeding to service an interrupt, a simple barrier() should be
enough to make the two operations appear in correct order wrt local
interrupts. I therefore don't think a full rmb() is required to insure
correct read order on UP, because, again, in this case we don't need to
order accesses as seen by the device.

Mathieu

> by prohibiting both the compiler and the CPU from reordering
> them.
>
> Section "Accessing devices":
>
> (2) If the accessor functions are used to refer to an I/O memory window with
> relaxed memory access properties, then _mandatory_ memory barriers are
> required to enforce ordering.
>
> David

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-10 04:20:23

[permalink] [raw]

Subject: Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

On Sun, 9 Nov 2008, Mathieu Desnoyers wrote:

> * Nicolas Pitre ([email protected]) wrote:
> > Do you really have such instances where multiple call sites are needed?
> > That sounds even more confusing to me than the current model. Better
> > encapsulate the usage of this macro within some function which has a
> > stronger meaning, such as sched_clock(), and call _that_ from multiple
> > sites instead.
>
> I see a few reasons for it :
>
> - If we want to inline the whole read function so we don't pay the extra
> runtime cost of a function call, this would become required.

You can inline it as you want as long as it remains in the same .c file.
The static variable is still shared amongst all call sites in that case.

> - If we want to create a per cpu timer which updates the value
> periodically without calling the function. We may want to add some
> WARN_ON or some sanity tests in this periodic update that would not be
> part of the standard read code. If we don't have access to this
> variable outside of the macro, this becomes impossible.

I don't see how you could update the variable without calling the
function somehow or duplicating it. As to the sanity check argument:
you can perform such checks on the result rather than the internal
variable which IMHO would be more logical.

And if you want a per CPU version, then it's better to create a
per_cpu_cnt32_to_63() variant which could use a relaxed barrier in that
case.

Nicolas

2008-11-10 04:44:50

[permalink] [raw]

Subject: Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

On Sun, 09 Nov 2008 23:20:00 -0500 (EST) Nicolas Pitre <[email protected]> wrote:

> On Sun, 9 Nov 2008, Mathieu Desnoyers wrote:
>
> > * Nicolas Pitre ([email protected]) wrote:
> > > Do you really have such instances where multiple call sites are needed?
> > > That sounds even more confusing to me than the current model. Better
> > > encapsulate the usage of this macro within some function which has a
> > > stronger meaning, such as sched_clock(), and call _that_ from multiple
> > > sites instead.
> >
> > I see a few reasons for it :
> >
> > - If we want to inline the whole read function so we don't pay the extra
> > runtime cost of a function call, this would become required.
>
> You can inline it as you want as long as it remains in the same .c file.
> The static variable is still shared amongst all call sites in that case.

Please don't rely upon deep compiler behaviour like that. It is
unobvious to the reader and it might break if someone uses it incorrectly,
or if the compiler implementation changes, or if a non-gcc compiler is
used, etc.

It is far better to make the management of the state explicit and at
the control of the caller. Get the caller to allocate the state and
pass its address into this function. Simple, clear, explicit and
robust.

2008-11-10 21:35:27

[permalink] [raw]

Subject: Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

On Sun, 9 Nov 2008, Andrew Morton wrote:

> On Sun, 09 Nov 2008 23:20:00 -0500 (EST) Nicolas Pitre <[email protected]> wrote:
>
> > On Sun, 9 Nov 2008, Mathieu Desnoyers wrote:
> >
> > > * Nicolas Pitre ([email protected]) wrote:
> > > > Do you really have such instances where multiple call sites are needed?
> > > > That sounds even more confusing to me than the current model. Better
> > > > encapsulate the usage of this macro within some function which has a
> > > > stronger meaning, such as sched_clock(), and call _that_ from multiple
> > > > sites instead.
> > >
> > > I see a few reasons for it :
> > >
> > > - If we want to inline the whole read function so we don't pay the extra
> > > runtime cost of a function call, this would become required.
> >
> > You can inline it as you want as long as it remains in the same .c file.
> > The static variable is still shared amongst all call sites in that case.
>
> Please don't rely upon deep compiler behaviour like that. It is
> unobvious to the reader and it might break if someone uses it incorrectly,
> or if the compiler implementation changes, or if a non-gcc compiler is
> used, etc.

If a compiler doesn't reference the same storage for a static variable
used by a function that gets inlined in the same compilation unit then
it is utterly broken.

> It is far better to make the management of the state explicit and at
> the control of the caller. Get the caller to allocate the state and
> pass its address into this function. Simple, clear, explicit and
> robust.

Sigh... What about this compromize then?

diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
index 7605fdd..74ce767 100644
--- a/include/linux/cnt32_to_63.h
+++ b/include/linux/cnt32_to_63.h
@@ -32,8 +32,9 @@ union cnt32_to_63 {

/**
- * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
+ * __cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
* @cnt_lo: The low part of the counter
+ * @cnt_hi_p: Pointer to storage for the extended part of the counter
*
* Many hardware clock counters are only 32 bits wide and therefore have
* a relatively short period making wrap-arounds rather frequent. This
@@ -75,16 +76,31 @@ union cnt32_to_63 {
* clear-bit instruction. Otherwise caller must remember to clear the top
* bit explicitly.
*/
-#define cnt32_to_63(cnt_lo) \
+#define __cnt32_to_63(cnt_lo, cnt_hi_p) \
({ \
- static u32 __m_cnt_hi; \
union cnt32_to_63 __x; \
- __x.hi = __m_cnt_hi; \
+ __x.hi = *(cnt_hi_p); \
smp_rmb(); \
__x.lo = (cnt_lo); \
if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
- __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
+ *(cnt_hi_p) = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
__x.val; \
})

+/**
+ * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
+ * @cnt_lo: The low part of the counter
+ *
+ * This is the same as __cnt32_to_63() except that the storage for the
+ * extended part of the counter is implicit. Because this uses a static
+ * variable, a user of this code must not be an inline function unless
+ * that function is contained in a single .c file for a given counter.
+ * All usage requirements for __cnt32_to_63() also apply here as well.
+ */
+#define cnt32_to_63(cnt_lo) \
+({ \
+ static u32 __m_cnt_hi; \
+ __cnt32_to_63(cnt_lo, &__m_cnt_hi); \
+})
+
#endif

Nicolas

2008-11-10 22:00:58

[permalink] [raw]

Subject: Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

On Mon, 10 Nov 2008 16:34:54 -0500 (EST)
Nicolas Pitre <[email protected]> wrote:

> > It is far better to make the management of the state explicit and at
> > the control of the caller. Get the caller to allocate the state and
> > pass its address into this function. Simple, clear, explicit and
> > robust.
>
> Sigh... What about this compromize then?
>
> diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
> index 7605fdd..74ce767 100644
> --- a/include/linux/cnt32_to_63.h
> +++ b/include/linux/cnt32_to_63.h
> @@ -32,8 +32,9 @@ union cnt32_to_63 {
>
>
> /**
> - * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> + * __cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> * @cnt_lo: The low part of the counter
> + * @cnt_hi_p: Pointer to storage for the extended part of the counter
> *
> * Many hardware clock counters are only 32 bits wide and therefore have
> * a relatively short period making wrap-arounds rather frequent. This
> @@ -75,16 +76,31 @@ union cnt32_to_63 {
> * clear-bit instruction. Otherwise caller must remember to clear the top
> * bit explicitly.
> */
> -#define cnt32_to_63(cnt_lo) \
> +#define __cnt32_to_63(cnt_lo, cnt_hi_p) \
> ({ \
> - static u32 __m_cnt_hi; \
> union cnt32_to_63 __x; \
> - __x.hi = __m_cnt_hi; \
> + __x.hi = *(cnt_hi_p); \
> smp_rmb(); \
> __x.lo = (cnt_lo); \
> if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> - __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> + *(cnt_hi_p) = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> __x.val; \
> })

This references its second argument twice, which can cause correctness
or efficiency problems.

There is no reason that this had to be implemented in cpp.
Implementing it in C will fix the above problem.

To the reader of the code, a call to cnt32_to_63() looks exactly like a
plain old function call. Hiding the instantiation of the state storage
inside this macro misleads the reader and hence is bad practice. This
is one of the reasons why the management of that state should be
performed by the caller and made explicit.

I cannot think of any other cases in the kernel where this trick of
instantiating static storage at a macro expansion site is performed.
It is unusual. It will surprise readers. Surprising readers is
undesirable.

2008-11-10 23:16:16

[permalink] [raw]

Subject: Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

On Mon, 10 Nov 2008, Andrew Morton wrote:

> On Mon, 10 Nov 2008 16:34:54 -0500 (EST)
> Nicolas Pitre <[email protected]> wrote:
>
> > > It is far better to make the management of the state explicit and at
> > > the control of the caller. Get the caller to allocate the state and
> > > pass its address into this function. Simple, clear, explicit and
> > > robust.
> >
> > Sigh... What about this compromize then?
> >
> > diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
> > index 7605fdd..74ce767 100644
> > --- a/include/linux/cnt32_to_63.h
> > +++ b/include/linux/cnt32_to_63.h
> > @@ -32,8 +32,9 @@ union cnt32_to_63 {
> >
> >
> > /**
> > - * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> > + * __cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> > * @cnt_lo: The low part of the counter
> > + * @cnt_hi_p: Pointer to storage for the extended part of the counter
> > *
> > * Many hardware clock counters are only 32 bits wide and therefore have
> > * a relatively short period making wrap-arounds rather frequent. This
> > @@ -75,16 +76,31 @@ union cnt32_to_63 {
> > * clear-bit instruction. Otherwise caller must remember to clear the top
> > * bit explicitly.
> > */
> > -#define cnt32_to_63(cnt_lo) \
> > +#define __cnt32_to_63(cnt_lo, cnt_hi_p) \
> > ({ \
> > - static u32 __m_cnt_hi; \
> > union cnt32_to_63 __x; \
> > - __x.hi = __m_cnt_hi; \
> > + __x.hi = *(cnt_hi_p); \
> > smp_rmb(); \
> > __x.lo = (cnt_lo); \
> > if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> > - __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> > + *(cnt_hi_p) = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> > __x.val; \
> > })
>
> This references its second argument twice, which can cause correctness
> or efficiency problems.
>
> There is no reason that this had to be implemented in cpp.
> Implementing it in C will fix the above problem.

No, it won't, for correctness and efficiency reasons.

And I've explained why already.

No need to discuss this further if you can't get it.

Nicolas

2008-11-10 23:23:44

[permalink] [raw]

Subject: Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
Nicolas Pitre <[email protected]> wrote:

> On Mon, 10 Nov 2008, Andrew Morton wrote:
>
> > On Mon, 10 Nov 2008 16:34:54 -0500 (EST)
> > Nicolas Pitre <[email protected]> wrote:
> >
> > > > It is far better to make the management of the state explicit and at
> > > > the control of the caller. Get the caller to allocate the state and
> > > > pass its address into this function. Simple, clear, explicit and
> > > > robust.
> > >
> > > Sigh... What about this compromize then?
> > >
> > > diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
> > > index 7605fdd..74ce767 100644
> > > --- a/include/linux/cnt32_to_63.h
> > > +++ b/include/linux/cnt32_to_63.h
> > > @@ -32,8 +32,9 @@ union cnt32_to_63 {
> > >
> > >
> > > /**
> > > - * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> > > + * __cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> > > * @cnt_lo: The low part of the counter
> > > + * @cnt_hi_p: Pointer to storage for the extended part of the counter
> > > *
> > > * Many hardware clock counters are only 32 bits wide and therefore have
> > > * a relatively short period making wrap-arounds rather frequent. This
> > > @@ -75,16 +76,31 @@ union cnt32_to_63 {
> > > * clear-bit instruction. Otherwise caller must remember to clear the top
> > > * bit explicitly.
> > > */
> > > -#define cnt32_to_63(cnt_lo) \
> > > +#define __cnt32_to_63(cnt_lo, cnt_hi_p) \
> > > ({ \
> > > - static u32 __m_cnt_hi; \
> > > union cnt32_to_63 __x; \
> > > - __x.hi = __m_cnt_hi; \
> > > + __x.hi = *(cnt_hi_p); \
> > > smp_rmb(); \
> > > __x.lo = (cnt_lo); \
> > > if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> > > - __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> > > + *(cnt_hi_p) = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> > > __x.val; \
> > > })
> >
> > This references its second argument twice, which can cause correctness
> > or efficiency problems.
> >
> > There is no reason that this had to be implemented in cpp.
> > Implementing it in C will fix the above problem.
>
> No, it won't, for correctness and efficiency reasons.
>
> And I've explained why already.

I'd be very surprised if you've really found a case where a macro is
faster than an inlined function. I don't think that has happened
before.

2008-11-10 23:38:44

[permalink] [raw]

Subject: Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

On Mon, 10 Nov 2008, Andrew Morton wrote:
> On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
> Nicolas Pitre <[email protected]> wrote:
> > >
> > > This references its second argument twice, which can cause correctness
> > > or efficiency problems.
> > >
> > > There is no reason that this had to be implemented in cpp.
> > > Implementing it in C will fix the above problem.
> >
> > No, it won't, for correctness and efficiency reasons.
> >
> > And I've explained why already.
>
> I'd be very surprised if you've really found a case where a macro is
> faster than an inlined function. I don't think that has happened
> before.

But that's the way my Grandpa did it. With macros!

-- Steve

2008-11-11 00:27:17

[permalink] [raw]

Subject: Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

On Mon, 10 Nov 2008, Andrew Morton wrote:

> On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
> Nicolas Pitre <[email protected]> wrote:
>
> > On Mon, 10 Nov 2008, Andrew Morton wrote:
> >
> > > This references its second argument twice, which can cause correctness
> > > or efficiency problems.
> > >
> > > There is no reason that this had to be implemented in cpp.
> > > Implementing it in C will fix the above problem.
> >
> > No, it won't, for correctness and efficiency reasons.
> >
> > And I've explained why already.
>
> I'd be very surprised if you've really found a case where a macro is
> faster than an inlined function. I don't think that has happened
> before.

That hasn't anything to do with "a macro is faster" at all. It's all
about the order used to evaluate provided arguments. And the first one
might be anything like a memory value, an IO operation, an expression,
etc. An inline function would work correctly with pointers only and
therefore totally break apart on x86 for example.

Nicolas

2008-11-11 18:28:21

[permalink] [raw]

Subject: [PATCH] convert cnt32_to_63 to inline

* Nicolas Pitre ([email protected]) wrote:
> On Mon, 10 Nov 2008, Andrew Morton wrote:
>
> > On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
> > Nicolas Pitre <[email protected]> wrote:
> >
> > > On Mon, 10 Nov 2008, Andrew Morton wrote:
> > >
> > > > This references its second argument twice, which can cause correctness
> > > > or efficiency problems.
> > > >
> > > > There is no reason that this had to be implemented in cpp.
> > > > Implementing it in C will fix the above problem.
> > >
> > > No, it won't, for correctness and efficiency reasons.
> > >
> > > And I've explained why already.
> >
> > I'd be very surprised if you've really found a case where a macro is
> > faster than an inlined function. I don't think that has happened
> > before.
>
> That hasn't anything to do with "a macro is faster" at all. It's all
> about the order used to evaluate provided arguments. And the first one
> might be anything like a memory value, an IO operation, an expression,
> etc. An inline function would work correctly with pointers only and
> therefore totally break apart on x86 for example.
>
>
> Nicolas

Let's see what it gives once implemented. Only compile-tested. Assumes
pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
arm versatile.

Turn cnt32_to_63 into an inline function.
Change all callers to new API.
Document barrier usage.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Nicolas Pitre <[email protected]>
CC: Andrew Morton <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
---
arch/arm/mach-pxa/time.c | 14 ++++++++++++-
arch/arm/mach-sa1100/generic.c | 15 +++++++++++++-
arch/arm/mach-versatile/core.c | 12 ++++++++++-
arch/mn10300/kernel/time.c | 19 +++++++++++++-----
include/linux/cnt32_to_63.h | 42 +++++++++++++++++++++++++++++------------
5 files changed, 82 insertions(+), 20 deletions(-)

Index: linux.trees.git/arch/arm/mach-pxa/time.c
===================================================================
--- linux.trees.git.orig/arch/arm/mach-pxa/time.c 2008-11-11 12:20:42.000000000 -0500
+++ linux.trees.git/arch/arm/mach-pxa/time.c 2008-11-11 13:05:01.000000000 -0500
@@ -37,6 +37,10 @@
#define OSCR2NS_SCALE_FACTOR 10

static unsigned long oscr2ns_scale;
+static u32 sched_clock_cnt_hi; /*
+ * Shared cnt_hi OK with cycle counter only
+ * for UP systems.
+ */

static void __init set_oscr2ns_scale(unsigned long oscr_rate)
{
@@ -54,7 +58,15 @@ static void __init set_oscr2ns_scale(uns

unsigned long long sched_clock(void)
{
- unsigned long long v = cnt32_to_63(OSCR);
+ u32 cnt_lo, cnt_hi;
+ unsigned long long v;
+
+ preempt_disable_notrace();
+ cnt_hi = sched_clock_cnt_hi;
+ barrier(); /* read cnt_hi before cnt_lo */
+ cnt_lo = OSCR;
+ v = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
+ preempt_enable_notrace();
return (v * oscr2ns_scale) >> OSCR2NS_SCALE_FACTOR;
}

Index: linux.trees.git/include/linux/cnt32_to_63.h
===================================================================
--- linux.trees.git.orig/include/linux/cnt32_to_63.h 2008-11-11 12:20:17.000000000 -0500
+++ linux.trees.git/include/linux/cnt32_to_63.h 2008-11-11 13:10:44.000000000 -0500
@@ -32,7 +32,9 @@ union cnt32_to_63 {

/**
* cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
- * @cnt_lo: The low part of the counter
+ * @cnt_hi: The high part of the counter (read first)
+ * @cnt_lo: The low part of the counter (read after cnt_hi)
+ * @cnt_hi_ptr: Pointer to high part of the counter
*
* Many hardware clock counters are only 32 bits wide and therefore have
* a relatively short period making wrap-arounds rather frequent. This
@@ -57,7 +59,10 @@ union cnt32_to_63 {
* code must be executed at least once per each half period of the 32-bit
* counter to properly update the state bit in memory. This is usually not a
* problem in practice, but if it is then a kernel timer could be scheduled
- * to manage for this code to be executed often enough.
+ * to manage for this code to be executed often enough. If a per-cpu cnt_hi is
+ * used, the value must be updated at least once per 32-bits half-period on each
+ * CPU. If cnt_hi is shared between CPUs, it suffice to update it once per
+ * 32-bits half-period on any CPU.
*
* Note that the top bit (bit 63) in the returned value should be considered
* as garbage. It is not cleared here because callers are likely to use a
@@ -65,16 +70,29 @@ union cnt32_to_63 {
* implicitly by making the multiplier even, therefore saving on a runtime
* clear-bit instruction. Otherwise caller must remember to clear the top
* bit explicitly.
+ *
+ * Preemption must be disabled when reading the cnt_hi and cnt_lo values and
+ * calling this function.
+ *
+ * The cnt_hi parameter _must_ be read before cnt_lo. This implies using the
+ * proper barriers :
+ * - smp_rmb() if cnt_lo is read from mmio and the cnt_hi variable is shared
+ * across CPUs.
+ * - use a per-cpu variable for cnt_high if cnt_lo is read from per-cpu cycles
+ * counters or to read the counters with only a barrier().
*/
-#define cnt32_to_63(cnt_lo) \
-({ \
- static volatile u32 __m_cnt_hi; \
- union cnt32_to_63 __x; \
- __x.hi = __m_cnt_hi; \
- __x.lo = (cnt_lo); \
- if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
- __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
- __x.val; \
-})
+static inline u64 cnt32_to_63(u32 cnt_hi, u32 cnt_lo, u32 *cnt_hi_ptr)
+{
+ union cnt32_to_63 __x = {
+ {
+ .hi = cnt_hi,
+ .lo = cnt_lo,
+ },
+ };
+
+ if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
+ *cnt_hi_ptr = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
+ return __x.val; /* Remember to clear the top bit in the caller */
+}

#endif
Index: linux.trees.git/arch/arm/mach-sa1100/generic.c
===================================================================
--- linux.trees.git.orig/arch/arm/mach-sa1100/generic.c 2008-11-11 12:20:42.000000000 -0500
+++ linux.trees.git/arch/arm/mach-sa1100/generic.c 2008-11-11 13:05:10.000000000 -0500
@@ -34,6 +34,11 @@
unsigned int reset_status;
EXPORT_SYMBOL(reset_status);

+static u32 sched_clock_cnt_hi; /*
+ * Shared cnt_hi OK with cycle counter only
+ * for UP systems.
+ */
+
#define NR_FREQS 16

/*
@@ -133,7 +138,15 @@ EXPORT_SYMBOL(cpufreq_get);
*/
unsigned long long sched_clock(void)
{
- unsigned long long v = cnt32_to_63(OSCR);
+ u32 cnt_lo, cnt_hi;
+ unsigned long long v;
+
+ preempt_disable_notrace();
+ cnt_hi = sched_clock_cnt_hi;
+ barrier(); /* read cnt_hi before cnt_lo */
+ cnt_lo = OSCR;
+ v = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
+ preempt_enable_notrace();

/* the <<1 gets rid of the cnt_32_to_63 top bit saving on a bic insn */
v *= 78125<<1;
Index: linux.trees.git/arch/arm/mach-versatile/core.c
===================================================================
--- linux.trees.git.orig/arch/arm/mach-versatile/core.c 2008-11-11 12:20:42.000000000 -0500
+++ linux.trees.git/arch/arm/mach-versatile/core.c 2008-11-11 12:57:55.000000000 -0500
@@ -60,6 +60,8 @@
#define VA_VIC_BASE __io_address(VERSATILE_VIC_BASE)
#define VA_SIC_BASE __io_address(VERSATILE_SIC_BASE)

+static u32 sched_clock_cnt_hi;
+
static void sic_mask_irq(unsigned int irq)
{
irq -= IRQ_SIC_START;
@@ -238,7 +240,15 @@ void __init versatile_map_io(void)
*/
unsigned long long sched_clock(void)
{
- unsigned long long v = cnt32_to_63(readl(VERSATILE_REFCOUNTER));
+ u32 cnt_lo, cnt_hi;
+ unsigned long long v;
+
+ preempt_disable_notrace();
+ cnt_hi = sched_clock_cnt_hi;
+ smp_rmb();
+ cnt_lo = readl(VERSATILE_REFCOUNTER);
+ v = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
+ preempt_enable_notrace();

/* the <<1 gets rid of the cnt_32_to_63 top bit saving on a bic insn */
v *= 125<<1;
Index: linux.trees.git/arch/mn10300/kernel/time.c
===================================================================
--- linux.trees.git.orig/arch/mn10300/kernel/time.c 2008-11-11 12:41:42.000000000 -0500
+++ linux.trees.git/arch/mn10300/kernel/time.c 2008-11-11 13:04:42.000000000 -0500
@@ -29,6 +29,11 @@ unsigned long mn10300_iobclk; /* system
unsigned long mn10300_tsc_per_HZ; /* number of ioclks per jiffy */
#endif /* CONFIG_MN10300_RTC */

+static u32 sched_clock_cnt_hi; /*
+ * shared cnt_hi OK with cycle counter only
+ * for UP systems.
+ */
+
static unsigned long mn10300_last_tsc; /* time-stamp counter at last time
* interrupt occurred */

@@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
unsigned long long ll;
unsigned l[2];
} tsc64, result;
- unsigned long tsc, tmp;
+ unsigned long tmp;
unsigned product[3]; /* 96-bit intermediate value */
+ u32 cnt_lo, cnt_hi;

- /* read the TSC value
- */
- tsc = 0 - get_cycles(); /* get_cycles() counts down */
+ preempt_disable_notrace();
+ cnt_hi = sched_clock_cnt_hi;
+ barrier(); /* read cnt_hi before cnt_lo */
+ cnt_lo = 0 - get_cycles(); /* get_cycles() counts down */

/* expand to 64-bits.
* - sched_clock() must be called once a minute or better or the
* following will go horribly wrong - see cnt32_to_63()
*/
- tsc64.ll = cnt32_to_63(tsc) & 0x7fffffffffffffffULL;
+ tsc64.ll = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
+ tsc64.ll &= 0x7fffffffffffffffULL;
+ preempt_enable_notrace();

/* scale the 64-bit TSC value to a nanosecond value via a 96-bit
* intermediate

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-11 19:16:37

[permalink] [raw]

Subject: Re: [PATCH] convert cnt32_to_63 to inline

On Tue, Nov 11, 2008 at 01:28:00PM -0500, Mathieu Desnoyers wrote:
> Let's see what it gives once implemented. Only compile-tested. Assumes
> pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
> arm versatile.

Versatile is also UP only.

The following are results from PXA built with gcc 3.4.3:

1. two additional registers used in sched_clock()
2. 8 additional bytes of code (which are needless if gcc was more inteligent)

both of these I put down to inefficiencies in gcc's register allocation.

3. worse instruction scheduling - two inter-dependent loads next to each
other causing a pipeline stall

Actual reading of variables/hardware is unaffected by this patch.

Old code:

c: e59f3050 ldr r3, [pc, #80] ; load address of oscr2ns_scale
10: e59fc050 ldr ip, [pc, #80] ; load address of __m_cnt_hi
14: e5932000 ldr r2, [r3] ; read oscr2ns_scale
18: e59f304c ldr r3, [pc, #76] ; load address of OSCR
1c: e59c1000 ldr r1, [ip] ; read __m_cnt_hi
20: e1a07002 mov r7, r2
24: e3a08000 mov r8, #0 ; 0x0
28: e5933000 ldr r3, [r3] ; read OSCR register
...
58: e1820b04 orr r0, r2, r4, lsl #22
5c: e1a01524 lsr r1, r4, #10
60: e89da9f0 ldm sp, {r4, r5, r6, r7, r8, fp, sp, pc}

New code:

c: e59f0058 ldr r0, [pc, #88] ; load address of oscr2ns_scale
10: e5901000 ldr r1, [r0] ; read oscr2ns_scale <= pipeline stall
14: e59f3054 ldr r3, [pc, #84] ; load address of __m_cnt_hi
18: e1a08001 mov r8, r1
1c: e5932000 ldr r2, [r3] ; read __m_cnt_hi
20: e59f304c ldr r3, [pc, #76] ; load address of OSCR
24: e1a09002 mov r9, r2
28: e3a0a000 mov sl, #0 ; 0x0
2c: e5933000 ldr r3, [r3] ; read OSCR
...
58: e1825b04 orr r5, r2, r4, lsl #22
5c: e1a06524 lsr r6, r4, #10
60: e1a01006 mov r1, r6
64: e1a00005 mov r0, r5
68: e89daff0 ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}

Versatile:

1. 12 additional bytes of code
2. same number of registers
3. worse instruction scheduling causing pipeline stall

Actual reading of variables/hardware is unaffected by this patch.

So, we have two platforms where this patch makes things visibly worse
with no material benefit.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2008-11-11 20:11:46

[permalink] [raw]

Subject: Re: [PATCH] convert cnt32_to_63 to inline

* Russell King ([email protected]) wrote:
> On Tue, Nov 11, 2008 at 01:28:00PM -0500, Mathieu Desnoyers wrote:
> > Let's see what it gives once implemented. Only compile-tested. Assumes
> > pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
> > arm versatile.
>
> Versatile is also UP only.
>
> The following are results from PXA built with gcc 3.4.3:
>
> 1. two additional registers used in sched_clock()
> 2. 8 additional bytes of code (which are needless if gcc was more inteligent)
>
> both of these I put down to inefficiencies in gcc's register allocation.
>
> 3. worse instruction scheduling - two inter-dependent loads next to each
> other causing a pipeline stall
>
> Actual reading of variables/hardware is unaffected by this patch.
>
> Old code:
>
> c: e59f3050 ldr r3, [pc, #80] ; load address of oscr2ns_scale
> 10: e59fc050 ldr ip, [pc, #80] ; load address of __m_cnt_hi
> 14: e5932000 ldr r2, [r3] ; read oscr2ns_scale
> 18: e59f304c ldr r3, [pc, #76] ; load address of OSCR
> 1c: e59c1000 ldr r1, [ip] ; read __m_cnt_hi
> 20: e1a07002 mov r7, r2
> 24: e3a08000 mov r8, #0 ; 0x0
> 28: e5933000 ldr r3, [r3] ; read OSCR register
> ...
> 58: e1820b04 orr r0, r2, r4, lsl #22
> 5c: e1a01524 lsr r1, r4, #10
> 60: e89da9f0 ldm sp, {r4, r5, r6, r7, r8, fp, sp, pc}
>
>
> New code:
>
> c: e59f0058 ldr r0, [pc, #88] ; load address of oscr2ns_scale
> 10: e5901000 ldr r1, [r0] ; read oscr2ns_scale <= pipeline stall
> 14: e59f3054 ldr r3, [pc, #84] ; load address of __m_cnt_hi
> 18: e1a08001 mov r8, r1
> 1c: e5932000 ldr r2, [r3] ; read __m_cnt_hi
> 20: e59f304c ldr r3, [pc, #76] ; load address of OSCR
> 24: e1a09002 mov r9, r2
> 28: e3a0a000 mov sl, #0 ; 0x0
> 2c: e5933000 ldr r3, [r3] ; read OSCR
> ...
> 58: e1825b04 orr r5, r2, r4, lsl #22
> 5c: e1a06524 lsr r6, r4, #10
> 60: e1a01006 mov r1, r6
> 64: e1a00005 mov r0, r5
> 68: e89daff0 ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
>
> Versatile:
>
> 1. 12 additional bytes of code
> 2. same number of registers
> 3. worse instruction scheduling causing pipeline stall
>
> Actual reading of variables/hardware is unaffected by this patch.
>
> So, we have two platforms where this patch makes things visibly worse
> with no material benefit.
>

I think the added barrier() are causing these pipeline stalls. They
don't allow the compiler to read variables such as oscr2ns_scale before
the barrier because gcc cannot assume it won't be modified. However, to
insure that OSCR read is done after __m_cnt_hi read, this barrier seems
required to be safe against gcc optimizations.

Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
the macro or to a vanilla tree ?

I wonder if reading those values sooner would help gcc ?

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-11 21:01:26

[permalink] [raw]

Subject: Re: [PATCH] convert cnt32_to_63 to inline

On Tue, 11 Nov 2008, Mathieu Desnoyers wrote:

> * Nicolas Pitre ([email protected]) wrote:
> > That hasn't anything to do with "a macro is faster" at all. It's all
> > about the order used to evaluate provided arguments. And the first one
> > might be anything like a memory value, an IO operation, an expression,
> > etc. An inline function would work correctly with pointers only and
> > therefore totally break apart on x86 for example.
> >
> >
> > Nicolas
>
> Let's see what it gives once implemented. Only compile-tested. Assumes
> pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
> arm versatile.
>
> Turn cnt32_to_63 into an inline function.
> Change all callers to new API.
> Document barrier usage.

Look, I'm not interested at all into this mess.

The _WHOLE_ point of the cnt32_to_63 macro was to abstract and
encapsulate the subtlety of the algorithm. It initially started as an
open coded implementation in PXA's sched_clock(). Then I was asked to
make it a macro that can be reused for other ARM platforms. Then David
wanted to reuse it on other platforms than ARM.

Now you are simply destroying all the value of having that macro in the
first place. The argument is that the macro could be misused because it
has a static variable inside it, etc. etc. The solution: spread the
subtlety all around instead of keeping it into the macro and risk having
it wrong or broken due to future changes surrounding it in the future.
And it _will_ happen due to the increased exposure making the whole idea
even more fragile compared to having it concentrated in only one spot.
This is a total non sense and I can't believe you truly think your patch
makes things better...

You're even disabling preemption where it is really unneeded making the
whole thing about the double its initial cost. Look at the generated
assembly and count the cycles if you don't believe me!

No thank you. If this trend continues I'm going to make it back private
to ARM again so you could pessimize your own code as much as you want.

NACK.

Nicolas

2008-11-11 21:15:21

[permalink] [raw]

Subject: Re: [PATCH] convert cnt32_to_63 to inline

On Tue, Nov 11, 2008 at 04:00:46PM -0500, Nicolas Pitre wrote:
> No thank you. If this trend continues I'm going to make it back private
> to ARM again so you could pessimize your own code as much as you want.

As I've already stated several days ago, I think that's the right
course of action. Given all the concerns raised, it's clearly not
something that should have been allowed to become generic.

So, let's just close this discussion off by taking that course of
action.

What's required is (in order):
1. a local copy for NM10300 needs to be created and it converted to that
2. these two commits then need to be reverted:

bc173c5789e1fc6065fd378edc815914b40ee86b
b4f151ff899362fec952c45d166252c9912c041f

Then our usage is again limited to sched_clock() which is well understood
and known to be problem free.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2008-11-11 21:53:35

[permalink] [raw]

Subject: Re: [PATCH] convert cnt32_to_63 to inline

On Tue, Nov 11, 2008 at 03:11:30PM -0500, Mathieu Desnoyers wrote:
> I think the added barrier() are causing these pipeline stalls. They
> don't allow the compiler to read variables such as oscr2ns_scale before
> the barrier because gcc cannot assume it won't be modified. However, to
> insure that OSCR read is done after __m_cnt_hi read, this barrier seems
> required to be safe against gcc optimizations.
>
> Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
> the macro or to a vanilla tree ?

Nicolas' patch compared to unmodified - there's less side effects,
which come down to two pipeline stalls whereas we had none with
the unmodified code.

One pipeline stall for loading the address of __m_cnt_hi and reading
its value, followed by the same thing for oscr2ns_scale.

I think this is showing the problem of compiler barriers - they are
indescriminate. They are total and complete barriers - not only do
they act on the data but also the compilers ability to emit code for
generating the addresses of the data to be loaded.

Clearly, the address of OSCR, __m_cnt_hi nor oscr2ns_scale is ever
going to change at run time - their addresses are all stored in the
literal pool, but by putting compiler barriers in, the compiler is
being prevented from reading from the literal pool at the most
appropriate point.

So, I've tried this:

unsigned long long sched_clock(void)
{
+ unsigned long *oscr2ns_ptr = &oscr2ns_scale;
unsigned long long v = cnt32_to_63(OSCR);
- return (v * oscr2ns_scale) >> OSCR2NS_SCALE_FACTOR;
+ return (v * *oscr2ns_ptr) >> OSCR2NS_SCALE_FACTOR;
}

to try to explicitly code the loads. This unfortunately results in
three pipeline stalls. Also tried swapping the two lines starting
'unsigned long' without any improvement on not having those extra hacks
to work around the barrier.

So, let's summarise this:

1. the existing code works, is correct on ARM, and is efficient.
2. throwing barriers into the function makes it less efficient.
3. re-engineering the code appears to make things worse.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2008-11-11 22:33:12

[permalink] [raw]

Subject: Re: [PATCH] convert cnt32_to_63 to inline

Mathieu Desnoyers <[email protected]> wrote:

> @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
> ...
> + preempt_disable_notrace();

Please, no! sched_clock() is called with preemption or interrupts disabled
everywhere except from some debugging code (lock tracing IIRC). If you need
to insert this preemption disablement somewhere, please insert it there. At
least then sched_clock() will be called consistently.

David

2008-11-11 22:38:29

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] convert cnt32_to_63 to inline

On Tue, 2008-11-11 at 22:31 +0000, David Howells wrote:
> Mathieu Desnoyers <[email protected]> wrote:
>
> > @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
> > ...
> > + preempt_disable_notrace();
>
> Please, no! sched_clock() is called with preemption or interrupts disabled
> everywhere except from some debugging code (lock tracing IIRC). If you need
> to insert this preemption disablement somewhere, please insert it there. At
> least then sched_clock() will be called consistently.

Agreed. You could do a WARN_ON(!in_atomic); in sched_clock() depending
on DEBUG_PREEMPT or something to ensure this.

2008-11-12 01:13:58

[permalink] [raw]

Subject: Re: [PATCH] convert cnt32_to_63 to inline

On Tue, 11 Nov 2008, Peter Zijlstra wrote:

> On Tue, 2008-11-11 at 22:31 +0000, David Howells wrote:
> > Mathieu Desnoyers <[email protected]> wrote:
> >
> > > @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
> > > ...
> > > + preempt_disable_notrace();
> >
> > Please, no! sched_clock() is called with preemption or interrupts disabled
> > everywhere except from some debugging code (lock tracing IIRC). If you need
> > to insert this preemption disablement somewhere, please insert it there. At
> > least then sched_clock() will be called consistently.
>
> Agreed. You could do a WARN_ON(!in_atomic); in sched_clock() depending
> on DEBUG_PREEMPT or something to ensure this.

It would also be nice if this requirement (calling sched_clock with
preemption disabled) was documented somewhere more obvious.

Doing as Peter suggested, adding a WARN_ON and documenting that this must
be called with preemption disabled, would be nice.

-- Steve

2008-11-12 03:53:41