2011-02-14 23:09:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/2] jump label: 2.6.38 updates

On Mon, Feb 14, 2011 at 06:03:01PM -0500, Mathieu Desnoyers wrote:
> * Matt Fleming ([email protected]) wrote:
> > On Mon, 14 Feb 2011 13:46:00 -0800 (PST)
> > David Miller <[email protected]> wrote:
> >
> > > From: Steven Rostedt <[email protected]>
> > > Date: Mon, 14 Feb 2011 16:39:36 -0500
> > >
> > > > Thus it is not about global, as global is updated by normal means
> > > > and will update the caches. atomic_t is updated via the ll/sc that
> > > > ignores the cache and causes all this to break down. IOW... broken
> > > > hardware ;)
> > >
> > > I don't see how cache coherency can possibly work if the hardware
> > > behaves this way.
> >
> > Cache coherency is still maintained provided writes/reads both go
> > through the cache ;-)
> >
> > The problem is that for read-modify-write operations the arbitration
> > logic that decides who "wins" and is allowed to actually perform the
> > write, assuming two or more CPUs are competing for a single memory
> > address, is not implemented in the cache controller, I think. I'm not a
> > hardware engineer and I never understood how the arbitration logic
> > worked but I'm guessing that's the reason that the ll/sc instructions
> > bypass the cache.
> >
> > Which is why the atomic_t functions worked out really well for that
> > arch, such that any accesses to an atomic_t * had to go through the
> > wrapper functions.

???

What CPU family are we talking about here? For cache coherent CPUs,
cache coherence really is supposed to work, even for mixed atomic and
non-atomic instructions to the same variable.

Thanx, Paul

> If this is true, then we have bugs in lots of xchg/cmpxchg users (which
> do not reside in atomic.h), e.g.:
>
> fs/fs_struct.c:
> int current_umask(void)
> {
> return current->fs->umask;
> }
> EXPORT_SYMBOL(current_umask);
>
> kernel/sys.c:
> SYSCALL_DEFINE1(umask, int, mask)
> {
> mask = xchg(&current->fs->umask, mask & S_IRWXUGO);
> return mask;
> }
>
> The solution to this would be to force all xchg/cmpxchg users to swap to
> atomic.h variables, which would force the ll semantic on read. But I'd
> really like to see where this is documented first -- or which PowerPC
> engineer we should talk to.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com


2011-02-14 23:29:57

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 0/2] jump label: 2.6.38 updates

[ added Segher Boessenkool and Paul Mackerras to CC list ]

* Paul E. McKenney ([email protected]) wrote:
> On Mon, Feb 14, 2011 at 06:03:01PM -0500, Mathieu Desnoyers wrote:
> > * Matt Fleming ([email protected]) wrote:
> > > On Mon, 14 Feb 2011 13:46:00 -0800 (PST)
> > > David Miller <[email protected]> wrote:
> > >
> > > > From: Steven Rostedt <[email protected]>
> > > > Date: Mon, 14 Feb 2011 16:39:36 -0500
> > > >
> > > > > Thus it is not about global, as global is updated by normal means
> > > > > and will update the caches. atomic_t is updated via the ll/sc that
> > > > > ignores the cache and causes all this to break down. IOW... broken
> > > > > hardware ;)
> > > >
> > > > I don't see how cache coherency can possibly work if the hardware
> > > > behaves this way.
> > >
> > > Cache coherency is still maintained provided writes/reads both go
> > > through the cache ;-)
> > >
> > > The problem is that for read-modify-write operations the arbitration
> > > logic that decides who "wins" and is allowed to actually perform the
> > > write, assuming two or more CPUs are competing for a single memory
> > > address, is not implemented in the cache controller, I think. I'm not a
> > > hardware engineer and I never understood how the arbitration logic
> > > worked but I'm guessing that's the reason that the ll/sc instructions
> > > bypass the cache.
> > >
> > > Which is why the atomic_t functions worked out really well for that
> > > arch, such that any accesses to an atomic_t * had to go through the
> > > wrapper functions.
>
> ???
>
> What CPU family are we talking about here? For cache coherent CPUs,
> cache coherence really is supposed to work, even for mixed atomic and
> non-atomic instructions to the same variable.
>

I'm really curious to know which CPU families too. I've used git blame
to see where these lwz/stw instructions were added to powerpc, and it
points to:

commit 9f0cbea0d8cc47801b853d3c61d0e17475b0cc89
Author: Segher Boessenkool <[email protected]>
Date: Sat Aug 11 10:15:30 2007 +1000

[POWERPC] Implement atomic{, 64}_{read, write}() without volatile

Instead, use asm() like all other atomic operations already do.

Also use inline functions instead of macros; this actually
improves code generation (some code becomes a little smaller,
probably because of improved alias information -- just a few
hundred bytes total on a default kernel build, nothing shocking).

Signed-off-by: Segher Boessenkool <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

So let's ping the relevant people to see if there was any reason for
making these atomic read/set operations different from other
architectures in the first place.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2011-02-15 11:53:43

by Will Newton

[permalink] [raw]
Subject: Re: [PATCH 0/2] jump label: 2.6.38 updates

On Mon, Feb 14, 2011 at 11:09 PM, Paul E. McKenney
<[email protected]> wrote:

Hi Paul,

> What CPU family are we talking about here? ?For cache coherent CPUs,
> cache coherence really is supposed to work, even for mixed atomic and
> non-atomic instructions to the same variable.

Is there a specific situation you can think of where this would be a
problem? I have to admit to a certain amount of unease with the design
our hardware guys came up with, but I don't have a specific case where
it won't work, just cases where it is less than optimal.

2011-02-18 19:04:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/2] jump label: 2.6.38 updates

On Tue, Feb 15, 2011 at 11:53:37AM +0000, Will Newton wrote:
> On Mon, Feb 14, 2011 at 11:09 PM, Paul E. McKenney
> <[email protected]> wrote:
>
> Hi Paul,
>
> > What CPU family are we talking about here? ?For cache coherent CPUs,
> > cache coherence really is supposed to work, even for mixed atomic and
> > non-atomic instructions to the same variable.
>
> Is there a specific situation you can think of where this would be a
> problem? I have to admit to a certain amount of unease with the design
> our hardware guys came up with, but I don't have a specific case where
> it won't work, just cases where it is less than optimal.

OK, you did ask...

One case is when a given block of memory was subject to atomic
instructions, then was freed and reallocated as a structure used by
normal instructions. It would be quite bad if the last pre-free atomic
operation failed to play nice with the first post-allocate non-atomic
instruction. The reverse situation is of course important as well,
where a block subject to non-atomic instructions is freed and reallocated
as a structure subject to atomic instructions. I would guess you would
handle these cases by making the memory allocator deal with any hardware
caching issues, but however it is handled, it does need to be handled.

Another case is a leaky-bucket token protocol, where there is a rate
limit of some sort. There is an integer that is positive when progress
is permitted, and negative otherwise. This integer is periodically reset
to its upper limit, and this reset operation can use a non-atomic store.
When attempting to carry out a rate-limited operation, you use either
atomic_add_return() if underflow cannot happen, but you must use
cmpxchg() if underflow is a possibility. Now you -could- use atomic
xchg() to reset the integer, but you don't have to. You -could- also
use atomic_cmpxchg() to check and atomic_set() to reset the limit, but
again, you don't have to. And there might well be places in the Linux
kernel that mix atomic and non-atomic operations in this case.

Yet another case is a variation on the lockless queue that can have
concurrent enqueues but where only one task may dequeue at a time, for
example, dequeuing might be guarded by a lock. Suppose that dequeues
removed all the elements on the queue at one shot. Such a queue might
have a head and tail pointer, where the tail pointer references the
->next pointer of the last element, or references the head pointer if
the queue is empty. Each element also has a flag that indicates whether
it is a normal element or a dummy element.

Enqueues are handled in the normal way for this sort of queue:

1. Initialize the element to be added, including NULLing out
its ->next pointer.

2. Atomically exchange the queue's tail pointer with a pointer
to the element's ->next pointer, placing the old tail pointer
into a local variable (call it "oldtail").

3. Nonatomically set the pointer referenced by oldtail to
point to the newly added element.

Then a bulk dequeue could be written as follows:

1. Pick up the head pointer, placing it in a local variable
(call it "oldhead"). If NULL, return an empty list, otherwise
continue through the following steps.

2. Store NULL into the head pointer. This can be done nonatomically,
because no one else will be concurrently storing into this
pointer -- there is at least one element on the list, and so
the enqueuers will be instead storing to the ->next pointer
of the last element.

3. Atomically exchange the queue's tail pointer with a pointer
to the queue's head pointer, placing the old value of the
tail pointer into a local variable (again, call it "oldtail").

4. Return a list with oldhead as the head pointer and oldtail
as the tail pointer.

The caller cannot rely on NULL pointers to find the end of the
list, as an enqueuer might be delayed between steps 2 and 3.
Instead, the caller must check to see if the address of the NULL
pointer is equal to oldtail, in which case, the caller has in
fact reached the end of the list. Otherwise, the caller must
wait for the pointer to become non-NULL.

Yes, you can replace the non-atomic loads and stores in the enqueuer's
step #3 and in the bulk dequeue's steps #1 and #2 with atomic exchange
instructions -- in fact you can replace either or both. And you could
also require that the caller use atomic instructions when looking at
each element's ->next pointer.

There are other algorithms, but this should be a decent start.

And yes, you -can- make these algorithms use only atomic instructions, but
you don't -have- to. So it is quite likely that similar algorithms exist
somewhere in the 10+ million lines of code making up the Linux kernel.

Thanx, Paul