2006-12-06 16:44:04

by David Howells

[permalink] [raw]
Subject: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

From: David Howells <[email protected]>

Implement generic UP cmpxchg() where an arch doesn't otherwise support it.
This assuming that the arch doesn't have support SMP without providing its own
cmpxchg() implementation.

This is required because cmpxchg() is used by the reduced work queue patches to
adjust the management data in a work_struct.


Also provide ARMv6 with a cmpxchg() implementation using LDREX/STREXEQ. Pre-v6
ARM doesn't support SMP according to ARM's atomic.h, so the generic
IRQ-disablement based cmpxchg() is entirely adequate there (if it isn't, then
atomic_cmpxchg() is also broken on ARM).


Furthermore, if the generic IRQ-disablement cmpxchg() is being used, then a
preprocessor symbol is defined (ARCH_USING_IRQ_BASED_CMPXCHG) so that code
using it can make alternative arrangements where such would be more efficient.

__queue_work() in kernel/workqueue.c then makes use of this as it disables
interrupts before attempting to modify the work queue pointer in the work item.

With the reduction patches, cmpxchg() is called indirectly from within
__queue_work(), and so it's more efficient to use __queue_work()'s interrupt
disablement and just make the desired change directly if the aforementioned
ARCH_USING_IRQ_BASED_CMPXCHG is set rather than using cmpxchg() at all.

Unfortunately, the compiler's optimiser can't discard excessive interrupt
disablement, so optimisation here has to be done manually, if at all.

Signed-Off-By: David Howells <[email protected]>
---

include/asm-arm/system.h | 40 ++++++++++++++++++++++++++++++++++
include/asm-generic/cmpxchg.h | 48 +++++++++++++++++++++++++++++++++++++++++
kernel/workqueue.c | 37 +++++++++++++++++++++++++++++++-
3 files changed, 124 insertions(+), 1 deletions(-)

diff --git a/include/asm-arm/system.h b/include/asm-arm/system.h
index f05fbe3..f16e42d 100644
--- a/include/asm-arm/system.h
+++ b/include/asm-arm/system.h
@@ -325,6 +325,46 @@ #endif
extern void disable_hlt(void);
extern void enable_hlt(void);

+/*
+ * We only implement cmpxchg in ASM on ARMv6 where we have LDREX/STREX
+ * available, and we only implement it for word-sized exchanges
+ */
+#if __LINUX_ARM_ARCH__ >= 6
+extern void __bad_cmpxchg(volatile void *, int);
+
+#define cmpxchg(ptr, old, new) \
+({ \
+ __typeof__ (ptr) ____p = (ptr); \
+ __typeof__(*ptr) ____old = (old); \
+ __typeof__(*ptr) ____new = (new); \
+ __typeof__(*ptr) ____oldval; \
+ __typeof__(*ptr) ____res; \
+ \
+ switch (sizeof(____res)) { \
+ case 4: \
+ do { \
+ __asm__ __volatile__("@ cmpxchg\n" \
+ "ldrex %1, [%2]\n" \
+ "mov %0, #0\n" \
+ "teq %1, %3\n" \
+ "strexeq %0, %4, [%2]\n" \
+ : "=&r" (____res), "=&r" (____oldval) \
+ : "r" (____p), "Ir" (____old), "r" (____new) \
+ : "cc"); \
+ } while(____res); \
+ break; \
+ default: \
+ __bad_cmpxchg(____p, sizeof(____res)); \
+ ____oldval = 0; \
+ break; \
+ } \
+ ____oldval; \
+})
+
+#else
+#include <asm-generic/cmpxchg.h>
+#endif
+
#endif /* __ASSEMBLY__ */

#define arch_align_stack(x) (x)
diff --git a/include/asm-generic/cmpxchg.h b/include/asm-generic/cmpxchg.h
new file mode 100644
index 0000000..4316092
--- /dev/null
+++ b/include/asm-generic/cmpxchg.h
@@ -0,0 +1,48 @@
+/* Generic cmpxchg for those arches that don't implement it themselves
+ *
+ * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _ASM_GENERIC_CMPXCHG_H
+#define _ASM_GENERIC_CMPXCHG_H
+
+#if !defined(cmpxchg) && !defined(CONFIG_SMP)
+
+/**
+ * cmpxchg - Atomically conditionally exchange one value for another.
+ * @ptr - Pointer to the value to be altered.
+ * @old - The value to change from.
+ * @new - The value to change to.
+ *
+ * This function atomically compares the current value at the word pointed to
+ * by @ptr, and if it's the same as @old, changes it to @new. If it's not the
+ * same then it's left unchanged.
+ *
+ * The value that was in the word pointed to by @ptr is returned, whether or
+ * not it was changed to @new.
+ */
+#define cmpxchg(ptr, old, new) \
+({ \
+ unsigned long ____flags; \
+ __typeof__ (ptr) ____p = (ptr); \
+ __typeof__(*ptr) ____old = (old); \
+ __typeof__(*ptr) ____new = (new); \
+ __typeof__(*ptr) ____res; \
+ raw_local_irq_save(____flags); \
+ ____res = *____p; \
+ if (likely(____res == (____old))) \
+ *____p = (____new); \
+ raw_local_irq_restore(____flags); \
+ ____res; \
+})
+
+#define ARCH_USING_IRQ_BASED_CMPXCHG
+
+#endif /* !cmpxchg && !SMP */
+#endif /* _ASM_GENERIC_CMPXCHG_H */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8d1e7cb..fdf0c30 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -80,6 +80,11 @@ static inline int is_single_threaded(str
return list_empty(&wq->list);
}

+/*
+ * set the work queue pointer in the work item's management data
+ * - must retain the NAR flag, but the pending flag can just be set
+ */
+#ifndef ARCH_USING_IRQ_BASED_CMPXCHG
static inline void set_wq_data(struct work_struct *work, void *wq)
{
unsigned long new, old, res;
@@ -98,6 +103,36 @@ static inline void set_wq_data(struct wo
}
}

+#define __set_wq_data(w, wq) set_wq_data((w), (wq))
+
+#else
+/*
+ * set the work queue pointer when the caller guarantees atomicity by disabling
+ * interrupts on a UP box (where there's no CMPXCHG equivalent)
+ */
+static inline void __set_wq_data(struct work_struct *work, void *wq)
+{
+ unsigned long tmp;
+
+ tmp = work->management & WORK_STRUCT_FLAG_MASK;
+ tmp |= (unsigned long) wq;
+ tmp |= 1UL << WORK_STRUCT_PENDING;
+ work->management = tmp;
+}
+
+static inline void set_wq_data(struct work_struct *work, void *wq)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __set_wq_data(work, wq);
+ local_irq_restore(flags);
+}
+#endif
+
+/*
+ * get the workqueue data from the work item's management data
+ */
static inline void *get_wq_data(struct work_struct *work)
{
return (void *) (work->management & WORK_STRUCT_WQ_DATA_MASK);
@@ -110,7 +145,7 @@ static void __queue_work(struct cpu_work
unsigned long flags;

spin_lock_irqsave(&cwq->lock, flags);
- set_wq_data(work, cwq);
+ __set_wq_data(work, cwq);
list_add_tail(&work->entry, &cwq->worklist);
cwq->insert_sequence++;
wake_up(&cwq->more_work);


2006-12-06 17:23:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Wed, 6 Dec 2006, David Howells wrote:
>
> Implement generic UP cmpxchg() where an arch doesn't otherwise support it.
> This assuming that the arch doesn't have support SMP without providing its own
> cmpxchg() implementation.

This is too ugly to live. At least the kernel/workqueue.c part.

The requirement that everybody implement a workable cmpxchg (and falling
back to spinlocks if required as per the atomic bit operations) looks ok,
but don't show that in generic code.

Linus

2006-12-06 18:56:50

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

I'd really appreciate a cmpxchg that is generically available for
all arches. It will allow lockless implementation for various performance
criticial portions of the kernel.


2006-12-06 19:00:36

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 10:56:14AM -0800, Christoph Lameter wrote:
> I'd really appreciate a cmpxchg that is generically available for
> all arches. It will allow lockless implementation for various performance
> criticial portions of the kernel.

Let's recap on cmpxchg.

For CPUs with no atomic operation other than SWP, it is not lockless.
For CPUs with load locked + store conditional, it is expensive.

I've shown in the past that a generic implementation based around ll/sc
can be cheaply implemented using cmpxchg. The reverse is *not* true.

If you want an operation for performance critical portions of the
kernel, please allow architecture maintainers the freedom to use their
best performance enhancements.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-06 19:06:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Wed, 6 Dec 2006, Christoph Lameter wrote:
>
> I'd really appreciate a cmpxchg that is generically available for
> all arches. It will allow lockless implementation for various performance
> criticial portions of the kernel.

I suspect ARM may have been the last one without one, no?

That said, cmpxchg won't necessarily be "high-performance" unless the hw
supports it naturally in hardware, so..

Russell, are you ok with the code DavidH posted (the "try 2" one)? I'd
like to get an ack from the ARM maintainer before applying it, but it
looked ok.

Linus

2006-12-06 19:08:37

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 6 Dec 2006, Christoph Lameter wrote:
> >
> > I'd really appreciate a cmpxchg that is generically available for
> > all arches. It will allow lockless implementation for various performance
> > criticial portions of the kernel.
>
> I suspect ARM may have been the last one without one, no?

No. sparc32 doesn't have one, for instance.

2006-12-06 19:12:59

by Lennert Buytenhek

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 04:43:14PM +0000, David Howells wrote:

> Pre-v6 ARM doesn't support SMP according to ARM's atomic.h,

That's not quite true, there exist ARMv5 processors that in theory
can support SMP.

2006-12-06 19:17:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, 6 Dec 2006, Russell King wrote:

> On Wed, Dec 06, 2006 at 10:56:14AM -0800, Christoph Lameter wrote:
> > I'd really appreciate a cmpxchg that is generically available for
> > all arches. It will allow lockless implementation for various performance
> > criticial portions of the kernel.
>
> Let's recap on cmpxchg.
>
> For CPUs with no atomic operation other than SWP, it is not lockless.

But then its also just requires disable/enable interrupts on UP which may
be cheaper than an atomic operation.

> For CPUs with load locked + store conditional, it is expensive.

Because it locks the bus? I am not that familiar with those architectures
but it seems that those will have a general problem anyways.

> If you want an operation for performance critical portions of the
> kernel, please allow architecture maintainers the freedom to use their
> best performance enhancements.

And thereby denying the kernel developers to use a simple atomic SMP
operation? Adding additional defines for each arch and each performance
critical piece of kernel logic?

2006-12-06 19:26:51

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> On Wed, 6 Dec 2006, Christoph Lameter wrote:
> >
> > I'd really appreciate a cmpxchg that is generically available for
> > all arches. It will allow lockless implementation for various performance
> > criticial portions of the kernel.
>
> I suspect ARM may have been the last one without one, no?

It's just been pointed out to me that the parisc one isn't safe.

<dhowells> imagine variable X is set to 3
<dhowells> CPU A issues cmpxchg(&X, 3, 5)
<dhowells> you'd expect that to change X to 5
<dhowells> but what if CPU B assigns 6 to X between cmpxchg reading X
and it setting X?

Given parisc's paucity of atomic operations (load-and-zero-32bit and
load-and-zero-64bit), cmpxchg() is impossible to implement safely.
There has to be something we can hook to exclude another processor
modifying the variable. I'm OK with using atomic_cmpxchg(); we have
atomic_set locked against it.

Of course, using cmpxchg() isn't really lockless. It's just hidden
locking.

2006-12-06 19:26:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Wed, 6 Dec 2006, Al Viro wrote:
>
> No. sparc32 doesn't have one, for instance.

Ok. For SMP-safety, it's important that any architecture that can't do
this needs to _share_ the same spinlock (on SMP only, of course) that it
uses for the bitops.

It would be good (but perhaps not as strict a requirement) if the atomic
counters also use the same lock. But that is probably impossible on
sparc32 (since it has a per-counter "lock"-like thing, iirc). So doing a
cmpxchg() on an atomic_t would be a bug.

Linus

2006-12-06 19:29:42

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 11:25:35AM -0800, Linus Torvalds wrote:
> Ok. For SMP-safety, it's important that any architecture that can't do
> this needs to _share_ the same spinlock (on SMP only, of course) that it
> uses for the bitops.

That doesn't help, since assignment can't be guarded by any lock.

> It would be good (but perhaps not as strict a requirement) if the atomic
> counters also use the same lock. But that is probably impossible on
> sparc32 (since it has a per-counter "lock"-like thing, iirc). So doing a
> cmpxchg() on an atomic_t would be a bug.

sparc32 switched over to the parisc way of doing things, so they could
expand their atomic_t to a full 32 bits. They still have the old
atomic_24_t lying around for their arch-private use, but atomic_t uses a
hashed spinlock.

2006-12-06 19:29:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Wed, 6 Dec 2006, Christoph Lameter wrote:
>
> > For CPUs with load locked + store conditional, it is expensive.
>
> Because it locks the bus? I am not that familiar with those architectures
> but it seems that those will have a general problem anyways.

load_locked + store_conditional should _not_ be any more expensive than
any atomic sequence will always be.

Atomic sequences in SMP are obviously never "cheap". But cmpxchg shouldn't
be any more expensive than any other atomic sequence if you have
load-locked and store-conditional.

There are obviously _implementation_ bugs. The early alpha's had such an
atrocious ldl/stc that it wasn't even funny. That might be true in other
implementations too, but it's definitely not cmpxchg-specific if so. It
will affect _any_ atomic ops on such an architecture (atomic_inc() and
friends)

Linus

2006-12-06 19:29:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> It's just been pointed out to me that the parisc one isn't safe.
>
> <dhowells> imagine variable X is set to 3
> <dhowells> CPU A issues cmpxchg(&X, 3, 5)
> <dhowells> you'd expect that to change X to 5
> <dhowells> but what if CPU B assigns 6 to X between cmpxchg reading X
> and it setting X?

The same could happen with a regular cmpxchg. Cmpxchg changes it to 5 and
then other cpu performs a store before the next instruction.

2006-12-06 19:35:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Wed, 6 Dec 2006, Matthew Wilcox wrote:
>
> Given parisc's paucity of atomic operations (load-and-zero-32bit and
> load-and-zero-64bit), cmpxchg() is impossible to implement safely.
> There has to be something we can hook to exclude another processor
> modifying the variable. I'm OK with using atomic_cmpxchg(); we have
> atomic_set locked against it.

How do you to the atomic bitops?

Also, I don't see quite why you think "cmpxchg()" and "atomic_cmpxchg()"
would be different. ANY cmpxchg() needs to be atomic - if it's not,
there's no point to the operation at all, since you'd just write it as

if (*p == x)
*p = y;

instead, and not use cmpxchg().

So yes, architectures without native support (where "native" includes
load-locked + store-conditional) always need to

- on UP, just disable interrupts
- on SMP, use a spinlock (with interrupts disabled), and share that
spinlock with all the other atomic ops (bitops at a minimum - the
"atomic_t" operations have traditionally been in another "locking
space" because of sparc32 historic braindamage, but I'd suggest
sharing the same spinlock between them all).

And yeah, it sucks. You _can_ (if you really want to) make the spinlock be
hashed based on the address of the atomic data structure. That at least
allows you to do _multiple_ spinlocks, but let's face it, your real
problem is _likely_ going to be cacheline bouncing, not contention, and
then using a hashed lock won't be likely to buy you all that much.

Linus

2006-12-06 19:36:44

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 11:29:42AM -0800, Christoph Lameter wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
>
> > It's just been pointed out to me that the parisc one isn't safe.
> >
> > <dhowells> imagine variable X is set to 3
> > <dhowells> CPU A issues cmpxchg(&X, 3, 5)
> > <dhowells> you'd expect that to change X to 5
> > <dhowells> but what if CPU B assigns 6 to X between cmpxchg reading X
> > and it setting X?
>
> The same could happen with a regular cmpxchg. Cmpxchg changes it to 5 and
> then other cpu performs a store before the next instruction.

For someone who's advocating use of cmpxchg, it seems you don't
understand its semantics! In the scenario dhowells pointed out, X would
be left set to 5. X should have the value 6 under any legitimate
implementation:

CPU A CPU B
cmpxchg(3,5)
X = 6


CPU A CPU B
X = 6
cmpxhcg(3,5)


CPU A
cmpxchg(3,
X = 6
5)


Given that even yourself got confused about how to use it, perhaps it's
not a good idea to expose this primitive to most programmers anyway?

2006-12-06 19:41:56

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 11:34:52AM -0800, Linus Torvalds wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> > Given parisc's paucity of atomic operations (load-and-zero-32bit and
> > load-and-zero-64bit), cmpxchg() is impossible to implement safely.
> > There has to be something we can hook to exclude another processor
> > modifying the variable. I'm OK with using atomic_cmpxchg(); we have
> > atomic_set locked against it.
>
> How do you to the atomic bitops?

The same way we do atomic_t.

What I hadn't realised (because I hadn't read dhowell's implementation
... because it hasn't shown up on git2.kernel.org yet) is that he
doesn't actually *use* this unlocked-assignment that would cause the
problem. He uses bitops which use the same locks.

> Also, I don't see quite why you think "cmpxchg()" and "atomic_cmpxchg()"
> would be different. ANY cmpxchg() needs to be atomic - if it's not,
> there's no point to the operation at all, since you'd just write it as
>
> if (*p == x)
> *p = y;
>
> instead, and not use cmpxchg().

The difference is that we can (and do) acquire a lock for atomic_set.
We can't acquire one for X = 6.

> - on SMP, use a spinlock (with interrupts disabled), and share that
> spinlock with all the other atomic ops (bitops at a minimum - the
> "atomic_t" operations have traditionally been in another "locking
> space" because of sparc32 historic braindamage, but I'd suggest
> sharing the same spinlock between them all).

Yep, we agree.

> And yeah, it sucks. You _can_ (if you really want to) make the spinlock be
> hashed based on the address of the atomic data structure. That at least
> allows you to do _multiple_ spinlocks, but let's face it, your real
> problem is _likely_ going to be cacheline bouncing, not contention, and
> then using a hashed lock won't be likely to buy you all that much.

We do hash based on the address -- and we try to arrange things such
that different spinlocks are acquired for different cachelines. I don't
know if anyone's benchmarked it recently to see how well it performs.
It's a bit of a waltzing bear [1] at times ;-)

[1] The wonder is not how well it waltzes, but that it waltzes at all

2006-12-06 19:43:53

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Matthew Wilcox <[email protected]> wrote:

> > Ok. For SMP-safety, it's important that any architecture that can't do
> > this needs to _share_ the same spinlock (on SMP only, of course) that it
> > uses for the bitops.
>
> That doesn't help, since assignment can't be guarded by any lock.

It's not a problem for workqueues, since the only direct assignment to the
management member variable is during initialisation.

But in general cmpxchg() might be a problem with respect to assignment.

atomic_cmpxchg() should be safe wrt atomic_set().

David

2006-12-06 19:46:37

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Linus Torvalds <[email protected]> wrote:

> Also, I don't see quite why you think "cmpxchg()" and "atomic_cmpxchg()"
> would be different. ANY cmpxchg() needs to be atomic - if it's not,
> there's no point to the operation at all, since you'd just write it as

It's not that atomic_cmpxchg() is different to cmpxchg(), it's that
atomic_set() is different to direct assignment.

atomic_set() on PA-RISC, for example, has spinlocks in it.

David

2006-12-06 19:48:00

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Lennert Buytenhek <[email protected]> wrote:

> > Pre-v6 ARM doesn't support SMP according to ARM's atomic.h,
>
> That's not quite true, there exist ARMv5 processors that in theory
> can support SMP.

I meant that the Linux ARM arch doesn't support it:

#else /* ARM_ARCH_6 */

#include <asm/system.h>

#ifdef CONFIG_SMP
#error SMP not supported on pre-ARMv6 CPUs
#endif

David

2006-12-06 19:48:51

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> On Wed, Dec 06, 2006 at 11:29:42AM -0800, Christoph Lameter wrote:
> > On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> >
> > > It's just been pointed out to me that the parisc one isn't safe.
> > >
> > > <dhowells> imagine variable X is set to 3
> > > <dhowells> CPU A issues cmpxchg(&X, 3, 5)
> > > <dhowells> you'd expect that to change X to 5
> > > <dhowells> but what if CPU B assigns 6 to X between cmpxchg reading X
> > > and it setting X?
> >
> > The same could happen with a regular cmpxchg. Cmpxchg changes it to 5 and
> > then other cpu performs a store before the next instruction.
>
> For someone who's advocating use of cmpxchg, it seems you don't
> understand its semantics! In the scenario dhowells pointed out, X would
> be left set to 5. X should have the value 6 under any legitimate
> implementation:

Nope this is a UP implementation. There is no cpu B.

2006-12-06 19:50:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 11:47:40AM -0800, Christoph Lameter wrote:
> Nope this is a UP implementation. There is no cpu B.

Follow the thread back. I'm talking about parisc. Machines exist (are
still being sold) with up to 128 cores. I think the largest we've
confirmed work are 8 CPUs.

2006-12-06 19:54:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Wed, 6 Dec 2006, Matthew Wilcox wrote:
>
> That doesn't help, since assignment can't be guarded by any lock.

True. Pure assignment will be lost, and is only ok for the case of a pure
initializer (where nobody can see the old state).

Your problem will be, of course, that any architecture that does this in
hardware will just DoTheRightThing, and as such, broken architectures with
bad locking primitives will have to test and do source-level analysis
more.

Linus

2006-12-06 19:57:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Wed, 6 Dec 2006, Linus Torvalds wrote:
>
> Your problem will be, of course, that any architecture that does this in
> hardware will just DoTheRightThing, and as such, broken architectures with
> bad locking primitives will have to test and do source-level analysis
> more.

(The underlying thread here being that the workqueue stuff _should_ be
safe in this regard. But the "testing will not catch bugs like these is
certainly true in general)

Linus

2006-12-06 19:58:28

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:
> On Wed, 6 Dec 2006, Russell King wrote:
>
> > On Wed, Dec 06, 2006 at 10:56:14AM -0800, Christoph Lameter wrote:
> > > I'd really appreciate a cmpxchg that is generically available for
> > > all arches. It will allow lockless implementation for various performance
> > > criticial portions of the kernel.
> >
> > Let's recap on cmpxchg.
> >
> > For CPUs with no atomic operation other than SWP, it is not lockless.
>
> But then its also just requires disable/enable interrupts on UP which may
> be cheaper than an atomic operation.

No. SWP is atomic on the CPU it's being issued on, especially wrt
interrupts. Only on one ARM CPU (which is UP) does it have a
questionable use, and there we do it via interrupt disable/enable.

> > For CPUs with load locked + store conditional, it is expensive.
>
> Because it locks the bus? I am not that familiar with those architectures
> but it seems that those will have a general problem anyways.

No. That certainly would be bad for performance. I can talk
authoritively from the ARM implementation.

When you use a special "ldrex" (load exclusive) instruction, the
CPU remembers the "address + cpu" pair. If another access occurs
to the same address, this state is reset.

Only if this state is preserved will a "strex" (store exclusive)
instruction succeed. This instruction returns status indicating
whether it succeeded.

So, to implement cmpxchg, you need to do this:

; r1 = temporary register
; r2 = address
; r4 = new value
; r3 = returned status
ldrex r1, [r2]
cmp r1, old_value
streqex r3, r4, [r2]

> > If you want an operation for performance critical portions of the
> > kernel, please allow architecture maintainers the freedom to use their
> > best performance enhancements.
>
> And thereby denying the kernel developers to use a simple atomic SMP
> operation? Adding additional defines for each arch and each performance
> critical piece of kernel logic?

No. If you read what I said, you'll see that you can _cheaply_ use
cmpxchg in a ll/sc based implementation. Take an atomic increment
operation.

do {
old = load_locked(addr);
} while (store_exclusive(old, old + 1, addr);

On a cmpxchg, that "store_exclusive" (loosely) becomes your cmpxchg
instruction, comparing the first arg, and if equal storing the second.
The "load_locked" macro becomes a standard pointer deref. Ergo, x86
becomes:

do {
load value
manipulate it
conditional store
} while not stored

On ll/sc, the load_locked() macro is the load locked instruction. The
store_exclusive() macro is the exclusive store and it doesn't need to
use the first parameter at all. Ergo, ARM becomes:

do {
ldrex r1, [r2]
manipulate r1
strex r0, r1, [r2]
} while failed

Notice that both are optimal.

Now let's consider the cmpxchg case.

do {
val = *addr;
} while (cmpxchg(val, val + 1, addr);

The x86 case is _identical_ to the ll/sc based implementation. Absolutely
entirely. No impact what so ever.

Let's look at the ll/sc case. The cmpxchg code implemented on this has
to reload the original value, compare it, if equal store the new value.
So:

do {
val = *addr;
(r2 = addr,
ldrex r1, [r2]
compare r1, r0
strexeq r4, r3, [r2] (store exclusive if equal)
} while store failed or comparecondition failed

Note how the cmpxchg has _forced_ the ll/sc implementation to become
more complex.

So, let's recap.

Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
architectures to produce optimal code.

Implementing an cmpxchg based accessor macro allows cmpxchg architectures
to produce optimal code and ll/sc non-optimal code.

See my point?

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-06 20:00:30

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 6 Dec 2006, Christoph Lameter wrote:
> >
> > I'd really appreciate a cmpxchg that is generically available for
> > all arches. It will allow lockless implementation for various performance
> > criticial portions of the kernel.
>
> I suspect ARM may have been the last one without one, no?
>
> That said, cmpxchg won't necessarily be "high-performance" unless the hw
> supports it naturally in hardware, so..
>
> Russell, are you ok with the code DavidH posted (the "try 2" one)? I'd
> like to get an ack from the ARM maintainer before applying it, but it
> looked ok.

I've only recently got in today, so it'll be a while (== tomorrow) before
I can get around to doing any tests. (It's evening this side of the
pond.)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-06 20:09:14

by Lennert Buytenhek

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 07:47:22PM +0000, David Howells wrote:

> > > Pre-v6 ARM doesn't support SMP according to ARM's atomic.h,
> >
> > That's not quite true, there exist ARMv5 processors that in theory
> > can support SMP.
>
> I meant that the Linux ARM arch doesn't support it:

Ah, yes, not currently. That's true.

2006-12-06 20:12:11

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> On Wed, Dec 06, 2006 at 11:47:40AM -0800, Christoph Lameter wrote:
> > Nope this is a UP implementation. There is no cpu B.
>
> Follow the thread back. I'm talking about parisc. Machines exist (are
> still being sold) with up to 128 cores. I think the largest we've
> confirmed work are 8 CPUs.

And you only have an atomic get and clear atomic op? Amazing. Is this
really still in production?


2006-12-06 20:17:08

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 12:11:01PM -0800, Christoph Lameter wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> > Follow the thread back. I'm talking about parisc. Machines exist (are
> > still being sold) with up to 128 cores. I think the largest we've
> > confirmed work are 8 CPUs.
>
> And you only have an atomic get and clear atomic op? Amazing. Is this
> really still in production?

Yes. And yes.

2006-12-06 21:36:29

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 07:58:20PM +0000, Russell King wrote:
> No. If you read what I said, you'll see that you can _cheaply_ use
> cmpxchg in a ll/sc based implementation. Take an atomic increment
> operation.
>
> do {
> old = load_locked(addr);
> } while (store_exclusive(old, old + 1, addr);

[...]

> Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
> architectures to produce optimal code.
>
> Implementing an cmpxchg based accessor macro allows cmpxchg architectures
> to produce optimal code and ll/sc non-optimal code.

And for those of us with only load-and-zero, that's simply:

#define load_locked(addr) spin_lock(hash(addr)), *addr
#define store_exclusive(addr, old, new) \
*addr = new, spin_unlock(hash(addr)), 0

which is also optimal for us.

2006-12-06 21:52:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> And for those of us with only load-and-zero, that's simply:
>
> #define load_locked(addr) spin_lock(hash(addr)), *addr
> #define store_exclusive(addr, old, new) \
> *addr = new, spin_unlock(hash(addr)), 0
>
> which is also optimal for us.

This means we tolerate the assignment race for SMP that was pointed out
earlier?

cmpxchg emulation may then also be tolerable just replace the irq
enable/disable in David's implementation with taking a spin lock?

2006-12-06 22:05:36

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 01:52:20PM -0800, Christoph Lameter wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
>
> > And for those of us with only load-and-zero, that's simply:
> >
> > #define load_locked(addr) spin_lock(hash(addr)), *addr
> > #define store_exclusive(addr, old, new) \
> > *addr = new, spin_unlock(hash(addr)), 0
> >
> > which is also optimal for us.
>
> This means we tolerate the assignment race for SMP that was pointed out
> earlier?

What gave you that impression? It simply wasn't part of this example.

To be honest, it'd be much easier if we only defined these operations on
atomic_t's. We have all the infrastructure in place for them, and
they're fairly well understood. If you need different sizes, I'm OK
with an atomic_pointer_t, or whatever.

2006-12-06 22:16:13

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> To be honest, it'd be much easier if we only defined these operations on
> atomic_t's. We have all the infrastructure in place for them, and
> they're fairly well understood. If you need different sizes, I'm OK
> with an atomic_pointer_t, or whatever.

An pointer is probably one of the most important entities to use with
cmpxchg (aside from the ints and longs that we already support).

A pointer can be defined to point to any other type. So we would need
atomic_pointer_t(<type_pointed_to>)?

2006-12-06 22:38:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Wed, 6 Dec 2006, Christoph Lameter wrote:
>
> This means we tolerate the assignment race for SMP that was pointed out
> earlier?

The assignment "race" doesn't really exist in real code, I suspect.

There's two cases of assignment:

- pure initialization. This one isn't racy, since it's literally the
initial setting of some data structure, and if it's visible in
uninitialized state on other CPU's before being initialized, you have
bigger problems than some silly assignment race ;)

So the race doesn't exist for this class of assignment at all, and
that's the _common_ case of a pure assignment (ie the "we start out
with this field having value X")

- mixing assignment on one CPU (and not using "atomic_set()") and using
"cmpxchg()".

This would be a problem on architectures that use an external spinlock
or something, but I don't really see how it could be a valid code
sequence ANYWAY.

Any code that does this had better have some _higherlevel_
serialization around that data structure, since while it's not a _bug_
on architectures that do "cmpxchg()" in hardware (as far as the cmpxchg
itself is concerned), the fact is, in the absense of any other locking,
what the hell is such code supposed to _mean_?

In other words - the second case would be a bug in that it bypasses the
serialization of the spinlock, but why would you ever do that anyway? What
could _possibly_ be a valid use of such a "set value blindly" kind of
sequence? Since another CPU is clearly doing something to the value (ie
the cmpxchg), the "set it to value X" is just not a well-defined
operation in the first place, since you don't know what the end result is
(will it be "X", or will the "cmpxchg" on the other CPU have set it to
something else?).

So I don't think the race really exists for any sane code anyway. Even if
people don't use "atomic_set()".

But hey, maybe I'm just ignoring some really odd usage case.

Linus

2006-12-07 00:38:29

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Hi,

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> To be honest, it'd be much easier if we only defined these operations on
> atomic_t's. We have all the infrastructure in place for them, and
> they're fairly well understood. If you need different sizes, I'm OK
> with an atomic_pointer_t, or whatever.

FWIW Seconded.

bye, Roman

2006-12-07 00:47:18

by Ralf Baechle

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:

> But then its also just requires disable/enable interrupts on UP which may
> be cheaper than an atomic operation.
>
> > For CPUs with load locked + store conditional, it is expensive.
>
> Because it locks the bus? I am not that familiar with those architectures
> but it seems that those will have a general problem anyways.

On a decent implementation ll/sc will have the same cost as ordinary
non-atomic load and store instructions. A likely uniprocessor
implementation uses a single flip-flop ("llbit") in the CPU which is set
by ll and cleared by any exception handler, especially interrupt. A later
store conditional will then simply fail if that bit is cleared. That
is extremly trivial stuff. On SMP it's somewhat more complex; A
processor will have to remember the address used with ll and start
snooping the bus for writes to it. The store conditional will then
go and upgrade the cache line to exclusive state if the llbit is still
set and perform the store. The llbit would be cleared if the processor
has snooped any other write to the cacheline. Details are fun but that's
bascially how it's implemented.

Of course load linked / store conditional are typically used in loops
so there is a little extra overhead from that especially where when the
branch is misspredicted.

Also note there is no locked cycle required to implement load linked /
store conditional.

Ralf

2006-12-07 00:55:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Thu, 7 Dec 2006, Roman Zippel wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
>
> > To be honest, it'd be much easier if we only defined these operations on
> > atomic_t's. We have all the infrastructure in place for them, and
> > they're fairly well understood. If you need different sizes, I'm OK
> > with an atomic_pointer_t, or whatever.
>
> FWIW Seconded.

I disagree.

Any _real_ CPU will simply never care about _anything_ else than just the
size of the datum in question. There's absolutely no point to only allow
it on certain types, especially as we _know_ those certain types are
already going to be more than one, and we also know that they are going to
be different sizes. In other words, in reality, we have to handle a
sizeable subset of the whole generic situation, and the "on certain types
only" situation is only going to be awkward and irritating.

For example, would we have a different "cmpxchg_ptr()" function for the
atomic pointer thing? With any reasonable CPU just depending on the _size_
of the type, I don't see what the problem is with just doing the
bog-standard "cmpxchg_8/16/32/64" and having the bog-standard case-
statement in a header file to do it all automatically for you, and then we
don't need to worry about it.

Linus

2006-12-07 01:05:54

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Hi,

On Wed, 6 Dec 2006, Linus Torvalds wrote:

> > > To be honest, it'd be much easier if we only defined these operations on
> > > atomic_t's. We have all the infrastructure in place for them, and
> > > they're fairly well understood. If you need different sizes, I'm OK
> > > with an atomic_pointer_t, or whatever.
> >
> > FWIW Seconded.
>
> I disagree.
>
> Any _real_ CPU will simply never care about _anything_ else than just the
> size of the datum in question.

..or alignment which a dedicated atomic type would allow to be attached.

bye, Roman

2006-12-07 01:09:04

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

From: Al Viro <[email protected]>
Date: Wed, 6 Dec 2006 19:08:28 +0000

> On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> >
> >
> > On Wed, 6 Dec 2006, Christoph Lameter wrote:
> > >
> > > I'd really appreciate a cmpxchg that is generically available for
> > > all arches. It will allow lockless implementation for various performance
> > > criticial portions of the kernel.
> >
> > I suspect ARM may have been the last one without one, no?
>
> No. sparc32 doesn't have one, for instance.

That's correct. It has an atomic swap, but not a cmpxchg.

2006-12-07 01:19:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Thu, 7 Dec 2006, Roman Zippel wrote:
> >
> > Any _real_ CPU will simply never care about _anything_ else than just the
> > size of the datum in question.
>
> ..or alignment which a dedicated atomic type would allow to be attached.

Can you give any example of a real CPU where alignment matters?

Sure, it needs to be naturally aligned, but that's true of _any_ type in
the kernel. We don't do unaligneds without "get_unaligned()" and friends.

Btw, if you want to leave out 8-bit and 16-bit data, that's fine. So
generally you'd only need to handle 32-bit (and 64-bit on a 64-bit
architecture) accesses anyway.

Linus

2006-12-07 01:25:01

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Hi,

On Wed, 6 Dec 2006, Linus Torvalds wrote:

> > > Any _real_ CPU will simply never care about _anything_ else than just the
> > > size of the datum in question.
> >
> > ..or alignment which a dedicated atomic type would allow to be attached.
>
> Can you give any example of a real CPU where alignment matters?
>
> Sure, it needs to be naturally aligned, but that's true of _any_ type in
> the kernel. We don't do unaligneds without "get_unaligned()" and friends.

m68060 produces a trap for unaligned atomic access, unfortunately standard
alignment is smaller than this.

bye, Roman

2006-12-07 01:37:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Thu, 7 Dec 2006, Roman Zippel wrote:
>
> m68060 produces a trap for unaligned atomic access, unfortunately standard
> alignment is smaller than this.

Umm. on 68060, since it's 32-bit, you'd only have the 32-bit case. Are you
saying that you can't do a 32-bit atomic access at any 32-bit aligned
boundary? Or are you saying that gcc aligns normal 32-bit entities at
16-bit alignment? Neither of those sound very likely.

Linus

2006-12-07 01:44:31

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 05:36:29PM -0800, Linus Torvalds wrote:
> Or are you saying that gcc aligns normal 32-bit entities at
> 16-bit alignment? Neither of those sound very likely.

alignof(u32) is 2 on m68k. Crazy, huh?

2006-12-07 01:53:09

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Hi,

On Wed, 6 Dec 2006, Linus Torvalds wrote:

> > m68060 produces a trap for unaligned atomic access, unfortunately standard
> > alignment is smaller than this.
>
> Umm. on 68060, since it's 32-bit, you'd only have the 32-bit case. Are you
> saying that you can't do a 32-bit atomic access at any 32-bit aligned
> boundary? Or are you saying that gcc aligns normal 32-bit entities at
> 16-bit alignment? Neither of those sound very likely.

The latter.
And yes, I'm starting to hate it too, especially after I've seen all the
weird bugs this causes in userspace, which then only trigger on m68k.
I'm thinking of changing it when switching to proper TLS support, but it's
not there yet.
BTW there is another reason I actually like the atomic type -
documentation. It makes it more clear that this intended to be used as
atomic value, so an unhealthy mixture of different accesses is less likely
and if they pile up at some place it's a good indicator to maybe switch
back to spinlocks.

bye, Roman

2006-12-07 02:09:04

by Douglas McNaught

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Matthew Wilcox <[email protected]> writes:

> On Wed, Dec 06, 2006 at 05:36:29PM -0800, Linus Torvalds wrote:
>> Or are you saying that gcc aligns normal 32-bit entities at
>> 16-bit alignment? Neither of those sound very likely.
>
> alignof(u32) is 2 on m68k. Crazy, huh?

The original 68000 had a 16-bit bus (but 32-bit registers), which is
probably why it's that way.

-Doug

2006-12-07 09:23:57

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Linus Torvalds wrote:
>
> On Thu, 7 Dec 2006, Roman Zippel wrote:
>
>>On Wed, 6 Dec 2006, Matthew Wilcox wrote:
>>
>>
>>>To be honest, it'd be much easier if we only defined these operations on
>>>atomic_t's. We have all the infrastructure in place for them, and
>>>they're fairly well understood. If you need different sizes, I'm OK
>>>with an atomic_pointer_t, or whatever.
>>
>>FWIW Seconded.
>
>
> I disagree.
>
> Any _real_ CPU will simply never care about _anything_ else than just the
> size of the datum in question. There's absolutely no point to only allow
> it on certain types, especially as we _know_ those certain types are
> already going to be more than one, and we also know that they are going to
> be different sizes. In other words, in reality, we have to handle a
> sizeable subset of the whole generic situation, and the "on certain types
> only" situation is only going to be awkward and irritating.
>
> For example, would we have a different "cmpxchg_ptr()" function for the
> atomic pointer thing? With any reasonable CPU just depending on the _size_
> of the type, I don't see what the problem is with just doing the
> bog-standard "cmpxchg_8/16/32/64" and having the bog-standard case-
> statement in a header file to do it all automatically for you, and then we
> don't need to worry about it.

What's wrong with using atomic_cmpxchg? For unsigned long / pointers,
there is a patch to implement atomic_long_cmpxchg.

Some architectures simply can't implement cmpxchg on memory that may
be modified in arbitrary ways. If you add some more conditions to use
cmpxchg, then you weaken it (ie. can't use it for synchronisation
with userspace) and on top of that you don't get the easy static
checking that atomic_t gives you.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-07 09:31:57

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Russell King wrote:
> On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:

> No. If you read what I said, you'll see that you can _cheaply_ use
> cmpxchg in a ll/sc based implementation. Take an atomic increment
> operation.
>
> do {
> old = load_locked(addr);
> } while (store_exclusive(old, old + 1, addr);
>
> On a cmpxchg, that "store_exclusive" (loosely) becomes your cmpxchg
> instruction, comparing the first arg, and if equal storing the second.
> The "load_locked" macro becomes a standard pointer deref. Ergo, x86
> becomes:
>
> do {
> load value
> manipulate it
> conditional store
> } while not stored
>
> On ll/sc, the load_locked() macro is the load locked instruction. The
> store_exclusive() macro is the exclusive store and it doesn't need to
> use the first parameter at all. Ergo, ARM becomes:
>
> do {
> ldrex r1, [r2]
> manipulate r1
> strex r0, r1, [r2]
> } while failed
>
> Notice that both are optimal.
>
> Now let's consider the cmpxchg case.
>
> do {
> val = *addr;
> } while (cmpxchg(val, val + 1, addr);
>
> The x86 case is _identical_ to the ll/sc based implementation. Absolutely
> entirely. No impact what so ever.
>
> Let's look at the ll/sc case. The cmpxchg code implemented on this has
> to reload the original value, compare it, if equal store the new value.
> So:
>
> do {
> val = *addr;
> (r2 = addr,
> ldrex r1, [r2]
> compare r1, r0
> strexeq r4, r3, [r2] (store exclusive if equal)
> } while store failed or comparecondition failed
>
> Note how the cmpxchg has _forced_ the ll/sc implementation to become
> more complex.
>
> So, let's recap.
>
> Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
> architectures to produce optimal code.
>
> Implementing an cmpxchg based accessor macro allows cmpxchg architectures
> to produce optimal code and ll/sc non-optimal code.
>
> See my point?

Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
load_locked is not locked at all, and there can be interleaving writes
between the load and cmpxchg which do not cause the store_conditional
to fail.

It might be reasonable to implement this watered down version, but:
don't some architectures have restrictions on what instructions can
be issued between the ll and the sc?

But in general I agree with you, in that a higher level primitive is
preferable (eg. atomic_add_unless).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-07 13:20:36

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
> It might be reasonable to implement this watered down version, but:
> don't some architectures have restrictions on what instructions can
> be issued between the ll and the sc?

Yes. On Alpha you cannot execute other load/stores, taken branches etc.
So in general case the code between ll and sc cannot be written in C...

Ivan.

2006-12-07 15:03:22

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
> Russell King wrote:
> >On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:
>
> >No. If you read what I said, you'll see that you can _cheaply_ use
> >cmpxchg in a ll/sc based implementation. Take an atomic increment
> >operation.
> >
> > do {
> > old = load_locked(addr);
> > } while (store_exclusive(old, old + 1, addr);
> >
> >On a cmpxchg, that "store_exclusive" (loosely) becomes your cmpxchg
> >instruction, comparing the first arg, and if equal storing the second.
> >The "load_locked" macro becomes a standard pointer deref. Ergo, x86
> >becomes:
> >
> > do {
> > load value
> > manipulate it
> > conditional store
> > } while not stored
> >
> >On ll/sc, the load_locked() macro is the load locked instruction. The
> >store_exclusive() macro is the exclusive store and it doesn't need to
> >use the first parameter at all. Ergo, ARM becomes:
> >
> > do {
> > ldrex r1, [r2]
> > manipulate r1
> > strex r0, r1, [r2]
> > } while failed
> >
> >Notice that both are optimal.
> >
> >Now let's consider the cmpxchg case.
> >
> > do {
> > val = *addr;
> > } while (cmpxchg(val, val + 1, addr);
> >
> >The x86 case is _identical_ to the ll/sc based implementation. Absolutely
> >entirely. No impact what so ever.
> >
> >Let's look at the ll/sc case. The cmpxchg code implemented on this has
> >to reload the original value, compare it, if equal store the new value.
> >So:
> >
> > do {
> > val = *addr;
> > (r2 = addr,
> > ldrex r1, [r2]
> > compare r1, r0
> > strexeq r4, r3, [r2] (store exclusive if equal)
> > } while store failed or comparecondition failed
> >
> >Note how the cmpxchg has _forced_ the ll/sc implementation to become
> >more complex.
> >
> >So, let's recap.
> >
> >Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
> >architectures to produce optimal code.
> >
> >Implementing an cmpxchg based accessor macro allows cmpxchg architectures
> >to produce optimal code and ll/sc non-optimal code.
> >
> >See my point?
>
> Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
> load_locked is not locked at all,

Intentional - cmpxchg architectures don't generally have a load locked.

> and there can be interleaving writes
> between the load and cmpxchg which do not cause the store_conditional
> to fail.

In which case the cmpxchg fails and we do the atomic operation again,
in exactly the same way that we do the operation again if the 'sc'
fails in the ll/sc case.

I do not see any problem.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-07 15:06:16

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 6 Dec 2006, Christoph Lameter wrote:
> >
> > I'd really appreciate a cmpxchg that is generically available for
> > all arches. It will allow lockless implementation for various performance
> > criticial portions of the kernel.
>
> I suspect ARM may have been the last one without one, no?
>
> That said, cmpxchg won't necessarily be "high-performance" unless the hw
> supports it naturally in hardware, so..
>
> Russell, are you ok with the code DavidH posted (the "try 2" one)? I'd
> like to get an ack from the ARM maintainer before applying it, but it
> looked ok.

Things seem to have moved on since this request. Do I need to do
anything? I dunno.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-07 15:42:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Thu, 7 Dec 2006, Nick Piggin wrote:
>
> It might be reasonable to implement this watered down version, but:
> don't some architectures have restrictions on what instructions can
> be issued between the ll and the sc?

Yes. You really probably do not want to expose ll/sc on a C level because
of this.

On alpha, the architecture manual says (I didn't go back and check, but
I'm pretty sure) that a ld.l and st.c cannot have a taken branch in
between then, for example. That basically means that you can't allow the
compiler to reorder the basic blocks (which it often will with a
while-loop).

Now, I actually suspect that this was not a microarchitectural flaw, and
that a branch would _work_ there, and that the architecture manual was
just being anal, but strictly speaking, it means that these things had
better always be in assembly, and you can sadly not expose them (on alpha,
at least) as higher-level primitives as such - you can only expose the
end result ("cmpxchg" or similar).

Linus

2006-12-08 15:33:00

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Thu, Dec 07, 2006 at 03:06:06PM +0000, Russell King wrote:
> On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> >
> >
> > On Wed, 6 Dec 2006, Christoph Lameter wrote:
> > >
> > > I'd really appreciate a cmpxchg that is generically available for
> > > all arches. It will allow lockless implementation for various performance
> > > criticial portions of the kernel.
> >
> > I suspect ARM may have been the last one without one, no?
> >
> > That said, cmpxchg won't necessarily be "high-performance" unless the hw
> > supports it naturally in hardware, so..
> >
> > Russell, are you ok with the code DavidH posted (the "try 2" one)? I'd
> > like to get an ack from the ARM maintainer before applying it, but it
> > looked ok.
>
> Things seem to have moved on since this request. Do I need to do
> anything? I dunno.

ARM's still broken.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-08 16:07:03

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, 8 Dec 2006, Russell King wrote:

> I'm trying to suggest a better implementation for atomic ops rather
> than just bowing to this x86-centric "cmpxchg is the best, everyone
> must implement it" mentality.

cmpxchg is the simplest solution to realize many other atomic operations
and its widely available on a wide variety of platforms. It is the most
universal atomic instruction that I know of. Other atomic operations may
be more efficient but certainly cmpxchg is the most universal.

Having multiple instructions with restrictions of what can be done in
between just complicates the use and seems to be arch specific. I have not
seen a better solution. Are you really advocating the weirdly complex
ll/sc be adopted by other architectures?

2006-12-08 16:32:13

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, Dec 08, 2006 at 08:06:23AM -0800, Christoph Lameter wrote:
> On Fri, 8 Dec 2006, Russell King wrote:
>
> > I'm trying to suggest a better implementation for atomic ops rather
> > than just bowing to this x86-centric "cmpxchg is the best, everyone
> > must implement it" mentality.
>
> cmpxchg is the simplest solution to realize many other atomic operations
> and its widely available on a wide variety of platforms. It is the most
> universal atomic instruction that I know of. Other atomic operations may
> be more efficient but certainly cmpxchg is the most universal.
>
> Having multiple instructions with restrictions of what can be done in
> between just complicates the use and seems to be arch specific. I have not
> seen a better solution. Are you really advocating the weirdly complex
> ll/sc be adopted by other architectures?

You're advocating cmpxchg is adopted by all architectures. It isn't
available on many architectures, and those which it can be requires
unnecessarily complicated coding.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-08 16:43:29

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, 8 Dec 2006, Russell King wrote:

> You're advocating cmpxchg is adopted by all architectures. It isn't
> available on many architectures, and those which it can be requires
> unnecessarily complicated coding.

Not having cmpxchg is even worse because it requires the introduction and
maintenance of large sets of arch specific operations. Much more complex.


2006-12-08 16:47:45

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, Dec 08, 2006 at 08:43:09AM -0800, Christoph Lameter wrote:
> On Fri, 8 Dec 2006, Russell King wrote:
>
> > You're advocating cmpxchg is adopted by all architectures. It isn't
> > available on many architectures, and those which it can be requires
> > unnecessarily complicated coding.
>
> Not having cmpxchg is even worse because it requires the introduction and
> maintenance of large sets of arch specific operations. Much more complex.

And which bit of "not available on many architectures" have you not grasped
yet?

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-08 16:54:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, 8 Dec 2006, Russell King wrote:

> > Not having cmpxchg is even worse because it requires the introduction and
> > maintenance of large sets of arch specific operations. Much more complex.
>
> And which bit of "not available on many architectures" have you not grasped
> yet?

We discussed various forms of emulating that functionality on this thread.
Seems to work satisfactorily. You can discover the information you skipped
by going back to some earlier messages of this thread.




2006-12-08 16:57:36

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Christoph Lameter <[email protected]> wrote:

> cmpxchg is the simplest solution to realize many other atomic operations and
> its widely available on a wide variety of platforms.

But by no means all. Where it doesn't exist it can only be properly emulated
by something such as LL/SC if they are there. If not, then you can't do a
safe cmpxchg on SMP.

> It is the most universal atomic instruction that I know of.

I think TAS-type things and XCHG-type things are more common.

In fact I think more things have LL/SC than have CMPXCHG.

David

2006-12-08 16:59:21

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, Dec 08, 2006 at 08:53:22AM -0800, Christoph Lameter wrote:
> On Fri, 8 Dec 2006, Russell King wrote:
>
> > > Not having cmpxchg is even worse because it requires the introduction and
> > > maintenance of large sets of arch specific operations. Much more complex.
> >
> > And which bit of "not available on many architectures" have you not grasped
> > yet?
>
> We discussed various forms of emulating that functionality on this thread.
> Seems to work satisfactorily. You can discover the information you skipped
> by going back to some earlier messages of this thread.

Oh for god sake. I've put forward a coherent argument. I've disproved
every point put forward by people for my approach.

ARM continues to be broken because people like you are stuck with doing
things only one way. Grow up and open your mind to other possibilities.

(It should be noted that LDREX/STREX is broken on some ARM implementations
at the moment - they will cause a livelock due to silicon bugs if both
CPUs have exactly the same number of cycles between LDREX and STREX.
This makes loops-within-loops implementations using cmpxchg _extremely_
expensive on ARM.)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-08 17:06:49

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, 8 Dec 2006, David Howells wrote:

> > It is the most universal atomic instruction that I know of.
>
> I think TAS-type things and XCHG-type things are more common.

Huh? The most popular architectures are i386 x86_64 sparc ia64 etc which
all have one or the other form of cmpxchg (some issues with early sparc
and i386).

And yes the xchg was the first multiprocessor instruction and therefore
is also available on very old processors.

> In fact I think more things have LL/SC than have CMPXCHG.

LL/SC can be easily used to come up with a cmpxchg equivalent.

2006-12-08 08:56:53

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, Dec 08, 2006 at 12:18:52PM +1100, Nick Piggin wrote:
> Russell King wrote:
> >On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
>
> >>>Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
> >>>architectures to produce optimal code.
> >>>
> >>>Implementing an cmpxchg based accessor macro allows cmpxchg architectures
> >>>to produce optimal code and ll/sc non-optimal code.
> >>>
> >>>See my point?
> >>
> >>Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
> >>load_locked is not locked at all,
> >
> >
> >Intentional - cmpxchg architectures don't generally have a load locked.
>
> Exactly, so it is wrong -- you can't implement that behaviour with
> load + cmpxchg.

I disagree. I _have_ implemented the required behaviour. I really
don't understand your point saying that it is wrong.

> >>and there can be interleaving writes
> >>between the load and cmpxchg which do not cause the store_conditional
> >>to fail.
> >
> >
> >In which case the cmpxchg fails and we do the atomic operation again,
> >in exactly the same way that we do the operation again if the 'sc'
> >fails in the ll/sc case.
>
> Not if cmpxchg sees the same value, it won't fail, regardless of how
> many writes have hit that memory address.

Don't see anything wrong with that. If that was a problem, atomic
implementations using cmpxchg on x86 would be impossible.

I think you're trying to implement ll/sc semantics on CPUs without
ll/sc which is exactly not what I'm trying to do. I'd argue that's
impossible.

I'm trying to suggest a better implementation for atomic ops rather
than just bowing to this x86-centric "cmpxchg is the best, everyone
must implement it" mentality.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-07 16:51:58

by Ralf Baechle

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:

> Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
> load_locked is not locked at all, and there can be interleaving writes
> between the load and cmpxchg which do not cause the store_conditional
> to fail.
>
> It might be reasonable to implement this watered down version, but:
> don't some architectures have restrictions on what instructions can
> be issued between the ll and the sc?

On MIPS the restriction is no loads or stores or sync instructions between
ll/sc. Also there may be at most 2048 bytes between the address of the
ll and sc instructions. Which means ll/sc sequences should better be
written in assembler or gcc might do a bit too creative things ...

Ralf

2006-12-08 17:18:31

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, Dec 08, 2006 at 09:06:00AM -0800, Christoph Lameter wrote:
> On Fri, 8 Dec 2006, David Howells wrote:
>
> > > It is the most universal atomic instruction that I know of.
> >
> > I think TAS-type things and XCHG-type things are more common.
>
> Huh? The most popular architectures are i386 x86_64 sparc ia64 etc which
> all have one or the other form of cmpxchg (some issues with early sparc
> and i386).

According to the latest figures, 621 million ARM processors were
shipped in Q3, which equates to about 78 processors per second.
(taken from http://www.arm.com/news/15300.html).

I'd like to know the figures for these other so-called "popular
architectures".

But in any case that's an utterly irrelevant point to the argument
at hand.

> > In fact I think more things have LL/SC than have CMPXCHG.
>
> LL/SC can be easily used to come up with a cmpxchg equivalent.

As proven previously the reverse is also true. And as shown previously
the cheaper out of the two for all platforms is the LL/SC based
implementation, where the architecture specific implementation can
be _either_ LL/SC based or cmpxchg based depending on what is
supported in their hardware.

I'm going to keep saying this until people get it.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-08 01:19:45

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Russell King wrote:
> On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:

>>>Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
>>>architectures to produce optimal code.
>>>
>>>Implementing an cmpxchg based accessor macro allows cmpxchg architectures
>>>to produce optimal code and ll/sc non-optimal code.
>>>
>>>See my point?
>>
>>Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
>>load_locked is not locked at all,
>
>
> Intentional - cmpxchg architectures don't generally have a load locked.

Exactly, so it is wrong -- you can't implement that behaviour with
load + cmpxchg.

>>and there can be interleaving writes
>>between the load and cmpxchg which do not cause the store_conditional
>>to fail.
>
>
> In which case the cmpxchg fails and we do the atomic operation again,
> in exactly the same way that we do the operation again if the 'sc'
> fails in the ll/sc case.

Not if cmpxchg sees the same value, it won't fail, regardless of how
many writes have hit that memory address.

> I do not see any problem.

This was not the big problem -- as I said, if this was the only problem
we could opt for a "watered down" version that doesn't actually load
locked [the ll/sc interface would be much cooler than cmpxchg, IMO :)]

The main problem is the restrictions between the ll and sc. This is why
I implemented atomic_cmpxchg rather than atomic_ll/sc. However I agree
that in critical code, a higher level API should be used instead (eg.
see atomic_add_unless, which can be implemented optimally on RISCs).

Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-08 17:25:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, 8 Dec 2006, Russell King wrote:

> As proven previously the reverse is also true. And as shown previously
> the cheaper out of the two for all platforms is the LL/SC based
> implementation, where the architecture specific implementation can
> be _either_ LL/SC based or cmpxchg based depending on what is
> supported in their hardware.

As also shown in this thread: There are restrictions on what you can do
between ll/sc. You would not want to use C code there. ll/sc is an thing
that needs to be restricted to asm code. So this is not a viable proposal
at all. ll/sc is useful to construct various atomic functions but cannot
be directly used in C code. cmpxchg can be effectively realized using
ll/sc.

2006-12-08 18:47:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Fri, 8 Dec 2006, David Howells wrote:
>
> In fact I think more things have LL/SC than have CMPXCHG.

But you cannot expose ll/sc to C, so that's a bogus argument.

If you do ll/sc, you need to program in assembly language.

Linus

2006-12-08 19:04:32

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, Dec 08, 2006 at 10:46:17AM -0800, Linus Torvalds wrote:
>
>
> On Fri, 8 Dec 2006, David Howells wrote:
> >
> > In fact I think more things have LL/SC than have CMPXCHG.
>
> But you cannot expose ll/sc to C, so that's a bogus argument.

Yes you can. Well, you can on ARM at least. Between the load exclusive
you can do anything you like until you hit the store exclusive. If you
haven't touched the location (with anything other than another load
exclusive) and no other CPU has issued a load exclusive, your store
exclusive succeeds. Whether you have branches, loads or stores to
other locations, etc. None of that matters.

Not so bogus as you make out.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-08 19:17:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Fri, 8 Dec 2006, Christoph Lameter wrote:
>
> As also shown in this thread: There are restrictions on what you can do
> between ll/sc

This, btw, is almost certainly true on ARM too.

There are three major reasons for restrictions on ll/sc:

- bus-cycle induced things (eg variations of "you cannot do a store in
between the ll and the sc, because it will touch the cache and clear
the bit", where "the store" might be a load too, and "the cache" might
be just "the bus interface")

- trap handling usually clears the internal lock bit too, which means
that depending on the micro-architecture, even internal microtraps
(like even just branch misprediction, but more commonly things like TLB
misses etc) can cause a sc to always fail.

- timing. Livelock in particular.

The last one is the one that hits everybody, regardless of
microarchitecture. The rule may be that the LL/SC need to be within a
certain number of cycles (which can be very small - like ten) in order to
guarantee that the cacheline can't be stolen.

All of which means that _nobody_ can really do this reliably in C. Even if
there are no other microarchitectural rules (and it sounds like that might
be true on ARM), the timing issue means that you can _still_ only use it
for very specific and simple sequences, and trying to expose it as a
higher-level thing is not going to work in general for anything even
remotely complicated.

(The timing may also mean that you end up having to do random back-off
etc, just to make sure _somebody_ makes progress. Ie it might not be a
matter of "within ten cycles", but "you need to randomize the timing").

In other words, it's simply not an option to expose LL/SC as an interface.
It would be VERY convenient to do, since cmpxchg can emulate ll/sc (the
"ll" part is a normal load, the "sc" part is a "compare that the old value
still matches, and store the new one if so"). But because you can't expose
LL/SC anyway in any reasonably portable way, that just doesn't work.

So, you really do end up with three possibilities:

- do things with TRULY PORTABLE interfaces. And like it or not, cmpxchg
is the closest thing you can get to that. It's trivial to do cmpxchg
using ll/sc (modulo the "random backoff part" if you need it, which is
still pretty simple, but no longer totally trivial), and architectures
that have neither ll/sc _nor_ a native cmpxchg can just go screw
themselves with spinlocks - they really aren't worth worrying about in
SMP. At some point you have to tell hardware designers that their
hardware just sucks.

- have ugly conditional code in generic code. I personally think this is
a _much_ worse option in most cases.

- have a much higher-level interface and make it _all_ architecture-
dependent (possibly with a "generic" version for sane architectures).
This works, but the more high-level it is, the more you end up having
the same thign written in many different ways, and nasty maintenance.

So we generally set the bar pretty low. Things like semaphore locking
primitives are high-level enough already that we prefer to try to make
them use common lower-level interfaces (spinlocks, cmpxchg etc).
Something like kernel/workqueue.c is _way_ too high a level to do
arch-specific.

So right now, I think the "cmpxchg" or the "bitmask set" approach are the
alternatives. Russell - LL/SC simply isn't on the table as an interface,
whether you like it or not.

Linus

2006-12-08 19:31:35

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, Dec 08, 2006 at 11:15:58AM -0800, Linus Torvalds wrote:
> On Fri, 8 Dec 2006, Christoph Lameter wrote:
> >
> > As also shown in this thread: There are restrictions on what you can do
> > between ll/sc
>
> This, btw, is almost certainly true on ARM too.
>
> There are three major reasons for restrictions on ll/sc:
>
> - bus-cycle induced things (eg variations of "you cannot do a store in
> between the ll and the sc, because it will touch the cache and clear
> the bit", where "the store" might be a load too, and "the cache" might
> be just "the bus interface")

No such restriction on ARM.

> - trap handling usually clears the internal lock bit too, which means
> that depending on the micro-architecture, even internal microtraps
> (like even just branch misprediction, but more commonly things like TLB
> misses etc) can cause a sc to always fail.

Also not true. The architectural implementation is:

ldrex: tags the _physical_ address + cpu number,
transitions to exclusive access.

strex: if in exclusive access state and matches the previous
tagged physical address + cpu number pair, store
succeeds.

This is typically implemented in hardware, and in the case of a SMP
system, external to the CPU cores themselves. So all it's doing is
looking at the exclusive accesses. It is not embedded into the CPU
core so that micro-architectural stuff affects it.

> - timing. Livelock in particular.

That is a problem that we're facing, and the solution is rather simple.
You need to introduce a CPU-number specific number of cycles before
retrying the operation on failure.

> All of which means that _nobody_ can really do this reliably in C.

I utterly disagree. I could code atomic_add() as:

extern void cpu_specific_delay(void);

static inline int atomic_add_return(int i, atomic_t *v)
{
do {
asm("ldrex %0, [%1]" : "=r" (val) : "r" (v));
val += i;
asm("strex %0, %1, [%2]" : "=r" (res) : "r" (val), "r" (v));
if (res)
cpu_specific_delay();
} while (res);

return val;
}

and actually we /are/ going to have to go down this path to break the
livelock problem, like it or not. Ditto for the ARM bitops operations,
so we might as well have ARM at least implement a generic ll/sc thing.

Coding the cpu specific delays into every strex site is going to make
them far too heavy to put inline.

> So right now, I think the "cmpxchg" or the "bitmask set" approach are the
> alternatives. Russell - LL/SC simply isn't on the table as an interface,
> whether you like it or not.

Buggerit then. cmpxchg sucks as an interface, and I am going to
continue to assert that.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-08 19:37:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Fri, 8 Dec 2006, Russell King wrote:
>
> Yes you can. Well, you can on ARM at least. Between the load exclusive
> you can do anything you like until you hit the store exclusive. If you
> haven't touched the location (with anything other than another load
> exclusive) and no other CPU has issued a load exclusive, your store
> exclusive succeeds.

Is that actually true?

Almost all LL/SC implementations have granularity rules, where "touch the
location" is not a byte-granular thing, but actually ends up being
something like "touch the same cachline".

They also often have _other_ rules like: "the cacheline has to stay in the
L1 in exclusive state" etc. Which means that in a direct-mapped L1 cache,
you can't even load anything that might be in the same way, because it
would cause a cache eviction that invalidates the SC.

It's possible that ARM has really strong LL/SC, but quite frankly, that
sounds unlikely. I've never heard of anybody ever _architecturally_ saying
that they support that strong requirements, even if certain micro-
architectures might actually support stronger semantics than the ones
guaranteed by the architectural rules.

Linus

2006-12-08 19:38:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Fri, 8 Dec 2006, Russell King wrote:
>
> I utterly disagree. I could code atomic_add() as:

Sure. And Alpha could do that too. If you write the C code a specific way,
you can make it work. That does NOT mean that you can expose it widely as
a portable interface - it's still just a very _nonportable_ interface that
you use internally within one architecture to implement other interfaces.

Linus

2006-12-08 19:44:12

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, Dec 08, 2006 at 11:37:45AM -0800, Linus Torvalds wrote:
>
>
> On Fri, 8 Dec 2006, Russell King wrote:
> >
> > I utterly disagree. I could code atomic_add() as:
>
> Sure. And Alpha could do that too. If you write the C code a specific way,
> you can make it work. That does NOT mean that you can expose it widely as
> a portable interface - it's still just a very _nonportable_ interface that
> you use internally within one architecture to implement other interfaces.

However, nothing stops you wrapping the non-portable nature of ll/sc up
into the store part though.

If you can efficiently implement cmpxchg inside an ll/sc based portable
interface (yes you can) and you can implement problematical ll/sc
structures inside a cmpxchg() interface, you can do it either way around.
Only one way doesn't penalise broken ll/sc based implementations though.

That is the essence of my argument.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-08 20:00:06

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

On Fri, Dec 08, 2006 at 11:35:39AM -0800, Linus Torvalds wrote:
>
>
> On Fri, 8 Dec 2006, Russell King wrote:
> >
> > Yes you can. Well, you can on ARM at least. Between the load exclusive
> > you can do anything you like until you hit the store exclusive. If you
> > haven't touched the location (with anything other than another load
> > exclusive) and no other CPU has issued a load exclusive, your store
> > exclusive succeeds.
>
> Is that actually true?
>
> Almost all LL/SC implementations have granularity rules, where "touch the
> location" is not a byte-granular thing, but actually ends up being
> something like "touch the same cachline".

ARM's implementation works by using some CPU core peripheral hardware to hold
two pieces of information:

1. the physical address being accessed by the load exclusive instruction.
2. the cpu number.

When an exclusive load is performed, this hardware "remembers" bits
[31:N] (7 >= N >= 2 *) of the physical address and the CPU number.

When an exclusive store is performed, the address and CPU number are
compared to the stored versions, and the store is only allowed to succeed
if they match. Hence, it is independent of the actual cache state,
and is relatively cheap to implement.

The only instructions which affect the exclusive access state are the
exclusive access instructions themselves. Hence, to the same memory
location with the following sequence:

load exclusive
load normal
modify
store normal
store exclusive

the store exclusive _will_ succeed even though you've modified the value
in that memory location via standard loads/stores.

(* - while it is true that the physical address is stored,
implementations may themselves choose not to store between the
lower 7 to 2 bits inclusive - which incidentally means putting
two atomic_t's next together is probably a Bad Thing(tm)... and
rather sucks. However, non-exclusive accesses within that address
region do not affect the exclusive access state.)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-12-08 20:02:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Fri, 8 Dec 2006, Russell King wrote:
>
> No such restriction on ARM.
>
> Also not true. The architectural implementation is:

I checked the ARM manuals, and quite fankly, they don't back you up.

They do not claim that the physical address tag is byte-granular, and in
fact they make it pretty clear that the same tag is used for all the
sizes, which implies that the tag granularity is NOT byte granular, but
likely something else.

Both the manuals I checked also say: "Other events might cause the tag to
be cleared", without going into particular details other than saying that
a region that is marked non-shared might still clear the tag on access by
other CPU's - but they leave it open whether that's by design or not.

In other words, if there actually is an architectural guarantee that
ldrex/strex are really as strong as you imply, it's not in the standard
architecture manuals from ARM at least for the ARM11562 or the ARM1136.

So I suspect you're wrong, and that the ldrex/strex tags actually are not
all that different from other archtiectures which tend to have cacheline
granularities or more (I _think_ the original alpha granularity was the
whole address space, and any cache traffic would clear it. That's _really_
pathetically weak, but hey, I might remember wrong, and it was the very
first implementation. I doubt ARM is _that_ weak, but I doubt it's as
strong as you claim).

Linus

2006-12-08 20:35:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it



On Fri, 8 Dec 2006, Russell King wrote:
>>
> The only instructions which affect the exclusive access state are the
> exclusive access instructions themselves.

Not according to the docs I found.

The ARM1136 manual explicitly states that any attempt to modify that
address clears the tag (for shared memory regions, by _any_ CPU, and for
nonshared regions by _that_ CPU).

And btw, that _has_ to be true, because otherwise the whole ldrex/strex
sequence would be totally unusable as a way to do atomic bit operations on
UP in the presense of interrupts (well, you could have a clrex instruction
in the interrupt handler, but as far as I know you don't, so that seems to
be a moot point - you only seem to do it in __switch_to).

In other words, I _really_ think you're wrong.

So the very code sequence you quote MUST NOT WORK the way you claim it
does. And not only that, since the granularity of the mark is not just for
the bytes in question, but potentially apparently up to 128 bytes, any
store even _close_ to the memory you had xclusive access to will break the
exclusive access.

Really, Russell. Your stance makes no sense. It doesn't make any sense
from a microarchitectural standpoint (it's not how you'd normally
implement these things), but it ALSO makes no sense from the way you
already use those instructions (as a way to protect against other
processors - including your own - touching that word).

Linus

2006-12-08 22:34:08

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Russell King wrote:
> On Fri, Dec 08, 2006 at 12:18:52PM +1100, Nick Piggin wrote:
>
>>Russell King wrote:
>>
>>>On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
>>
>>>>>Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
>>>>>architectures to produce optimal code.
>>>>>
>>>>>Implementing an cmpxchg based accessor macro allows cmpxchg architectures
>>>>>to produce optimal code and ll/sc non-optimal code.
>>>>>
>>>>>See my point?
>>>>
>>>>Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
>>>>load_locked is not locked at all,
>>>
>>>
>>>Intentional - cmpxchg architectures don't generally have a load locked.
>>
>>Exactly, so it is wrong -- you can't implement that behaviour with
>>load + cmpxchg.
>
>
> I disagree. I _have_ implemented the required behaviour. I really
> don't understand your point saying that it is wrong.
>
>
>>>>and there can be interleaving writes
>>>>between the load and cmpxchg which do not cause the store_conditional
>>>>to fail.
>>>
>>>
>>>In which case the cmpxchg fails and we do the atomic operation again,
>>>in exactly the same way that we do the operation again if the 'sc'
>>>fails in the ll/sc case.
>>
>>Not if cmpxchg sees the same value, it won't fail, regardless of how
>>many writes have hit that memory address.
>
>
> Don't see anything wrong with that. If that was a problem, atomic
> implementations using cmpxchg on x86 would be impossible.
>
> I think you're trying to implement ll/sc semantics on CPUs without
> ll/sc which is exactly not what I'm trying to do. I'd argue that's
> impossible.

Yes, I did think that from reading your emails. It is not a problem
as such, but it is important to be clear on semantics.

> I'm trying to suggest a better implementation for atomic ops rather
> than just bowing to this x86-centric "cmpxchg is the best, everyone
> must implement it" mentality.

Even if ARM is able to handle any arbitrary C code between the
"load locked" and store conditional API, other architectures can not
by definition.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-11 11:04:41

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

Russell King <[email protected]> wrote:

> Yes you can. Well, you can on ARM at least. Between the load exclusive
> you can do anything you like until you hit the store exclusive.

How come atomic_set() on arm6 is implemented as:

static inline void atomic_set(atomic_t *v, int i)
{
unsigned long tmp;

__asm__ __volatile__("@ atomic_set\n"
"1: ldrex %0, [%1]\n"
" strex %0, %2, [%1]\n"
" teq %0, #0\n"
" bne 1b"
: "=&r" (tmp)
: "r" (&v->counter), "r" (i)
: "cc");
}

Why LDREX/STREX and not direct assignment?

David