Currently unlock_buffer() contains a smb_mb__after_clear_bit() which is
weird because bit_spin_unlock() uses smb_mb__before_clear_bit():
>From include/linux/bit_spinlock.h:
static inline void bit_spin_unlock(int bitnum, unsigned long *addr)
{
smp_mb__before_clear_bit();
clear_bit(bitnum, addr);
preempt_enable();
__release(bitlock);
}
For most architectures there is no difference because both
smp_mb__after_clear_bit() and smp_mb__before_clear_bit() are both
memory barriers and clear_buffer_locked() is an atomic operation.
However, they differ under IA64.
Note that this potential race has never been seen under IA64. It was
discovered by inspection by Zoltan Menyhart <[email protected]>.
Regardless if this is a true race or not, I think the unlock sequence
needs to be the same for bit locks and unlock_buffer(). Maybe
unlock_buffer and lock_buffer better use bit spinlock operations?
Change unlock_buffer() to work the same way as bit_spin_unlock.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c 2006-03-27 14:09:54.000000000 -0800
+++ linux-2.6/fs/buffer.c 2006-03-27 19:40:32.000000000 -0800
@@ -78,8 +78,8 @@ EXPORT_SYMBOL(__lock_buffer);
void fastcall unlock_buffer(struct buffer_head *bh)
{
+ smp_mb__before_clear_bit();
clear_buffer_locked(bh);
- smp_mb__after_clear_bit();
wake_up_bit(&bh->b_state, BH_Lock);
}
Christoph Lameter wrote:
> Currently unlock_buffer() contains a smb_mb__after_clear_bit() which is
> weird because bit_spin_unlock() uses smb_mb__before_clear_bit():
>
>>From include/linux/bit_spinlock.h:
>
> static inline void bit_spin_unlock(int bitnum, unsigned long *addr)
> {
> smp_mb__before_clear_bit();
> clear_bit(bitnum, addr);
> preempt_enable();
> __release(bitlock);
> }
>
> For most architectures there is no difference because both
> smp_mb__after_clear_bit() and smp_mb__before_clear_bit() are both
> memory barriers and clear_buffer_locked() is an atomic operation.
> However, they differ under IA64.
>
> Note that this potential race has never been seen under IA64. It was
> discovered by inspection by Zoltan Menyhart <[email protected]>.
>
> Regardless if this is a true race or not, I think the unlock sequence
> needs to be the same for bit locks and unlock_buffer(). Maybe
> unlock_buffer and lock_buffer better use bit spinlock operations?
>
> Change unlock_buffer() to work the same way as bit_spin_unlock.
>
OK, that's fair enough and I guess you do need a barrier there.
However, should the mb__after barrier still remain? The comment
in wake_up_bit suggests yes, and there is similar code in
unlock_page.
Let's see... I guess the race is that the waitqueue load could be
completed the clearing of the bit becomes visible, so it may
appear to be empty, but another CPU may find the bit locked after
it has added the task to the waitqueue. I think?
Also, I think there is still the issue of ia64 not having the
correct memory consistency semantics. To start with, all the bitops
and atomic ops which both modify their operand and return a value
should be full memory barriers before and after the operation,
according to Documentation/atomic_ops.txt.
Secondly, I don't think smp_mb__before/after are just defined to
have aquire or relese semantics, but should be full barriers.
> Signed-off-by: Christoph Lameter <[email protected]>
>
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c 2006-03-27 14:09:54.000000000 -0800
> +++ linux-2.6/fs/buffer.c 2006-03-27 19:40:32.000000000 -0800
> @@ -78,8 +78,8 @@ EXPORT_SYMBOL(__lock_buffer);
>
> void fastcall unlock_buffer(struct buffer_head *bh)
> {
> + smp_mb__before_clear_bit();
> clear_buffer_locked(bh);
> - smp_mb__after_clear_bit();
> wake_up_bit(&bh->b_state, BH_Lock);
> }
>
>
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin wrote on Tuesday, March 28, 2006 12:11 AM
> Also, I think there is still the issue of ia64 not having the
> correct memory consistency semantics. To start with, all the bitops
> and atomic ops which both modify their operand and return a value
> should be full memory barriers before and after the operation,
> according to Documentation/atomic_ops.txt.
I suppose the usage of atomic ops is abused, it is used in both lock
and unlock path. And it naturally suck because it now requires full
memory barrier. A better way is to define 3 variants: one for lock
path, one for unlock path, and one with full memory fence.
- Ken
Chen, Kenneth W wrote:
> Nick Piggin wrote on Tuesday, March 28, 2006 12:11 AM
>
>>Also, I think there is still the issue of ia64 not having the
>>correct memory consistency semantics. To start with, all the bitops
>>and atomic ops which both modify their operand and return a value
>>should be full memory barriers before and after the operation,
>>according to Documentation/atomic_ops.txt.
>
> I suppose the usage of atomic ops is abused, it is used in both lock
> and unlock path. And it naturally suck because it now requires full
> memory barrier. A better way is to define 3 variants: one for lock
> path, one for unlock path, and one with full memory fence.
I agree. As I wrote a few days ago:
Why not to use separate bit operations for different purposes?
- e.g. "test_and_set_bit_N_acquire()" for lock acquisition
- "test_and_set_bit()", "clear_bit()" as they are today
- "release_N_clear_bit()"...
Thanks,
Zoltan
On Tue, 28 Mar 2006, Zoltan Menyhart wrote:
> Why not to use separate bit operations for different purposes?
>
> - e.g. "test_and_set_bit_N_acquire()" for lock acquisition
> - "test_and_set_bit()", "clear_bit()" as they are today
> - "release_N_clear_bit()"...
>
That would force IA64 specifics onto all other architectures.
Could we simply define these smb_mb__*_clear_bit to be noops
and then make the atomic bit ops to have full barriers? That would satisfy
Nick's objections.
Index: linux-2.6.16/include/asm-ia64/bitops.h
===================================================================
--- linux-2.6.16.orig/include/asm-ia64/bitops.h 2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16/include/asm-ia64/bitops.h 2006-03-28 15:45:08.000000000 -0800
@@ -45,6 +45,7 @@
old = *m;
new = old | bit;
} while (cmpxchg_acq(m, old, new) != old);
+ smb_mb();
}
/**
@@ -65,7 +66,7 @@
/*
* clear_bit() has "acquire" semantics.
*/
-#define smp_mb__before_clear_bit() smp_mb()
+#define smp_mb__before_clear_bit() do { } while (0)
#define smp_mb__after_clear_bit() do { /* skip */; } while (0)
/**
@@ -92,6 +93,7 @@
old = *m;
new = old & mask;
} while (cmpxchg_acq(m, old, new) != old);
+ smp_mb();
}
/**
@@ -128,6 +130,7 @@
old = *m;
new = old ^ bit;
} while (cmpxchg_acq(m, old, new) != old);
+ smp_mb();
}
/**
@@ -167,6 +170,7 @@
old = *m;
new = old | bit;
} while (cmpxchg_acq(m, old, new) != old);
+ smp_mb();
return (old & bit) != 0;
}
@@ -212,6 +216,7 @@
old = *m;
new = old & mask;
} while (cmpxchg_acq(m, old, new) != old);
+ smp_mb();
return (old & ~mask) != 0;
}
@@ -257,6 +262,7 @@
old = *m;
new = old ^ bit;
} while (cmpxchg_acq(m, old, new) != old);
+ smp_mb();
return (old & bit) != 0;
}
Christoph Lameter wrote on Tuesday, March 28, 2006 3:48 PM
> On Tue, 28 Mar 2006, Zoltan Menyhart wrote:
>
> > Why not to use separate bit operations for different purposes?
> >
> > - e.g. "test_and_set_bit_N_acquire()" for lock acquisition
> > - "test_and_set_bit()", "clear_bit()" as they are today
> > - "release_N_clear_bit()"...
> >
>
> That would force IA64 specifics onto all other architectures.
>
> Could we simply define these smb_mb__*_clear_bit to be noops
> and then make the atomic bit ops to have full barriers? That would satisfy
> Nick's objections.
>
> --- linux-2.6.16.orig/include/asm-ia64/bitops.h 2006-03-19 21:53:29.000000000 -0800
> +++ linux-2.6.16/include/asm-ia64/bitops.h 2006-03-28 15:45:08.000000000 -0800
> @@ -45,6 +45,7 @@
> old = *m;
> new = old | bit;
> } while (cmpxchg_acq(m, old, new) != old);
> + smb_mb();
> }
There are better way to do it. The pointer is already cast as volatile,
so old = *m has acq semantics built-in, we can just change cmpxchg_acq to
cmpxchg_rel, then effectively it is a full memory barrier without doing the
expensive smp_mb().
- Ken
Christoph Lameter wrote on Tuesday, March 28, 2006 3:48 PM
> Could we simply define these smb_mb__*_clear_bit to be noops
> and then make the atomic bit ops to have full barriers? That would satisfy
> Nick's objections.
Oh, it also penalize all other 1,055 call site of clear_bit(), though I don't
know how many actually needs memory barrier. I suspect some need "lock"
barrier, some need "unlock" barrier, and of course some needs full fence.
Why not make unlock_buffer use test_and_clear_bit()? Utilizing it's implied
full memory fence and throw away the return value? OK, OK, this is obscured.
Then introduce clear_bit_memory_fence API or some sort.
- Ken
diff -Nurp linux-2.6.16/fs/buffer.c linux-2.6.16.ken/fs/buffer.c
--- linux-2.6.16/fs/buffer.c 2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16.ken/fs/buffer.c 2006-03-28 17:20:02.000000000 -0800
@@ -78,8 +78,7 @@ EXPORT_SYMBOL(__lock_buffer);
void fastcall unlock_buffer(struct buffer_head *bh)
{
- clear_buffer_locked(bh);
- smp_mb__after_clear_bit();
+ test_clear_buffer_locked(bh);
wake_up_bit(&bh->b_state, BH_Lock);
}
On Tue, 28 Mar 2006, Chen, Kenneth W wrote:
> Why not make unlock_buffer use test_and_clear_bit()? Utilizing it's implied
> full memory fence and throw away the return value? OK, OK, this is obscured.
> Then introduce clear_bit_memory_fence API or some sort.
Only for IA64's sake? Better clean up the bitops as you suggested earlier.
The open ended acquires there leaves a weird feeling.
Something like this? (builds fine not tested yet)
Fix up the bitops for ia64.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.16/include/asm-ia64/bitops.h
===================================================================
--- linux-2.6.16.orig/include/asm-ia64/bitops.h 2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16/include/asm-ia64/bitops.h 2006-03-28 16:35:51.000000000 -0800
@@ -38,13 +38,14 @@ set_bit (int nr, volatile void *addr)
volatile __u32 *m;
CMPXCHG_BUGCHECK_DECL
+ /* Volatile load = acq */
m = (volatile __u32 *) addr + (nr >> 5);
bit = 1 << (nr & 31);
do {
CMPXCHG_BUGCHECK(m);
old = *m;
new = old | bit;
- } while (cmpxchg_acq(m, old, new) != old);
+ } while (cmpxchg_rel(m, old, new) != old);
}
/**
@@ -65,7 +66,7 @@ __set_bit (int nr, volatile void *addr)
/*
* clear_bit() has "acquire" semantics.
*/
-#define smp_mb__before_clear_bit() smp_mb()
+#define smp_mb__before_clear_bit() do { /* skip */; } while (0)
#define smp_mb__after_clear_bit() do { /* skip */; } while (0)
/**
@@ -85,13 +86,14 @@ clear_bit (int nr, volatile void *addr)
volatile __u32 *m;
CMPXCHG_BUGCHECK_DECL
+ /* Volatile load = acq */
m = (volatile __u32 *) addr + (nr >> 5);
mask = ~(1 << (nr & 31));
do {
CMPXCHG_BUGCHECK(m);
old = *m;
new = old & mask;
- } while (cmpxchg_acq(m, old, new) != old);
+ } while (cmpxchg_rel(m, old, new) != old);
}
/**
@@ -121,13 +123,14 @@ change_bit (int nr, volatile void *addr)
volatile __u32 *m;
CMPXCHG_BUGCHECK_DECL
+ /* Volatile load = acq */
m = (volatile __u32 *) addr + (nr >> 5);
bit = (1 << (nr & 31));
do {
CMPXCHG_BUGCHECK(m);
old = *m;
new = old ^ bit;
- } while (cmpxchg_acq(m, old, new) != old);
+ } while (cmpxchg_rel(m, old, new) != old);
}
/**
@@ -160,13 +163,14 @@ test_and_set_bit (int nr, volatile void
volatile __u32 *m;
CMPXCHG_BUGCHECK_DECL
+ /* Volatile load = acq */
m = (volatile __u32 *) addr + (nr >> 5);
bit = 1 << (nr & 31);
do {
CMPXCHG_BUGCHECK(m);
old = *m;
new = old | bit;
- } while (cmpxchg_acq(m, old, new) != old);
+ } while (cmpxchg_rel(m, old, new) != old);
return (old & bit) != 0;
}
@@ -205,13 +209,14 @@ test_and_clear_bit (int nr, volatile voi
volatile __u32 *m;
CMPXCHG_BUGCHECK_DECL
+ /* Volatile load = acq */
m = (volatile __u32 *) addr + (nr >> 5);
mask = ~(1 << (nr & 31));
do {
CMPXCHG_BUGCHECK(m);
old = *m;
new = old & mask;
- } while (cmpxchg_acq(m, old, new) != old);
+ } while (cmpxchg_rel(m, old, new) != old);
return (old & ~mask) != 0;
}
Christoph Lameter wrote on Tuesday, March 28, 2006 4:47 PM
> > Why not make unlock_buffer use test_and_clear_bit()? Utilizing it's implied
> > full memory fence and throw away the return value? OK, OK, this is obscured.
> > Then introduce clear_bit_memory_fence API or some sort.
>
> Only for IA64's sake? Better clean up the bitops as you suggested earlier.
> The open ended acquires there leaves a weird feeling.
>
> Something like this? (builds fine not tested yet)
It's warm and fuzzy feeling with changes in set_bit(), clear_bit(), and
change_bit(). The API never meant to have implied memory fence in them.
Though the usage might be assuming one way or the other because of x86
semantics.
How many of these things are used as (1) simple atomic op, (2) lock,
(3) unlock, and (4) full fence?
clear_bit - 1,070 hits
Set_bit - 1,450 hits
Change_bit - 8 hits
The effect of changing them to full memory fence could be wide spread. Though
I don't have any numbers yet to say how much it will matter for performance.
- Ken
Christoph Lameter wrote:
>On Tue, 28 Mar 2006, Zoltan Menyhart wrote:
>
>
>>Why not to use separate bit operations for different purposes?
>>
>>- e.g. "test_and_set_bit_N_acquire()" for lock acquisition
>>- "test_and_set_bit()", "clear_bit()" as they are today
>>- "release_N_clear_bit()"...
>>
>>
>
>That would force IA64 specifics onto all other architectures.
>
>Could we simply define these smb_mb__*_clear_bit to be noops
>and then make the atomic bit ops to have full barriers? That would satisfy
>Nick's objections.
>
>
Well yes, I think the current operations should not be changed because
that would probably require large and difficult audits of the tree.
However, I think it might be reaonsable to use bit lock operations for
in places like page lock and buffer lock (ie. with acquire and relese
semantics). It improves ia64 without harming other architectures, and
also makes the code more expressive.
>Index: linux-2.6.16/include/asm-ia64/bitops.h
>===================================================================
>--- linux-2.6.16.orig/include/asm-ia64/bitops.h 2006-03-19 21:53:29.000000000 -0800
>+++ linux-2.6.16/include/asm-ia64/bitops.h 2006-03-28 15:45:08.000000000 -0800
>@@ -45,6 +45,7 @@
> old = *m;
> new = old | bit;
> } while (cmpxchg_acq(m, old, new) != old);
>+ smb_mb();
> }
>
>
Can this allow the actual bitop to move up past previous memory ops?
Would it be sufficient to just put a release barrier above the acquire
barrier?
--
Send instant messages to your online friends http://au.messenger.yahoo.com
On Wed, 29 Mar 2006, Nick Piggin wrote:
> However, I think it might be reaonsable to use bit lock operations for
> in places like page lock and buffer lock (ie. with acquire and relese
> semantics). It improves ia64 without harming other architectures, and
> also makes the code more expressive.
How would be express the acquire and release semantics?
Christoph Lameter wrote:
>On Wed, 29 Mar 2006, Nick Piggin wrote:
>
>
>>However, I think it might be reaonsable to use bit lock operations for
>>in places like page lock and buffer lock (ie. with acquire and relese
>>semantics). It improves ia64 without harming other architectures, and
>>also makes the code more expressive.
>>
>
>How would be express the acquire and release semantics?
>
Hmm, not sure. Maybe a few new bitops with _lock / _unlock postfixes?
For page lock and buffer lock we'd just need test_and_set_bit_lock,
clear_bit_unlock, smp_mb__after_clear_bit_unlock.
I don't know, _for_lock might be a better name. But it's getting long.
--
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin wrote on Tuesday, March 28, 2006 6:36 PM
> Hmm, not sure. Maybe a few new bitops with _lock / _unlock postfixes?
> For page lock and buffer lock we'd just need test_and_set_bit_lock,
> clear_bit_unlock, smp_mb__after_clear_bit_unlock.
>
> I don't know, _for_lock might be a better name. But it's getting long.
I think kernel needs all 4 variants:
clear_bit
clear_bit_lock
clear_bit_unlock
clear_bit_fence
And the variant need to permutated on all other bit ops ... I think it
would be indeed a better API and be more explicit about the ordering.
- Ken
Nick Piggin wrote on Tuesday, March 28, 2006 12:11 AM
> OK, that's fair enough and I guess you do need a barrier there.
> However, should the mb__after barrier still remain? The comment
> in wake_up_bit suggests yes, and there is similar code in
> unlock_page.
Question on unlock_page:
void fastcall unlock_page(struct page *page)
{
smp_mb__before_clear_bit();
if (!TestClearPageLocked(page))
BUG();
smp_mb__after_clear_bit();
wake_up_page(page, PG_locked);
}
Assuming test_and_clear_bit() on all arch does what the API is
called for with full memory fence around the atomic op, why do
you need smp_mb__before_clear_bit and smp_mb__after_clear_bit?
Aren't they redundant?
- Ken
On Tue, 28 Mar 2006, Chen, Kenneth W wrote:
> Nick Piggin wrote on Tuesday, March 28, 2006 6:36 PM
> > Hmm, not sure. Maybe a few new bitops with _lock / _unlock postfixes?
> > For page lock and buffer lock we'd just need test_and_set_bit_lock,
> > clear_bit_unlock, smp_mb__after_clear_bit_unlock.
> >
> > I don't know, _for_lock might be a better name. But it's getting long.
>
> I think kernel needs all 4 variants:
>
> clear_bit
> clear_bit_lock
> clear_bit_unlock
> clear_bit_fence
>
> And the variant need to permutated on all other bit ops ... I think it
> would be indeed a better API and be more explicit about the ordering.
How about clear_bit(why, bit, address) in order to keep
the variants down? Get rid of the smp_mb__*_xxxx stuff.
Christoph Lameter wrote:
> On Tue, 28 Mar 2006, Zoltan Menyhart wrote:
>
>
>>Why not to use separate bit operations for different purposes?
>>
>>- e.g. "test_and_set_bit_N_acquire()" for lock acquisition
>>- "test_and_set_bit()", "clear_bit()" as they are today
>>- "release_N_clear_bit()"...
>>
>
>
> That would force IA64 specifics onto all other architectures.
>
> Could we simply define these smb_mb__*_clear_bit to be noops
> and then make the atomic bit ops to have full barriers? That would satisfy
> Nick's objections.
I do not want to force IA64 specifics...
I think the lock - unlock are as they are because of some historical reasons.
IA64 only helps me to see what is really necessary for the lock
implementation and what is just a heritage of the PCs.
I try to summarize my thoughts about the locks in an architecture
independent way. If we can agree upon how it should work, then we can
enter into the details later :-)
1. lock():
The lock acquisition semantics guarantees that the lock is acquired by
its new owner and the new ownership is made globally visible prior to all
subsequent orderable instructions (issued by the new owner) becoming
visible to the others.
Example:
|
v
|
store_X ----------+
| |
v |
| |
======== lock ====+=======
| | ^ ^
v v ^ ^
| | |
load_Y ---------------+ |
store_Z ----------------+
|
v
|
Note that "store_X" is not guaranteed to be globally performed before
the lock.
2. unlock():
The lock release semantics guarantees that all previous orderable
instructions (issued by the lock owner) are made globally visible prior
to make visible to the others the lock becoming free.
Example:
|
v
|
store_A ----------+
| |
v v ^
| v |
======= unlock =======+===
| |
v |
| |
load_B ---------------+
|
v
|
Note that "load_B" can be globally performed before the unlock.
The mutual exclusion means:
- "store_A" is before unlocking
- unlocking is before locking by the next owner
- locking is before "load_Y" or "store_Z"
therefore "store_A" is before "load_Y" or "store_Z".
No relation is defined between "store_X" and any other operation,
nor is between "load_B" and any other operation.
Can we agree that locking and unlocking mean no more than what I
described, especially, they do not imply bidirectional fencing?
Should someone need a bidirectional fencing, s/he would have to
use some additional shync. methods.
In order to separate the theory (how is should work) and the practice
(how is should be implemented), I put the bit-op stuff in another
mail.
Thanks.
Zoltan
Part 2.
There are a couple of different ways of using / implementing bit-op-s.
I try to summarize them:
1. "Stand alone" bit operations:
By "stand alone" I mean that a modification to a bit field member is
independent of any other operation before / after the bit operation
in question. There is no ordering with respect to the other operations.
They do not provide any SMP synch. semantics, there is no need to
implement them as some atomic operations.
2. Atomic bit operations:
As Linux turned to SMP safe, a couple of bit operations have been
re-implemented by use of some atomic operations.
An early implementation on PCs determined the basic semantics of
these operations, including their fencing behavior.
atomic_ops.txt:
" obj->dead = 1;
if (test_and_set_bit(0, &obj->flags))
/* ... */;
obj->killed = 1;
The implementation of test_and_set_bit() must guarentee that
"obj->dead = 1;" is visible to cpus before the atomic memory operation
done by test_and_set_bit() becomes visible. Likewise, the atomic
memory operation done by test_and_set_bit() must become visible before
"obj->killed = 1;" is visible."
I am not convenienced that this is a good sync. strategy.
Unfortunately, the acquisition / release semantics were not defined
in Linux via some architecture independent and explicit way.
People simply abused the fact that the bidirectional fencing is
implicitly provided by some architectures for free.
Theoretically, the code fragments like above have to be cleared up
to use explicit ordering primitives. The description in atomic_ops.txt
should state that this is not the way to do... (see the Fencing bit
operations below).
There can be ~20 "test_and_set_bit()" here or there.
The atomic bit operation with requirements to order the preceding /
following operations are mainly used for locking(-like) purposes:
3. Fencing bit operations:
Should someone still insist on using the atomic bit operations as today,
s/he should change to use e.g.:
obj->dead = 1;
if (test_and_set_bit_N_fence(0, &obj->flags))
/* ... */;
obj->killed = 1;
4. Bit-lock operations:
I summarized the ordering requirements here:
http://marc.theaimsgroup.com/?l=linux-ia64&m=114362989421046&w=2
In order to let the architectures implement these bit-lock
operations efficiently, the usage has to indicate the _minimal_
required ordering semantics, e.g.:
test_and_set_bit_N_acquire()
or ordered_test_and_set_bit(acquire, ...)
release_N_clear_bit()
etc.
The style like:
do_some_ordering_before()
bit_op()
do_some_ordering_after()
is quite correct, however, the compiler is not sufficiently
intelligent to let the architecture dependent part to merge e.g. into
a single machine instruction (if it exists).
Regards,
Zoltan
For what it's worth, I've been involved in some recent work whose goals
currently include properly specifying such things at the user level,
currently primarily in the context of C++, with the hope that this will
eventually migrate into C. (See
http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/ for details.)
Some potentially relevant observations:
- Java uses the terms "acquire" and "release" to express approximately
what "lock" and "unlock" mean here. The major difference is that
performing a "release" operation on a particular object only ensures
visibility to threads that later perform an "acquire" operation on the
same object. I'm not sure that there are any architectures on which
that difference would currently result in a different implementation,
though it may matter for software DSM implementations. I think this
terminology has been in fairly widespread use in the architecture
community since at least 1990 or so. (Google "release consistency".)
- C# uses similar terminology, as we have been doing so far for C++.
I'm not convinced these are Itanium specific terms.
- At user level, the ordering semantics required for something like
pthread_mutex_lock() are unfortunately unclear. If you try to interpret
the current standard, you arrive at the conclusion that
pthread_mutex_lock() basically needs a full barrier, though
pthread_mutex_unlock() doesn't. (See
http://www.hpl.hp.com/techreports/2005/HPL-2005-217.html .) There is
some evidence that this is unintentional, inconsistently implemented,
and probably undesired.
- The C++ equivalent of this is not very far along and still
controversial. But I think the current majority sentiment is in favor
of expressing the ordering constraints in a way that ensures they cannot
accidentally become dynamically computed arguments. i.e. in favor of
making them either part of the operation name or template arguments, but
not regular arguments.
Hans
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Christoph Lameter
> Sent: Tuesday, March 28, 2006 11:11 PM
> To: Chen, Kenneth W
> Cc: 'Nick Piggin'; Zoltan Menyhart; [email protected];
> [email protected]; [email protected]
> Subject: RE: Fix unlock_buffer() to work the same way as bit_unlock()
>
> On Tue, 28 Mar 2006, Chen, Kenneth W wrote:
>
> > Nick Piggin wrote on Tuesday, March 28, 2006 6:36 PM
> > > Hmm, not sure. Maybe a few new bitops with _lock /
> _unlock postfixes?
> > > For page lock and buffer lock we'd just need
> test_and_set_bit_lock,
> > > clear_bit_unlock, smp_mb__after_clear_bit_unlock.
> > >
> > > I don't know, _for_lock might be a better name. But it's
> getting long.
> >
> > I think kernel needs all 4 variants:
> >
> > clear_bit
> > clear_bit_lock
> > clear_bit_unlock
> > clear_bit_fence
> >
> > And the variant need to permutated on all other bit ops ...
> I think
> > it would be indeed a better API and be more explicit about
> the ordering.
>
> How about clear_bit(why, bit, address) in order to keep the
> variants down? Get rid of the smp_mb__*_xxxx stuff.
>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-ia64" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Wed, Mar 29, 2006 at 10:33:57AM -0800, Boehm, Hans wrote:
...
> - At user level, the ordering semantics required for something like
> pthread_mutex_lock() are unfortunately unclear. If you try to interpret
> the current standard, you arrive at the conclusion that
> pthread_mutex_lock() basically needs a full barrier, though
> pthread_mutex_unlock() doesn't. (See
> http://www.hpl.hp.com/techreports/2005/HPL-2005-217.html .)
Was the talk you presented at the May 2005 Gelato meeting in Cupertino
based on an earlier version of this paper?
That was a very good presentation that exposed the deficiencies
in the programming models and languages. If the slides and/or
a recording are available, that might be helpful here too.
thanks,
grant
That was actually based on a different, earlier paper.
Somewhat improved slides for the talk Grant is referring to are at
http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/pldi05_threads.pdf
. The problem is also relevant for kernel development, though the title
doesn't fit, and it clearly needs to be addressed at the language spec
and compiler level. (Note that the claim about gcc on slide 14 is
actually incorrect as it stands (I misread the .s file), but the claim
is correct if you add a conditional to the body of the example loop.
Thus you won't be led far astray.) The PLDI paper on which the talk is
based contained a conjecture about required ordering for Posix locks,
which is disproved by the TR below.
It's hard to get this stuff right. But we knew that.
Hans
> -----Original Message-----
> From: Grundler, Grant G
> Sent: Wednesday, March 29, 2006 11:12 AM
> To: Boehm, Hans
> Cc: Christoph Lameter; Chen, Kenneth W; Nick Piggin; Zoltan
> Menyhart; [email protected]; [email protected];
> [email protected]
> Subject: Re: Fix unlock_buffer() to work the same way as bit_unlock()
>
> On Wed, Mar 29, 2006 at 10:33:57AM -0800, Boehm, Hans wrote:
> ...
> > - At user level, the ordering semantics required for something like
> > pthread_mutex_lock() are unfortunately unclear. If you try to
> > interpret the current standard, you arrive at the conclusion that
> > pthread_mutex_lock() basically needs a full barrier, though
> > pthread_mutex_unlock() doesn't. (See
> > http://www.hpl.hp.com/techreports/2005/HPL-2005-217.html .)
>
> Was the talk you presented at the May 2005 Gelato meeting in
> Cupertino based on an earlier version of this paper?
>
> That was a very good presentation that exposed the
> deficiencies in the programming models and languages. If the
> slides and/or a recording are available, that might be
> helpful here too.
>
> thanks,
> grant
>
On Wed, 29 Mar 2006, Boehm, Hans wrote:
> Somewhat improved slides for the talk Grant is referring to are at
> http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/pldi05_threads.pdf
Hmm.. A paper on that subject would be better. Cannot get much from the
slides.
> It's hard to get this stuff right. But we knew that.
Could you come up with a proposal how to correctly define and API to bit
ops in such a way that they work for all architectures and allow us to
utilize all the features that processors may have?
The slides and paper are really discussing a different issue, namely the
extent to which it is safe to use a compiler like gcc which was
primarily designed for sequential code on concurrent code, using a model
similar to what pthreads (or I believe the kernel) uses. The short
answer is that it isn't. There is some low probablity that the compiler
will introduce a data race. And that is very hard to anticipate based
on the source, and seems to be similarly hard to avoid by any known
programming guidelines. The solution strategy is to fix language
standards and compilers. We're working on the first part for now.
You can find the paper corresponding to those slides at
http://portal.acm.org/citation.cfm?doid=1065010.1065042 or
http://www.hpl.hp.com/techreports/2004/HPL-2004-209.html .
Returning to the original topic, I don't think I'm the one to design the
bitops API, since I'm not sufficiently familiar with the kernel issues.
I did design a vaguely comparable user-level interface that addresses
atomic operations in general, not specifically bit vector operations.
That's described at http://www.hpl.hp.com/research/linux/atomic_ops/, or
more specifically at
http://www.hpl.hp.com/research/linux/atomic_ops/README.txt . I'm making
another pass over the C++ proposal version as we speak, but it's mostly
similar in spirit. Design decisions that have turned out to be dubious
are:
1. Including all ordering types for simple load and store operations.
Some don't make sense.
2. The set of ordering constraints there is probably too large. None,
acquire, release, and full really seem to be the important ones.
dd_acquire_read is nice, but probably nonsensical. Hardware tends to
give you data dependent ordering for free, but the compiler doesn't
preserve it.
Hans
> -----Original Message-----
> From: Christoph Lameter
> [mailto:[email protected]] On Behalf Of
> Christoph Lameter
> Sent: Wednesday, March 29, 2006 2:18 PM
> To: Boehm, Hans
> Cc: Grundler, Grant G; Chen, Kenneth W; Nick Piggin; Zoltan
> Menyhart; [email protected]; [email protected];
> [email protected]
> Subject: RE: Fix unlock_buffer() to work the same way as bit_unlock()
>
> On Wed, 29 Mar 2006, Boehm, Hans wrote:
>
> > Somewhat improved slides for the talk Grant is referring to are at
> >
> http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/pldi05_threads.p
> > df
>
> Hmm.. A paper on that subject would be better. Cannot get
> much from the slides.
>
> > It's hard to get this stuff right. But we knew that.
>
> Could you come up with a proposal how to correctly define and
> API to bit ops in such a way that they work for all
> architectures and allow us to utilize all the features that
> processors may have?
>
Hmmm... Maybe we therefore need to add a mode to each bit operation in
the kernel?
With that we can also get rid of the __* version of bitops.
Possible modes are
NON_ATOMIC Do not perform any atomic ops at all.
ATOMIC Atomic but unordered
ACQUIRE Atomic with acquire semantics (or lock semantics)
RELEASE Atomic with release semantics (or unlock semantics)
FENCE Atomic with full fence.
This would require another bitops overhaul.
Maybe we can preserve the existing code with bitops like __* mapped to
*(..., NON_ATOMIC) and * mapped to *(..., FENCE) and the gradually fix the
rest of the kernel.
Nick?
Index: linux-2.6/include/asm-ia64/bitops.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/bitops.h 2006-03-27 09:45:11.000000000 -0800
+++ linux-2.6/include/asm-ia64/bitops.h 2006-03-29 15:27:54.000000000 -0800
@@ -14,6 +14,30 @@
#include <asm/bitops.h>
#include <asm/intrinsics.h>
+#define MODE_NON_ATOMIC 0
+#define MODE_ATOMIC 1
+#define MODE_ACQUIRE 2
+#define MODE_RELEASE 3
+#define MODE_FENCE 4
+
+__u32 cmpxchg_mode(volatile __u32 *m, __u32 old, __u32 new, int mode)
+{
+ switch (mode) {
+ case MODE_NONE :
+ case MODE_ACQUIRE :
+ return cmpxchg_acq(m, old, new);
+
+ case MODE_FENCE :
+ smp_mb();
+ /* Fall through */
+
+ case MODE_RELEASE :
+ return cmpxchg_rel(m, old, new);
+
+ }
+}
+
+
/**
* set_bit - Atomically set a bit in memory
* @nr: the bit to set
@@ -32,7 +56,7 @@
* bit 0 is the LSB of addr; bit 32 is the LSB of (addr+1).
*/
static __inline__ void
-set_bit (int nr, volatile void *addr)
+set_bit (int nr, volatile void *addr, int mode)
{
__u32 bit, old, new;
volatile __u32 *m;
@@ -40,35 +64,20 @@ set_bit (int nr, volatile void *addr)
m = (volatile __u32 *) addr + (nr >> 5);
bit = 1 << (nr & 31);
+
+ if (mode == ORDER_NON_ATOMIC) {
+ *m |= bit;
+ return;
+ }
+
do {
CMPXCHG_BUGCHECK(m);
old = *m;
new = old | bit;
- } while (cmpxchg_acq(m, old, new) != old);
+ } while (cmpxchg_mode(m, old, new, mode) != old);
}
/**
- * __set_bit - Set a bit in memory
- * @nr: the bit to set
- * @addr: the address to start counting from
- *
- * Unlike set_bit(), this function is non-atomic and may be reordered.
- * If it's called on the same region of memory simultaneously, the effect
- * may be that only one operation succeeds.
- */
-static __inline__ void
-__set_bit (int nr, volatile void *addr)
-{
- *((__u32 *) addr + (nr >> 5)) |= (1 << (nr & 31));
-}
-
-/*
- * clear_bit() has "acquire" semantics.
- */
-#define smp_mb__before_clear_bit() smp_mb()
-#define smp_mb__after_clear_bit() do { /* skip */; } while (0)
-
-/**
* clear_bit - Clears a bit in memory
* @nr: Bit to clear
* @addr: Address to start counting from
@@ -79,7 +88,7 @@ __set_bit (int nr, volatile void *addr)
* in order to ensure changes are visible on other processors.
*/
static __inline__ void
-clear_bit (int nr, volatile void *addr)
+clear_bit (int nr, volatile void *addr, int mode)
{
__u32 mask, old, new;
volatile __u32 *m;
@@ -87,22 +96,17 @@ clear_bit (int nr, volatile void *addr)
m = (volatile __u32 *) addr + (nr >> 5);
mask = ~(1 << (nr & 31));
+
+ if (mode == MODE_NON_ATOMIC) {
+ *m &= ~mask;
+ return;
+ }
+
do {
CMPXCHG_BUGCHECK(m);
old = *m;
new = old & mask;
- } while (cmpxchg_acq(m, old, new) != old);
-}
-
-/**
- * __clear_bit - Clears a bit in memory (non-atomic version)
- */
-static __inline__ void
-__clear_bit (int nr, volatile void *addr)
-{
- volatile __u32 *p = (__u32 *) addr + (nr >> 5);
- __u32 m = 1 << (nr & 31);
- *p &= ~m;
+ } while (cmpxchg_mode(m, old, new, mode) != old);
}
/**
@@ -115,7 +119,7 @@ __clear_bit (int nr, volatile void *addr
* restricted to acting on a single-word quantity.
*/
static __inline__ void
-change_bit (int nr, volatile void *addr)
+change_bit (int nr, volatile void *addr, int mode)
{
__u32 bit, old, new;
volatile __u32 *m;
@@ -123,6 +127,12 @@ change_bit (int nr, volatile void *addr)
m = (volatile __u32 *) addr + (nr >> 5);
bit = (1 << (nr & 31));
+
+ if (mode == MODE_NON_ATOMIC) {
+ *m ~= bit;
+ return;
+ }
+
do {
CMPXCHG_BUGCHECK(m);
old = *m;
@@ -131,21 +141,6 @@ change_bit (int nr, volatile void *addr)
}
/**
- * __change_bit - Toggle a bit in memory
- * @nr: the bit to set
- * @addr: the address to start counting from
- *
- * Unlike change_bit(), this function is non-atomic and may be reordered.
- * If it's called on the same region of memory simultaneously, the effect
- * may be that only one operation succeeds.
- */
-static __inline__ void
-__change_bit (int nr, volatile void *addr)
-{
- *((__u32 *) addr + (nr >> 5)) ^= (1 << (nr & 31));
-}
-
-/**
* test_and_set_bit - Set a bit and return its old value
* @nr: Bit to set
* @addr: Address to count from
@@ -154,7 +149,7 @@ __change_bit (int nr, volatile void *add
* It also implies a memory barrier.
*/
static __inline__ int
-test_and_set_bit (int nr, volatile void *addr)
+test_and_set_bit (int nr, volatile void *addr, int mode)
{
__u32 bit, old, new;
volatile __u32 *m;
@@ -162,6 +157,13 @@ test_and_set_bit (int nr, volatile void
m = (volatile __u32 *) addr + (nr >> 5);
bit = 1 << (nr & 31);
+
+ if (mode == MODE_NON_ATOMIC) {
+ int oldbitset = *m & bit;
+ *m |= bit;
+ return oldbitset;
+ }
+
do {
CMPXCHG_BUGCHECK(m);
old = *m;
@@ -171,26 +173,6 @@ test_and_set_bit (int nr, volatile void
}
/**
- * __test_and_set_bit - Set a bit and return its old value
- * @nr: Bit to set
- * @addr: Address to count from
- *
- * This operation is non-atomic and can be reordered.
- * If two examples of this operation race, one can appear to succeed
- * but actually fail. You must protect multiple accesses with a lock.
- */
-static __inline__ int
-__test_and_set_bit (int nr, volatile void *addr)
-{
- __u32 *p = (__u32 *) addr + (nr >> 5);
- __u32 m = 1 << (nr & 31);
- int oldbitset = (*p & m) != 0;
-
- *p |= m;
- return oldbitset;
-}
-
-/**
* test_and_clear_bit - Clear a bit and return its old value
* @nr: Bit to set
* @addr: Address to count from
@@ -199,7 +181,7 @@ __test_and_set_bit (int nr, volatile voi
* It also implies a memory barrier.
*/
static __inline__ int
-test_and_clear_bit (int nr, volatile void *addr)
+test_and_clear_bit (int nr, volatile void *addr, int mode)
{
__u32 mask, old, new;
volatile __u32 *m;
@@ -207,6 +189,13 @@ test_and_clear_bit (int nr, volatile voi
m = (volatile __u32 *) addr + (nr >> 5);
mask = ~(1 << (nr & 31));
+
+ if (mode == MODE_NON_ATOMIC) {
+ int oldbitset = *m & mask;
+ *m &= ~mask;
+ return oldbitset;
+ }
+
do {
CMPXCHG_BUGCHECK(m);
old = *m;
@@ -216,26 +205,6 @@ test_and_clear_bit (int nr, volatile voi
}
/**
- * __test_and_clear_bit - Clear a bit and return its old value
- * @nr: Bit to set
- * @addr: Address to count from
- *
- * This operation is non-atomic and can be reordered.
- * If two examples of this operation race, one can appear to succeed
- * but actually fail. You must protect multiple accesses with a lock.
- */
-static __inline__ int
-__test_and_clear_bit(int nr, volatile void * addr)
-{
- __u32 *p = (__u32 *) addr + (nr >> 5);
- __u32 m = 1 << (nr & 31);
- int oldbitset = *p & m;
-
- *p &= ~m;
- return oldbitset;
-}
-
-/**
* test_and_change_bit - Change a bit and return its old value
* @nr: Bit to set
* @addr: Address to count from
@@ -244,7 +213,7 @@ __test_and_clear_bit(int nr, volatile vo
* It also implies a memory barrier.
*/
static __inline__ int
-test_and_change_bit (int nr, volatile void *addr)
+test_and_change_bit (int nr, volatile void *addr, int mode)
{
__u32 bit, old, new;
volatile __u32 *m;
@@ -252,6 +221,13 @@ test_and_change_bit (int nr, volatile vo
m = (volatile __u32 *) addr + (nr >> 5);
bit = (1 << (nr & 31));
+
+ if (mode == MODE_NON_ATOMIC) {
+ old = *m;
+ *m = old ^ bit;
+ return (old & bit) != 0;
+ }
+
do {
CMPXCHG_BUGCHECK(m);
old = *m;
@@ -260,20 +236,6 @@ test_and_change_bit (int nr, volatile vo
return (old & bit) != 0;
}
-/*
- * WARNING: non atomic version.
- */
-static __inline__ int
-__test_and_change_bit (int nr, void *addr)
-{
- __u32 old, bit = (1 << (nr & 31));
- __u32 *m = (__u32 *) addr + (nr >> 5);
-
- old = *m;
- *m = old ^ bit;
- return (old & bit) != 0;
-}
-
static __inline__ int
test_bit (int nr, const volatile void *addr)
{
Christoph Lameter wrote on Wednesday, March 29, 2006 3:33 PM
> Hmmm... Maybe we therefore need to add a mode to each bit operation in
> the kernel?
>
> With that we can also get rid of the __* version of bitops.
>
> Possible modes are
>
> NON_ATOMIC Do not perform any atomic ops at all.
>
> ATOMIC Atomic but unordered
>
> ACQUIRE Atomic with acquire semantics (or lock semantics)
>
> RELEASE Atomic with release semantics (or unlock semantics)
>
> FENCE Atomic with full fence.
>
> This would require another bitops overhaul.
>
> Maybe we can preserve the existing code with bitops like __* mapped to
> *(..., NON_ATOMIC) and * mapped to *(..., FENCE) and the gradually fix the
> rest of the kernel.
Is gcc smart enough to turn constant argument and collapse inline of
inline function? I hope it does.
Lots of other comments on actual code, but I will defer that until
some consensus is made on the API. This would be nice to have.
- Ken
On Wed, Mar 29, 2006 at 03:49:08PM -0800, Chen, Kenneth W wrote:
> Is gcc smart enough to turn constant argument and collapse inline of
> inline function? I hope it does.
Yes, I'm pretty sure at -O2 (normal kernel usage) it is.
grant
On Wed, 29 Mar 2006, Chen, Kenneth W wrote:
> Is gcc smart enough to turn constant argument and collapse inline of
> inline function? I hope it does.
Sure we use that all the time. Look at the cmpxchg definition in
asm-ia64/system.h
Chen, Kenneth W wrote:
>Nick Piggin wrote on Tuesday, March 28, 2006 6:36 PM
>
>>Hmm, not sure. Maybe a few new bitops with _lock / _unlock postfixes?
>>For page lock and buffer lock we'd just need test_and_set_bit_lock,
>>clear_bit_unlock, smp_mb__after_clear_bit_unlock.
>>
>>I don't know, _for_lock might be a better name. But it's getting long.
>>
>
>I think kernel needs all 4 variants:
>
>clear_bit
>clear_bit_lock
>clear_bit_unlock
>clear_bit_fence
>
>And the variant need to permutated on all other bit ops ... I think it
>would be indeed a better API and be more explicit about the ordering.
>
>
We could just introduce them as required, though? clear_bit_fence shouldn't
be required for ia64 any longer, if you change bitops to be full barriers,
right?
And for now, let's just not let people open critical sections unless doing
a test_and_set, nor close them unless doing a plain clear?
It seems that memory ordering model seems to be almost too much for people
to cope with already, so would it be reasonable to require some performance
justification before adding more?
--
Send instant messages to your online friends http://au.messenger.yahoo.com
Chen, Kenneth W wrote:
>Nick Piggin wrote on Tuesday, March 28, 2006 12:11 AM
>
>>OK, that's fair enough and I guess you do need a barrier there.
>>However, should the mb__after barrier still remain? The comment
>>in wake_up_bit suggests yes, and there is similar code in
>>unlock_page.
>>
>
>Question on unlock_page:
>
>void fastcall unlock_page(struct page *page)
>{
> smp_mb__before_clear_bit();
> if (!TestClearPageLocked(page))
> BUG();
> smp_mb__after_clear_bit();
> wake_up_page(page, PG_locked);
>}
>
>Assuming test_and_clear_bit() on all arch does what the API is
>called for with full memory fence around the atomic op, why do
>you need smp_mb__before_clear_bit and smp_mb__after_clear_bit?
>Aren't they redundant?
>
>
Yep. I pointed this out earlier.
I'd say it may have initially just been a ClearPageLocked, and
was changed for debugging reasons.
We could instead change it to
BUG_ON(!PageLocked(page);
ClearPageLocked(page); /* this does clear_bit_for_unlock */
smp_mb__after_clear_bit_unlock();
wake_up_page
--
Send instant messages to your online friends http://au.messenger.yahoo.com
Zoltan Menyhart wrote:
>
> 4. Bit-lock operations:
>
> I summarized the ordering requirements here:
> http://marc.theaimsgroup.com/?l=linux-ia64&m=114362989421046&w=2
>
> In order to let the architectures implement these bit-lock
> operations efficiently, the usage has to indicate the _minimal_
> required ordering semantics, e.g.:
>
> test_and_set_bit_N_acquire()
> or ordered_test_and_set_bit(acquire, ...)
> release_N_clear_bit()
> etc.
>
The problem is simply that we don't expose acquire or release
ordering operations to AI kernel code (outside of locking, which
is a great wrapper). The reason is to avoid proliferation of all
these semantics.
If you do this then the powerpc guys will say they want all
their weird crap in there too. If you remove seperate read
and write barriers, then x86 and sparc64 folks will get upset
etc etc.
Changing semantics would probably require some fairly hefty
discussions. Can you first fix ia64, then (perhaps) introduce
lock semantics for the couple of bitops that can use it, then
can we see some performance justification for proposed
changes to the API?
--
Send instant messages to your online friends http://au.messenger.yahoo.com
Zoltan Menyhart wrote:
> Christoph Lameter wrote:
>
>> [...]
>> void fastcall unlock_buffer(struct buffer_head *bh)
>> {
>> + smp_mb__before_clear_bit();
>> clear_buffer_locked(bh);
>> - smp_mb__after_clear_bit();
>> wake_up_bit(&bh->b_state, BH_Lock);
>> }
>
>
> The sequence:
>
> smp_mb__before_clear_bit();
> clear_buffer_locked(bh);
>
> is correct (yet not efficient for all architectures).
>
Yes, this is explicitly documented in the wake_up_bit interface. I
don't really think it needs to be changed, does it? Bill did most
of this work I think, so I've cc'ed him.
> We have got here two unrelated operations: ending a critical section
> and waking up the eventual waiters. What we need is a barrier between
> these two unrelated operations.
> It is not "smp_mb__after_clear_bit()" but a simple "smp_mb()".
> The correct code is:
>
> void fastcall unlock_buffer(struct buffer_head *bh)
> {
> smp_mb__before_clear_bit();
> clear_buffer_locked(bh);
> smp_mb();
> wake_up_bit(&bh->b_state, BH_Lock);
> }
>
If you were going to do that, I'd prefer my suggestion.
clear_buffer_locked(); /* clear_bit_unlock */
smp_mb__after_clear_bit_unlock();
wake_up_bit()
Nick
--
Send instant messages to your online friends http://au.messenger.yahoo.com
Christoph Lameter wrote:
> Hmmm... Maybe we therefore need to add a mode to each bit operation in
> the kernel?
>
> With that we can also get rid of the __* version of bitops.
>
> Possible modes are
>
> NON_ATOMIC Do not perform any atomic ops at all.
>
> ATOMIC Atomic but unordered
>
> ACQUIRE Atomic with acquire semantics (or lock semantics)
>
> RELEASE Atomic with release semantics (or unlock semantics)
>
> FENCE Atomic with full fence.
>
> This would require another bitops overhaul.
>
> Maybe we can preserve the existing code with bitops like __* mapped to
> *(..., NON_ATOMIC) and * mapped to *(..., FENCE) and the gradually fix the
> rest of the kernel.
Form semantical point of view, the forms:
bit_foo(..., mode)
and
bit_foo_mode(...)
are equivalent.
However, I do not think your implementation would be efficient due to
selecting the ordering mode at run time:
> + switch (mode) {
> + case MODE_NONE :
> + case MODE_ACQUIRE :
> + return cmpxchg_acq(m, old, new);
> + case MODE_FENCE :
> + smp_mb();
> + /* Fall through */
> + case MODE_RELEASE :
> + return cmpxchg_rel(m, old, new);
> + if (mode == ORDER_NON_ATOMIC) {
> + *m |= bit;
> + return;
> + }
etc.
In addition, we may want to inline these primitives...
A compile-time selection of the appropriate code sequence would help.
Thanks,
Zoltan
On Thu, 30 Mar 2006, Zoltan Menyhart wrote:
> Form semantical point of view, the forms:
>
> bit_foo(..., mode)
> and
> bit_foo_mode(...)
>
> are equivalent.
Correct but the above form leads to less macro definitions.
> However, I do not think your implementation would be efficient due to
> selecting the ordering mode at run time:
The compiler will select that at compile time. One has the option of also
generating run time seletion by specifying a variable instead of a
constant when callig these functions.
> In addition, we may want to inline these primitives...
Of course.
> A compile-time selection of the appropriate code sequence would help.
They are compile time selected.
> From: Christoph Lameter
>
> On Thu, 30 Mar 2006, Zoltan Menyhart wrote:
>
> > Form semantical point of view, the forms:
> >
> > bit_foo(..., mode)
> > and
> > bit_foo_mode(...)
> >
> > are equivalent.
>
> Correct but the above form leads to less macro definitions.
>
> > However, I do not think your implementation would be
> efficient due to
> > selecting the ordering mode at run time:
>
> The compiler will select that at compile time. One has the
> option of also generating run time seletion by specifying a
> variable instead of a constant when callig these functions.
I would view the latter as a disadvantage, since I can't think of a case
in which you wouldn't want it reported as an error instead, at least if
you care about performance. If you know of one, I'd be very interested.
The first form does have the advantage that it's possible to build up
more complicated primitives from simpler ones without repeating the
definition four times.
I'm not sure there's a clear winner here.
In the C++ case, I currently expect we will go with template arguments,
which are guaranteed to be static, but are an option you don't have ...
Hans
On Thu, 30 Mar 2006, Boehm, Hans wrote:
> > The compiler will select that at compile time. One has the
> > option of also generating run time seletion by specifying a
> > variable instead of a constant when callig these functions.
> I would view the latter as a disadvantage, since I can't think of a case
> in which you wouldn't want it reported as an error instead, at least if
> you care about performance. If you know of one, I'd be very interested.
In that case: We could check that a constant is passed at compile time.
> The first form does have the advantage that it's possible to build up
> more complicated primitives from simpler ones without repeating the
> definition four times.
What is the first form? The advantage of passing a parameter is more
compact code and less definitions.
Zoltan Menyhart wrote:
> However, I do not think your implementation would be efficient due to
> selecting the ordering mode at run time:
>
>> + switch (mode) {
>> + case MODE_NONE :
>> + case MODE_ACQUIRE :
>> + return cmpxchg_acq(m, old, new);
>> + case MODE_FENCE :
>> + smp_mb();
>> + /* Fall through */
>> + case MODE_RELEASE :
>> + return cmpxchg_rel(m, old, new);
>
BTW. Isn't MODE_FENCE wrong? Seems like a read or write could be moved
above cmpxchg_rel?
I think you need rel+acq rather than acq+rel (if I'm right, then the
same goes for your earlier bitops patches, btw).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Thu, 30 Mar 2006, Nick Piggin wrote:
> Zoltan Menyhart wrote:
>
> > However, I do not think your implementation would be efficient due to
> > selecting the ordering mode at run time:
> >
> > > + switch (mode) {
> > > + case MODE_NONE :
> > > + case MODE_ACQUIRE :
> > > + return cmpxchg_acq(m, old, new);
> > > + case MODE_FENCE :
> > > + smp_mb();
> > > + /* Fall through */
> > > + case MODE_RELEASE :
> > > + return cmpxchg_rel(m, old, new);
> >
>
> BTW. Isn't MODE_FENCE wrong? Seems like a read or write could be moved
> above cmpxchg_rel?
Hmmm.... We should call this MODE_BARRIER I guess...
> I think you need rel+acq rather than acq+rel (if I'm right, then the
> same goes for your earlier bitops patches, btw).
The question is what does fence imply for the order of the atomic
operation. Will the operation be performed before or after the barrier?
Or do we need barriers on both sides?
If so then we may simulate that with a release and then do a acquire load
afterwards
cmpxchg_rel(m, old, new);
return *m; /* m volatile so it has acquire semantics */
> From: Christoph Lameter
> On Thu, 30 Mar 2006, Nick Piggin wrote:
>
> > Zoltan Menyhart wrote:
> >
> > > However, I do not think your implementation would be
> efficient due
> > > to selecting the ordering mode at run time:
> > >
> > > > + switch (mode) {
> > > > + case MODE_NONE :
> > > > + case MODE_ACQUIRE :
> > > > + return cmpxchg_acq(m, old, new);
> > > > + case MODE_FENCE :
> > > > + smp_mb();
> > > > + /* Fall through */
> > > > + case MODE_RELEASE :
> > > > + return cmpxchg_rel(m, old, new);
> > >
> >
> > BTW. Isn't MODE_FENCE wrong? Seems like a read or write
> could be moved
> > above cmpxchg_rel?
>
> Hmmm.... We should call this MODE_BARRIER I guess...
>
I arrived at the conclusion that "fence" is a better term, at least in
user-level code. "Barrier" seems to generate confusion with
"pthread_barrier_wait" and similar constructs, which are a different
kind of beast.
Hans