2022-08-22 10:06:08

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH] wait_on_bit: add an acquire memory barrier

Hi

I'd like to ask what do you think about this patch? Do you want to commit
it - or do you think that the barrier should be added to the callers of
wait_on_bit?

Mikulas



From: Mikulas Patocka <[email protected]>

There are several places in the kernel where wait_on_bit is not followed
by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read). On
architectures with weak memory ordering, it may happen that memory
accesses that follow wait_on_bit are reordered before wait_on_bit and they
may return invalid data.

Fix this class of bugs by adding an acquire memory barrier to wait_on_bit,
wait_on_bit_io, wait_on_bit_timeout and wait_on_bit_action. The code that
uses these functions should clear the bit using the function
clear_bit_unlock.

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]

---
include/linux/wait_bit.h | 16 ++++++++++++----
kernel/sched/wait_bit.c | 2 ++
2 files changed, 14 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/wait_bit.h
===================================================================
--- linux-2.6.orig/include/linux/wait_bit.h 2022-08-20 14:33:44.000000000 +0200
+++ linux-2.6/include/linux/wait_bit.h 2022-08-20 15:41:43.000000000 +0200
@@ -71,8 +71,10 @@ static inline int
wait_on_bit(unsigned long *word, int bit, unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit(bit, word)) {
+ smp_acquire__after_ctrl_dep(); /* should pair with clear_bit_unlock */
return 0;
+ }
return out_of_line_wait_on_bit(word, bit,
bit_wait,
mode);
@@ -96,8 +98,10 @@ static inline int
wait_on_bit_io(unsigned long *word, int bit, unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit(bit, word)) {
+ smp_acquire__after_ctrl_dep(); /* should pair with clear_bit_unlock */
return 0;
+ }
return out_of_line_wait_on_bit(word, bit,
bit_wait_io,
mode);
@@ -123,8 +127,10 @@ wait_on_bit_timeout(unsigned long *word,
unsigned long timeout)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit(bit, word)) {
+ smp_acquire__after_ctrl_dep(); /* should pair with clear_bit_unlock */
return 0;
+ }
return out_of_line_wait_on_bit_timeout(word, bit,
bit_wait_timeout,
mode, timeout);
@@ -151,8 +157,10 @@ wait_on_bit_action(unsigned long *word,
unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit(bit, word)) {
+ smp_acquire__after_ctrl_dep(); /* should pair with clear_bit_unlock */
return 0;
+ }
return out_of_line_wait_on_bit(word, bit, action, mode);
}

Index: linux-2.6/kernel/sched/wait_bit.c
===================================================================
--- linux-2.6.orig/kernel/sched/wait_bit.c 2022-08-20 14:33:44.000000000 +0200
+++ linux-2.6/kernel/sched/wait_bit.c 2022-08-20 15:41:39.000000000 +0200
@@ -49,6 +49,8 @@ __wait_on_bit(struct wait_queue_head *wq
ret = (*action)(&wbq_entry->key, mode);
} while (test_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags) && !ret);

+ smp_acquire__after_ctrl_dep(); /* should pair with clear_bit_unlock */
+
finish_wait(wq_head, &wbq_entry->wq_entry);

return ret;


2022-08-22 17:13:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier

On Mon, Aug 22, 2022 at 2:39 AM Mikulas Patocka <[email protected]> wrote:
>
> I'd like to ask what do you think about this patch?

I really don't like it. It adds a pointless read barrier only because
you didn't want to do it properly.

On x86, it doesn't matter, since rmb is a no-op and only a scheduling
barrier (and not noticeable in this case anyway).

On other architectures, it might.

But on all architectures it's just ugly.

I suggested in an earlier thread that you just do it right with an
explicit smp_load_acquire() and a manual bit test.

So why don't we just create a "test_bit_acquire()" and be done with
it? We literally created clear_bit_unlock() for the opposite reason,
and your comments about the new barrier hack even point to it.

Why is "clear_bit_unlock()" worthy of a real helper, but
"test_bit_acquire()" is not and people who want it have to use this
horrendous hack?

Please stop adding random barriers already. Just do it right. I've
said this before, why do you then keep doing this and asking for
comments?

My reply will remain the same: JUST DO IT RIGHT.

Linus

2022-08-22 17:47:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier

On Mon, Aug 22, 2022 at 10:08 AM Linus Torvalds
<[email protected]> wrote:
>
> So why don't we just create a "test_bit_acquire()" and be done with
> it? We literally created clear_bit_unlock() for the opposite reason,
> and your comments about the new barrier hack even point to it.

Here's a patch that is

(a) almost entirely untested (I checked that one single case builds
and seems to generate the expected code)

(b) needs some more loving

but seems to superficially work.

At a minimum this needs to be split into two (so the bitop and the
wait_on_bit parts split up), and that whole placement of
<asm/barrier.h> and generic_bit_test_acquire() need at least some
thinking about, but on the whole it seems reasonable.

For example, it would make more sense to have this in
<asm-generic/bitops/lock.h>, but not all architectures include that,
and some do their own version. I didn't want to mess with
architecture-specific headers, so this illogically just uses
generic-non-atomic.h.

Maybe just put it in <linux/bitops.h> directly?

So I'm not at all claiming that this is a great patch. It definitely
needs more work, and a lot more testing.

But I think this is at least the right _direction_ to take here.

And yes, I think it also would have been better if
"clear_bit_unlock()" would have been called "clear_bit_release()", and
we'd have more consistent naming with our ordered atomics. But it's
probably not worth changing.

Linus


Attachments:
patch.diff (4.10 kB)

2022-08-22 18:09:41

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier

On Mon, Aug 22, 2022 at 1:39 PM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Aug 22, 2022 at 10:08 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > So why don't we just create a "test_bit_acquire()" and be done with
> > it? We literally created clear_bit_unlock() for the opposite reason,
> > and your comments about the new barrier hack even point to it.
>
> Here's a patch that is
>
> (a) almost entirely untested (I checked that one single case builds
> and seems to generate the expected code)
>
> (b) needs some more loving
>
> but seems to superficially work.
>
> At a minimum this needs to be split into two (so the bitop and the
> wait_on_bit parts split up), and that whole placement of
> <asm/barrier.h> and generic_bit_test_acquire() need at least some
> thinking about, but on the whole it seems reasonable.
>
> For example, it would make more sense to have this in
> <asm-generic/bitops/lock.h>, but not all architectures include that,
> and some do their own version. I didn't want to mess with
> architecture-specific headers, so this illogically just uses
> generic-non-atomic.h.
>
> Maybe just put it in <linux/bitops.h> directly?
>
> So I'm not at all claiming that this is a great patch. It definitely
> needs more work, and a lot more testing.
>
> But I think this is at least the right _direction_ to take here.
>
> And yes, I think it also would have been better if
> "clear_bit_unlock()" would have been called "clear_bit_release()", and
> we'd have more consistent naming with our ordered atomics. But it's
> probably not worth changing.

Also, as a suggestion to Mikulas or whoever works on this - update the
ORDERING section of Documentation/atomic_bitops.txt too?

Thanks,

- Joel

2022-08-25 21:45:37

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier



On Mon, 22 Aug 2022, Linus Torvalds wrote:

> On Mon, Aug 22, 2022 at 10:08 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > So why don't we just create a "test_bit_acquire()" and be done with
> > it? We literally created clear_bit_unlock() for the opposite reason,
> > and your comments about the new barrier hack even point to it.
>
> Here's a patch that is
>
> (a) almost entirely untested (I checked that one single case builds
> and seems to generate the expected code)
>
> (b) needs some more loving
>
> but seems to superficially work.
>
> At a minimum this needs to be split into two (so the bitop and the
> wait_on_bit parts split up), and that whole placement of
> <asm/barrier.h> and generic_bit_test_acquire() need at least some
> thinking about, but on the whole it seems reasonable.
>
> For example, it would make more sense to have this in
> <asm-generic/bitops/lock.h>, but not all architectures include that,
> and some do their own version. I didn't want to mess with
> architecture-specific headers, so this illogically just uses
> generic-non-atomic.h.
>
> Maybe just put it in <linux/bitops.h> directly?
>
> So I'm not at all claiming that this is a great patch. It definitely
> needs more work, and a lot more testing.
>
> But I think this is at least the right _direction_ to take here.
>
> And yes, I think it also would have been better if
> "clear_bit_unlock()" would have been called "clear_bit_release()", and
> we'd have more consistent naming with our ordered atomics. But it's
> probably not worth changing.
>
> Linus

Hi

Here I reworked your patch, so that test_bit_acquire is defined just like
test_bit. There's some code duplication (in
include/asm-generic/bitops/generic-non-atomic.h and in
arch/x86/include/asm/bitops.h), but that duplication exists in the
test_bit function too.

I tested it on x86-64 and arm64. On x86-64 it generates the "bt"
instruction for variable-bit test and "shr; and $1" for constant bit test.
On arm64 it generates the "ldar" instruction for both constant and
variable bit test.

For me, the kernel 6.0-rc2 doesn't boot in an arm64 virtual machine at all
(with or without this patch), so I only compile-tested it on arm64. I have
to bisect it.

Mikulas



From: Mikulas Patocka <[email protected]>

There are several places in the kernel where wait_on_bit is not followed
by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read). On
architectures with weak memory ordering, it may happen that memory
accesses that follow wait_on_bit are reordered before wait_on_bit and they
may return invalid data.

Fix this class of bugs by introducing a new function "test_bit_acquire"
that works like test_bit, but has acquire memory ordering semantics.

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]

arch/x86/include/asm/bitops.h | 13 +++++++++++++
include/asm-generic/bitops/generic-non-atomic.h | 14 ++++++++++++++
include/asm-generic/bitops/instrumented-non-atomic.h | 12 ++++++++++++
include/asm-generic/bitops/non-atomic.h | 1 +
include/asm-generic/bitops/non-instrumented-non-atomic.h | 1 +
include/linux/bitops.h | 1 +
include/linux/buffer_head.h | 2 +-
include/linux/wait_bit.h | 8 ++++----
kernel/sched/wait_bit.c | 2 +-
9 files changed, 48 insertions(+), 6 deletions(-)

Index: linux-2.6/include/asm-generic/bitops/generic-non-atomic.h
===================================================================
--- linux-2.6.orig/include/asm-generic/bitops/generic-non-atomic.h
+++ linux-2.6/include/asm-generic/bitops/generic-non-atomic.h
@@ -4,6 +4,7 @@
#define __ASM_GENERIC_BITOPS_GENERIC_NON_ATOMIC_H

#include <linux/bits.h>
+#include <asm/barrier.h>

#ifndef _LINUX_BITOPS_H
#error only <linux/bitops.h> can be included directly
@@ -127,6 +128,18 @@ generic_test_bit(unsigned long nr, const
return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
}

+/**
+ * generic_test_bit - Determine whether a bit is set with acquire semantics
+ * @nr: bit number to test
+ * @addr: Address to start counting from
+ */
+static __always_inline bool
+generic_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
+{
+ unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
+ return 1UL & (smp_load_acquire(p) >> (nr & (BITS_PER_LONG-1)));
+}
+
/*
* const_*() definitions provide good compile-time optimizations when
* the passed arguments can be resolved at compile time.
@@ -137,6 +150,7 @@ generic_test_bit(unsigned long nr, const
#define const___test_and_set_bit generic___test_and_set_bit
#define const___test_and_clear_bit generic___test_and_clear_bit
#define const___test_and_change_bit generic___test_and_change_bit
+#define const_test_bit_acquire generic_test_bit_acquire

/**
* const_test_bit - Determine whether a bit is set
Index: linux-2.6/include/asm-generic/bitops/non-atomic.h
===================================================================
--- linux-2.6.orig/include/asm-generic/bitops/non-atomic.h
+++ linux-2.6/include/asm-generic/bitops/non-atomic.h
@@ -13,6 +13,7 @@
#define arch___test_and_change_bit generic___test_and_change_bit

#define arch_test_bit generic_test_bit
+#define arch_test_bit_acquire generic_test_bit_acquire

#include <asm-generic/bitops/non-instrumented-non-atomic.h>

Index: linux-2.6/include/linux/bitops.h
===================================================================
--- linux-2.6.orig/include/linux/bitops.h
+++ linux-2.6/include/linux/bitops.h
@@ -59,6 +59,7 @@ extern unsigned long __sw_hweight64(__u6
#define __test_and_clear_bit(nr, addr) bitop(___test_and_clear_bit, nr, addr)
#define __test_and_change_bit(nr, addr) bitop(___test_and_change_bit, nr, addr)
#define test_bit(nr, addr) bitop(_test_bit, nr, addr)
+#define test_bit_acquire(nr, addr) bitop(_test_bit_acquire, nr, addr)

/*
* Include this here because some architectures need generic_ffs/fls in
Index: linux-2.6/include/linux/wait_bit.h
===================================================================
--- linux-2.6.orig/include/linux/wait_bit.h
+++ linux-2.6/include/linux/wait_bit.h
@@ -71,7 +71,7 @@ static inline int
wait_on_bit(unsigned long *word, int bit, unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit_acquire(bit, word))
return 0;
return out_of_line_wait_on_bit(word, bit,
bit_wait,
@@ -96,7 +96,7 @@ static inline int
wait_on_bit_io(unsigned long *word, int bit, unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit_acquire(bit, word))
return 0;
return out_of_line_wait_on_bit(word, bit,
bit_wait_io,
@@ -123,7 +123,7 @@ wait_on_bit_timeout(unsigned long *word,
unsigned long timeout)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit_acquire(bit, word))
return 0;
return out_of_line_wait_on_bit_timeout(word, bit,
bit_wait_timeout,
@@ -151,7 +151,7 @@ wait_on_bit_action(unsigned long *word,
unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit_acquire(bit, word))
return 0;
return out_of_line_wait_on_bit(word, bit, action, mode);
}
Index: linux-2.6/kernel/sched/wait_bit.c
===================================================================
--- linux-2.6.orig/kernel/sched/wait_bit.c
+++ linux-2.6/kernel/sched/wait_bit.c
@@ -47,7 +47,7 @@ __wait_on_bit(struct wait_queue_head *wq
prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
if (test_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags))
ret = (*action)(&wbq_entry->key, mode);
- } while (test_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags) && !ret);
+ } while (test_bit_acquire(wbq_entry->key.bit_nr, wbq_entry->key.flags) && !ret);

finish_wait(wq_head, &wbq_entry->wq_entry);

Index: linux-2.6/include/asm-generic/bitops/non-instrumented-non-atomic.h
===================================================================
--- linux-2.6.orig/include/asm-generic/bitops/non-instrumented-non-atomic.h
+++ linux-2.6/include/asm-generic/bitops/non-instrumented-non-atomic.h
@@ -12,5 +12,6 @@
#define ___test_and_change_bit arch___test_and_change_bit

#define _test_bit arch_test_bit
+#define _test_bit_acquire arch_test_bit_acquire

#endif /* __ASM_GENERIC_BITOPS_NON_INSTRUMENTED_NON_ATOMIC_H */
Index: linux-2.6/arch/x86/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/bitops.h
+++ linux-2.6/arch/x86/include/asm/bitops.h
@@ -207,6 +207,12 @@ static __always_inline bool constant_tes
(addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
}

+static __always_inline bool constant_test_bit_acquire(long nr, const volatile unsigned long *addr)
+{
+ unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
+ return 1UL & (smp_load_acquire(p) >> (nr & (BITS_PER_LONG-1)));
+}
+
static __always_inline bool variable_test_bit(long nr, volatile const unsigned long *addr)
{
bool oldbit;
@@ -226,6 +232,13 @@ arch_test_bit(unsigned long nr, const vo
variable_test_bit(nr, addr);
}

+static __always_inline bool
+arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
+{
+ return __builtin_constant_p(nr) ? constant_test_bit_acquire(nr, addr) :
+ variable_test_bit(nr, addr);
+}
+
/**
* __ffs - find first set bit in word
* @word: The word to search
Index: linux-2.6/include/asm-generic/bitops/instrumented-non-atomic.h
===================================================================
--- linux-2.6.orig/include/asm-generic/bitops/instrumented-non-atomic.h
+++ linux-2.6/include/asm-generic/bitops/instrumented-non-atomic.h
@@ -142,4 +142,16 @@ _test_bit(unsigned long nr, const volati
return arch_test_bit(nr, addr);
}

+/**
+ * _test_bit_acquire - Determine whether a bit is set with acquire semantics
+ * @nr: bit number to test
+ * @addr: Address to start counting from
+ */
+static __always_inline bool
+_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
+{
+ instrument_atomic_read(addr + BIT_WORD(nr), sizeof(long));
+ return arch_test_bit_acquire(nr, addr);
+}
+
#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -156,7 +156,7 @@ static __always_inline int buffer_uptoda
* make it consistent with folio_test_uptodate
* pairs with smp_mb__before_atomic in set_buffer_uptodate
*/
- return (smp_load_acquire(&bh->b_state) & (1UL << BH_Uptodate)) != 0;
+ return test_bit_acquire(BH_Uptodate, &bh->b_state);
}

#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)

2022-08-25 21:57:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier

On Thu, Aug 25, 2022 at 2:03 PM Mikulas Patocka <[email protected]> wrote:
>
> Here I reworked your patch, so that test_bit_acquire is defined just like
> test_bit. There's some code duplication (in
> include/asm-generic/bitops/generic-non-atomic.h and in
> arch/x86/include/asm/bitops.h), but that duplication exists in the
> test_bit function too.

This looks fine to me, and I like how you fixed up buffer_uptodate()
while at it.

> I tested it on x86-64 and arm64. On x86-64 it generates the "bt"
> instruction for variable-bit test and "shr; and $1" for constant bit test.

That shr/and is almost certainly pessimal for small constant values at
least, and it's better done as "movq %rax" followed by "test %rax".
But I guess it depends on the bit value (and thus the constant size).

Doing a "testb $imm8" would likely be optimal, but you'll never get
that with smp_load_acquire() on x86 unless you use inline asm, because
of how we're doing it with a volatile pointer.

Anyway, you could try something like this:

static __always_inline bool constant_test_bit(long nr, const
volatile unsigned long *addr)
{
bool oldbit;

asm volatile("testb %2,%1"
CC_SET(nz)
: CC_OUT(nz) (oldbit)
: "m" (((unsigned char *)addr)[nr >> 3]),
"Ir" (1 << (nr & 7))
:"memory");
return oldbit;
}

for both the regular test_bit() and for the acquire (since all loads
are acquires on x86, and using an asm basically forces a memory load
so it just does that "volatile" part.

But that's a separate optimization and independent of the acquire thing.

> For me, the kernel 6.0-rc2 doesn't boot in an arm64 virtual machine at all
> (with or without this patch), so I only compile-tested it on arm64. I have
> to bisect it.

Hmm. I'm running it on real arm64 hardware (rc2+ - not your patch), so
I wonder what's up..

Linus

2022-08-26 01:22:40

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier

On Thu, Aug 25, 2022 at 05:03:40PM -0400, Mikulas Patocka wrote:
> From: Mikulas Patocka <[email protected]>
>
> There are several places in the kernel where wait_on_bit is not followed
> by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read). On
> architectures with weak memory ordering, it may happen that memory
> accesses that follow wait_on_bit are reordered before wait_on_bit and they
> may return invalid data.
>
> Fix this class of bugs by introducing a new function "test_bit_acquire"
> that works like test_bit, but has acquire memory ordering semantics.
>
> Signed-off-by: Mikulas Patocka <[email protected]>

...

> --- linux-2.6.orig/include/asm-generic/bitops/generic-non-atomic.h
> +++ linux-2.6/include/asm-generic/bitops/generic-non-atomic.h
> @@ -4,6 +4,7 @@
> #define __ASM_GENERIC_BITOPS_GENERIC_NON_ATOMIC_H
>
> #include <linux/bits.h>
> +#include <asm/barrier.h>
>
> #ifndef _LINUX_BITOPS_H
> #error only <linux/bitops.h> can be included directly
> @@ -127,6 +128,18 @@ generic_test_bit(unsigned long nr, const
> return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
> }
>
> +/**
> + * generic_test_bit - Determine whether a bit is set with acquire semantics

Trivial: Name in the kerneldoc isn't the same as the actual function name.

(Also, "with acquire semantics" is supposed to modify "Determine", not
"is set" -- we don't set bits using acquire semantics. You could change
this to "Determine, with acquire semantics, whether a bit is set".)

Alan Stern

> + * @nr: bit number to test
> + * @addr: Address to start counting from
> + */
> +static __always_inline bool
> +generic_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
> +{
> + unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
> + return 1UL & (smp_load_acquire(p) >> (nr & (BITS_PER_LONG-1)));
> +}

2022-08-26 11:39:21

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier

On Thu, Aug 25, 2022 at 05:03:40PM -0400, Mikulas Patocka wrote:
> Here I reworked your patch, so that test_bit_acquire is defined just like
> test_bit. There's some code duplication (in
> include/asm-generic/bitops/generic-non-atomic.h and in
> arch/x86/include/asm/bitops.h), but that duplication exists in the
> test_bit function too.
>
> I tested it on x86-64 and arm64. On x86-64 it generates the "bt"
> instruction for variable-bit test and "shr; and $1" for constant bit test.
> On arm64 it generates the "ldar" instruction for both constant and
> variable bit test.
>
> For me, the kernel 6.0-rc2 doesn't boot in an arm64 virtual machine at all
> (with or without this patch), so I only compile-tested it on arm64. I have
> to bisect it.

It's working fine for me and I haven't had any other reports that it's not
booting. Please could you share some more details about your setup so we
can try to reproduce the problem?

> From: Mikulas Patocka <[email protected]>
>
> There are several places in the kernel where wait_on_bit is not followed
> by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read). On
> architectures with weak memory ordering, it may happen that memory
> accesses that follow wait_on_bit are reordered before wait_on_bit and they
> may return invalid data.
>
> Fix this class of bugs by introducing a new function "test_bit_acquire"
> that works like test_bit, but has acquire memory ordering semantics.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
> Cc: [email protected]
>
> arch/x86/include/asm/bitops.h | 13 +++++++++++++
> include/asm-generic/bitops/generic-non-atomic.h | 14 ++++++++++++++
> include/asm-generic/bitops/instrumented-non-atomic.h | 12 ++++++++++++
> include/asm-generic/bitops/non-atomic.h | 1 +
> include/asm-generic/bitops/non-instrumented-non-atomic.h | 1 +
> include/linux/bitops.h | 1 +
> include/linux/buffer_head.h | 2 +-
> include/linux/wait_bit.h | 8 ++++----
> kernel/sched/wait_bit.c | 2 +-
> 9 files changed, 48 insertions(+), 6 deletions(-)

This looks good to me, thanks for doing it! Just one thing that jumped out
at me:

> Index: linux-2.6/include/linux/buffer_head.h
> ===================================================================
> --- linux-2.6.orig/include/linux/buffer_head.h
> +++ linux-2.6/include/linux/buffer_head.h
> @@ -156,7 +156,7 @@ static __always_inline int buffer_uptoda
> * make it consistent with folio_test_uptodate
> * pairs with smp_mb__before_atomic in set_buffer_uptodate
> */
> - return (smp_load_acquire(&bh->b_state) & (1UL << BH_Uptodate)) != 0;
> + return test_bit_acquire(BH_Uptodate, &bh->b_state);

Do you think it would be worth adding set_bit_release() and then relaxing
set_buffer_uptodate() to use that rather than the smp_mb__before_atomic()?

Will

2022-08-26 11:56:24

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier



On Fri, 26 Aug 2022, Will Deacon wrote:

> On Thu, Aug 25, 2022 at 05:03:40PM -0400, Mikulas Patocka wrote:
> > Here I reworked your patch, so that test_bit_acquire is defined just like
> > test_bit. There's some code duplication (in
> > include/asm-generic/bitops/generic-non-atomic.h and in
> > arch/x86/include/asm/bitops.h), but that duplication exists in the
> > test_bit function too.
> >
> > I tested it on x86-64 and arm64. On x86-64 it generates the "bt"
> > instruction for variable-bit test and "shr; and $1" for constant bit test.
> > On arm64 it generates the "ldar" instruction for both constant and
> > variable bit test.
> >
> > For me, the kernel 6.0-rc2 doesn't boot in an arm64 virtual machine at all
> > (with or without this patch), so I only compile-tested it on arm64. I have
> > to bisect it.
>
> It's working fine for me and I haven't had any other reports that it's not
> booting. Please could you share some more details about your setup so we
> can try to reproduce the problem?

I'm bisecting it now. I'll post the offending commit when I'm done.

It gets stuck without printing anything at this point:
Loading Linux 6.0.0-rc2 ...
Loading initial ramdisk ...
EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...

I uploaded my .config here:
https://people.redhat.com/~mpatocka/testcases/arm64-config/config-6.0.0-rc2
so you can try it on your own.

The host system is MacchiatoBIN board with Debian 10.12.

> This looks good to me, thanks for doing it! Just one thing that jumped out
> at me:
>
> > Index: linux-2.6/include/linux/buffer_head.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/buffer_head.h
> > +++ linux-2.6/include/linux/buffer_head.h
> > @@ -156,7 +156,7 @@ static __always_inline int buffer_uptoda
> > * make it consistent with folio_test_uptodate
> > * pairs with smp_mb__before_atomic in set_buffer_uptodate
> > */
> > - return (smp_load_acquire(&bh->b_state) & (1UL << BH_Uptodate)) != 0;
> > + return test_bit_acquire(BH_Uptodate, &bh->b_state);
>
> Do you think it would be worth adding set_bit_release() and then relaxing
> set_buffer_uptodate() to use that rather than the smp_mb__before_atomic()?
>
> Will

Yes, we could add this (but it would be better to add it in a separate
patch, so that backporting of the origianal patch to -stable is easier).

Mikulas

2022-08-26 13:30:02

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH v3] wait_on_bit: add an acquire memory barrier



On Thu, 25 Aug 2022, Linus Torvalds wrote:

> On Thu, Aug 25, 2022 at 2:03 PM Mikulas Patocka <[email protected]> wrote:
> >
> > Here I reworked your patch, so that test_bit_acquire is defined just like
> > test_bit. There's some code duplication (in
> > include/asm-generic/bitops/generic-non-atomic.h and in
> > arch/x86/include/asm/bitops.h), but that duplication exists in the
> > test_bit function too.
>
> This looks fine to me, and I like how you fixed up buffer_uptodate()
> while at it.
>
> > I tested it on x86-64 and arm64. On x86-64 it generates the "bt"
> > instruction for variable-bit test and "shr; and $1" for constant bit test.
>
> That shr/and is almost certainly pessimal for small constant values at
> least, and it's better done as "movq %rax" followed by "test %rax".
> But I guess it depends on the bit value (and thus the constant size).
>
> Doing a "testb $imm8" would likely be optimal, but you'll never get
> that with smp_load_acquire() on x86 unless you use inline asm, because
> of how we're doing it with a volatile pointer.
>
> Anyway, you could try something like this:
>
> static __always_inline bool constant_test_bit(long nr, const
> volatile unsigned long *addr)
> {
> bool oldbit;
>
> asm volatile("testb %2,%1"
> CC_SET(nz)
> : CC_OUT(nz) (oldbit)
> : "m" (((unsigned char *)addr)[nr >> 3]),
> "Ir" (1 << (nr & 7))
> :"memory");
> return oldbit;
> }
>
> for both the regular test_bit() and for the acquire (since all loads
> are acquires on x86, and using an asm basically forces a memory load
> so it just does that "volatile" part.

I wouldn't do this for regular test_bit because if you read memory with
different size/alignment from what you wrote, various CPUs suffer from
store->load forwarding penalties.

But for test_bit_acqure this optimization is likely harmless because the
bit will not be tested a few instructions after writing it.

> But that's a separate optimization and independent of the acquire thing.
>
> > For me, the kernel 6.0-rc2 doesn't boot in an arm64 virtual machine at all
> > (with or without this patch), so I only compile-tested it on arm64. I have
> > to bisect it.
>
> Hmm. I'm running it on real arm64 hardware (rc2+ - not your patch), so
> I wonder what's up..
>
> Linus
>

This is version 3 of the patch. Changes:
* use assembler "testb" in constant_test_bit_acquire
* fix some comments as suggeste by Alan Stern
* fix Documentation/atomic_bitops.txt (note that since the commit
415d832497098030241605c52ea83d4e2cfa7879, test_and_set/clear_bit is
always ordered, so fix this claim as well)

Mikulas




From: Mikulas Patocka <[email protected]>

There are several places in the kernel where wait_on_bit is not followed
by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read). On
architectures with weak memory ordering, it may happen that memory
accesses that follow wait_on_bit are reordered before wait_on_bit and they
may return invalid data.

Fix this class of bugs by introducing a new function "test_bit_acquire"
that works like test_bit, but has acquire memory ordering semantics.

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]

Documentation/atomic_bitops.txt | 10 ++-----
arch/x86/include/asm/bitops.h | 21 +++++++++++++++
include/asm-generic/bitops/generic-non-atomic.h | 14 ++++++++++
include/asm-generic/bitops/instrumented-non-atomic.h | 12 ++++++++
include/asm-generic/bitops/non-atomic.h | 1
include/asm-generic/bitops/non-instrumented-non-atomic.h | 1
include/linux/bitops.h | 1
include/linux/buffer_head.h | 2 -
include/linux/wait_bit.h | 8 ++---
kernel/sched/wait_bit.c | 2 -
10 files changed, 60 insertions(+), 12 deletions(-)

Index: linux-2.6/include/asm-generic/bitops/generic-non-atomic.h
===================================================================
--- linux-2.6.orig/include/asm-generic/bitops/generic-non-atomic.h
+++ linux-2.6/include/asm-generic/bitops/generic-non-atomic.h
@@ -4,6 +4,7 @@
#define __ASM_GENERIC_BITOPS_GENERIC_NON_ATOMIC_H

#include <linux/bits.h>
+#include <asm/barrier.h>

#ifndef _LINUX_BITOPS_H
#error only <linux/bitops.h> can be included directly
@@ -127,6 +128,18 @@ generic_test_bit(unsigned long nr, const
return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
}

+/**
+ * generic_test_bit_acquire - Determine, with acquire semantics, whether a bit is set
+ * @nr: bit number to test
+ * @addr: Address to start counting from
+ */
+static __always_inline bool
+generic_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
+{
+ unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
+ return 1UL & (smp_load_acquire(p) >> (nr & (BITS_PER_LONG-1)));
+}
+
/*
* const_*() definitions provide good compile-time optimizations when
* the passed arguments can be resolved at compile time.
@@ -137,6 +150,7 @@ generic_test_bit(unsigned long nr, const
#define const___test_and_set_bit generic___test_and_set_bit
#define const___test_and_clear_bit generic___test_and_clear_bit
#define const___test_and_change_bit generic___test_and_change_bit
+#define const_test_bit_acquire generic_test_bit_acquire

/**
* const_test_bit - Determine whether a bit is set
Index: linux-2.6/include/asm-generic/bitops/non-atomic.h
===================================================================
--- linux-2.6.orig/include/asm-generic/bitops/non-atomic.h
+++ linux-2.6/include/asm-generic/bitops/non-atomic.h
@@ -13,6 +13,7 @@
#define arch___test_and_change_bit generic___test_and_change_bit

#define arch_test_bit generic_test_bit
+#define arch_test_bit_acquire generic_test_bit_acquire

#include <asm-generic/bitops/non-instrumented-non-atomic.h>

Index: linux-2.6/include/linux/bitops.h
===================================================================
--- linux-2.6.orig/include/linux/bitops.h
+++ linux-2.6/include/linux/bitops.h
@@ -59,6 +59,7 @@ extern unsigned long __sw_hweight64(__u6
#define __test_and_clear_bit(nr, addr) bitop(___test_and_clear_bit, nr, addr)
#define __test_and_change_bit(nr, addr) bitop(___test_and_change_bit, nr, addr)
#define test_bit(nr, addr) bitop(_test_bit, nr, addr)
+#define test_bit_acquire(nr, addr) bitop(_test_bit_acquire, nr, addr)

/*
* Include this here because some architectures need generic_ffs/fls in
Index: linux-2.6/include/linux/wait_bit.h
===================================================================
--- linux-2.6.orig/include/linux/wait_bit.h
+++ linux-2.6/include/linux/wait_bit.h
@@ -71,7 +71,7 @@ static inline int
wait_on_bit(unsigned long *word, int bit, unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit_acquire(bit, word))
return 0;
return out_of_line_wait_on_bit(word, bit,
bit_wait,
@@ -96,7 +96,7 @@ static inline int
wait_on_bit_io(unsigned long *word, int bit, unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit_acquire(bit, word))
return 0;
return out_of_line_wait_on_bit(word, bit,
bit_wait_io,
@@ -123,7 +123,7 @@ wait_on_bit_timeout(unsigned long *word,
unsigned long timeout)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit_acquire(bit, word))
return 0;
return out_of_line_wait_on_bit_timeout(word, bit,
bit_wait_timeout,
@@ -151,7 +151,7 @@ wait_on_bit_action(unsigned long *word,
unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit_acquire(bit, word))
return 0;
return out_of_line_wait_on_bit(word, bit, action, mode);
}
Index: linux-2.6/kernel/sched/wait_bit.c
===================================================================
--- linux-2.6.orig/kernel/sched/wait_bit.c
+++ linux-2.6/kernel/sched/wait_bit.c
@@ -47,7 +47,7 @@ __wait_on_bit(struct wait_queue_head *wq
prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
if (test_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags))
ret = (*action)(&wbq_entry->key, mode);
- } while (test_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags) && !ret);
+ } while (test_bit_acquire(wbq_entry->key.bit_nr, wbq_entry->key.flags) && !ret);

finish_wait(wq_head, &wbq_entry->wq_entry);

Index: linux-2.6/include/asm-generic/bitops/non-instrumented-non-atomic.h
===================================================================
--- linux-2.6.orig/include/asm-generic/bitops/non-instrumented-non-atomic.h
+++ linux-2.6/include/asm-generic/bitops/non-instrumented-non-atomic.h
@@ -12,5 +12,6 @@
#define ___test_and_change_bit arch___test_and_change_bit

#define _test_bit arch_test_bit
+#define _test_bit_acquire arch_test_bit_acquire

#endif /* __ASM_GENERIC_BITOPS_NON_INSTRUMENTED_NON_ATOMIC_H */
Index: linux-2.6/arch/x86/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/bitops.h
+++ linux-2.6/arch/x86/include/asm/bitops.h
@@ -207,6 +207,20 @@ static __always_inline bool constant_tes
(addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
}

+static __always_inline bool constant_test_bit_acquire(long nr, const volatile unsigned long *addr)
+{
+ bool oldbit;
+
+ asm volatile("testb %2,%1"
+ CC_SET(nz)
+ : CC_OUT(nz) (oldbit)
+ : "m" (((unsigned char *)addr)[nr >> 3]),
+ "i" (1 << (nr & 7))
+ :"memory");
+
+ return oldbit;
+}
+
static __always_inline bool variable_test_bit(long nr, volatile const unsigned long *addr)
{
bool oldbit;
@@ -226,6 +240,13 @@ arch_test_bit(unsigned long nr, const vo
variable_test_bit(nr, addr);
}

+static __always_inline bool
+arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
+{
+ return __builtin_constant_p(nr) ? constant_test_bit_acquire(nr, addr) :
+ variable_test_bit(nr, addr);
+}
+
/**
* __ffs - find first set bit in word
* @word: The word to search
Index: linux-2.6/include/asm-generic/bitops/instrumented-non-atomic.h
===================================================================
--- linux-2.6.orig/include/asm-generic/bitops/instrumented-non-atomic.h
+++ linux-2.6/include/asm-generic/bitops/instrumented-non-atomic.h
@@ -142,4 +142,16 @@ _test_bit(unsigned long nr, const volati
return arch_test_bit(nr, addr);
}

+/**
+ * _test_bit_acquire - Determine, with acquire semantics, whether a bit is set
+ * @nr: bit number to test
+ * @addr: Address to start counting from
+ */
+static __always_inline bool
+_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
+{
+ instrument_atomic_read(addr + BIT_WORD(nr), sizeof(long));
+ return arch_test_bit_acquire(nr, addr);
+}
+
#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -156,7 +156,7 @@ static __always_inline int buffer_uptoda
* make it consistent with folio_test_uptodate
* pairs with smp_mb__before_atomic in set_buffer_uptodate
*/
- return (smp_load_acquire(&bh->b_state) & (1UL << BH_Uptodate)) != 0;
+ return test_bit_acquire(BH_Uptodate, &bh->b_state);
}

#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
Index: linux-2.6/Documentation/atomic_bitops.txt
===================================================================
--- linux-2.6.orig/Documentation/atomic_bitops.txt
+++ linux-2.6/Documentation/atomic_bitops.txt
@@ -58,13 +58,11 @@ Like with atomic_t, the rule of thumb is

- RMW operations that have a return value are fully ordered.

- - RMW operations that are conditional are unordered on FAILURE,
- otherwise the above rules apply. In the case of test_and_set_bit_lock(),
- if the bit in memory is unchanged by the operation then it is deemed to have
- failed.
+ - RMW operations that are conditional are fully ordered.

-Except for a successful test_and_set_bit_lock() which has ACQUIRE semantics and
-clear_bit_unlock() which has RELEASE semantics.
+Except for a successful test_and_set_bit_lock() which has ACQUIRE semantics,
+clear_bit_unlock() which has RELEASE semantics and test_bit_acquire which has
+ACQUIRE semantics.

Since a platform only has a single means of achieving atomic operations
the same barriers as for atomic_t are used, see atomic_t.txt.

2022-08-26 14:42:51

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier



On Fri, 26 Aug 2022, Mikulas Patocka wrote:

>
>
> On Fri, 26 Aug 2022, Will Deacon wrote:
>
> > On Thu, Aug 25, 2022 at 05:03:40PM -0400, Mikulas Patocka wrote:
> > >
> > > For me, the kernel 6.0-rc2 doesn't boot in an arm64 virtual machine at all
> > > (with or without this patch), so I only compile-tested it on arm64. I have
> > > to bisect it.
> >
> > It's working fine for me and I haven't had any other reports that it's not
> > booting. Please could you share some more details about your setup so we
> > can try to reproduce the problem?
>
> I'm bisecting it now. I'll post the offending commit when I'm done.

So, the bad commit is aacd149b62382c63911060b8f64c1e3d89bd405a ("arm64:
head: avoid relocating the kernel twice for KASLR").

> It gets stuck without printing anything at this point:
> Loading Linux 6.0.0-rc2 ...
> Loading initial ramdisk ...
> EFI stub: Booting Linux Kernel...
> EFI stub: Using DTB from configuration table
> EFI stub: Exiting boot services...
>
> I uploaded my .config here:
> https://people.redhat.com/~mpatocka/testcases/arm64-config/config-6.0.0-rc2
> so you can try it on your own.
>
> The host system is MacchiatoBIN board with Debian 10.12.

Mikulas

2022-08-26 17:05:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3] wait_on_bit: add an acquire memory barrier

On Fri, Aug 26, 2022 at 6:17 AM Mikulas Patocka <[email protected]> wrote:
>
> I wouldn't do this for regular test_bit because if you read memory with
> different size/alignment from what you wrote, various CPUs suffer from
> store->load forwarding penalties.

All of the half-way modern CPU's do ok with store->load forwarding as
long as the load is fully contained in the store, so I suspect the
'testb' model is pretty much universally better than loading a word
into a register and testing it there.

So narrowing the load is fine (but you generally never want to narrow
a *store*, because that results in huge problems with subsequent wider
loads).

But it's not a huge deal, and this way if somebody actually runs the
numbers and does any comparisons, we have both versions, and if the
'testb' is better, we can just rename the x86
constant_test_bit_acquire() to just constant_test_bit() and use it for
both cases.

> But for test_bit_acqure this optimization is likely harmless because the
> bit will not be tested a few instructions after writing it.

Note that if we really do that, then we've already lost because of the
volatile access, ie if we cared about a "write bit, test bit" pattern,
we should use other operations.

Now, the new "const_test_bit()" logic (commits bb7379bfa680 "bitops:
define const_*() versions of the non-atomics" and 0e862838f290
"bitops: unify non-atomic bitops prototypes across architectures")
means that as long as you are setting and testing a bit in a local
variable, it gets elided entirely. But in general use you're going to
see that load from memory, and then the wider load is likely worse
(because bigger constants, and because it requries a register).

So maybe there is room to tweak it further, but this version of the
patch looks good to me, and I've applied it.

Thanks,
Linus

2022-08-26 17:34:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3] wait_on_bit: add an acquire memory barrier

On Fri, Aug 26, 2022 at 9:45 AM Linus Torvalds
<[email protected]> wrote:
>
> So narrowing the load is fine (but you generally never want to narrow
> a *store*, because that results in huge problems with subsequent wider
> loads).

.. so making that statement made me go look around, and it's exactly
what we use for clear_bit() on x86.

Which is actually ok too, because it's a locked operation, and at that
point the whole "store buffer forwarding" issue pretty much goes out
the window anyway.

But because I looked at where the new test_bit_acquire() triggers and
where there are other bitops around it, I found this beauty in
fs/buffer.c: clean_bdev_aliases():

if (!buffer_mapped(bh) ||
(bh->b_blocknr < block))
goto next;
if (bh->b_blocknr >= block + len)
break;
clear_buffer_dirty(bh);
wait_on_buffer(bh);
clear_buffer_req(bh);

where it basically works on four different bits (buffer_mapped, dirty,
lock, req) right next to each other.

That code sequence really doesn't matter, but it was interesting
seeing the generated code. Not pretty, but the ugliest part was
actually how the might_sleep() calls in those helper functions result
in __cond_resched() when PREEMPT_VOLUNTARY is on, and how horrid that
is for register allocation.

And in bh_submit_read() we have another ugly thing:

mov (%rdi),%rax
test $0x4,%al
je <bh_submit_read+0x7d>
push %rbx
mov %rdi,%rbx
testb $0x1,(%rdi)

where we have a mix of those two kinds of "test_bit()" (and test
"testb" version most definitely looks better. But that's due to the
source code being

BUG_ON(!buffer_locked(bh));
if (buffer_uptodate(bh)) {

ie we have that *ancient* BUG_ON() messing things up. Oh well.

None of the buffer-head code matters any more, bit it's certainly
showing the effects of that patch of yours, and the code isn't pretty.

But none of the ugliness I found was actually _due_ to your patch. It
was all due to other things, and your patch just makes code generation
better in the cases I looked at.

Linus

2022-08-26 18:20:43

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier

On Fri, Aug 26, 2022 at 10:08:42AM -0400, Mikulas Patocka wrote:
> On Fri, 26 Aug 2022, Mikulas Patocka wrote:
> > On Fri, 26 Aug 2022, Will Deacon wrote:
> > > On Thu, Aug 25, 2022 at 05:03:40PM -0400, Mikulas Patocka wrote:
> > > > For me, the kernel 6.0-rc2 doesn't boot in an arm64 virtual machine at all
> > > > (with or without this patch), so I only compile-tested it on arm64. I have
> > > > to bisect it.
> > >
> > > It's working fine for me and I haven't had any other reports that it's not
> > > booting. Please could you share some more details about your setup so we
> > > can try to reproduce the problem?
> >
> > I'm bisecting it now. I'll post the offending commit when I'm done.
>
> So, the bad commit is aacd149b62382c63911060b8f64c1e3d89bd405a ("arm64:
> head: avoid relocating the kernel twice for KASLR").

Thanks. Any chance you could give this a go, please?

https://lore.kernel.org/r/[email protected]

Will

2022-08-26 18:23:47

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH v3] wait_on_bit: add an acquire memory barrier



On Fri, 26 Aug 2022, Linus Torvalds wrote:

> That code sequence really doesn't matter, but it was interesting
> seeing the generated code. Not pretty, but the ugliest part was
> actually how the might_sleep() calls in those helper functions result
> in __cond_resched() when PREEMPT_VOLUNTARY is on, and how horrid that
> is for register allocation.

Perhaps, use __attribute__((no_caller_saved_registers)) on
__cond_resched() ?

Mikulas

2022-08-26 18:27:14

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier



On Fri, 26 Aug 2022, Will Deacon wrote:

> > So, the bad commit is aacd149b62382c63911060b8f64c1e3d89bd405a ("arm64:
> > head: avoid relocating the kernel twice for KASLR").
>
> Thanks. Any chance you could give this a go, please?
>
> https://lore.kernel.org/r/[email protected]
>
> Will

This doesn't help.

Mikulas

2022-08-26 18:47:49

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier

On Fri, 26 Aug 2022 at 20:25, Mikulas Patocka <[email protected]> wrote:
>
>
>
> On Fri, 26 Aug 2022, Will Deacon wrote:
>
> > > So, the bad commit is aacd149b62382c63911060b8f64c1e3d89bd405a ("arm64:
> > > head: avoid relocating the kernel twice for KASLR").
> >
> > Thanks. Any chance you could give this a go, please?
> >
> > https://lore.kernel.org/r/[email protected]
> >
> > Will
>
> This doesn't help.
>

Could you try booting with earlycon?

Just 'earlycon' and if that does not help,
'earlycon=uart8250,mmio32,<uart PA>' [IIRC, mcbin uses 16550
compatible UARTs, right?]

2022-08-26 19:26:06

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH v3] wait_on_bit: add an acquire memory barrier

Hi Mikulas,

On Fri, Aug 26, 2022 at 3:18 PM Mikulas Patocka <[email protected]> wrote:
> On Thu, 25 Aug 2022, Linus Torvalds wrote:
> > On Thu, Aug 25, 2022 at 2:03 PM Mikulas Patocka <[email protected]> wrote:
> > > Here I reworked your patch, so that test_bit_acquire is defined just like
> > > test_bit. There's some code duplication (in
> > > include/asm-generic/bitops/generic-non-atomic.h and in
> > > arch/x86/include/asm/bitops.h), but that duplication exists in the
> > > test_bit function too.
> >
> > This looks fine to me, and I like how you fixed up buffer_uptodate()
> > while at it.
> >
> > > I tested it on x86-64 and arm64. On x86-64 it generates the "bt"
> > > instruction for variable-bit test and "shr; and $1" for constant bit test.
> >
> > That shr/and is almost certainly pessimal for small constant values at
> > least, and it's better done as "movq %rax" followed by "test %rax".
> > But I guess it depends on the bit value (and thus the constant size).
> >
> > Doing a "testb $imm8" would likely be optimal, but you'll never get
> > that with smp_load_acquire() on x86 unless you use inline asm, because
> > of how we're doing it with a volatile pointer.
> >
> > Anyway, you could try something like this:
> >
> > static __always_inline bool constant_test_bit(long nr, const
> > volatile unsigned long *addr)
> > {
> > bool oldbit;
> >
> > asm volatile("testb %2,%1"
> > CC_SET(nz)
> > : CC_OUT(nz) (oldbit)
> > : "m" (((unsigned char *)addr)[nr >> 3]),
> > "Ir" (1 << (nr & 7))
> > :"memory");
> > return oldbit;
> > }
> >
> > for both the regular test_bit() and for the acquire (since all loads
> > are acquires on x86, and using an asm basically forces a memory load
> > so it just does that "volatile" part.
>
> I wouldn't do this for regular test_bit because if you read memory with
> different size/alignment from what you wrote, various CPUs suffer from
> store->load forwarding penalties.
>
> But for test_bit_acqure this optimization is likely harmless because the
> bit will not be tested a few instructions after writing it.
>
> > But that's a separate optimization and independent of the acquire thing.
> >
> > > For me, the kernel 6.0-rc2 doesn't boot in an arm64 virtual machine at all
> > > (with or without this patch), so I only compile-tested it on arm64. I have
> > > to bisect it.
> >
> > Hmm. I'm running it on real arm64 hardware (rc2+ - not your patch), so
> > I wonder what's up..
> >
> > Linus
> >
>
> This is version 3 of the patch. Changes:
> * use assembler "testb" in constant_test_bit_acquire
> * fix some comments as suggeste by Alan Stern
> * fix Documentation/atomic_bitops.txt (note that since the commit
> 415d832497098030241605c52ea83d4e2cfa7879, test_and_set/clear_bit is
> always ordered, so fix this claim as well)
>
> Mikulas
>
>
>
>
> From: Mikulas Patocka <[email protected]>
>
> There are several places in the kernel where wait_on_bit is not followed
> by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read). On
> architectures with weak memory ordering, it may happen that memory
> accesses that follow wait_on_bit are reordered before wait_on_bit and they
> may return invalid data.
>
> Fix this class of bugs by introducing a new function "test_bit_acquire"
> that works like test_bit, but has acquire memory ordering semantics.
>
> Signed-off-by: Mikulas Patocka <[email protected]>

Thanks for your patch, which is now commit 8238b4579866b7c1
("wait_on_bit: add an acquire memory barrier").

[email protected] reports lots of build failures on m68k:

include/asm-generic/bitops/non-instrumented-non-atomic.h:15:33:
error: implicit declaration of function 'arch_test_bit_acquire'; did
you mean '_test_bit_acquire'? [-Werror=implicit-function-declaration]

which I've bisected to this commit.

http://kisskb.ellerman.id.au/kisskb/head/3e5c673f0d75bc22b3c26eade87e4db4f374cd34


> Cc: [email protected]


>
> Documentation/atomic_bitops.txt | 10 ++-----
> arch/x86/include/asm/bitops.h | 21 +++++++++++++++
> include/asm-generic/bitops/generic-non-atomic.h | 14 ++++++++++
> include/asm-generic/bitops/instrumented-non-atomic.h | 12 ++++++++
> include/asm-generic/bitops/non-atomic.h | 1
> include/asm-generic/bitops/non-instrumented-non-atomic.h | 1
> include/linux/bitops.h | 1
> include/linux/buffer_head.h | 2 -
> include/linux/wait_bit.h | 8 ++---
> kernel/sched/wait_bit.c | 2 -
> 10 files changed, 60 insertions(+), 12 deletions(-)
>
> Index: linux-2.6/include/asm-generic/bitops/generic-non-atomic.h
> ===================================================================
> --- linux-2.6.orig/include/asm-generic/bitops/generic-non-atomic.h
> +++ linux-2.6/include/asm-generic/bitops/generic-non-atomic.h
> @@ -4,6 +4,7 @@
> #define __ASM_GENERIC_BITOPS_GENERIC_NON_ATOMIC_H
>
> #include <linux/bits.h>
> +#include <asm/barrier.h>
>
> #ifndef _LINUX_BITOPS_H
> #error only <linux/bitops.h> can be included directly
> @@ -127,6 +128,18 @@ generic_test_bit(unsigned long nr, const
> return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
> }
>
> +/**
> + * generic_test_bit_acquire - Determine, with acquire semantics, whether a bit is set
> + * @nr: bit number to test
> + * @addr: Address to start counting from
> + */
> +static __always_inline bool
> +generic_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
> +{
> + unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
> + return 1UL & (smp_load_acquire(p) >> (nr & (BITS_PER_LONG-1)));
> +}
> +
> /*
> * const_*() definitions provide good compile-time optimizations when
> * the passed arguments can be resolved at compile time.
> @@ -137,6 +150,7 @@ generic_test_bit(unsigned long nr, const
> #define const___test_and_set_bit generic___test_and_set_bit
> #define const___test_and_clear_bit generic___test_and_clear_bit
> #define const___test_and_change_bit generic___test_and_change_bit
> +#define const_test_bit_acquire generic_test_bit_acquire
>
> /**
> * const_test_bit - Determine whether a bit is set
> Index: linux-2.6/include/asm-generic/bitops/non-atomic.h
> ===================================================================
> --- linux-2.6.orig/include/asm-generic/bitops/non-atomic.h
> +++ linux-2.6/include/asm-generic/bitops/non-atomic.h
> @@ -13,6 +13,7 @@
> #define arch___test_and_change_bit generic___test_and_change_bit
>
> #define arch_test_bit generic_test_bit
> +#define arch_test_bit_acquire generic_test_bit_acquire
>
> #include <asm-generic/bitops/non-instrumented-non-atomic.h>
>
> Index: linux-2.6/include/linux/bitops.h
> ===================================================================
> --- linux-2.6.orig/include/linux/bitops.h
> +++ linux-2.6/include/linux/bitops.h
> @@ -59,6 +59,7 @@ extern unsigned long __sw_hweight64(__u6
> #define __test_and_clear_bit(nr, addr) bitop(___test_and_clear_bit, nr, addr)
> #define __test_and_change_bit(nr, addr) bitop(___test_and_change_bit, nr, addr)
> #define test_bit(nr, addr) bitop(_test_bit, nr, addr)
> +#define test_bit_acquire(nr, addr) bitop(_test_bit_acquire, nr, addr)
>
> /*
> * Include this here because some architectures need generic_ffs/fls in
> Index: linux-2.6/include/linux/wait_bit.h
> ===================================================================
> --- linux-2.6.orig/include/linux/wait_bit.h
> +++ linux-2.6/include/linux/wait_bit.h
> @@ -71,7 +71,7 @@ static inline int
> wait_on_bit(unsigned long *word, int bit, unsigned mode)
> {
> might_sleep();
> - if (!test_bit(bit, word))
> + if (!test_bit_acquire(bit, word))
> return 0;
> return out_of_line_wait_on_bit(word, bit,
> bit_wait,
> @@ -96,7 +96,7 @@ static inline int
> wait_on_bit_io(unsigned long *word, int bit, unsigned mode)
> {
> might_sleep();
> - if (!test_bit(bit, word))
> + if (!test_bit_acquire(bit, word))
> return 0;
> return out_of_line_wait_on_bit(word, bit,
> bit_wait_io,
> @@ -123,7 +123,7 @@ wait_on_bit_timeout(unsigned long *word,
> unsigned long timeout)
> {
> might_sleep();
> - if (!test_bit(bit, word))
> + if (!test_bit_acquire(bit, word))
> return 0;
> return out_of_line_wait_on_bit_timeout(word, bit,
> bit_wait_timeout,
> @@ -151,7 +151,7 @@ wait_on_bit_action(unsigned long *word,
> unsigned mode)
> {
> might_sleep();
> - if (!test_bit(bit, word))
> + if (!test_bit_acquire(bit, word))
> return 0;
> return out_of_line_wait_on_bit(word, bit, action, mode);
> }
> Index: linux-2.6/kernel/sched/wait_bit.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/wait_bit.c
> +++ linux-2.6/kernel/sched/wait_bit.c
> @@ -47,7 +47,7 @@ __wait_on_bit(struct wait_queue_head *wq
> prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
> if (test_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags))
> ret = (*action)(&wbq_entry->key, mode);
> - } while (test_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags) && !ret);
> + } while (test_bit_acquire(wbq_entry->key.bit_nr, wbq_entry->key.flags) && !ret);
>
> finish_wait(wq_head, &wbq_entry->wq_entry);
>
> Index: linux-2.6/include/asm-generic/bitops/non-instrumented-non-atomic.h
> ===================================================================
> --- linux-2.6.orig/include/asm-generic/bitops/non-instrumented-non-atomic.h
> +++ linux-2.6/include/asm-generic/bitops/non-instrumented-non-atomic.h
> @@ -12,5 +12,6 @@
> #define ___test_and_change_bit arch___test_and_change_bit
>
> #define _test_bit arch_test_bit
> +#define _test_bit_acquire arch_test_bit_acquire
>
> #endif /* __ASM_GENERIC_BITOPS_NON_INSTRUMENTED_NON_ATOMIC_H */
> Index: linux-2.6/arch/x86/include/asm/bitops.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/bitops.h
> +++ linux-2.6/arch/x86/include/asm/bitops.h
> @@ -207,6 +207,20 @@ static __always_inline bool constant_tes
> (addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
> }
>
> +static __always_inline bool constant_test_bit_acquire(long nr, const volatile unsigned long *addr)
> +{
> + bool oldbit;
> +
> + asm volatile("testb %2,%1"
> + CC_SET(nz)
> + : CC_OUT(nz) (oldbit)
> + : "m" (((unsigned char *)addr)[nr >> 3]),
> + "i" (1 << (nr & 7))
> + :"memory");
> +
> + return oldbit;
> +}
> +
> static __always_inline bool variable_test_bit(long nr, volatile const unsigned long *addr)
> {
> bool oldbit;
> @@ -226,6 +240,13 @@ arch_test_bit(unsigned long nr, const vo
> variable_test_bit(nr, addr);
> }
>
> +static __always_inline bool
> +arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
> +{
> + return __builtin_constant_p(nr) ? constant_test_bit_acquire(nr, addr) :
> + variable_test_bit(nr, addr);
> +}
> +
> /**
> * __ffs - find first set bit in word
> * @word: The word to search
> Index: linux-2.6/include/asm-generic/bitops/instrumented-non-atomic.h
> ===================================================================
> --- linux-2.6.orig/include/asm-generic/bitops/instrumented-non-atomic.h
> +++ linux-2.6/include/asm-generic/bitops/instrumented-non-atomic.h
> @@ -142,4 +142,16 @@ _test_bit(unsigned long nr, const volati
> return arch_test_bit(nr, addr);
> }
>
> +/**
> + * _test_bit_acquire - Determine, with acquire semantics, whether a bit is set
> + * @nr: bit number to test
> + * @addr: Address to start counting from
> + */
> +static __always_inline bool
> +_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
> +{
> + instrument_atomic_read(addr + BIT_WORD(nr), sizeof(long));
> + return arch_test_bit_acquire(nr, addr);
> +}
> +
> #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
> Index: linux-2.6/include/linux/buffer_head.h
> ===================================================================
> --- linux-2.6.orig/include/linux/buffer_head.h
> +++ linux-2.6/include/linux/buffer_head.h
> @@ -156,7 +156,7 @@ static __always_inline int buffer_uptoda
> * make it consistent with folio_test_uptodate
> * pairs with smp_mb__before_atomic in set_buffer_uptodate
> */
> - return (smp_load_acquire(&bh->b_state) & (1UL << BH_Uptodate)) != 0;
> + return test_bit_acquire(BH_Uptodate, &bh->b_state);
> }
>
> #define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
> Index: linux-2.6/Documentation/atomic_bitops.txt
> ===================================================================
> --- linux-2.6.orig/Documentation/atomic_bitops.txt
> +++ linux-2.6/Documentation/atomic_bitops.txt
> @@ -58,13 +58,11 @@ Like with atomic_t, the rule of thumb is
>
> - RMW operations that have a return value are fully ordered.
>
> - - RMW operations that are conditional are unordered on FAILURE,
> - otherwise the above rules apply. In the case of test_and_set_bit_lock(),
> - if the bit in memory is unchanged by the operation then it is deemed to have
> - failed.
> + - RMW operations that are conditional are fully ordered.
>
> -Except for a successful test_and_set_bit_lock() which has ACQUIRE semantics and
> -clear_bit_unlock() which has RELEASE semantics.
> +Except for a successful test_and_set_bit_lock() which has ACQUIRE semantics,
> +clear_bit_unlock() which has RELEASE semantics and test_bit_acquire which has
> +ACQUIRE semantics.
>
> Since a platform only has a single means of achieving atomic operations
> the same barriers as for atomic_t are used, see atomic_t.txt.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2022-08-26 19:34:26

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier



On Fri, 26 Aug 2022, Ard Biesheuvel wrote:

> Could you try booting with earlycon?
>
> Just 'earlycon' and if that does not help,

It doesn't help.

> 'earlycon=uart8250,mmio32,<uart PA>' [IIRC, mcbin uses 16550
> compatible UARTs, right?]

mcbin is the host system (running a stable kernel fine). The crash happens
in a virtual machine. The vm uses /dev/ttyAMA0 as a console:

Serial: AMBA PL011 UART driver
9000000.pl011: ttyAMA0 at MMIO 0x9000000 (irq = 45, base_baud = 0) is a PL011 rev1
printk: console [ttyAMA0] enabled

I tried earlycon=pl011,mmio32,0x9000000 - but it doesn't help, it hangs
without printing anything.

Mikulas

2022-08-26 20:27:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3] wait_on_bit: add an acquire memory barrier

On Fri, Aug 26, 2022 at 12:23 PM Geert Uytterhoeven
<[email protected]> wrote:
>
> include/asm-generic/bitops/non-instrumented-non-atomic.h:15:33:
> error: implicit declaration of function 'arch_test_bit_acquire'; did
> you mean '_test_bit_acquire'? [-Werror=implicit-function-declaration]
>

Ahh. m68k isn't using any of the generic bitops headers.

*Most* architectures have that

#include <asm-generic/bitops/non-atomic.h>

and get it that way, but while it's common, it's most definitely not universal:

[torvalds@ryzen linux]$ git grep -L bitops/non-atomic.h
arch/*/include/asm/bitops.h
arch/alpha/include/asm/bitops.h
arch/hexagon/include/asm/bitops.h
arch/ia64/include/asm/bitops.h
arch/m68k/include/asm/bitops.h
arch/s390/include/asm/bitops.h
arch/sparc/include/asm/bitops.h
arch/x86/include/asm/bitops.h

and of that list only x86 has the new arch_test_bit_acquire().

So I assume it's not just m68k, but also alpha, hexagon, ia64, s390
and sparc that have this issue (unless they maybe have some other path
that includes the gerneric ones, I didn't check).

This was actually why my original suggested patch used the
'generic-non-atomic.h' header for it, because that is actually
included regardless of any architecture headers directly from
<linux/bitops.h>.

And it never triggered for me that Mikulas' updated patch then had
this arch_test_bit_acquire() issue.

Something like the attached patch *MAY* fix it, but I really haven't
thought about it a lot, and it's pretty ugly. Maybe it would be better
to just add the

#define arch_test_bit_acquire generic_test_bit_acquire

to the affected <asm/bitops.h> files instead, and then let those
architectures decide on their own that maybe they want to use their
own test_bit() function because it is _already_ an acquire one.

Mikulas?

Geert - any opinions on that "maybe the arch should just do that
#define itself"? I don't think it actually matters for m68k, you end
up with pretty much the same thing anyway, because
"smp_load_acquire()" is just a load anyway..

Linus


Attachments:
patch.diff (1.08 kB)

2022-08-26 21:01:21

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH] provide arch_test_bit_acquire for architectures that define test_bit



On Fri, 26 Aug 2022, Geert Uytterhoeven wrote:

> Hi Mikulas,
>
> [email protected] reports lots of build failures on m68k:
>
> include/asm-generic/bitops/non-instrumented-non-atomic.h:15:33:
> error: implicit declaration of function 'arch_test_bit_acquire'; did
> you mean '_test_bit_acquire'? [-Werror=implicit-function-declaration]
>
> which I've bisected to this commit.
>
> http://kisskb.ellerman.id.au/kisskb/head/3e5c673f0d75bc22b3c26eade87e4db4f374cd34

Does this patch fix it? It is untested.

I'm not sure about the hexagon architecture, it is presumably in-order so
that test_bit and test_bit_acquire are equivalent, but I am not sure about
that - I'm adding hexagon maintainer to the recipient field.

Mikulas



provide arch_test_bit_acquire for architectures that define test_bit

Some architectures define their own arch_test_bit and they also need
arch_test_bit_acquire, otherwise they won't compile.

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Fixes: 8238b4579866 ("wait_on_bit: add an acquire memory barrier")

---
arch/alpha/include/asm/bitops.h | 7 +++++++
arch/hexagon/include/asm/bitops.h | 15 +++++++++++++++
arch/ia64/include/asm/bitops.h | 7 +++++++
arch/m68k/include/asm/bitops.h | 7 +++++++
arch/s390/include/asm/bitops.h | 7 +++++++
5 files changed, 43 insertions(+)

Index: linux-2.6/arch/m68k/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/m68k/include/asm/bitops.h
+++ linux-2.6/arch/m68k/include/asm/bitops.h
@@ -163,6 +163,13 @@ arch_test_bit(unsigned long nr, const vo
return (addr[nr >> 5] & (1UL << (nr & 31))) != 0;
}

+static __always_inline bool
+arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
+{
+ unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
+ return 1UL & (smp_load_acquire(p) >> (nr & (BITS_PER_LONG-1)));
+}
+
static inline int bset_reg_test_and_set_bit(int nr,
volatile unsigned long *vaddr)
{
Index: linux-2.6/arch/alpha/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/alpha/include/asm/bitops.h
+++ linux-2.6/arch/alpha/include/asm/bitops.h
@@ -289,6 +289,13 @@ arch_test_bit(unsigned long nr, const vo
return (1UL & (((const int *) addr)[nr >> 5] >> (nr & 31))) != 0UL;
}

+static __always_inline bool
+arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
+{
+ unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
+ return 1UL & (smp_load_acquire(p) >> (nr & (BITS_PER_LONG-1)));
+}
+
/*
* ffz = Find First Zero in word. Undefined if no zero exists,
* so code should check against ~0UL first..
Index: linux-2.6/arch/hexagon/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/hexagon/include/asm/bitops.h
+++ linux-2.6/arch/hexagon/include/asm/bitops.h
@@ -179,6 +179,21 @@ arch_test_bit(unsigned long nr, const vo
return retval;
}

+static __always_inline bool
+arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
+{
+ int retval;
+
+ asm volatile(
+ "{P0 = tstbit(%1,%2); if (P0.new) %0 = #1; if (!P0.new) %0 = #0;}\n"
+ : "=&r" (retval)
+ : "r" (addr[BIT_WORD(nr)]), "r" (nr % BITS_PER_LONG)
+ : "p0", "memory"
+ );
+
+ return retval;
+}
+
/*
* ffz - find first zero in word.
* @word: The word to search
Index: linux-2.6/arch/ia64/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/bitops.h
+++ linux-2.6/arch/ia64/include/asm/bitops.h
@@ -337,6 +337,13 @@ arch_test_bit(unsigned long nr, const vo
return 1 & (((const volatile __u32 *) addr)[nr >> 5] >> (nr & 31));
}

+static __always_inline bool
+arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
+{
+ unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
+ return 1UL & (smp_load_acquire(p) >> (nr & (BITS_PER_LONG-1)));
+}
+
/**
* ffz - find the first zero bit in a long word
* @x: The long word to find the bit in
Index: linux-2.6/arch/s390/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/bitops.h
+++ linux-2.6/arch/s390/include/asm/bitops.h
@@ -185,6 +185,13 @@ arch_test_bit(unsigned long nr, const vo
return *p & mask;
}

+static __always_inline bool
+arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
+{
+ unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
+ return 1UL & (smp_load_acquire(p) >> (nr & (BITS_PER_LONG-1)));
+}
+
static inline bool arch_test_and_set_bit_lock(unsigned long nr,
volatile unsigned long *ptr)
{

2022-08-26 21:08:27

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH] provide arch_test_bit_acquire for architectures that define test_bit



On Fri, 26 Aug 2022, Linus Torvalds wrote:

> On Fri, Aug 26, 2022 at 12:23 PM Geert Uytterhoeven
> <[email protected]> wrote:
> >
> > include/asm-generic/bitops/non-instrumented-non-atomic.h:15:33:
> > error: implicit declaration of function 'arch_test_bit_acquire'; did
> > you mean '_test_bit_acquire'? [-Werror=implicit-function-declaration]
> >
>
> Ahh. m68k isn't using any of the generic bitops headers.
>
> *Most* architectures have that
>
> #include <asm-generic/bitops/non-atomic.h>
>
> and get it that way, but while it's common, it's most definitely not universal:
>
> [torvalds@ryzen linux]$ git grep -L bitops/non-atomic.h
> arch/*/include/asm/bitops.h
> arch/alpha/include/asm/bitops.h
> arch/hexagon/include/asm/bitops.h
> arch/ia64/include/asm/bitops.h
> arch/m68k/include/asm/bitops.h
> arch/s390/include/asm/bitops.h
> arch/sparc/include/asm/bitops.h
> arch/x86/include/asm/bitops.h
>
> and of that list only x86 has the new arch_test_bit_acquire().
>
> So I assume it's not just m68k, but also alpha, hexagon, ia64, s390
> and sparc that have this issue (unless they maybe have some other path
> that includes the gerneric ones, I didn't check).

For sparc, there is arch/sparc/include/asm/bitops_32.h and
arch/sparc/include/asm/bitops_64.h that include
asm-generic/bitops/non-atomic.h

For the others, the generic version is not included.

I'm wondering why do the architectures redefine test_bit, if their
definition is equivalent to the generic one? We could just delete
arch_test_bit and use "#define arch_test_bit generic_test_bit" as well.

> This was actually why my original suggested patch used the
> 'generic-non-atomic.h' header for it, because that is actually
> included regardless of any architecture headers directly from
> <linux/bitops.h>.
>
> And it never triggered for me that Mikulas' updated patch then had
> this arch_test_bit_acquire() issue.
>
> Something like the attached patch *MAY* fix it, but I really haven't
> thought about it a lot, and it's pretty ugly. Maybe it would be better
> to just add the
>
> #define arch_test_bit_acquire generic_test_bit_acquire
>
> to the affected <asm/bitops.h> files instead, and then let those
> architectures decide on their own that maybe they want to use their
> own test_bit() function because it is _already_ an acquire one.
>
> Mikulas?
>
> Geert - any opinions on that "maybe the arch should just do that
> #define itself"? I don't think it actually matters for m68k, you end
> up with pretty much the same thing anyway, because
> "smp_load_acquire()" is just a load anyway..
>
> Linus

Another untested patch ... tomorrow, I'll try to compile it, at least for
architectures where Debian provides cross-compiling gcc.




From: Mikulas Patocka <[email protected]>

Some architectures define their own arch_test_bit and they also need
arch_test_bit_acquire, otherwise they won't compile. We also clean up the
code by using the generic test_bit if that is equivalent to the
arch-specific version.

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Fixes: 8238b4579866 ("wait_on_bit: add an acquire memory barrier")

---
arch/alpha/include/asm/bitops.h | 7 ++-----
arch/hexagon/include/asm/bitops.h | 15 +++++++++++++++
arch/ia64/include/asm/bitops.h | 7 ++-----
arch/m68k/include/asm/bitops.h | 7 ++-----
arch/s390/include/asm/bitops.h | 10 ++--------
arch/sh/include/asm/bitops-op32.h | 12 ++----------
6 files changed, 25 insertions(+), 33 deletions(-)

Index: linux-2.6/arch/alpha/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/alpha/include/asm/bitops.h
+++ linux-2.6/arch/alpha/include/asm/bitops.h
@@ -283,11 +283,8 @@ arch___test_and_change_bit(unsigned long
return (old & mask) != 0;
}

-static __always_inline bool
-arch_test_bit(unsigned long nr, const volatile unsigned long *addr)
-{
- return (1UL & (((const int *) addr)[nr >> 5] >> (nr & 31))) != 0UL;
-}
+#define arch_test_bit generic_test_bit
+#define arch_test_bit_acquire generic_test_bit_acquire

/*
* ffz = Find First Zero in word. Undefined if no zero exists,
Index: linux-2.6/arch/hexagon/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/hexagon/include/asm/bitops.h
+++ linux-2.6/arch/hexagon/include/asm/bitops.h
@@ -179,6 +179,21 @@ arch_test_bit(unsigned long nr, const vo
return retval;
}

+static __always_inline bool
+arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
+{
+ int retval;
+
+ asm volatile(
+ "{P0 = tstbit(%1,%2); if (P0.new) %0 = #1; if (!P0.new) %0 = #0;}\n"
+ : "=&r" (retval)
+ : "r" (addr[BIT_WORD(nr)]), "r" (nr % BITS_PER_LONG)
+ : "p0", "memory"
+ );
+
+ return retval;
+}
+
/*
* ffz - find first zero in word.
* @word: The word to search
Index: linux-2.6/arch/ia64/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/bitops.h
+++ linux-2.6/arch/ia64/include/asm/bitops.h
@@ -331,11 +331,8 @@ arch___test_and_change_bit(unsigned long
return (old & bit) != 0;
}

-static __always_inline bool
-arch_test_bit(unsigned long nr, const volatile unsigned long *addr)
-{
- return 1 & (((const volatile __u32 *) addr)[nr >> 5] >> (nr & 31));
-}
+#define arch_test_bit generic_test_bit
+#define arch_test_bit_acquire generic_test_bit_acquire

/**
* ffz - find the first zero bit in a long word
Index: linux-2.6/arch/s390/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/bitops.h
+++ linux-2.6/arch/s390/include/asm/bitops.h
@@ -176,14 +176,8 @@ arch___test_and_change_bit(unsigned long
return old & mask;
}

-static __always_inline bool
-arch_test_bit(unsigned long nr, const volatile unsigned long *addr)
-{
- const volatile unsigned long *p = __bitops_word(nr, addr);
- unsigned long mask = __bitops_mask(nr);
-
- return *p & mask;
-}
+#define arch_test_bit generic_test_bit
+#define arch_test_bit_acquire generic_test_bit_acquire

static inline bool arch_test_and_set_bit_lock(unsigned long nr,
volatile unsigned long *ptr)
Index: linux-2.6/arch/m68k/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/m68k/include/asm/bitops.h
+++ linux-2.6/arch/m68k/include/asm/bitops.h
@@ -157,11 +157,8 @@ arch___change_bit(unsigned long nr, vola
change_bit(nr, addr);
}

-static __always_inline bool
-arch_test_bit(unsigned long nr, const volatile unsigned long *addr)
-{
- return (addr[nr >> 5] & (1UL << (nr & 31))) != 0;
-}
+#define arch_test_bit generic_test_bit
+#define arch_test_bit_acquire generic_test_bit_acquire

static inline int bset_reg_test_and_set_bit(int nr,
volatile unsigned long *vaddr)
Index: linux-2.6/arch/sh/include/asm/bitops-op32.h
===================================================================
--- linux-2.6.orig/arch/sh/include/asm/bitops-op32.h
+++ linux-2.6/arch/sh/include/asm/bitops-op32.h
@@ -135,16 +135,8 @@ arch___test_and_change_bit(unsigned long
return (old & mask) != 0;
}

-/**
- * arch_test_bit - Determine whether a bit is set
- * @nr: bit number to test
- * @addr: Address to start counting from
- */
-static __always_inline bool
-arch_test_bit(unsigned long nr, const volatile unsigned long *addr)
-{
- return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
-}
+#define arch_test_bit generic_test_bit
+#define arch_test_bit_acquire generic_test_bit_acquire

#include <asm-generic/bitops/non-instrumented-non-atomic.h>


2022-08-26 23:29:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] provide arch_test_bit_acquire for architectures that define test_bit

On Fri, Aug 26, 2022 at 1:43 PM Mikulas Patocka <[email protected]> wrote:
>
> I'm wondering why do the architectures redefine test_bit, if their
> definition is equivalent to the generic one? We could just delete
> arch_test_bit and use "#define arch_test_bit generic_test_bit" as well.

I think generic_test_bit() came after many of them, and when it
didn't, people copied earlier architectures where they had already
done their own.

> Another untested patch ... tomorrow, I'll try to compile it, at least for
> architectures where Debian provides cross-compiling gcc.

Looks good to me, except I'd just do

#define arch_test_bit_acquire arch_test_bit

on hexagon rather than duplicate that function.

From my reading, Hexagon doesn't have any fancy memory ordering, it's
just the usual UP with barriers basically for instruction cache
coherence etc.

Linus

2022-08-26 23:34:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] provide arch_test_bit_acquire for architectures that define test_bit

On Fri, Aug 26, 2022 at 4:10 PM Linus Torvalds
<[email protected]> wrote:
>
> Looks good to me, except I'd just do
>
> #define arch_test_bit_acquire arch_test_bit
>
> on hexagon rather than duplicate that function.

Oh, except you didn't quite duplicate it, you added the "memory"
clober to it to make sure it's ordered.

Which looks correct to me, even if the "almost entirely duplicated" is
a bit annoying.

Linus

2022-08-27 06:58:38

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier

On Fri, 26 Aug 2022 at 21:10, Mikulas Patocka <[email protected]> wrote:
>
>
>
> On Fri, 26 Aug 2022, Ard Biesheuvel wrote:
>
> > Could you try booting with earlycon?
> >
> > Just 'earlycon' and if that does not help,
>
> It doesn't help.
>
> > 'earlycon=uart8250,mmio32,<uart PA>' [IIRC, mcbin uses 16550
> > compatible UARTs, right?]
>
> mcbin is the host system (running a stable kernel fine). The crash happens
> in a virtual machine. The vm uses /dev/ttyAMA0 as a console:
>
> Serial: AMBA PL011 UART driver
> 9000000.pl011: ttyAMA0 at MMIO 0x9000000 (irq = 45, base_baud = 0) is a PL011 rev1
> printk: console [ttyAMA0] enabled
>
> I tried earlycon=pl011,mmio32,0x9000000 - but it doesn't help, it hangs
> without printing anything.
>

If you are using pl011, you should drop the mmio32 - it only takes a
physical address (and optionally baud rate etc, but QEMU doesn't need
those)

2022-08-27 07:03:06

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier

On Sat, 27 Aug 2022 at 08:32, Ard Biesheuvel <[email protected]> wrote:
>
> On Fri, 26 Aug 2022 at 21:10, Mikulas Patocka <[email protected]> wrote:
> >
> >
> >
> > On Fri, 26 Aug 2022, Ard Biesheuvel wrote:
> >
> > > Could you try booting with earlycon?
> > >
> > > Just 'earlycon' and if that does not help,
> >
> > It doesn't help.
> >
> > > 'earlycon=uart8250,mmio32,<uart PA>' [IIRC, mcbin uses 16550
> > > compatible UARTs, right?]
> >
> > mcbin is the host system (running a stable kernel fine). The crash happens
> > in a virtual machine. The vm uses /dev/ttyAMA0 as a console:
> >
> > Serial: AMBA PL011 UART driver
> > 9000000.pl011: ttyAMA0 at MMIO 0x9000000 (irq = 45, base_baud = 0) is a PL011 rev1
> > printk: console [ttyAMA0] enabled
> >
> > I tried earlycon=pl011,mmio32,0x9000000 - but it doesn't help, it hangs
> > without printing anything.
> >
>
> If you are using pl011, you should drop the mmio32 - it only takes a
> physical address (and optionally baud rate etc, but QEMU doesn't need
> those)

Could you try the diff below please?

--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -371,7 +371,9 @@ SYM_FUNC_END(create_idmap)
SYM_FUNC_START_LOCAL(create_kernel_mapping)
adrp x0, init_pg_dir
mov_q x5, KIMAGE_VADDR // compile time __va(_text)
+#ifdef CONFIG_RELOCATABLE
add x5, x5, x23 // add KASLR displacement
+#endif
adrp x6, _end // runtime __pa(_end)
adrp x3, _text // runtime __pa(_text)
sub x6, x6, x3 // _end - _text

2022-08-27 08:56:53

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] wait_on_bit: add an acquire memory barrier



On Sat, 27 Aug 2022, Ard Biesheuvel wrote:

> On Sat, 27 Aug 2022 at 08:32, Ard Biesheuvel <[email protected]> wrote:
> >
> > On Fri, 26 Aug 2022 at 21:10, Mikulas Patocka <[email protected]> wrote:
> > >
> > >
> > >
> > > On Fri, 26 Aug 2022, Ard Biesheuvel wrote:
> > >
> > > > Could you try booting with earlycon?
> > > >
> > > > Just 'earlycon' and if that does not help,
> > >
> > > It doesn't help.
> > >
> > > > 'earlycon=uart8250,mmio32,<uart PA>' [IIRC, mcbin uses 16550
> > > > compatible UARTs, right?]
> > >
> > > mcbin is the host system (running a stable kernel fine). The crash happens
> > > in a virtual machine. The vm uses /dev/ttyAMA0 as a console:
> > >
> > > Serial: AMBA PL011 UART driver
> > > 9000000.pl011: ttyAMA0 at MMIO 0x9000000 (irq = 45, base_baud = 0) is a PL011 rev1
> > > printk: console [ttyAMA0] enabled
> > >
> > > I tried earlycon=pl011,mmio32,0x9000000 - but it doesn't help, it hangs
> > > without printing anything.
> > >
> >
> > If you are using pl011, you should drop the mmio32 - it only takes a
> > physical address (and optionally baud rate etc, but QEMU doesn't need
> > those)
>
> Could you try the diff below please?
>
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -371,7 +371,9 @@ SYM_FUNC_END(create_idmap)
> SYM_FUNC_START_LOCAL(create_kernel_mapping)
> adrp x0, init_pg_dir
> mov_q x5, KIMAGE_VADDR // compile time __va(_text)
> +#ifdef CONFIG_RELOCATABLE
> add x5, x5, x23 // add KASLR displacement
> +#endif
> adrp x6, _end // runtime __pa(_end)
> adrp x3, _text // runtime __pa(_text)
> sub x6, x6, x3 // _end - _text
>

Yes - this patch fixes the crash.

Mikulas

2022-08-27 11:43:58

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] provide arch_test_bit_acquire for architectures that define test_bit



On Fri, 26 Aug 2022, Linus Torvalds wrote:

> On Fri, Aug 26, 2022 at 1:43 PM Mikulas Patocka <[email protected]> wrote:
> >
> > I'm wondering why do the architectures redefine test_bit, if their
> > definition is equivalent to the generic one? We could just delete
> > arch_test_bit and use "#define arch_test_bit generic_test_bit" as well.
>
> I think generic_test_bit() came after many of them, and when it
> didn't, people copied earlier architectures where they had already
> done their own.
>
> > Another untested patch ... tomorrow, I'll try to compile it, at least for
> > architectures where Debian provides cross-compiling gcc.

I compile-tested this patch on alpha, s390x, m68k, sh, sparc32, sparc64.
So, you can commit it to close these uncompilable-kernel reports.

Mikulas

2022-08-27 17:41:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] provide arch_test_bit_acquire for architectures that define test_bit

On Sat, Aug 27, 2022 at 4:38 AM Mikulas Patocka <[email protected]> wrote:
>
> I compile-tested this patch on alpha, s390x, m68k, sh, sparc32, sparc64.
> So, you can commit it to close these uncompilable-kernel reports.

Thanks, done.

Linus