Let's have a look at this piece of code in __bread_slow:
get_bh(bh);
bh->b_end_io = end_buffer_read_sync;
submit_bh(REQ_OP_READ, 0, bh);
wait_on_buffer(bh);
if (buffer_uptodate(bh))
return bh;
Neither wait_on_buffer nor buffer_uptodate contain a memory barrier.
Consequently, if someone calls sb_bread and then reads the buffer data,
the read of buffer data may be speculatively executed before
wait_on_buffer(bh) and it may return invalid data.
Also, there is this pattern present several times:
wait_on_buffer(bh);
if (!buffer_uptodate(bh))
err = -EIO;
It may be possible that buffer_uptodate is executed before wait_on_buffer
and it may return spurious error.
Fix these bugs by adding a read memory barrier to wait_on_buffer().
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -353,6 +353,11 @@ static inline void wait_on_buffer(struct
might_sleep();
if (buffer_locked(bh))
__wait_on_buffer(bh);
+ /*
+ * Make sure that the following accesses to buffer state or buffer data
+ * are not reordered with buffer_locked(bh).
+ */
+ smp_rmb();
}
static inline int trylock_buffer(struct buffer_head *bh)
Hi Mikulas,
On Sun, 31 Jul 2022 at 13:43, Mikulas Patocka <[email protected]> wrote:
>
> Let's have a look at this piece of code in __bread_slow:
> get_bh(bh);
> bh->b_end_io = end_buffer_read_sync;
> submit_bh(REQ_OP_READ, 0, bh);
> wait_on_buffer(bh);
> if (buffer_uptodate(bh))
> return bh;
> Neither wait_on_buffer nor buffer_uptodate contain a memory barrier.
> Consequently, if someone calls sb_bread and then reads the buffer data,
> the read of buffer data may be speculatively executed before
> wait_on_buffer(bh) and it may return invalid data.
>
This has little to do with speculation, so better to drop this S bomb
from your commit message. This is about concurrency and weak memory
ordering.
> Also, there is this pattern present several times:
> wait_on_buffer(bh);
> if (!buffer_uptodate(bh))
> err = -EIO;
> It may be possible that buffer_uptodate is executed before wait_on_buffer
> and it may return spurious error.
>
> Fix these bugs by adding a read memory barrier to wait_on_buffer().
>
> Signed-off-by: Mikulas Patocka <[email protected]>
> Cc: [email protected]
>
> Index: linux-2.6/include/linux/buffer_head.h
> ===================================================================
> --- linux-2.6.orig/include/linux/buffer_head.h
> +++ linux-2.6/include/linux/buffer_head.h
> @@ -353,6 +353,11 @@ static inline void wait_on_buffer(struct
> might_sleep();
> if (buffer_locked(bh))
> __wait_on_buffer(bh);
> + /*
> + * Make sure that the following accesses to buffer state or buffer data
> + * are not reordered with buffer_locked(bh).
> + */
> + smp_rmb();
> }
>
> static inline int trylock_buffer(struct buffer_head *bh)
>
This doesn't seem like a very robust fix to me, tbh - I suppose this
makes the symptom you encountered go away, but the underlying issue
remains afaict.
Given that the lock and uptodate fields etc are just bits in a
bitfield, wouldn't it be better to use cmpxchg() with acquire/release
semantics (as appropriate) to manage these bits?
On Sun, 31 Jul 2022, Ard Biesheuvel wrote:
> This has little to do with speculation, so better to drop this S bomb
> from your commit message. This is about concurrency and weak memory
> ordering.
Yes.
> This doesn't seem like a very robust fix to me, tbh - I suppose this
> makes the symptom you encountered go away, but the underlying issue
> remains afaict.
>
> Given that the lock and uptodate fields etc are just bits in a
> bitfield, wouldn't it be better to use cmpxchg() with acquire/release
> semantics (as appropriate) to manage these bits?
The kernel already uses clear_bit_unlock, test_and_set_bit_lock and
wait_on_bit_lock_io to manage the BH_Lock bit - and they have
acquire/release semantics.
The only problem is that test_bit doesn't provide any memory barriers.
Should we add the barrier to buffer_locked() instead of wait_on_buffer()?
Perhaps it would fix more bugs - in reiserfs, there's this piece of code:
if (buffer_locked(bh)) {
spin_unlock(lock);
wait_on_buffer(bh);
spin_lock(lock);
}
if (!buffer_uptodate(bh)) {
ret = -EIO;
}
or this:
if (buffer_locked(bh)) {
int depth;
PROC_INFO_INC(sb, scan_bitmap.wait);
depth = reiserfs_write_unlock_nested(sb);
__wait_on_buffer(bh);
reiserfs_write_lock_nested(sb, depth);
}
BUG_ON(!buffer_uptodate(bh));
BUG_ON(atomic_read(&bh->b_count) == 0);
That assumes that buffer_locked provides a barrier.
Mikulas
> The only problem is that test_bit doesn't provide any memory barriers.
> Should we add the barrier to buffer_locked() instead of wait_on_buffer()?
> Perhaps it would fix more bugs - in reiserfs, there's this piece of code:
Her I'm sending the second version of the patch that changes buffer_locked
to provide an acquire semantics.
Mikulas
From: Mikulas Patocka <[email protected]>
Let's have a look at this piece of code in __bread_slow:
get_bh(bh);
bh->b_end_io = end_buffer_read_sync;
submit_bh(REQ_OP_READ, 0, bh);
wait_on_buffer(bh);
if (buffer_uptodate(bh))
return bh;
Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
Consequently, if someone calls sb_bread and then reads the buffer data,
the read of buffer data may be executed before wait_on_buffer(bh) on
architectures with weak memory ordering and it may return invalid data.
Also, there is this pattern present several times:
wait_on_buffer(bh);
if (!buffer_uptodate(bh))
err = -EIO;
It may be possible that buffer_uptodate is executed before wait_on_buffer
and it may return spurious error.
Fix these bugs by chaning the function buffer_locked to have the acquire
semantics - so that code that follows buffer_locked cannot be moved before
it. We must also add a read barrier after wait_on_bit_io because
wait_on_bit_io doesn't provide any barrier. (perhaps, should this
smp_rmb() be moved into wait_on_bit_io?)
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -120,7 +120,6 @@ static __always_inline int test_clear_bu
BUFFER_FNS(Uptodate, uptodate)
BUFFER_FNS(Dirty, dirty)
TAS_BUFFER_FNS(Dirty, dirty)
-BUFFER_FNS(Lock, locked)
BUFFER_FNS(Req, req)
TAS_BUFFER_FNS(Req, req)
BUFFER_FNS(Mapped, mapped)
@@ -135,6 +134,17 @@ BUFFER_FNS(Meta, meta)
BUFFER_FNS(Prio, prio)
BUFFER_FNS(Defer_Completion, defer_completion)
+static __always_inline void set_buffer_locked(struct buffer_head *bh)
+{
+ set_bit(BH_Lock, &bh->b_state);
+}
+
+static __always_inline int buffer_locked(const struct buffer_head *bh)
+{
+ unsigned long state = smp_load_acquire(&bh->b_state);
+ return test_bit(BH_Lock, &state);
+}
+
#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
/* If we *know* page->private refers to buffer_heads */
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -120,6 +120,7 @@ EXPORT_SYMBOL(buffer_check_dirty_writeba
void __wait_on_buffer(struct buffer_head * bh)
{
wait_on_bit_io(&bh->b_state, BH_Lock, TASK_UNINTERRUPTIBLE);
+ smp_rmb();
}
EXPORT_SYMBOL(__wait_on_buffer);
[ Will and Paul, question for you below ]
On Sun, Jul 31, 2022 at 8:08 AM Mikulas Patocka <[email protected]> wrote:
>
> Also, there is this pattern present several times:
> wait_on_buffer(bh);
> if (!buffer_uptodate(bh))
> err = -EIO;
> It may be possible that buffer_uptodate is executed before wait_on_buffer
> and it may return spurious error.
I'm not convinced that's actually valid.
They are testing the same memory location, and I don't think our
memory ordering model allows for _that_ to be out-of-order. Memory
barriers are for accesses to different memory locations.
Even alpha is specified to be locally ordered wrt *one* memory
location, including for reads (See table 5-1: "Processor issue order",
and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
new value, a subsequent read is not allowed to see an older one - even
without a memory barrier.
Will, Paul? Maybe that's only for overlapping loads/stores, not for
loads/loads. Because maybe alpha for once isn't the weakest possible
ordering.
I didn't find this actually in our documentation, so maybe other
architectures allow it? Our docs talk about "overlapping loads and
stores", and maybe that was meant to imply that "load overlaps with
load" case, but it really reads like load-vs-store, not load-vs-load.
But the patch looks fine, though I agree that the ordering in
__wait_on_buffer should probably be moved into
wait_on_bit/wait_on_bit_io.
And would we perhaps want the bitops to have the different ordering
versions? Like "set_bit_release()" and "test_bit_acquire()"? That
would seem to be (a) cleaner and (b) possibly generate better code for
architectures where that makes a difference?
Linus
On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> [ Will and Paul, question for you below ]
>
> On Sun, Jul 31, 2022 at 8:08 AM Mikulas Patocka <[email protected]> wrote:
> >
> > Also, there is this pattern present several times:
> > wait_on_buffer(bh);
> > if (!buffer_uptodate(bh))
> > err = -EIO;
> > It may be possible that buffer_uptodate is executed before wait_on_buffer
> > and it may return spurious error.
>
> I'm not convinced that's actually valid.
>
> They are testing the same memory location, and I don't think our
> memory ordering model allows for _that_ to be out-of-order. Memory
> barriers are for accesses to different memory locations.
Yes, aligned same-sized marked accesses to a given location are always
executed so as to be consistent with some global order. Please note the
"marked accesses". The compiler can rearrange unmarked reads and in
some cases can discard unmarked writes. For more information, please
see tools/memory-model/Documentation/recipes.txt's "Single CPU or single
memory location" section.
> Even alpha is specified to be locally ordered wrt *one* memory
> location, including for reads (See table 5-1: "Processor issue order",
> and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> new value, a subsequent read is not allowed to see an older one - even
> without a memory barrier.
>
> Will, Paul? Maybe that's only for overlapping loads/stores, not for
> loads/loads. Because maybe alpha for once isn't the weakest possible
> ordering.
The "bad boy" in this case is Itanium, which can do some VLIW reordering
of accesses. Or could, I am not sure that newer Itanium hardware
does this. But this is why Itanium compilers made volatile loads use
the ld,acq instruction.
Which means that aligned same-sized marked accesses to a single location
really do execute consistently with some global ordering, even on Itanium.
That said, I confess that I am having a hard time finding the
buffer_locked() definition. So if buffer_locked() uses normal C-language
accesses to sample the BH_Lock bit of the ->b_state field, then yes,
there could be a problem. The compiler would then be free to reorder
calls to buffer_locked() because it could then assume that no other
thread was touching that ->b_state field.
> I didn't find this actually in our documentation, so maybe other
> architectures allow it? Our docs talk about "overlapping loads and
> stores", and maybe that was meant to imply that "load overlaps with
> load" case, but it really reads like load-vs-store, not load-vs-load.
I placed the relevant text from recipes.txt at the end of this email.
> But the patch looks fine, though I agree that the ordering in
> __wait_on_buffer should probably be moved into
> wait_on_bit/wait_on_bit_io.
>
> And would we perhaps want the bitops to have the different ordering
> versions? Like "set_bit_release()" and "test_bit_acquire()"? That
> would seem to be (a) cleaner and (b) possibly generate better code for
> architectures where that makes a difference?
As always, I defer to the architecture maintainers on this one.
Thanx, Paul
------------------------------------------------------------------------
Single CPU or single memory location
------------------------------------
If there is only one CPU on the one hand or only one variable
on the other, the code will execute in order. There are (as
usual) some things to be careful of:
1. Some aspects of the C language are unordered. For example,
in the expression "f(x) + g(y)", the order in which f and g are
called is not defined; the object code is allowed to use either
order or even to interleave the computations.
2. Compilers are permitted to use the "as-if" rule. That is, a
compiler can emit whatever code it likes for normal accesses,
as long as the results of a single-threaded execution appear
just as if the compiler had followed all the relevant rules.
To see this, compile with a high level of optimization and run
the debugger on the resulting binary.
3. If there is only one variable but multiple CPUs, that variable
must be properly aligned and all accesses to that variable must
be full sized. Variables that straddle cachelines or pages void
your full-ordering warranty, as do undersized accesses that load
from or store to only part of the variable.
4. If there are multiple CPUs, accesses to shared variables should
use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
tearing, load/store fusing, and invented loads and stores.
There are exceptions to this rule, including:
i. When there is no possibility of a given shared variable
being updated by some other CPU, for example, while
holding the update-side lock, reads from that variable
need not use READ_ONCE().
ii. When there is no possibility of a given shared variable
being either read or updated by other CPUs, for example,
when running during early boot, reads from that variable
need not use READ_ONCE() and writes to that variable
need not use WRITE_ONCE().
On Sun, 31 Jul 2022, Linus Torvalds wrote:
> [ Will and Paul, question for you below ]
>
> On Sun, Jul 31, 2022 at 8:08 AM Mikulas Patocka <[email protected]> wrote:
> >
> > Also, there is this pattern present several times:
> > wait_on_buffer(bh);
> > if (!buffer_uptodate(bh))
> > err = -EIO;
> > It may be possible that buffer_uptodate is executed before wait_on_buffer
> > and it may return spurious error.
>
> I'm not convinced that's actually valid.
>
> They are testing the same memory location, and I don't think our
> memory ordering model allows for _that_ to be out-of-order. Memory
> barriers are for accesses to different memory locations.
You are right. And the bit tests are volatile, so the compiler can't
reorder them.
But the compiler can reorder non-volatile accesses around volatile
accesses (gcc does this, clang afaik doesn't), so the bit tests need at
least a compiler barrier after them.
> But the patch looks fine, though I agree that the ordering in
> __wait_on_buffer should probably be moved into
> wait_on_bit/wait_on_bit_io.
Yes, there are more bugs where the code does wait_on_bit and then reads
some data without any barrier. Adding the barrier to wait_on_bit fixes
that.
I'll send two patches, one for wait_on_bit and the other for
buffer_locked.
Do you think that wait_event also needs a read memory barrier? It is
defined as:
#define wait_event(wq_head, condition) \
do { \
might_sleep(); \
if (condition) \
break; \
__wait_event(wq_head, condition); \
} while (0)
Mikulas
> And would we perhaps want the bitops to have the different ordering
> versions? Like "set_bit_release()" and "test_bit_acquire()"? That
> would seem to be (a) cleaner and (b) possibly generate better code for
> architectures where that makes a difference?
>
> Linus
>
wait_on_bit tests the bit without any memory barriers, consequently the
code that follows wait_on_bit may be moved before testing the bit on
architectures with weak memory ordering. When the code tests for some
event using wait_on_bit and then performs a load operation, the load may
be unexpectedly moved before wait_on_bit and it may return data that
existed before the event occurred.
Such bugs exist in fs/buffer.c:__wait_on_buffer,
drivers/md/dm-bufio.c:new_read,
drivers/media/usb/dvb-usb-v2/dvb_usb_core.c:dvb_usb_start_feed,
drivers/bluetooth/btusb.c:btusb_mtk_hci_wmt_sync
and perhaps in other places.
We fix this class of bugs by adding a read barrier after test_bit().
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Index: linux-2.6/include/linux/wait_bit.h
===================================================================
--- linux-2.6.orig/include/linux/wait_bit.h
+++ linux-2.6/include/linux/wait_bit.h
@@ -71,8 +71,10 @@ static inline int
wait_on_bit(unsigned long *word, int bit, unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit(bit, word)) {
+ smp_rmb();
return 0;
+ }
return out_of_line_wait_on_bit(word, bit,
bit_wait,
mode);
@@ -96,8 +98,10 @@ static inline int
wait_on_bit_io(unsigned long *word, int bit, unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit(bit, word)) {
+ smp_rmb();
return 0;
+ }
return out_of_line_wait_on_bit(word, bit,
bit_wait_io,
mode);
@@ -123,8 +127,10 @@ wait_on_bit_timeout(unsigned long *word,
unsigned long timeout)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit(bit, word)) {
+ smp_rmb();
return 0;
+ }
return out_of_line_wait_on_bit_timeout(word, bit,
bit_wait_timeout,
mode, timeout);
@@ -151,8 +157,10 @@ wait_on_bit_action(unsigned long *word,
unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit(bit, word)) {
+ smp_rmb();
return 0;
+ }
return out_of_line_wait_on_bit(word, bit, action, mode);
}
Index: linux-2.6/kernel/sched/wait_bit.c
===================================================================
--- linux-2.6.orig/kernel/sched/wait_bit.c
+++ linux-2.6/kernel/sched/wait_bit.c
@@ -51,6 +51,8 @@ __wait_on_bit(struct wait_queue_head *wq
finish_wait(wq_head, &wbq_entry->wq_entry);
+ smp_rmb();
+
return ret;
}
EXPORT_SYMBOL(__wait_on_bit);
Let's have a look at this piece of code in __bread_slow:
get_bh(bh);
bh->b_end_io = end_buffer_read_sync;
submit_bh(REQ_OP_READ, 0, bh);
wait_on_buffer(bh);
if (buffer_uptodate(bh))
return bh;
Neither wait_on_buffer nor buffer_uptodate contain a memory barrier.
Consequently, if someone calls sb_bread and then reads the buffer data,
the read of buffer data may be executed before wait_on_buffer(bh) on
architectures with weak memory ordering and it may return invalid data.
Fix this bug by changing the function buffer_locked to have the acquire
semantics - so that code that follows buffer_locked cannot be moved before
it.
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -120,7 +120,6 @@ static __always_inline int test_clear_bu
BUFFER_FNS(Uptodate, uptodate)
BUFFER_FNS(Dirty, dirty)
TAS_BUFFER_FNS(Dirty, dirty)
-BUFFER_FNS(Lock, locked)
BUFFER_FNS(Req, req)
TAS_BUFFER_FNS(Req, req)
BUFFER_FNS(Mapped, mapped)
@@ -135,6 +134,17 @@ BUFFER_FNS(Meta, meta)
BUFFER_FNS(Prio, prio)
BUFFER_FNS(Defer_Completion, defer_completion)
+static __always_inline void set_buffer_locked(struct buffer_head *bh)
+{
+ set_bit(BH_Lock, &bh->b_state);
+}
+
+static __always_inline int buffer_locked(const struct buffer_head *bh)
+{
+ unsigned long state = smp_load_acquire(&bh->b_state);
+ return test_bit(BH_Lock, &state);
+}
+
#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
/* If we *know* page->private refers to buffer_heads */
On Sun, Jul 31, 2022 at 1:39 PM Mikulas Patocka <[email protected]> wrote:
>
> Do you think that wait_event also needs a read memory barrier?
Not really, no.
__wait_event() uses prepare_to_wait(), and it uses set_current_state()
very much so that the process state setting is serialized with the
test afterwards.
And the only race wait_event should worry about is exactly the "go to
sleep" vs "make the event true and then wake up" race, so that it
doesn't wait forever.
There is no guarantee of memory ordering wrt anything else, and I
don't think there is any need for such a guarantee.
If somebody wants that guarantee, they should probably make the
condition for wait_event() to be a "load_acquire()" or similar. Those
cases do exist.
Linus
On Sun, Jul 31, 2022 at 1:43 PM Mikulas Patocka <[email protected]> wrote:
>
> +
> +static __always_inline int buffer_locked(const struct buffer_head *bh)
> +{
> + unsigned long state = smp_load_acquire(&bh->b_state);
> + return test_bit(BH_Lock, &state);
This should not use 'test_bit()'. I suspect that generates horrendous
code, because it's a volatile access, so now you'll load it into a
register, and I suspect it will generate s pointless spill just to do
a volatile load.
I didn't check.
So once you've loaded b_state, just test the bit directly with
return (state & (1u << BH_Lock)) != 0;
or whatever.
Linus
On Sun, Jul 31, 2022 at 1:41 PM Mikulas Patocka <[email protected]> wrote:
>
> - if (!test_bit(bit, word))
> + if (!test_bit(bit, word)) {
> + smp_rmb();
Logically, I don't think that makes sense.
Maybe you're checking the buffer being up-to-date before you *write* to it?
So smp_rmb() seems entirely wrong.
I think it should consistently aim for just doing
unsigned long state = smp_read_acquire(word);
if (!(state & (1 << bit)))
return 0;
or whatever.
We should strive to *not* add new uses of the legacy memory barriers.
They are garbage from last century when people didn't know better.
Then people learnt to use acquire and release, and things improved.
Let's live in that improved world.
Linus
On Sun, Jul 31, 2022 at 04:43:08PM -0400, Mikulas Patocka wrote:
> Let's have a look at this piece of code in __bread_slow:
> get_bh(bh);
> bh->b_end_io = end_buffer_read_sync;
> submit_bh(REQ_OP_READ, 0, bh);
> wait_on_buffer(bh);
> if (buffer_uptodate(bh))
> return bh;
> Neither wait_on_buffer nor buffer_uptodate contain a memory barrier.
> Consequently, if someone calls sb_bread and then reads the buffer data,
> the read of buffer data may be executed before wait_on_buffer(bh) on
> architectures with weak memory ordering and it may return invalid data.
I think we should be consistent between PageUptodate() and
buffer_uptodate(). Here's how it's done for pages currently:
static inline bool folio_test_uptodate(struct folio *folio)
bool ret = test_bit(PG_uptodate, folio_flags(folio, 0));
/*
* Must ensure that the data we read out of the folio is loaded
* _after_ we've loaded folio->flags to check the uptodate bit.
* We can skip the barrier if the folio is not uptodate, because
* we wouldn't be reading anything from it.
*
* See folio_mark_uptodate() for the other side of the story.
*/
if (ret)
smp_rmb();
return ret;
...
static __always_inline void folio_mark_uptodate(struct folio *folio)
/*
* Memory barrier must be issued before setting the PG_uptodate bit,
* so that all previous stores issued in order to bring the folio
* uptodate are actually visible before folio_test_uptodate becomes true.
*/
smp_wmb();
set_bit(PG_uptodate, folio_flags(folio, 0));
I'm happy for these to also be changed to use acquire/release; no
attachment to the current code. But bufferheads & pages should have the
same semantics, or we'll be awfully confused.
On Mon, 1 Aug 2022 at 00:14, Matthew Wilcox <[email protected]> wrote:
>
> On Sun, Jul 31, 2022 at 04:43:08PM -0400, Mikulas Patocka wrote:
> > Let's have a look at this piece of code in __bread_slow:
> > get_bh(bh);
> > bh->b_end_io = end_buffer_read_sync;
> > submit_bh(REQ_OP_READ, 0, bh);
> > wait_on_buffer(bh);
> > if (buffer_uptodate(bh))
> > return bh;
> > Neither wait_on_buffer nor buffer_uptodate contain a memory barrier.
> > Consequently, if someone calls sb_bread and then reads the buffer data,
> > the read of buffer data may be executed before wait_on_buffer(bh) on
> > architectures with weak memory ordering and it may return invalid data.
>
> I think we should be consistent between PageUptodate() and
> buffer_uptodate(). Here's how it's done for pages currently:
>
> static inline bool folio_test_uptodate(struct folio *folio)
> bool ret = test_bit(PG_uptodate, folio_flags(folio, 0));
> /*
> * Must ensure that the data we read out of the folio is loaded
> * _after_ we've loaded folio->flags to check the uptodate bit.
> * We can skip the barrier if the folio is not uptodate, because
> * we wouldn't be reading anything from it.
> *
> * See folio_mark_uptodate() for the other side of the story.
> */
> if (ret)
> smp_rmb();
>
> return ret;
>
> ...
>
> static __always_inline void folio_mark_uptodate(struct folio *folio)
> /*
> * Memory barrier must be issued before setting the PG_uptodate bit,
> * so that all previous stores issued in order to bring the folio
> * uptodate are actually visible before folio_test_uptodate becomes true.
> */
> smp_wmb();
> set_bit(PG_uptodate, folio_flags(folio, 0));
>
> I'm happy for these to also be changed to use acquire/release; no
> attachment to the current code. But bufferheads & pages should have the
> same semantics, or we'll be awfully confused.
I suspect that adding acquire/release annotations at the bitops level
is not going to get us anywhere, given that for the uptodate flag, it
is the set operation that has release semantics, whereas for a lock
flag, it will be the clear operation. Reverting to the legacy barrier
instructions to try and avoid this ambiguity will likely only make
things worse.
I was cc'ed only on patch #1 of your v3, so I'm not sure where this is
headed, but I strongly +1 Matthew's point above that this should be
done at the level that defines how the bit fields should be
interpreted wrt to the contents of the data structure that they
describe/guard.
On Mon, 1 Aug 2022 at 00:31, Ard Biesheuvel <[email protected]> wrote:
>
...
>
> I was cc'ed only on patch #1 of your v3, so I'm not sure where this is
...
Apologies for causing confusion here by replying to [PATCH v3 2/2]
that I was not cc'ed on [PATCH v3 2/2], which is of course nonsense.
Gmail's threading gets a bit wonky sometimes, and the patch was
threaded under v2 as their subject lines are sufficiently similar, and
I assumed I was replying to v2.
On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> That said, I confess that I am having a hard time finding the
> buffer_locked() definition. So if buffer_locked() uses normal C-language
> accesses to sample the BH_Lock bit of the ->b_state field, then yes,
> there could be a problem. The compiler would then be free to reorder
> calls to buffer_locked() because it could then assume that no other
> thread was touching that ->b_state field.
You're having a hard time finding it because it's constructed with the C
preprocessor. I really wish we generated header files using CPP once
and then included the generated/ file. That would make them greppable.
You're looking for include/linux/buffer_head.h and it's done like this:
enum bh_state_bits {
...
BH_Lock, /* Is locked */
...
#define BUFFER_FNS(bit, name) \
...
static __always_inline int buffer_##name(const struct buffer_head *bh) \
{ \
return test_bit(BH_##bit, &(bh)->b_state); \
}
BUFFER_FNS(Lock, locked)
(fwiw, the page/folio versions of these functions don't autogenerate
the lock or uptodate ones because they need extra functions called)
On Sun, Jul 31, 2022 at 04:40:59PM -0400, Mikulas Patocka wrote:
> wait_on_bit tests the bit without any memory barriers, consequently the
> code that follows wait_on_bit may be moved before testing the bit on
> architectures with weak memory ordering. When the code tests for some
> event using wait_on_bit and then performs a load operation, the load may
> be unexpectedly moved before wait_on_bit and it may return data that
> existed before the event occurred.
>
> Such bugs exist in fs/buffer.c:__wait_on_buffer,
> drivers/md/dm-bufio.c:new_read,
> drivers/media/usb/dvb-usb-v2/dvb_usb_core.c:dvb_usb_start_feed,
> drivers/bluetooth/btusb.c:btusb_mtk_hci_wmt_sync
> and perhaps in other places.
>
> We fix this class of bugs by adding a read barrier after test_bit().
>
> Signed-off-by: Mikulas Patocka <[email protected]>
> Cc: [email protected]
>
> Index: linux-2.6/include/linux/wait_bit.h
> ===================================================================
> --- linux-2.6.orig/include/linux/wait_bit.h
> +++ linux-2.6/include/linux/wait_bit.h
> @@ -71,8 +71,10 @@ static inline int
> wait_on_bit(unsigned long *word, int bit, unsigned mode)
> {
> might_sleep();
> - if (!test_bit(bit, word))
> + if (!test_bit(bit, word)) {
> + smp_rmb();
Any new code using smp_rmb or an acquire access should always include a
comment that explains where the matching smp_wmb or release access is.
Alan Stern
> return 0;
> + }
> return out_of_line_wait_on_bit(word, bit,
> bit_wait,
> mode);
> @@ -96,8 +98,10 @@ static inline int
> wait_on_bit_io(unsigned long *word, int bit, unsigned mode)
> {
> might_sleep();
> - if (!test_bit(bit, word))
> + if (!test_bit(bit, word)) {
> + smp_rmb();
> return 0;
> + }
> return out_of_line_wait_on_bit(word, bit,
> bit_wait_io,
> mode);
> @@ -123,8 +127,10 @@ wait_on_bit_timeout(unsigned long *word,
> unsigned long timeout)
> {
> might_sleep();
> - if (!test_bit(bit, word))
> + if (!test_bit(bit, word)) {
> + smp_rmb();
> return 0;
> + }
> return out_of_line_wait_on_bit_timeout(word, bit,
> bit_wait_timeout,
> mode, timeout);
> @@ -151,8 +157,10 @@ wait_on_bit_action(unsigned long *word,
> unsigned mode)
> {
> might_sleep();
> - if (!test_bit(bit, word))
> + if (!test_bit(bit, word)) {
> + smp_rmb();
> return 0;
> + }
> return out_of_line_wait_on_bit(word, bit, action, mode);
> }
>
> Index: linux-2.6/kernel/sched/wait_bit.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/wait_bit.c
> +++ linux-2.6/kernel/sched/wait_bit.c
> @@ -51,6 +51,8 @@ __wait_on_bit(struct wait_queue_head *wq
>
> finish_wait(wq_head, &wbq_entry->wq_entry);
>
> + smp_rmb();
> +
> return ret;
> }
> EXPORT_SYMBOL(__wait_on_bit);
>
On Sun, Jul 31, 2022 at 11:48:32PM +0100, Matthew Wilcox wrote:
> On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> > That said, I confess that I am having a hard time finding the
> > buffer_locked() definition. So if buffer_locked() uses normal C-language
> > accesses to sample the BH_Lock bit of the ->b_state field, then yes,
> > there could be a problem. The compiler would then be free to reorder
> > calls to buffer_locked() because it could then assume that no other
> > thread was touching that ->b_state field.
>
> You're having a hard time finding it because it's constructed with the C
> preprocessor. I really wish we generated header files using CPP once
> and then included the generated/ file. That would make them greppable.
>
> You're looking for include/linux/buffer_head.h and it's done like this:
>
> enum bh_state_bits {
> ...
> BH_Lock, /* Is locked */
> ...
>
> #define BUFFER_FNS(bit, name) \
> ...
> static __always_inline int buffer_##name(const struct buffer_head *bh) \
> { \
> return test_bit(BH_##bit, &(bh)->b_state); \
> }
>
> BUFFER_FNS(Lock, locked)
>
> (fwiw, the page/folio versions of these functions don't autogenerate
> the lock or uptodate ones because they need extra functions called)
Thank you!
Another thing that would have helped me find it would have been to leave
the "BH_" prefix on the bit name in the BUFFER_FNS() invocation, as in
ditch the "BH_##" in the definition and then just say "BUFFER_FNS(BH_Lock,
locked)".
But what is life without a challenge? ;-/
Anyway, on x86 this will use the constant_test_bit() function, which
uses a volatile declaration for its parameter. So it should avoid
vulnerability to the vicissitudes of the compiler.
So I feel much better now.
Thanx, Paul
On Sun, 31 Jul 2022, Linus Torvalds wrote:
> On Sun, Jul 31, 2022 at 1:41 PM Mikulas Patocka <[email protected]> wrote:
> >
> > - if (!test_bit(bit, word))
> > + if (!test_bit(bit, word)) {
> > + smp_rmb();
>
> Logically, I don't think that makes sense.
>
> Maybe you're checking the buffer being up-to-date before you *write* to it?
None of the CPUs have speculative writes - so the write can't be moved
before the "test_bit" function. So, we are only concerned about reads.
> So smp_rmb() seems entirely wrong.
>
> I think it should consistently aim for just doing
>
> unsigned long state = smp_read_acquire(word);
> if (!(state & (1 << bit)))
> return 0;
>
> or whatever.
>
> We should strive to *not* add new uses of the legacy memory barriers.
> They are garbage from last century when people didn't know better.
>
> Then people learnt to use acquire and release, and things improved.
> Let's live in that improved world.
>
> Linus
OK - I'm sending new patches that introduce the function test_bit_acquire.
Mikulas
Let's have a look at this piece of code in __bread_slow:
get_bh(bh);
bh->b_end_io = end_buffer_read_sync;
submit_bh(REQ_OP_READ, 0, bh);
wait_on_buffer(bh);
if (buffer_uptodate(bh))
return bh;
Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
Consequently, if someone calls sb_bread and then reads the buffer data,
the read of buffer data may be executed before wait_on_buffer(bh) on
architectures with weak memory ordering and it may return invalid data.
Fix this bug by changing the function buffer_locked to have the acquire
semantics - so that code that follows buffer_locked cannot be moved before
it.
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -120,7 +120,6 @@ static __always_inline int test_clear_bu
BUFFER_FNS(Uptodate, uptodate)
BUFFER_FNS(Dirty, dirty)
TAS_BUFFER_FNS(Dirty, dirty)
-BUFFER_FNS(Lock, locked)
BUFFER_FNS(Req, req)
TAS_BUFFER_FNS(Req, req)
BUFFER_FNS(Mapped, mapped)
@@ -135,6 +134,17 @@ BUFFER_FNS(Meta, meta)
BUFFER_FNS(Prio, prio)
BUFFER_FNS(Defer_Completion, defer_completion)
+static __always_inline void set_buffer_locked(struct buffer_head *bh)
+{
+ set_bit(BH_Lock, &bh->b_state);
+}
+
+static __always_inline int buffer_locked(const struct buffer_head *bh)
+{
+ /* pairs with smp_mb__after_atomic() in unlock_buffer() */
+ return test_bit_acquire(BH_Lock, &bh->b_state);
+}
+
#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
/* If we *know* page->private refers to buffer_heads */
wait_on_bit tests the bit without any memory barriers, consequently the
code that follows wait_on_bit may be moved before testing the bit on
architectures with weak memory ordering. When the code tests for some
event using wait_on_bit and then performs a load operation, the load may
be unexpectedly moved before wait_on_bit and it may return data that
existed before the event occurred.
Such bugs exist in fs/buffer.c:__wait_on_buffer,
drivers/md/dm-bufio.c:new_read,
drivers/media/usb/dvb-usb-v2/dvb_usb_core.c:dvb_usb_start_feed,
drivers/bluetooth/btusb.c:btusb_mtk_hci_wmt_sync
and perhaps in other places.
We fix this class of bugs by adding a new function test_bit_acquire that
reads the bit and provides acquire memory ordering semantics.
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
---
arch/s390/include/asm/bitops.h | 10 ++++++++++
arch/x86/include/asm/bitops.h | 7 ++++++-
include/asm-generic/bitops/instrumented-non-atomic.h | 11 +++++++++++
include/asm-generic/bitops/non-atomic.h | 13 +++++++++++++
include/linux/wait_bit.h | 8 ++++----
kernel/sched/wait_bit.c | 6 +++---
6 files changed, 47 insertions(+), 8 deletions(-)
Index: linux-2.6/arch/x86/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
+++ linux-2.6/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
@@ -203,8 +203,10 @@ arch_test_and_change_bit(long nr, volati
static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
{
- return ((1UL << (nr & (BITS_PER_LONG-1))) &
+ bool r = ((1UL << (nr & (BITS_PER_LONG-1))) &
(addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
+ barrier();
+ return r;
}
static __always_inline bool variable_test_bit(long nr, volatile const unsigned long *addr)
@@ -224,6 +226,9 @@ static __always_inline bool variable_tes
? constant_test_bit((nr), (addr)) \
: variable_test_bit((nr), (addr)))
+#define arch_test_bit_acquire(nr, addr) \
+ arch_test_bit(nr, addr)
+
/**
* __ffs - find first set bit in word
* @word: The word to search
Index: linux-2.6/include/asm-generic/bitops/instrumented-non-atomic.h
===================================================================
--- linux-2.6.orig/include/asm-generic/bitops/instrumented-non-atomic.h 2022-08-01 12:27:43.000000000 +0200
+++ linux-2.6/include/asm-generic/bitops/instrumented-non-atomic.h 2022-08-01 12:28:33.000000000 +0200
@@ -135,4 +135,15 @@ static __always_inline bool test_bit(lon
return arch_test_bit(nr, addr);
}
+/**
+ * test_bit_acquire - Determine whether a bit is set with acquire semantics
+ * @nr: bit number to test
+ * @addr: Address to start counting from
+ */
+static __always_inline bool test_bit_acquire(long nr, const volatile unsigned long *addr)
+{
+ instrument_atomic_read(addr + BIT_WORD(nr), sizeof(long));
+ return arch_test_bit_acquire(nr, addr);
+}
+
#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
Index: linux-2.6/include/asm-generic/bitops/non-atomic.h
===================================================================
--- linux-2.6.orig/include/asm-generic/bitops/non-atomic.h 2022-08-01 12:27:43.000000000 +0200
+++ linux-2.6/include/asm-generic/bitops/non-atomic.h 2022-08-01 12:27:43.000000000 +0200
@@ -119,4 +119,17 @@ arch_test_bit(unsigned int nr, const vol
}
#define test_bit arch_test_bit
+/**
+ * arch_test_bit - Determine whether a bit is set with acquire semantics
+ * @nr: bit number to test
+ * @addr: Address to start counting from
+ */
+static __always_inline int
+arch_test_bit_acquire(unsigned int nr, const volatile unsigned long *addr)
+{
+ unsigned val = smp_load_acquire(&addr[BIT_WORD(nr)]);
+ return 1UL & (val >> (nr & (BITS_PER_LONG-1)));
+}
+#define test_bit_acquire arch_test_bit_acquire
+
#endif /* _ASM_GENERIC_BITOPS_NON_ATOMIC_H_ */
Index: linux-2.6/arch/s390/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
+++ linux-2.6/arch/s390/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
@@ -184,6 +184,16 @@ static inline bool arch_test_bit(unsigne
return *addr & mask;
}
+static inline bool arch_test_bit_acquire(unsigned long nr,
+ const volatile unsigned long *ptr)
+{
+ const volatile unsigned long *addr = __bitops_word(nr, ptr);
+ unsigned long val = smp_load_acquire(addr);
+ unsigned long mask = __bitops_mask(nr);
+
+ return val & mask;
+}
+
static inline bool arch_test_and_set_bit_lock(unsigned long nr,
volatile unsigned long *ptr)
{
Index: linux-2.6/include/linux/wait_bit.h
===================================================================
--- linux-2.6.orig/include/linux/wait_bit.h 2022-08-01 12:27:43.000000000 +0200
+++ linux-2.6/include/linux/wait_bit.h 2022-08-01 12:27:43.000000000 +0200
@@ -71,7 +71,7 @@ static inline int
wait_on_bit(unsigned long *word, int bit, unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit_acquire(bit, word))
return 0;
return out_of_line_wait_on_bit(word, bit,
bit_wait,
@@ -96,7 +96,7 @@ static inline int
wait_on_bit_io(unsigned long *word, int bit, unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit_acquire(bit, word))
return 0;
return out_of_line_wait_on_bit(word, bit,
bit_wait_io,
@@ -123,7 +123,7 @@ wait_on_bit_timeout(unsigned long *word,
unsigned long timeout)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit_acquire(bit, word))
return 0;
return out_of_line_wait_on_bit_timeout(word, bit,
bit_wait_timeout,
@@ -151,7 +151,7 @@ wait_on_bit_action(unsigned long *word,
unsigned mode)
{
might_sleep();
- if (!test_bit(bit, word))
+ if (!test_bit_acquire(bit, word))
return 0;
return out_of_line_wait_on_bit(word, bit, action, mode);
}
Index: linux-2.6/kernel/sched/wait_bit.c
===================================================================
--- linux-2.6.orig/kernel/sched/wait_bit.c 2022-08-01 12:27:43.000000000 +0200
+++ linux-2.6/kernel/sched/wait_bit.c 2022-08-01 12:27:43.000000000 +0200
@@ -25,7 +25,7 @@ int wake_bit_function(struct wait_queue_
if (wait_bit->key.flags != key->flags ||
wait_bit->key.bit_nr != key->bit_nr ||
- test_bit(key->bit_nr, key->flags))
+ test_bit_acquire(key->bit_nr, key->flags))
return 0;
return autoremove_wake_function(wq_entry, mode, sync, key);
@@ -45,9 +45,9 @@ __wait_on_bit(struct wait_queue_head *wq
do {
prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
- if (test_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags))
+ if (test_bit_acquire(wbq_entry->key.bit_nr, wbq_entry->key.flags))
ret = (*action)(&wbq_entry->key, mode);
- } while (test_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags) && !ret);
+ } while (test_bit_acquire(wbq_entry->key.bit_nr, wbq_entry->key.flags) && !ret);
finish_wait(wq_head, &wbq_entry->wq_entry);
On Mon, 1 Aug 2022, Matthew Wilcox wrote:
> On Mon, Aug 01, 2022 at 06:43:55AM -0400, Mikulas Patocka wrote:
> > Let's have a look at this piece of code in __bread_slow:
> > get_bh(bh);
> > bh->b_end_io = end_buffer_read_sync;
> > submit_bh(REQ_OP_READ, 0, bh);
> > wait_on_buffer(bh);
> > if (buffer_uptodate(bh))
> > return bh;
> > Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
> > Consequently, if someone calls sb_bread and then reads the buffer data,
> > the read of buffer data may be executed before wait_on_buffer(bh) on
> > architectures with weak memory ordering and it may return invalid data.
> >
> > Fix this bug by changing the function buffer_locked to have the acquire
> > semantics - so that code that follows buffer_locked cannot be moved before
> > it.
>
> I think this is the wrong approach. Instead, buffer_set_uptodate()
> should have the smp_wmb() and buffer_uptodate should have the smp_rmb().
> Just like the page flags. As I said last night.
Linus said that he prefers acquire/release to smp_rmb/smp_wmb. So, sort it
out with him :)
In most cases, the buffer is set uptodate while it is locked, so that
there is no race on the uptodate flag (the race exists on the locked
flag). Are there any cases where the uptodate flag is modified on unlocked
buffer, so that it needs special treatment too?
Mikulas
On Mon, Aug 01, 2022 at 06:43:55AM -0400, Mikulas Patocka wrote:
> Let's have a look at this piece of code in __bread_slow:
> get_bh(bh);
> bh->b_end_io = end_buffer_read_sync;
> submit_bh(REQ_OP_READ, 0, bh);
> wait_on_buffer(bh);
> if (buffer_uptodate(bh))
> return bh;
> Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
> Consequently, if someone calls sb_bread and then reads the buffer data,
> the read of buffer data may be executed before wait_on_buffer(bh) on
> architectures with weak memory ordering and it may return invalid data.
>
> Fix this bug by changing the function buffer_locked to have the acquire
> semantics - so that code that follows buffer_locked cannot be moved before
> it.
I think this is the wrong approach. Instead, buffer_set_uptodate()
should have the smp_wmb() and buffer_uptodate should have the smp_rmb().
Just like the page flags. As I said last night.
Hi Linus, Paul,
Apologies for the slow response here; believe it or not, I was attending
a workshop about memory ordering.
On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> > Even alpha is specified to be locally ordered wrt *one* memory
> > location, including for reads (See table 5-1: "Processor issue order",
> > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> > new value, a subsequent read is not allowed to see an older one - even
> > without a memory barrier.
> >
> > Will, Paul? Maybe that's only for overlapping loads/stores, not for
> > loads/loads. Because maybe alpha for once isn't the weakest possible
> > ordering.
>
> The "bad boy" in this case is Itanium, which can do some VLIW reordering
> of accesses. Or could, I am not sure that newer Itanium hardware
> does this. But this is why Itanium compilers made volatile loads use
> the ld,acq instruction.
>
> Which means that aligned same-sized marked accesses to a single location
> really do execute consistently with some global ordering, even on Itanium.
Although this is true, there's a really subtle issue which crops up if you
try to compose this read-after-read ordering with dependencies in the case
where the two reads read the same value (which is encapsulated by the
unusual RSW litmus test that I've tried to convert to C below):
/* Global definitions; assume everything zero-initialised */
struct foo {
int *x;
};
int x;
struct foo foo;
struct foo *ptr;
/* CPU 0 */
WRITE_ONCE(x, 1);
WRITE_ONCE(foo.x, &x);
/*
* Release ordering to ensure that somebody following a non-NULL ptr will
* see a fully-initialised 'foo'. smp_[w]mb() would work as well.
*/
smp_store_release(&ptr, &foo);
/* CPU 1 */
int *xp1, *xp2, val;
struct foo *foop;
/* Load the global pointer and check that it's not NULL. */
foop = READ_ONCE(ptr);
if (!foop)
return;
/*
* Load 'foo.x' via the pointer we just loaded. This is ordered after the
* previous READ_ONCE() because of the address dependency.
*/
xp1 = READ_ONCE(foop->x);
/*
* Load 'foo.x' directly via the global 'foo'.
* _This is loading the same address as the previous READ_ONCE() and
* therefore cannot return a stale (NULL) value!_
*/
xp2 = READ_ONCE(foo.x);
/*
* Load 'x' via the pointer we just loaded.
* _We may see zero here!_
*/
val = READ_ONCE(*xp2);
So in this case, the two same-location reads on CPU1 are actually executed
out of order, but you can't tell just by looking at them in isolation
because they returned the same value (i.e. xp1 == xp2 == &x). However, you
*can* tell when you compose them in funny situations such as above (and I
believe that this is demonstrably true on PPC and Arm; effectively the
read-after-read ordering machinery only triggers in response to incoming
snoops).
There's probably a more-compelling variant using an (RCpc) load acquire
instead of the last address dependency on CPU 1 as well.
Anyway, I think all I'm trying to say is that I'd tend to shy away from
relying on read-after-read ordering if it forms part of a more involved
ordering relationship.
> > But the patch looks fine, though I agree that the ordering in
> > __wait_on_buffer should probably be moved into
> > wait_on_bit/wait_on_bit_io.
> >
> > And would we perhaps want the bitops to have the different ordering
> > versions? Like "set_bit_release()" and "test_bit_acquire()"? That
> > would seem to be (a) cleaner and (b) possibly generate better code for
> > architectures where that makes a difference?
>
> As always, I defer to the architecture maintainers on this one.
FWIW, that makes sense to me. We already have release/acquire/releaxed
variants of the bitops in the atomic_t API so it seems natural to have
a counterpart in the actual bitops API as well.
Will
On Mon, Aug 01, 2022 at 06:42:15AM -0400, Mikulas Patocka wrote:
> wait_on_bit tests the bit without any memory barriers, consequently the
> code that follows wait_on_bit may be moved before testing the bit on
> architectures with weak memory ordering. When the code tests for some
> event using wait_on_bit and then performs a load operation, the load may
> be unexpectedly moved before wait_on_bit and it may return data that
> existed before the event occurred.
>
> Such bugs exist in fs/buffer.c:__wait_on_buffer,
> drivers/md/dm-bufio.c:new_read,
> drivers/media/usb/dvb-usb-v2/dvb_usb_core.c:dvb_usb_start_feed,
> drivers/bluetooth/btusb.c:btusb_mtk_hci_wmt_sync
> and perhaps in other places.
>
> We fix this class of bugs by adding a new function test_bit_acquire that
> reads the bit and provides acquire memory ordering semantics.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
> Cc: [email protected]
>
> ---
> arch/s390/include/asm/bitops.h | 10 ++++++++++
> arch/x86/include/asm/bitops.h | 7 ++++++-
> include/asm-generic/bitops/instrumented-non-atomic.h | 11 +++++++++++
> include/asm-generic/bitops/non-atomic.h | 13 +++++++++++++
> include/linux/wait_bit.h | 8 ++++----
> kernel/sched/wait_bit.c | 6 +++---
> 6 files changed, 47 insertions(+), 8 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/bitops.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> +++ linux-2.6/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> @@ -203,8 +203,10 @@ arch_test_and_change_bit(long nr, volati
>
> static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
> {
> - return ((1UL << (nr & (BITS_PER_LONG-1))) &
> + bool r = ((1UL << (nr & (BITS_PER_LONG-1))) &
> (addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
> + barrier();
> + return r;
Hmm, I find it a bit weird to have a barrier() here given that 'addr' is
volatile and we don't need a barrier() like this in the definition of
READ_ONCE(), for example.
> Index: linux-2.6/include/linux/wait_bit.h
> ===================================================================
> --- linux-2.6.orig/include/linux/wait_bit.h 2022-08-01 12:27:43.000000000 +0200
> +++ linux-2.6/include/linux/wait_bit.h 2022-08-01 12:27:43.000000000 +0200
> @@ -71,7 +71,7 @@ static inline int
> wait_on_bit(unsigned long *word, int bit, unsigned mode)
> {
> might_sleep();
> - if (!test_bit(bit, word))
> + if (!test_bit_acquire(bit, word))
> return 0;
> return out_of_line_wait_on_bit(word, bit,
> bit_wait,
Yet another approach here would be to leave test_bit as-is and add a call to
smp_acquire__after_ctrl_dep() since that exists already -- I don't have
strong opinions about it, but it saves you having to add another stub to
x86.
Will
On Mon, 1 Aug 2022, Will Deacon wrote:
> On Mon, Aug 01, 2022 at 06:42:15AM -0400, Mikulas Patocka wrote:
>
> > Index: linux-2.6/arch/x86/include/asm/bitops.h
> > ===================================================================
> > --- linux-2.6.orig/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > +++ linux-2.6/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > @@ -203,8 +203,10 @@ arch_test_and_change_bit(long nr, volati
> >
> > static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
> > {
> > - return ((1UL << (nr & (BITS_PER_LONG-1))) &
> > + bool r = ((1UL << (nr & (BITS_PER_LONG-1))) &
> > (addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
> > + barrier();
> > + return r;
>
> Hmm, I find it a bit weird to have a barrier() here given that 'addr' is
> volatile and we don't need a barrier() like this in the definition of
> READ_ONCE(), for example.
gcc doesn't reorder two volatile accesses, but it can reorder non-volatile
accesses around volatile accesses.
The purpose of the compiler barrier is to make sure that the non-volatile
accesses that follow test_bit are not reordered by the compiler before the
volatile access to addr.
> > Index: linux-2.6/include/linux/wait_bit.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/wait_bit.h 2022-08-01 12:27:43.000000000 +0200
> > +++ linux-2.6/include/linux/wait_bit.h 2022-08-01 12:27:43.000000000 +0200
> > @@ -71,7 +71,7 @@ static inline int
> > wait_on_bit(unsigned long *word, int bit, unsigned mode)
> > {
> > might_sleep();
> > - if (!test_bit(bit, word))
> > + if (!test_bit_acquire(bit, word))
> > return 0;
> > return out_of_line_wait_on_bit(word, bit,
> > bit_wait,
>
> Yet another approach here would be to leave test_bit as-is and add a call to
> smp_acquire__after_ctrl_dep() since that exists already -- I don't have
> strong opinions about it, but it saves you having to add another stub to
> x86.
It would be the same as my previous patch with smp_rmb() that Linus didn't
like. But I think smp_rmb (or smp_acquire__after_ctrl_dep) would be
correct here.
> Will
Mikulas
On Mon, Aug 01, 2022 at 12:12:47PM -0400, Mikulas Patocka wrote:
>
>
> On Mon, 1 Aug 2022, Will Deacon wrote:
>
> > On Mon, Aug 01, 2022 at 06:42:15AM -0400, Mikulas Patocka wrote:
> >
> > > Index: linux-2.6/arch/x86/include/asm/bitops.h
> > > ===================================================================
> > > --- linux-2.6.orig/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > > +++ linux-2.6/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > > @@ -203,8 +203,10 @@ arch_test_and_change_bit(long nr, volati
> > >
> > > static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
> > > {
> > > - return ((1UL << (nr & (BITS_PER_LONG-1))) &
> > > + bool r = ((1UL << (nr & (BITS_PER_LONG-1))) &
> > > (addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
> > > + barrier();
> > > + return r;
> >
> > Hmm, I find it a bit weird to have a barrier() here given that 'addr' is
> > volatile and we don't need a barrier() like this in the definition of
> > READ_ONCE(), for example.
>
> gcc doesn't reorder two volatile accesses, but it can reorder non-volatile
> accesses around volatile accesses.
>
> The purpose of the compiler barrier is to make sure that the non-volatile
> accesses that follow test_bit are not reordered by the compiler before the
> volatile access to addr.
>
Better to have a constant_test_bit_acquire()? I don't think all
test_bit() call sites need the ordering?
Regards,
Boqun
> > > Index: linux-2.6/include/linux/wait_bit.h
> > > ===================================================================
> > > --- linux-2.6.orig/include/linux/wait_bit.h 2022-08-01 12:27:43.000000000 +0200
> > > +++ linux-2.6/include/linux/wait_bit.h 2022-08-01 12:27:43.000000000 +0200
> > > @@ -71,7 +71,7 @@ static inline int
> > > wait_on_bit(unsigned long *word, int bit, unsigned mode)
> > > {
> > > might_sleep();
> > > - if (!test_bit(bit, word))
> > > + if (!test_bit_acquire(bit, word))
> > > return 0;
> > > return out_of_line_wait_on_bit(word, bit,
> > > bit_wait,
> >
> > Yet another approach here would be to leave test_bit as-is and add a call to
> > smp_acquire__after_ctrl_dep() since that exists already -- I don't have
> > strong opinions about it, but it saves you having to add another stub to
> > x86.
>
> It would be the same as my previous patch with smp_rmb() that Linus didn't
> like. But I think smp_rmb (or smp_acquire__after_ctrl_dep) would be
> correct here.
>
> > Will
>
> Mikulas
>
On Mon, Aug 01, 2022 at 04:41:09PM +0100, Will Deacon wrote:
> Hi Linus, Paul,
>
> Apologies for the slow response here; believe it or not, I was attending
> a workshop about memory ordering.
Nice!!! Anything that I can/should know from that gathering? ;-)
> On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> > > Even alpha is specified to be locally ordered wrt *one* memory
> > > location, including for reads (See table 5-1: "Processor issue order",
> > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> > > new value, a subsequent read is not allowed to see an older one - even
> > > without a memory barrier.
> > >
> > > Will, Paul? Maybe that's only for overlapping loads/stores, not for
> > > loads/loads. Because maybe alpha for once isn't the weakest possible
> > > ordering.
> >
> > The "bad boy" in this case is Itanium, which can do some VLIW reordering
> > of accesses. Or could, I am not sure that newer Itanium hardware
> > does this. But this is why Itanium compilers made volatile loads use
> > the ld,acq instruction.
> >
> > Which means that aligned same-sized marked accesses to a single location
> > really do execute consistently with some global ordering, even on Itanium.
>
> Although this is true, there's a really subtle issue which crops up if you
> try to compose this read-after-read ordering with dependencies in the case
> where the two reads read the same value (which is encapsulated by the
> unusual RSW litmus test that I've tried to convert to C below):
RSW from the infamous test6.pdf, correct?
> /* Global definitions; assume everything zero-initialised */
> struct foo {
> int *x;
> };
>
> int x;
> struct foo foo;
> struct foo *ptr;
>
>
> /* CPU 0 */
> WRITE_ONCE(x, 1);
Your x is RSW's z?
> WRITE_ONCE(foo.x, &x);
And your foo.x is RSW's x? If so, the above WRITE_ONCE() could happen at
compile time, correct? Or in the initialization clause of a litmus test?
> /*
> * Release ordering to ensure that somebody following a non-NULL ptr will
> * see a fully-initialised 'foo'. smp_[w]mb() would work as well.
> */
> smp_store_release(&ptr, &foo);
Your ptr is RSW's y, correct?
> /* CPU 1 */
> int *xp1, *xp2, val;
> struct foo *foop;
>
> /* Load the global pointer and check that it's not NULL. */
> foop = READ_ONCE(ptr);
> if (!foop)
> return;
A litmus tests can do this via the filter clause.
> /*
> * Load 'foo.x' via the pointer we just loaded. This is ordered after the
> * previous READ_ONCE() because of the address dependency.
> */
> xp1 = READ_ONCE(foop->x);
>
> /*
> * Load 'foo.x' directly via the global 'foo'.
> * _This is loading the same address as the previous READ_ONCE() and
> * therefore cannot return a stale (NULL) value!_
> */
> xp2 = READ_ONCE(foo.x);
OK, same location, but RSW calls only for po, not addr from the initial
read to this read, got it. (My first attempt left out this nuance,
in case you were wondering.)
> /*
> * Load 'x' via the pointer we just loaded.
> * _We may see zero here!_
> */
> val = READ_ONCE(*xp2);
And herd7/LKMM agree with this, at least assuming I got the litmus
test right. (I used RSW's variables as a cross-check.)
> So in this case, the two same-location reads on CPU1 are actually executed
> out of order, but you can't tell just by looking at them in isolation
> because they returned the same value (i.e. xp1 == xp2 == &x). However, you
> *can* tell when you compose them in funny situations such as above (and I
> believe that this is demonstrably true on PPC and Arm; effectively the
> read-after-read ordering machinery only triggers in response to incoming
> snoops).
Again, I leave the hardware specifics to the respective maintainers.
> There's probably a more-compelling variant using an (RCpc) load acquire
> instead of the last address dependency on CPU 1 as well.
>
> Anyway, I think all I'm trying to say is that I'd tend to shy away from
> relying on read-after-read ordering if it forms part of a more involved
> ordering relationship.
Agreed. As soon as you add that second variable, you need more ordering.
The read-after-read ordering applies only within the confines of a single
variable, and you need more ordering when you add that second variable.
And as Will's RSW example shows, more ordering than you might think.
> > > But the patch looks fine, though I agree that the ordering in
> > > __wait_on_buffer should probably be moved into
> > > wait_on_bit/wait_on_bit_io.
> > >
> > > And would we perhaps want the bitops to have the different ordering
> > > versions? Like "set_bit_release()" and "test_bit_acquire()"? That
> > > would seem to be (a) cleaner and (b) possibly generate better code for
> > > architectures where that makes a difference?
> >
> > As always, I defer to the architecture maintainers on this one.
>
> FWIW, that makes sense to me. We already have release/acquire/releaxed
> variants of the bitops in the atomic_t API so it seems natural to have
> a counterpart in the actual bitops API as well.
Just to join you guys in liking acquire and release better than smp_wmb()
and smp_rmb(). My RCU life got much better after update-side uses of
smp_wmb() were replaced by rcu_assign_pointer() and read-side uses of
smp_read_barrier_depends() were replaced by rcu_dereference().
Thanx, Paul
------------------------------------------------------------------------
C rsw
{
a=0;
x=z;
y=a;
z=0;
}
P0(int *x, int **y, int *z)
{
WRITE_ONCE(*z, 1);
WRITE_ONCE(*y, x);
}
P1(int *x, int **y, int *z)
{
r1 = READ_ONCE(*y);
r2 = READ_ONCE(*r1);
r3 = READ_ONCE(*x);
r4 = READ_ONCE(*r3);
}
filter(1:r1=x)
exists(1:r2=z /\ 1:r3=z /\ 1:r4=0)
------------------------------------------------------------------------
$ herd7 -conf linux-kernel.cfg /tmp/rsw.litmus
Test rsw Allowed
States 2
1:r2=z; 1:r3=z; 1:r4=0;
1:r2=z; 1:r3=z; 1:r4=1;
Ok
Witnesses
Positive: 1 Negative: 1
Condition exists (1:r2=z /\ 1:r3=z /\ 1:r4=0)
Observation rsw Sometimes 1 1
Time rsw 0.00
Hash=03618e3079eb0d371ca0f1ab679f3eae
From: Boqun Feng
> Sent: 01 August 2022 19:17
>
> On Mon, Aug 01, 2022 at 12:12:47PM -0400, Mikulas Patocka wrote:
> >
> >
> > On Mon, 1 Aug 2022, Will Deacon wrote:
> >
> > > On Mon, Aug 01, 2022 at 06:42:15AM -0400, Mikulas Patocka wrote:
> > >
> > > > Index: linux-2.6/arch/x86/include/asm/bitops.h
> > > > ===================================================================
> > > > --- linux-2.6.orig/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > > > +++ linux-2.6/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > > > @@ -203,8 +203,10 @@ arch_test_and_change_bit(long nr, volati
> > > >
> > > > static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
> > > > {
> > > > - return ((1UL << (nr & (BITS_PER_LONG-1))) &
> > > > + bool r = ((1UL << (nr & (BITS_PER_LONG-1))) &
> > > > (addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
> > > > + barrier();
> > > > + return r;
> > >
> > > Hmm, I find it a bit weird to have a barrier() here given that 'addr' is
> > > volatile and we don't need a barrier() like this in the definition of
> > > READ_ONCE(), for example.
> >
> > gcc doesn't reorder two volatile accesses, but it can reorder non-volatile
> > accesses around volatile accesses.
> >
> > The purpose of the compiler barrier is to make sure that the non-volatile
> > accesses that follow test_bit are not reordered by the compiler before the
> > volatile access to addr.
> >
>
> Better to have a constant_test_bit_acquire()? I don't think all
> test_bit() call sites need the ordering?
It is also unlikely that the compiler will 'usefully' move a read
across the test_bit() call - which is likely to be in a conditional.
So barrier() is unlikely to significantly affect the generated code.
Indeed, perhaps test_bit() should always enforce read ordering
even one weakly ordered cpu?
It is used with set_bit() and clear_bit() which are expensive
locked operations - so a slightly more expensive test_bit()
probably doesn't matter.
Remember these aren't functions to replace &= and |=.
(In spite of some code paths.)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Mon, Aug 01, 2022 at 12:12:47PM -0400, Mikulas Patocka wrote:
> On Mon, 1 Aug 2022, Will Deacon wrote:
> > On Mon, Aug 01, 2022 at 06:42:15AM -0400, Mikulas Patocka wrote:
> >
> > > Index: linux-2.6/arch/x86/include/asm/bitops.h
> > > ===================================================================
> > > --- linux-2.6.orig/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > > +++ linux-2.6/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > > @@ -203,8 +203,10 @@ arch_test_and_change_bit(long nr, volati
> > >
> > > static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
> > > {
> > > - return ((1UL << (nr & (BITS_PER_LONG-1))) &
> > > + bool r = ((1UL << (nr & (BITS_PER_LONG-1))) &
> > > (addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
> > > + barrier();
> > > + return r;
> >
> > Hmm, I find it a bit weird to have a barrier() here given that 'addr' is
> > volatile and we don't need a barrier() like this in the definition of
> > READ_ONCE(), for example.
>
> gcc doesn't reorder two volatile accesses, but it can reorder non-volatile
> accesses around volatile accesses.
>
> The purpose of the compiler barrier is to make sure that the non-volatile
> accesses that follow test_bit are not reordered by the compiler before the
> volatile access to addr.
If we need these accesses to be ordered reliably, then we need a CPU barrier
and that will additionally prevent the compiler reordering. So I still don't
think we need the barrier() here.
> > > Index: linux-2.6/include/linux/wait_bit.h
> > > ===================================================================
> > > --- linux-2.6.orig/include/linux/wait_bit.h 2022-08-01 12:27:43.000000000 +0200
> > > +++ linux-2.6/include/linux/wait_bit.h 2022-08-01 12:27:43.000000000 +0200
> > > @@ -71,7 +71,7 @@ static inline int
> > > wait_on_bit(unsigned long *word, int bit, unsigned mode)
> > > {
> > > might_sleep();
> > > - if (!test_bit(bit, word))
> > > + if (!test_bit_acquire(bit, word))
> > > return 0;
> > > return out_of_line_wait_on_bit(word, bit,
> > > bit_wait,
> >
> > Yet another approach here would be to leave test_bit as-is and add a call to
> > smp_acquire__after_ctrl_dep() since that exists already -- I don't have
> > strong opinions about it, but it saves you having to add another stub to
> > x86.
>
> It would be the same as my previous patch with smp_rmb() that Linus didn't
> like. But I think smp_rmb (or smp_acquire__after_ctrl_dep) would be
> correct here.
Right, I saw Linus' objection to smp_rmb() and I'm not sure where
smp_acquire__after_ctrl_dep() fits in with his line of reasoning. On the one
hand, it's talking about acquire ordering, but on the other, it's ugly as
sin :)
Will
On Mon, Aug 01, 2022 at 12:20:35PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 01, 2022 at 04:41:09PM +0100, Will Deacon wrote:
> > Apologies for the slow response here; believe it or not, I was attending
> > a workshop about memory ordering.
>
> Nice!!! Anything that I can/should know from that gathering? ;-)
Oh come off it, you know this stuff already ;)
> > On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> > > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> > > > Even alpha is specified to be locally ordered wrt *one* memory
> > > > location, including for reads (See table 5-1: "Processor issue order",
> > > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> > > > new value, a subsequent read is not allowed to see an older one - even
> > > > without a memory barrier.
> > > >
> > > > Will, Paul? Maybe that's only for overlapping loads/stores, not for
> > > > loads/loads. Because maybe alpha for once isn't the weakest possible
> > > > ordering.
> > >
> > > The "bad boy" in this case is Itanium, which can do some VLIW reordering
> > > of accesses. Or could, I am not sure that newer Itanium hardware
> > > does this. But this is why Itanium compilers made volatile loads use
> > > the ld,acq instruction.
> > >
> > > Which means that aligned same-sized marked accesses to a single location
> > > really do execute consistently with some global ordering, even on Itanium.
> >
> > Although this is true, there's a really subtle issue which crops up if you
> > try to compose this read-after-read ordering with dependencies in the case
> > where the two reads read the same value (which is encapsulated by the
> > unusual RSW litmus test that I've tried to convert to C below):
>
> RSW from the infamous test6.pdf, correct?
That's the badger. I've no doubt that you're aware of it already, but I
thought it was a useful exercise to transcribe it to C and have it on the
mailing list for folks to look at.
> > /* Global definitions; assume everything zero-initialised */
> > struct foo {
> > int *x;
> > };
> >
> > int x;
> > struct foo foo;
> > struct foo *ptr;
> >
> >
> > /* CPU 0 */
> > WRITE_ONCE(x, 1);
>
> Your x is RSW's z?
Yes.
> > WRITE_ONCE(foo.x, &x);
>
> And your foo.x is RSW's x? If so, the above WRITE_ONCE() could happen at
> compile time, correct? Or in the initialization clause of a litmus test?
Yes, although I think it's a tiny bit more like real code to have it done
here, although it means that the "surprising" outcome relies on this being
reordered before the store to x.
> > /*
> > * Release ordering to ensure that somebody following a non-NULL ptr will
> > * see a fully-initialised 'foo'. smp_[w]mb() would work as well.
> > */
> > smp_store_release(&ptr, &foo);
>
> Your ptr is RSW's y, correct?
Yes.
> > /* CPU 1 */
> > int *xp1, *xp2, val;
> > struct foo *foop;
> >
> > /* Load the global pointer and check that it's not NULL. */
> > foop = READ_ONCE(ptr);
> > if (!foop)
> > return;
>
> A litmus tests can do this via the filter clause.
Indeed, but I was trying to make this look like C code for non-litmus
speakers!
> > /*
> > * Load 'foo.x' via the pointer we just loaded. This is ordered after the
> > * previous READ_ONCE() because of the address dependency.
> > */
> > xp1 = READ_ONCE(foop->x);
> >
> > /*
> > * Load 'foo.x' directly via the global 'foo'.
> > * _This is loading the same address as the previous READ_ONCE() and
> > * therefore cannot return a stale (NULL) value!_
> > */
> > xp2 = READ_ONCE(foo.x);
>
> OK, same location, but RSW calls only for po, not addr from the initial
> read to this read, got it. (My first attempt left out this nuance,
> in case you were wondering.)
Right, there is only po from the initial read to this read. If there was an
address dependency, then we'd have a chain of address dependencies from the
first read to the last read on this CPU and the result (of x == 0) would be
forbidden.
> > /*
> > * Load 'x' via the pointer we just loaded.
> > * _We may see zero here!_
> > */
> > val = READ_ONCE(*xp2);
>
> And herd7/LKMM agree with this, at least assuming I got the litmus
> test right. (I used RSW's variables as a cross-check.)
That's promising, but see below...
> C rsw
>
> {
> a=0;
> x=z;
> y=a;
> z=0;
> }
>
> P0(int *x, int **y, int *z)
> {
> WRITE_ONCE(*z, 1);
> WRITE_ONCE(*y, x);
> }
Ah wait, you need a barrier between these two writes, don't you? I used
an smp_store_release() but smp[w]_mb() should do too.
Will
On Tue, 2 Aug 2022, Will Deacon wrote:
> On Mon, Aug 01, 2022 at 12:12:47PM -0400, Mikulas Patocka wrote:
> > On Mon, 1 Aug 2022, Will Deacon wrote:
> > > On Mon, Aug 01, 2022 at 06:42:15AM -0400, Mikulas Patocka wrote:
> > >
> > > > Index: linux-2.6/arch/x86/include/asm/bitops.h
> > > > ===================================================================
> > > > --- linux-2.6.orig/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > > > +++ linux-2.6/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > > > @@ -203,8 +203,10 @@ arch_test_and_change_bit(long nr, volati
> > > >
> > > > static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
> > > > {
> > > > - return ((1UL << (nr & (BITS_PER_LONG-1))) &
> > > > + bool r = ((1UL << (nr & (BITS_PER_LONG-1))) &
> > > > (addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
> > > > + barrier();
> > > > + return r;
> > >
> > > Hmm, I find it a bit weird to have a barrier() here given that 'addr' is
> > > volatile and we don't need a barrier() like this in the definition of
> > > READ_ONCE(), for example.
> >
> > gcc doesn't reorder two volatile accesses, but it can reorder non-volatile
> > accesses around volatile accesses.
> >
> > The purpose of the compiler barrier is to make sure that the non-volatile
> > accesses that follow test_bit are not reordered by the compiler before the
> > volatile access to addr.
>
> If we need these accesses to be ordered reliably, then we need a CPU barrier
> and that will additionally prevent the compiler reordering. So I still don't
> think we need the barrier() here.
This is x86-specific code. x86 has strong memory ordering, so we only care
about compiler reordering.
We could use smp_rmb() (or smp_load_acquire()) instead of barrier() here,
but smp_rmb() and smp_load_acquire() on x86 is identical to barrier()
anyway.
Mikulas
On Tue, Aug 02, 2022 at 07:38:17AM -0400, Mikulas Patocka wrote:
>
>
> On Tue, 2 Aug 2022, Will Deacon wrote:
>
> > On Mon, Aug 01, 2022 at 12:12:47PM -0400, Mikulas Patocka wrote:
> > > On Mon, 1 Aug 2022, Will Deacon wrote:
> > > > On Mon, Aug 01, 2022 at 06:42:15AM -0400, Mikulas Patocka wrote:
> > > >
> > > > > Index: linux-2.6/arch/x86/include/asm/bitops.h
> > > > > ===================================================================
> > > > > --- linux-2.6.orig/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > > > > +++ linux-2.6/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > > > > @@ -203,8 +203,10 @@ arch_test_and_change_bit(long nr, volati
> > > > >
> > > > > static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
> > > > > {
> > > > > - return ((1UL << (nr & (BITS_PER_LONG-1))) &
> > > > > + bool r = ((1UL << (nr & (BITS_PER_LONG-1))) &
> > > > > (addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
> > > > > + barrier();
> > > > > + return r;
> > > >
> > > > Hmm, I find it a bit weird to have a barrier() here given that 'addr' is
> > > > volatile and we don't need a barrier() like this in the definition of
> > > > READ_ONCE(), for example.
> > >
> > > gcc doesn't reorder two volatile accesses, but it can reorder non-volatile
> > > accesses around volatile accesses.
> > >
> > > The purpose of the compiler barrier is to make sure that the non-volatile
> > > accesses that follow test_bit are not reordered by the compiler before the
> > > volatile access to addr.
> >
> > If we need these accesses to be ordered reliably, then we need a CPU barrier
> > and that will additionally prevent the compiler reordering. So I still don't
> > think we need the barrier() here.
>
> This is x86-specific code. x86 has strong memory ordering, so we only care
> about compiler reordering.
Indeed, but what I'm trying to say is that the _caller_ would have a memory
barrier in this case, and so there's no need for one in here. test_bit() does
not have ordering semantics.
Will
On Tue, Aug 02, 2022 at 09:54:55AM +0100, Will Deacon wrote:
> On Mon, Aug 01, 2022 at 12:20:35PM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 01, 2022 at 04:41:09PM +0100, Will Deacon wrote:
> > > Apologies for the slow response here; believe it or not, I was attending
> > > a workshop about memory ordering.
> >
> > Nice!!! Anything that I can/should know from that gathering? ;-)
>
> Oh come off it, you know this stuff already ;)
Thank you for the kind words, but the most devastating learning disability
of all is thinking that you already know everything about the topic
in question. ;-)
> > > On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> > > > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> > > > > Even alpha is specified to be locally ordered wrt *one* memory
> > > > > location, including for reads (See table 5-1: "Processor issue order",
> > > > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> > > > > new value, a subsequent read is not allowed to see an older one - even
> > > > > without a memory barrier.
> > > > >
> > > > > Will, Paul? Maybe that's only for overlapping loads/stores, not for
> > > > > loads/loads. Because maybe alpha for once isn't the weakest possible
> > > > > ordering.
> > > >
> > > > The "bad boy" in this case is Itanium, which can do some VLIW reordering
> > > > of accesses. Or could, I am not sure that newer Itanium hardware
> > > > does this. But this is why Itanium compilers made volatile loads use
> > > > the ld,acq instruction.
> > > >
> > > > Which means that aligned same-sized marked accesses to a single location
> > > > really do execute consistently with some global ordering, even on Itanium.
> > >
> > > Although this is true, there's a really subtle issue which crops up if you
> > > try to compose this read-after-read ordering with dependencies in the case
> > > where the two reads read the same value (which is encapsulated by the
> > > unusual RSW litmus test that I've tried to convert to C below):
> >
> > RSW from the infamous test6.pdf, correct?
>
> That's the badger. I've no doubt that you're aware of it already, but I
> thought it was a useful exercise to transcribe it to C and have it on the
> mailing list for folks to look at.
I have seen it, but this was nevertheless a useful reminder.
> > > /* Global definitions; assume everything zero-initialised */
> > > struct foo {
> > > int *x;
> > > };
> > >
> > > int x;
> > > struct foo foo;
> > > struct foo *ptr;
> > >
> > >
> > > /* CPU 0 */
> > > WRITE_ONCE(x, 1);
> >
> > Your x is RSW's z?
>
> Yes.
>
> > > WRITE_ONCE(foo.x, &x);
> >
> > And your foo.x is RSW's x? If so, the above WRITE_ONCE() could happen at
> > compile time, correct? Or in the initialization clause of a litmus test?
>
> Yes, although I think it's a tiny bit more like real code to have it done
> here, although it means that the "surprising" outcome relies on this being
> reordered before the store to x.
>
> > > /*
> > > * Release ordering to ensure that somebody following a non-NULL ptr will
> > > * see a fully-initialised 'foo'. smp_[w]mb() would work as well.
> > > */
> > > smp_store_release(&ptr, &foo);
> >
> > Your ptr is RSW's y, correct?
>
> Yes.
>
> > > /* CPU 1 */
> > > int *xp1, *xp2, val;
> > > struct foo *foop;
> > >
> > > /* Load the global pointer and check that it's not NULL. */
> > > foop = READ_ONCE(ptr);
> > > if (!foop)
> > > return;
> >
> > A litmus tests can do this via the filter clause.
>
> Indeed, but I was trying to make this look like C code for non-litmus
> speakers!
>
> > > /*
> > > * Load 'foo.x' via the pointer we just loaded. This is ordered after the
> > > * previous READ_ONCE() because of the address dependency.
> > > */
> > > xp1 = READ_ONCE(foop->x);
> > >
> > > /*
> > > * Load 'foo.x' directly via the global 'foo'.
> > > * _This is loading the same address as the previous READ_ONCE() and
> > > * therefore cannot return a stale (NULL) value!_
> > > */
> > > xp2 = READ_ONCE(foo.x);
> >
> > OK, same location, but RSW calls only for po, not addr from the initial
> > read to this read, got it. (My first attempt left out this nuance,
> > in case you were wondering.)
>
> Right, there is only po from the initial read to this read. If there was an
> address dependency, then we'd have a chain of address dependencies from the
> first read to the last read on this CPU and the result (of x == 0) would be
> forbidden.
>
> > > /*
> > > * Load 'x' via the pointer we just loaded.
> > > * _We may see zero here!_
> > > */
> > > val = READ_ONCE(*xp2);
> >
> > And herd7/LKMM agree with this, at least assuming I got the litmus
> > test right. (I used RSW's variables as a cross-check.)
>
> That's promising, but see below...
>
> > C rsw
> >
> > {
> > a=0;
> > x=z;
> > y=a;
> > z=0;
> > }
> >
> > P0(int *x, int **y, int *z)
> > {
> > WRITE_ONCE(*z, 1);
> > WRITE_ONCE(*y, x);
> > }
>
> Ah wait, you need a barrier between these two writes, don't you? I used
> an smp_store_release() but smp[w]_mb() should do too.
You are quite right, thank you! Here is the fixed version and output,
which LKMM still says is allowed.
Thanx, Paul
------------------------------------------------------------------------
C rsw
{
a=0;
x=z;
y=a;
z=0;
}
P0(int *x, int **y, int *z)
{
WRITE_ONCE(*z, 1);
smp_store_release(y, x);
}
P1(int *x, int **y, int *z)
{
r1 = READ_ONCE(*y);
r2 = READ_ONCE(*r1);
r3 = READ_ONCE(*x);
r4 = READ_ONCE(*r3);
}
filter(1:r1=x)
exists(1:r2=z /\ 1:r3=z /\ 1:r4=0)
------------------------------------------------------------------------
$ herd7 -conf linux-kernel.cfg /tmp/rsw.litmus
Test rsw Allowed
States 2
1:r2=z; 1:r3=z; 1:r4=0;
1:r2=z; 1:r3=z; 1:r4=1;
Ok
Witnesses
Positive: 1 Negative: 1
Condition exists (1:r2=z /\ 1:r3=z /\ 1:r4=0)
Observation rsw Sometimes 1 1
Time rsw 0.01
Hash=588486c0f4d521fa3ce559a19ed118d5
On Tue, Aug 02, 2022 at 06:49:21AM -0700, Paul E. McKenney wrote:
> On Tue, Aug 02, 2022 at 09:54:55AM +0100, Will Deacon wrote:
> > On Mon, Aug 01, 2022 at 12:20:35PM -0700, Paul E. McKenney wrote:
> > > On Mon, Aug 01, 2022 at 04:41:09PM +0100, Will Deacon wrote:
> > > > Apologies for the slow response here; believe it or not, I was attending
> > > > a workshop about memory ordering.
> > >
> > > Nice!!! Anything that I can/should know from that gathering? ;-)
> >
> > Oh come off it, you know this stuff already ;)
>
> Thank you for the kind words, but the most devastating learning disability
> of all is thinking that you already know everything about the topic
> in question. ;-)
>
> > > > On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> > > > > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> > > > > > Even alpha is specified to be locally ordered wrt *one* memory
> > > > > > location, including for reads (See table 5-1: "Processor issue order",
> > > > > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> > > > > > new value, a subsequent read is not allowed to see an older one - even
> > > > > > without a memory barrier.
> > > > > >
> > > > > > Will, Paul? Maybe that's only for overlapping loads/stores, not for
> > > > > > loads/loads. Because maybe alpha for once isn't the weakest possible
> > > > > > ordering.
> > > > >
> > > > > The "bad boy" in this case is Itanium, which can do some VLIW reordering
> > > > > of accesses. Or could, I am not sure that newer Itanium hardware
> > > > > does this. But this is why Itanium compilers made volatile loads use
> > > > > the ld,acq instruction.
> > > > >
> > > > > Which means that aligned same-sized marked accesses to a single location
> > > > > really do execute consistently with some global ordering, even on Itanium.
> > > >
> > > > Although this is true, there's a really subtle issue which crops up if you
> > > > try to compose this read-after-read ordering with dependencies in the case
> > > > where the two reads read the same value (which is encapsulated by the
> > > > unusual RSW litmus test that I've tried to convert to C below):
> > >
> > > RSW from the infamous test6.pdf, correct?
> >
> > That's the badger. I've no doubt that you're aware of it already, but I
> > thought it was a useful exercise to transcribe it to C and have it on the
> > mailing list for folks to look at.
>
> I have seen it, but this was nevertheless a useful reminder.
>
> > > > /* Global definitions; assume everything zero-initialised */
> > > > struct foo {
> > > > int *x;
> > > > };
> > > >
> > > > int x;
> > > > struct foo foo;
> > > > struct foo *ptr;
> > > >
> > > >
> > > > /* CPU 0 */
> > > > WRITE_ONCE(x, 1);
> > >
> > > Your x is RSW's z?
> >
> > Yes.
> >
> > > > WRITE_ONCE(foo.x, &x);
> > >
> > > And your foo.x is RSW's x? If so, the above WRITE_ONCE() could happen at
> > > compile time, correct? Or in the initialization clause of a litmus test?
> >
> > Yes, although I think it's a tiny bit more like real code to have it done
> > here, although it means that the "surprising" outcome relies on this being
> > reordered before the store to x.
> >
> > > > /*
> > > > * Release ordering to ensure that somebody following a non-NULL ptr will
> > > > * see a fully-initialised 'foo'. smp_[w]mb() would work as well.
> > > > */
> > > > smp_store_release(&ptr, &foo);
> > >
> > > Your ptr is RSW's y, correct?
> >
> > Yes.
> >
> > > > /* CPU 1 */
> > > > int *xp1, *xp2, val;
> > > > struct foo *foop;
> > > >
> > > > /* Load the global pointer and check that it's not NULL. */
> > > > foop = READ_ONCE(ptr);
> > > > if (!foop)
> > > > return;
> > >
> > > A litmus tests can do this via the filter clause.
> >
> > Indeed, but I was trying to make this look like C code for non-litmus
> > speakers!
> >
> > > > /*
> > > > * Load 'foo.x' via the pointer we just loaded. This is ordered after the
> > > > * previous READ_ONCE() because of the address dependency.
> > > > */
> > > > xp1 = READ_ONCE(foop->x);
> > > >
> > > > /*
> > > > * Load 'foo.x' directly via the global 'foo'.
> > > > * _This is loading the same address as the previous READ_ONCE() and
> > > > * therefore cannot return a stale (NULL) value!_
> > > > */
> > > > xp2 = READ_ONCE(foo.x);
> > >
> > > OK, same location, but RSW calls only for po, not addr from the initial
> > > read to this read, got it. (My first attempt left out this nuance,
> > > in case you were wondering.)
> >
> > Right, there is only po from the initial read to this read. If there was an
> > address dependency, then we'd have a chain of address dependencies from the
> > first read to the last read on this CPU and the result (of x == 0) would be
> > forbidden.
> >
> > > > /*
> > > > * Load 'x' via the pointer we just loaded.
> > > > * _We may see zero here!_
> > > > */
> > > > val = READ_ONCE(*xp2);
> > >
> > > And herd7/LKMM agree with this, at least assuming I got the litmus
> > > test right. (I used RSW's variables as a cross-check.)
> >
> > That's promising, but see below...
> >
> > > C rsw
> > >
> > > {
> > > a=0;
> > > x=z;
> > > y=a;
> > > z=0;
> > > }
> > >
> > > P0(int *x, int **y, int *z)
> > > {
> > > WRITE_ONCE(*z, 1);
> > > WRITE_ONCE(*y, x);
> > > }
> >
> > Ah wait, you need a barrier between these two writes, don't you? I used
> > an smp_store_release() but smp[w]_mb() should do too.
>
> You are quite right, thank you! Here is the fixed version and output,
> which LKMM still says is allowed.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> C rsw
>
> {
> a=0;
> x=z;
> y=a;
> z=0;
> }
>
> P0(int *x, int **y, int *z)
> {
> WRITE_ONCE(*z, 1);
> smp_store_release(y, x);
And like test6.pdf says, this is still allowed even if these are a pair
of WRITE_ONCE() invocations separated by smp_mb(). So I took that form.
Thanx, Paul
> }
>
> P1(int *x, int **y, int *z)
> {
> r1 = READ_ONCE(*y);
> r2 = READ_ONCE(*r1);
> r3 = READ_ONCE(*x);
> r4 = READ_ONCE(*r3);
> }
>
> filter(1:r1=x)
> exists(1:r2=z /\ 1:r3=z /\ 1:r4=0)
>
> ------------------------------------------------------------------------
>
> $ herd7 -conf linux-kernel.cfg /tmp/rsw.litmus
> Test rsw Allowed
> States 2
> 1:r2=z; 1:r3=z; 1:r4=0;
> 1:r2=z; 1:r3=z; 1:r4=1;
> Ok
> Witnesses
> Positive: 1 Negative: 1
> Condition exists (1:r2=z /\ 1:r3=z /\ 1:r4=0)
> Observation rsw Sometimes 1 1
> Time rsw 0.01
> Hash=588486c0f4d521fa3ce559a19ed118d5
On Tue, 2 Aug 2022, Will Deacon wrote:
> On Tue, Aug 02, 2022 at 07:38:17AM -0400, Mikulas Patocka wrote:
> >
> >
> > On Tue, 2 Aug 2022, Will Deacon wrote:
> >
> > > On Mon, Aug 01, 2022 at 12:12:47PM -0400, Mikulas Patocka wrote:
> > > > On Mon, 1 Aug 2022, Will Deacon wrote:
> > > > > On Mon, Aug 01, 2022 at 06:42:15AM -0400, Mikulas Patocka wrote:
> > > > >
> > > > > > Index: linux-2.6/arch/x86/include/asm/bitops.h
> > > > > > ===================================================================
> > > > > > --- linux-2.6.orig/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > > > > > +++ linux-2.6/arch/x86/include/asm/bitops.h 2022-08-01 12:27:43.000000000 +0200
> > > > > > @@ -203,8 +203,10 @@ arch_test_and_change_bit(long nr, volati
> > > > > >
> > > > > > static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
> > > > > > {
> > > > > > - return ((1UL << (nr & (BITS_PER_LONG-1))) &
> > > > > > + bool r = ((1UL << (nr & (BITS_PER_LONG-1))) &
> > > > > > (addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
> > > > > > + barrier();
> > > > > > + return r;
> > > > >
> > > > > Hmm, I find it a bit weird to have a barrier() here given that 'addr' is
> > > > > volatile and we don't need a barrier() like this in the definition of
> > > > > READ_ONCE(), for example.
> > > >
> > > > gcc doesn't reorder two volatile accesses, but it can reorder non-volatile
> > > > accesses around volatile accesses.
> > > >
> > > > The purpose of the compiler barrier is to make sure that the non-volatile
> > > > accesses that follow test_bit are not reordered by the compiler before the
> > > > volatile access to addr.
> > >
> > > If we need these accesses to be ordered reliably, then we need a CPU barrier
> > > and that will additionally prevent the compiler reordering. So I still don't
> > > think we need the barrier() here.
> >
> > This is x86-specific code. x86 has strong memory ordering, so we only care
> > about compiler reordering.
>
> Indeed, but what I'm trying to say is that the _caller_ would have a memory
> barrier in this case, and so there's no need for one in here. test_bit() does
> not have ordering semantics.
>
> Will
But constant_test_bit() is also used for test_bit_acquire(), and for
test_bit_acquire(), the barrier is needed. Without the barrier, it doesn't
have acquire semantics, because the compiler (not CPU) can move the
following non-volatile accesses before the volatile access to "addr[nr >>
_BITOPS_LONG_SHIFT]".
See this piece of code in arch/x86/include/asm/bitops.h in the patch:
+#define arch_test_bit_acquire(nr, addr) \
+ arch_test_bit(nr, addr)
We could split constant_test_bit() to two functions: constant_test_bit()
and constant_test_bit_acquire() and put the barrier only in
constant_test_bit_acquire(). But I chose not to do it because code
duplication is bad and because the overhead of the compiler barrier is
none.
Mikulas
On Mon, Aug 01, 2022 at 11:01:40AM -0400, Mikulas Patocka wrote:
>
>
> On Mon, 1 Aug 2022, Matthew Wilcox wrote:
>
> > On Mon, Aug 01, 2022 at 06:43:55AM -0400, Mikulas Patocka wrote:
> > > Let's have a look at this piece of code in __bread_slow:
> > > get_bh(bh);
> > > bh->b_end_io = end_buffer_read_sync;
> > > submit_bh(REQ_OP_READ, 0, bh);
> > > wait_on_buffer(bh);
> > > if (buffer_uptodate(bh))
> > > return bh;
> > > Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
> > > Consequently, if someone calls sb_bread and then reads the buffer data,
> > > the read of buffer data may be executed before wait_on_buffer(bh) on
> > > architectures with weak memory ordering and it may return invalid data.
> > >
> > > Fix this bug by changing the function buffer_locked to have the acquire
> > > semantics - so that code that follows buffer_locked cannot be moved before
> > > it.
> >
> > I think this is the wrong approach. Instead, buffer_set_uptodate()
> > should have the smp_wmb() and buffer_uptodate should have the smp_rmb().
> > Just like the page flags. As I said last night.
>
> Linus said that he prefers acquire/release to smp_rmb/smp_wmb. So, sort it
> out with him :)
>
> In most cases, the buffer is set uptodate while it is locked, so that
> there is no race on the uptodate flag (the race exists on the locked
> flag). Are there any cases where the uptodate flag is modified on unlocked
> buffer, so that it needs special treatment too?
I think you misunderstand the purpose of locked/uptodate. At least
for pages, the lock flag does not order access to the data in the page.
Indeed, the contents of the page can be changed while you hold the lock.
But the uptodate flag does order access to the data. At the point where
you can observe the uptodate flag set, you know the contents of the page
have been completely read from storage. And you don't need to hold the
lock to check the uptodate flag. So this is wrong:
buffer_lock()
*data = 0x12345678;
buffer_set_uptodate_not_ordered()
buffer_unlock_ordered()
because a reader can do:
while (!buffer_test_uptodate()) {
buffer_lock();
buffer_unlock();
}
x = *data;
and get x != 0x12345678 because the compiler can move the
buffer_set_uptodate_not_ordered() before the store to *data.
On Fri, 5 Aug 2022, Matthew Wilcox wrote:
> On Mon, Aug 01, 2022 at 11:01:40AM -0400, Mikulas Patocka wrote:
> >
> > In most cases, the buffer is set uptodate while it is locked, so that
> > there is no race on the uptodate flag (the race exists on the locked
> > flag). Are there any cases where the uptodate flag is modified on unlocked
> > buffer, so that it needs special treatment too?
>
> I think you misunderstand the purpose of locked/uptodate. At least
> for pages, the lock flag does not order access to the data in the page.
> Indeed, the contents of the page can be changed while you hold the lock.
> But the uptodate flag does order access to the data. At the point where
> you can observe the uptodate flag set, you know the contents of the page
> have been completely read from storage. And you don't need to hold the
> lock to check the uptodate flag. So this is wrong:
>
> buffer_lock()
> *data = 0x12345678;
> buffer_set_uptodate_not_ordered()
> buffer_unlock_ordered()
>
> because a reader can do:
>
> while (!buffer_test_uptodate()) {
> buffer_lock();
> buffer_unlock();
> }
> x = *data;
>
> and get x != 0x12345678 because the compiler can move the
> buffer_set_uptodate_not_ordered() before the store to *data.
Thanks for explanation. Would you like this patch?
From: Mikulas Patocka <[email protected]>
Let's have a look at this piece of code in __bread_slow:
get_bh(bh);
bh->b_end_io = end_buffer_read_sync;
submit_bh(REQ_OP_READ, 0, bh);
wait_on_buffer(bh);
if (buffer_uptodate(bh))
return bh;
Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
Consequently, if someone calls sb_bread and then reads the buffer data,
the read of buffer data may be executed before wait_on_buffer(bh) on
architectures with weak memory ordering and it may return invalid data.
Fix this bug by adding a write memory barrier to set_buffer_uptodate and a
read memory barrier to buffer_uptodate (in the same way as
folio_test_uptodate and folio_mark_uptodate).
We also add a barrier to buffer_locked - it pairs with a barrier in
unlock_buffer.
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -117,10 +117,8 @@ static __always_inline int test_clear_bu
* of the form "mark_buffer_foo()". These are higher-level functions which
* do something in addition to setting a b_state bit.
*/
-BUFFER_FNS(Uptodate, uptodate)
BUFFER_FNS(Dirty, dirty)
TAS_BUFFER_FNS(Dirty, dirty)
-BUFFER_FNS(Lock, locked)
BUFFER_FNS(Req, req)
TAS_BUFFER_FNS(Req, req)
BUFFER_FNS(Mapped, mapped)
@@ -135,6 +133,49 @@ BUFFER_FNS(Meta, meta)
BUFFER_FNS(Prio, prio)
BUFFER_FNS(Defer_Completion, defer_completion)
+static __always_inline void set_buffer_uptodate(struct buffer_head *bh)
+{
+ /*
+ * make it consistent with folio_mark_uptodate
+ * pairs with smp_acquire__after_ctrl_dep in buffer_uptodate
+ */
+ smp_wmb();
+ set_bit(BH_Uptodate, &bh->b_state);
+}
+
+static __always_inline void clear_buffer_uptodate(struct buffer_head *bh)
+{
+ clear_bit(BH_Uptodate, &bh->b_state);
+}
+
+static __always_inline int buffer_uptodate(const struct buffer_head *bh)
+{
+ bool ret = test_bit(BH_Uptodate, &bh->b_state);
+ /*
+ * make it consistent with folio_test_uptodate
+ * pairs with smp_wmb in set_buffer_uptodate
+ */
+ if (ret)
+ smp_acquire__after_ctrl_dep();
+ return ret;
+}
+
+static __always_inline void set_buffer_locked(struct buffer_head *bh)
+{
+ set_bit(BH_Lock, &bh->b_state);
+}
+
+static __always_inline int buffer_locked(const struct buffer_head *bh)
+{
+ bool ret = test_bit(BH_Lock, &bh->b_state);
+ /*
+ * pairs with smp_mb__after_atomic in unlock_buffer
+ */
+ if (!ret)
+ smp_acquire__after_ctrl_dep();
+ return ret;
+}
+
#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
/* If we *know* page->private refers to buffer_heads */
On Sun, Aug 07, 2022 at 07:37:22AM -0400, Mikulas Patocka wrote:
> @@ -135,6 +133,49 @@ BUFFER_FNS(Meta, meta)
> BUFFER_FNS(Prio, prio)
> BUFFER_FNS(Defer_Completion, defer_completion)
>
> +static __always_inline void set_buffer_uptodate(struct buffer_head *bh)
> +{
> + /*
> + * make it consistent with folio_mark_uptodate
> + * pairs with smp_acquire__after_ctrl_dep in buffer_uptodate
> + */
> + smp_wmb();
> + set_bit(BH_Uptodate, &bh->b_state);
> +}
> +
> +static __always_inline void clear_buffer_uptodate(struct buffer_head *bh)
> +{
> + clear_bit(BH_Uptodate, &bh->b_state);
> +}
> +
> +static __always_inline int buffer_uptodate(const struct buffer_head *bh)
> +{
> + bool ret = test_bit(BH_Uptodate, &bh->b_state);
> + /*
> + * make it consistent with folio_test_uptodate
> + * pairs with smp_wmb in set_buffer_uptodate
> + */
> + if (ret)
> + smp_acquire__after_ctrl_dep();
> + return ret;
> +}
This all works for me. While we have the experts paying attention,
would it be better to do
return smp_load_acquire(&bh->b_state) & (1L << BH_Uptodate) > 0;
> +static __always_inline void set_buffer_locked(struct buffer_head *bh)
> +{
> + set_bit(BH_Lock, &bh->b_state);
> +}
> +
> +static __always_inline int buffer_locked(const struct buffer_head *bh)
> +{
> + bool ret = test_bit(BH_Lock, &bh->b_state);
> + /*
> + * pairs with smp_mb__after_atomic in unlock_buffer
> + */
> + if (!ret)
> + smp_acquire__after_ctrl_dep();
> + return ret;
> +}
Are there places that think that lock/unlock buffer implies a memory
barrier?
On Sun, 7 Aug 2022, Matthew Wilcox wrote:
> On Sun, Aug 07, 2022 at 07:37:22AM -0400, Mikulas Patocka wrote:
> > @@ -135,6 +133,49 @@ BUFFER_FNS(Meta, meta)
> > BUFFER_FNS(Prio, prio)
> > BUFFER_FNS(Defer_Completion, defer_completion)
> >
> > +static __always_inline void set_buffer_uptodate(struct buffer_head *bh)
> > +{
> > + /*
> > + * make it consistent with folio_mark_uptodate
> > + * pairs with smp_acquire__after_ctrl_dep in buffer_uptodate
> > + */
> > + smp_wmb();
> > + set_bit(BH_Uptodate, &bh->b_state);
> > +}
> > +
> > +static __always_inline void clear_buffer_uptodate(struct buffer_head *bh)
> > +{
> > + clear_bit(BH_Uptodate, &bh->b_state);
> > +}
> > +
> > +static __always_inline int buffer_uptodate(const struct buffer_head *bh)
> > +{
> > + bool ret = test_bit(BH_Uptodate, &bh->b_state);
> > + /*
> > + * make it consistent with folio_test_uptodate
> > + * pairs with smp_wmb in set_buffer_uptodate
> > + */
> > + if (ret)
> > + smp_acquire__after_ctrl_dep();
> > + return ret;
> > +}
>
> This all works for me. While we have the experts paying attention,
> would it be better to do
>
> return smp_load_acquire(&bh->b_state) & (1L << BH_Uptodate) > 0;
Yes, it may be nicer.
> > +static __always_inline void set_buffer_locked(struct buffer_head *bh)
> > +{
> > + set_bit(BH_Lock, &bh->b_state);
> > +}
> > +
> > +static __always_inline int buffer_locked(const struct buffer_head *bh)
> > +{
> > + bool ret = test_bit(BH_Lock, &bh->b_state);
> > + /*
> > + * pairs with smp_mb__after_atomic in unlock_buffer
> > + */
> > + if (!ret)
> > + smp_acquire__after_ctrl_dep();
> > + return ret;
> > +}
>
> Are there places that think that lock/unlock buffer implies a memory
> barrier?
There's this in fs/reiserfs:
if (!buffer_dirty(bh) && !buffer_locked(bh)) {
reiserfs_free_jh(bh); <--- this could be moved before buffer_locked
if (buffer_locked((journal->j_header_bh))) {
...
}
journal->j_last_flush_trans_id = trans_id;
journal->j_first_unflushed_offset = offset;
jh = (struct reiserfs_journal_header *)(journal->j_header_bh->b_data); <--- this could be moved before buffer_locked
Mikulas
On Mon, Aug 08, 2022 at 10:26:10AM -0400, Mikulas Patocka wrote:
> On Sun, 7 Aug 2022, Matthew Wilcox wrote:
> > > +static __always_inline void set_buffer_locked(struct buffer_head *bh)
> > > +{
> > > + set_bit(BH_Lock, &bh->b_state);
> > > +}
> > > +
> > > +static __always_inline int buffer_locked(const struct buffer_head *bh)
> > > +{
> > > + bool ret = test_bit(BH_Lock, &bh->b_state);
> > > + /*
> > > + * pairs with smp_mb__after_atomic in unlock_buffer
> > > + */
> > > + if (!ret)
> > > + smp_acquire__after_ctrl_dep();
> > > + return ret;
> > > +}
> >
> > Are there places that think that lock/unlock buffer implies a memory
> > barrier?
>
> There's this in fs/reiserfs:
>
> if (!buffer_dirty(bh) && !buffer_locked(bh)) {
> reiserfs_free_jh(bh); <--- this could be moved before buffer_locked
It might be better to think of buffer_locked() as
buffer_someone_has_exclusive_access(). I can't see the problem with
moving the reads in reiserfs_free_jh() before the read of buffer_locked.
> if (buffer_locked((journal->j_header_bh))) {
> ...
> }
> journal->j_last_flush_trans_id = trans_id;
> journal->j_first_unflushed_offset = offset;
> jh = (struct reiserfs_journal_header *)(journal->j_header_bh->b_data); <--- this could be moved before buffer_locked
I don't think b_data is going to be changed while someone else holds
the buffer locked. That's initialised by set_bh_page(), which is an
initialisation-time thing, before the BH is visible to any other thread.
On Mon, 8 Aug 2022, Matthew Wilcox wrote:
> On Mon, Aug 08, 2022 at 10:26:10AM -0400, Mikulas Patocka wrote:
> > On Sun, 7 Aug 2022, Matthew Wilcox wrote:
> > > > +static __always_inline void set_buffer_locked(struct buffer_head *bh)
> > > > +{
> > > > + set_bit(BH_Lock, &bh->b_state);
> > > > +}
> > > > +
> > > > +static __always_inline int buffer_locked(const struct buffer_head *bh)
> > > > +{
> > > > + bool ret = test_bit(BH_Lock, &bh->b_state);
> > > > + /*
> > > > + * pairs with smp_mb__after_atomic in unlock_buffer
> > > > + */
> > > > + if (!ret)
> > > > + smp_acquire__after_ctrl_dep();
> > > > + return ret;
> > > > +}
> > >
> > > Are there places that think that lock/unlock buffer implies a memory
> > > barrier?
> >
> > There's this in fs/reiserfs:
> >
> > if (!buffer_dirty(bh) && !buffer_locked(bh)) {
> > reiserfs_free_jh(bh); <--- this could be moved before buffer_locked
>
> It might be better to think of buffer_locked() as
> buffer_someone_has_exclusive_access(). I can't see the problem with
> moving the reads in reiserfs_free_jh() before the read of buffer_locked.
>
> > if (buffer_locked((journal->j_header_bh))) {
> > ...
> > }
> > journal->j_last_flush_trans_id = trans_id;
> > journal->j_first_unflushed_offset = offset;
> > jh = (struct reiserfs_journal_header *)(journal->j_header_bh->b_data); <--- this could be moved before buffer_locked
>
> I don't think b_data is going to be changed while someone else holds
> the buffer locked. That's initialised by set_bh_page(), which is an
> initialisation-time thing, before the BH is visible to any other thread.
So, do you think that we don't need a barrier in buffer_locked()?
There is also this (where the BUG_ON(!buffer_uptodate(bh)) saves it).
if (buffer_locked(bh)) {
int depth;
PROC_INFO_INC(sb, scan_bitmap.wait);
depth = reiserfs_write_unlock_nested(sb);
__wait_on_buffer(bh);
reiserfs_write_lock_nested(sb, depth);
}
BUG_ON(!buffer_uptodate(bh));
BUG_ON(atomic_read(&bh->b_count) == 0);
if (info->free_count == UINT_MAX)
reiserfs_cache_bitmap_metadata(sb, bh, info); <--- this could be moved before buffer_locked if there were no BUG_ONs
Mikulas
On Mon, Aug 08, 2022 at 10:57:45AM -0400, Mikulas Patocka wrote:
>
>
> On Mon, 8 Aug 2022, Matthew Wilcox wrote:
>
> > On Mon, Aug 08, 2022 at 10:26:10AM -0400, Mikulas Patocka wrote:
> > > On Sun, 7 Aug 2022, Matthew Wilcox wrote:
> > > > > +static __always_inline void set_buffer_locked(struct buffer_head *bh)
> > > > > +{
> > > > > + set_bit(BH_Lock, &bh->b_state);
> > > > > +}
> > > > > +
> > > > > +static __always_inline int buffer_locked(const struct buffer_head *bh)
> > > > > +{
> > > > > + bool ret = test_bit(BH_Lock, &bh->b_state);
> > > > > + /*
> > > > > + * pairs with smp_mb__after_atomic in unlock_buffer
> > > > > + */
> > > > > + if (!ret)
> > > > > + smp_acquire__after_ctrl_dep();
> > > > > + return ret;
> > > > > +}
> > > >
> > > > Are there places that think that lock/unlock buffer implies a memory
> > > > barrier?
> > >
> > > There's this in fs/reiserfs:
> > >
> > > if (!buffer_dirty(bh) && !buffer_locked(bh)) {
> > > reiserfs_free_jh(bh); <--- this could be moved before buffer_locked
> >
> > It might be better to think of buffer_locked() as
> > buffer_someone_has_exclusive_access(). I can't see the problem with
> > moving the reads in reiserfs_free_jh() before the read of buffer_locked.
> >
> > > if (buffer_locked((journal->j_header_bh))) {
> > > ...
> > > }
> > > journal->j_last_flush_trans_id = trans_id;
> > > journal->j_first_unflushed_offset = offset;
> > > jh = (struct reiserfs_journal_header *)(journal->j_header_bh->b_data); <--- this could be moved before buffer_locked
> >
> > I don't think b_data is going to be changed while someone else holds
> > the buffer locked. That's initialised by set_bh_page(), which is an
> > initialisation-time thing, before the BH is visible to any other thread.
>
> So, do you think that we don't need a barrier in buffer_locked()?
The question to ask here is "What prevents another call to buffer_locked()
from returning false?"
> There is also this (where the BUG_ON(!buffer_uptodate(bh)) saves it).
> if (buffer_locked(bh)) {
Right here, for example. If something prevents any change that might
cause buffer_locked() to return false here, we don't need a barrier.
If there is nothing preventing such a change, how is a barrier going
to help?
One way this code could be correct is if the above check is a heuristic,
so that a false positive just consumes a bit more CPU and a false negative
just delays this action.
I must leave final judgment to those having better understanding of this
code than do I.
Thanx, Paul
> int depth;
> PROC_INFO_INC(sb, scan_bitmap.wait);
> depth = reiserfs_write_unlock_nested(sb);
> __wait_on_buffer(bh);
> reiserfs_write_lock_nested(sb, depth);
> }
> BUG_ON(!buffer_uptodate(bh));
> BUG_ON(atomic_read(&bh->b_count) == 0);
>
> if (info->free_count == UINT_MAX)
> reiserfs_cache_bitmap_metadata(sb, bh, info); <--- this could be moved before buffer_locked if there were no BUG_ONs
>
> Mikulas
>
On Mon, Aug 08, 2022 at 10:57:45AM -0400, Mikulas Patocka wrote:
> On Mon, 8 Aug 2022, Matthew Wilcox wrote:
>
> > On Mon, Aug 08, 2022 at 10:26:10AM -0400, Mikulas Patocka wrote:
> > > On Sun, 7 Aug 2022, Matthew Wilcox wrote:
> > > > > +static __always_inline void set_buffer_locked(struct buffer_head *bh)
> > > > > +{
> > > > > + set_bit(BH_Lock, &bh->b_state);
> > > > > +}
> > > > > +
> > > > > +static __always_inline int buffer_locked(const struct buffer_head *bh)
> > > > > +{
> > > > > + bool ret = test_bit(BH_Lock, &bh->b_state);
> > > > > + /*
> > > > > + * pairs with smp_mb__after_atomic in unlock_buffer
> > > > > + */
> > > > > + if (!ret)
> > > > > + smp_acquire__after_ctrl_dep();
> > > > > + return ret;
> > > > > +}
> > > >
> > > > Are there places that think that lock/unlock buffer implies a memory
> > > > barrier?
> > >
> > > There's this in fs/reiserfs:
> > >
> > > if (!buffer_dirty(bh) && !buffer_locked(bh)) {
> > > reiserfs_free_jh(bh); <--- this could be moved before buffer_locked
> >
> > It might be better to think of buffer_locked() as
> > buffer_someone_has_exclusive_access(). I can't see the problem with
> > moving the reads in reiserfs_free_jh() before the read of buffer_locked.
> >
> > > if (buffer_locked((journal->j_header_bh))) {
> > > ...
> > > }
> > > journal->j_last_flush_trans_id = trans_id;
> > > journal->j_first_unflushed_offset = offset;
> > > jh = (struct reiserfs_journal_header *)(journal->j_header_bh->b_data); <--- this could be moved before buffer_locked
> >
> > I don't think b_data is going to be changed while someone else holds
> > the buffer locked. That's initialised by set_bh_page(), which is an
> > initialisation-time thing, before the BH is visible to any other thread.
>
> So, do you think that we don't need a barrier in buffer_locked()?
That's my feeling. Of course, you might not be the only one confused,
and if fs authors in general have made the mistake of thinking that
buffer_locked is serialising, then it might be better to live up to
that expectation.
> There is also this (where the BUG_ON(!buffer_uptodate(bh)) saves it).
> if (buffer_locked(bh)) {
> int depth;
> PROC_INFO_INC(sb, scan_bitmap.wait);
> depth = reiserfs_write_unlock_nested(sb);
> __wait_on_buffer(bh);
> reiserfs_write_lock_nested(sb, depth);
> }
> BUG_ON(!buffer_uptodate(bh));
> BUG_ON(atomic_read(&bh->b_count) == 0);
>
> if (info->free_count == UINT_MAX)
> reiserfs_cache_bitmap_metadata(sb, bh, info); <--- this could be moved before buffer_locked if there were no BUG_ONs
It could be moved before buffer_locked(), but I don't see the harm in
that. Look at how reiserfs_read_bitmap_block() gets the bh:
bh = sb_bread(sb, block);
__bread_gfp() has either already read the buffer (and it's uptodate),
in which case it returns it. Or it calls __bread_slow() which will do
the read and check uptodate before returning it. I wouldn't be surprised
to find that this buffer_locked() check is actually dead code, but I have
no desire to dive into reiserfs far enough to find out whether it's dead
code or not.
On Mon, 8 Aug 2022, Matthew Wilcox wrote:
> On Mon, Aug 08, 2022 at 10:57:45AM -0400, Mikulas Patocka wrote:
> > On Mon, 8 Aug 2022, Matthew Wilcox wrote:
> >
> > > On Mon, Aug 08, 2022 at 10:26:10AM -0400, Mikulas Patocka wrote:
> > > > On Sun, 7 Aug 2022, Matthew Wilcox wrote:
> > > > > > +static __always_inline void set_buffer_locked(struct buffer_head *bh)
> > > > > > +{
> > > > > > + set_bit(BH_Lock, &bh->b_state);
> > > > > > +}
> > > > > > +
> > > > > > +static __always_inline int buffer_locked(const struct buffer_head *bh)
> > > > > > +{
> > > > > > + bool ret = test_bit(BH_Lock, &bh->b_state);
> > > > > > + /*
> > > > > > + * pairs with smp_mb__after_atomic in unlock_buffer
> > > > > > + */
> > > > > > + if (!ret)
> > > > > > + smp_acquire__after_ctrl_dep();
> > > > > > + return ret;
> > > > > > +}
> > > > >
> > > > > Are there places that think that lock/unlock buffer implies a memory
> > > > > barrier?
> > > >
> > > > There's this in fs/reiserfs:
> > > >
> > > > if (!buffer_dirty(bh) && !buffer_locked(bh)) {
> > > > reiserfs_free_jh(bh); <--- this could be moved before buffer_locked
> > >
> > > It might be better to think of buffer_locked() as
> > > buffer_someone_has_exclusive_access(). I can't see the problem with
> > > moving the reads in reiserfs_free_jh() before the read of buffer_locked.
> > >
> > > > if (buffer_locked((journal->j_header_bh))) {
> > > > ...
> > > > }
> > > > journal->j_last_flush_trans_id = trans_id;
> > > > journal->j_first_unflushed_offset = offset;
> > > > jh = (struct reiserfs_journal_header *)(journal->j_header_bh->b_data); <--- this could be moved before buffer_locked
> > >
> > > I don't think b_data is going to be changed while someone else holds
> > > the buffer locked. That's initialised by set_bh_page(), which is an
> > > initialisation-time thing, before the BH is visible to any other thread.
> >
> > So, do you think that we don't need a barrier in buffer_locked()?
>
> That's my feeling. Of course, you might not be the only one confused,
> and if fs authors in general have made the mistake of thinking that
> buffer_locked is serialising, then it might be better to live up to
> that expectation.
In my spadfs filesystem, I used lock_buffer/unlock_buffer to prevent the
system from seeing or writing back incomplete data. The patterns is
lock_buffer(bh);
... do several changes to the buffer that should appear atomically
unlock_buffer(bh);
mark_buffer_dirty(bh);
but it seems to be ok, because both lock_buffer and unlock_buffer have
acquire/release semantics. I'm not sure about buffer_locked - perhaps it
really doesn't need the barriers - spin_is_locked, mutex_is_locked and
rwsem_is_locked also don't have any barriers.
Here I'm sending the patch without the change to buffer_locked.
Mikulas
From: Mikulas Patocka <[email protected]>
Let's have a look at this piece of code in __bread_slow:
get_bh(bh);
bh->b_end_io = end_buffer_read_sync;
submit_bh(REQ_OP_READ, 0, bh);
wait_on_buffer(bh);
if (buffer_uptodate(bh))
return bh;
Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
Consequently, if someone calls sb_bread and then reads the buffer data,
the read of buffer data may be executed before wait_on_buffer(bh) on
architectures with weak memory ordering and it may return invalid data.
Fix this bug by adding a memory barrier to set_buffer_uptodate and an
acquire barrier to buffer_uptodate (in a similar way as
folio_test_uptodate and folio_mark_uptodate).
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -117,7 +117,6 @@ static __always_inline int test_clear_bu
* of the form "mark_buffer_foo()". These are higher-level functions which
* do something in addition to setting a b_state bit.
*/
-BUFFER_FNS(Uptodate, uptodate)
BUFFER_FNS(Dirty, dirty)
TAS_BUFFER_FNS(Dirty, dirty)
BUFFER_FNS(Lock, locked)
@@ -135,6 +134,30 @@ BUFFER_FNS(Meta, meta)
BUFFER_FNS(Prio, prio)
BUFFER_FNS(Defer_Completion, defer_completion)
+static __always_inline void set_buffer_uptodate(struct buffer_head *bh)
+{
+ /*
+ * make it consistent with folio_mark_uptodate
+ * pairs with smp_load_acquire in buffer_uptodate
+ */
+ smp_mb__before_atomic();
+ set_bit(BH_Uptodate, &bh->b_state);
+}
+
+static __always_inline void clear_buffer_uptodate(struct buffer_head *bh)
+{
+ clear_bit(BH_Uptodate, &bh->b_state);
+}
+
+static __always_inline int buffer_uptodate(const struct buffer_head *bh)
+{
+ /*
+ * make it consistent with folio_test_uptodate
+ * pairs with smp_mb__before_atomic in set_buffer_uptodate
+ */
+ return (smp_load_acquire(&bh->b_state) & (1UL << BH_Uptodate)) != 0;
+}
+
#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
/* If we *know* page->private refers to buffer_heads */
On Tue, Aug 09, 2022 at 02:32:13PM -0400, Mikulas Patocka wrote:
> From: Mikulas Patocka <[email protected]>
>
> Let's have a look at this piece of code in __bread_slow:
> get_bh(bh);
> bh->b_end_io = end_buffer_read_sync;
> submit_bh(REQ_OP_READ, 0, bh);
> wait_on_buffer(bh);
> if (buffer_uptodate(bh))
> return bh;
> Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
> Consequently, if someone calls sb_bread and then reads the buffer data,
> the read of buffer data may be executed before wait_on_buffer(bh) on
> architectures with weak memory ordering and it may return invalid data.
>
> Fix this bug by adding a memory barrier to set_buffer_uptodate and an
> acquire barrier to buffer_uptodate (in a similar way as
> folio_test_uptodate and folio_mark_uptodate).
>
> Signed-off-by: Mikulas Patocka <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
> Cc: [email protected]
>
> Index: linux-2.6/include/linux/buffer_head.h
> ===================================================================
> --- linux-2.6.orig/include/linux/buffer_head.h
> +++ linux-2.6/include/linux/buffer_head.h
> @@ -117,7 +117,6 @@ static __always_inline int test_clear_bu
> * of the form "mark_buffer_foo()". These are higher-level functions which
> * do something in addition to setting a b_state bit.
> */
> -BUFFER_FNS(Uptodate, uptodate)
> BUFFER_FNS(Dirty, dirty)
> TAS_BUFFER_FNS(Dirty, dirty)
> BUFFER_FNS(Lock, locked)
> @@ -135,6 +134,30 @@ BUFFER_FNS(Meta, meta)
> BUFFER_FNS(Prio, prio)
> BUFFER_FNS(Defer_Completion, defer_completion)
>
> +static __always_inline void set_buffer_uptodate(struct buffer_head *bh)
> +{
> + /*
> + * make it consistent with folio_mark_uptodate
> + * pairs with smp_load_acquire in buffer_uptodate
> + */
> + smp_mb__before_atomic();
> + set_bit(BH_Uptodate, &bh->b_state);
> +}
> +
> +static __always_inline void clear_buffer_uptodate(struct buffer_head *bh)
> +{
> + clear_bit(BH_Uptodate, &bh->b_state);
> +}
> +
> +static __always_inline int buffer_uptodate(const struct buffer_head *bh)
> +{
> + /*
> + * make it consistent with folio_test_uptodate
> + * pairs with smp_mb__before_atomic in set_buffer_uptodate
> + */
> + return (smp_load_acquire(&bh->b_state) & (1UL << BH_Uptodate)) != 0;
> +}
> +
> #define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
>
> /* If we *know* page->private refers to buffer_heads */
>
On Tue, Aug 9, 2022 at 11:32 AM Mikulas Patocka <[email protected]> wrote:
>
> Let's have a look at this piece of code in __bread_slow:
> get_bh(bh);
> bh->b_end_io = end_buffer_read_sync;
> submit_bh(REQ_OP_READ, 0, bh);
> wait_on_buffer(bh);
> if (buffer_uptodate(bh))
> return bh;
> Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
> Consequently, if someone calls sb_bread and then reads the buffer data,
> the read of buffer data may be executed before wait_on_buffer(bh) on
> architectures with weak memory ordering and it may return invalid data.
>
> Fix this bug by adding a memory barrier to set_buffer_uptodate and an
> acquire barrier to buffer_uptodate (in a similar way as
> folio_test_uptodate and folio_mark_uptodate).
Ok, I've applied this to my tree.
I still feel that we should probably take a long look at having the
proper "acquire/release" uses everywhere for the buffer / page / folio
flags, but that wouldn't really work for backporting to stable, so I
think that's a "future fixes/cleanup" thing.
Thanks,
Linus