LinuxLists.cc - Reasoning about memory ordering

2018-02-23 12:32:50

Subject: Reasoning about memory ordering

Hello,

I'm cc'ing a bunch of people I know are well-versed in
the black arts of memory ordering!

Currently in btrfs we have roughly the following sequence:

T1: T2:
i_size_write(inode, newsize);
set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags); atomic_inc(&inode->i_dio_count);
smp_mb(); if (iov_iter_rw(iter) == READ) {
if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK, &BTRFS_I(inode)->runtime_flags)) {
if (atomic_read(&inode->i_dio_count)) { if (atomic_dec_and_test(&inode->i_dio_count))
wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP); wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP); }
if (offset >= i_size_read(inode))
do { return;
prepare_to_wait(wq, &q.wq_entry, TASK_UNINTERRUPTIBLE); }
if (atomic_read(&inode->i_dio_count))
schedule();
} while (atomic_read(&inode->i_dio_count));
finish_wait(wq, &q.wq_entry);
}

smp_mb__before_atomic();
clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);

The semantics I'm after are:

1. If T1 goes to sleep, then T2 would see the
BTRFS_INODE_READDIO_NEED_LOCK and hence will execute the
atomic_dec_and_test and possibly wake up T1. This flag serves as a way
to indicate to possibly multiple T2 (dio readers) that T1 is blocked
and they should unblock it and resort to acquiring some locks (this is not
visible in this excerpt of code for brevity). It's sort of a back-off
mechanism.

2. BTRFS_INODE_READDIO_NEED_LOCK bit must be set _before_ going to sleep

3. BTRFS_INODE_READDIO_NEED_LOCK must be cleared _after_ the thread has
been woken up.

4. After T1 is woken up, it's possible that a new T2 comes and doesn't see
the BTRFS_INODE_READDIO_NEED_LOCK flag set but this is fine, since the check
for i_size should cause T2 to just return (it will also execute atomic_dec_and_test)

Given this is the current state of the code (it's part of btrfs) I believe
the following could/should be done:

1. The smp_mb after the set_bit in T1 could be removed, since there is
already an implied full mm in prepare_to_wait. That is if we go to sleep,
then T2 is guaranteed to see the flag/i_size_write happening by merit of
the implied memory barrier in prepare_to_wait/schedule. But what if it doesn't
go to sleep? I still would like the i_size_write to be visible to T2

2. The bit clearing code in T1 should be possible to be replaced by
clear_bit_unlock (this was suggested by PeterZ on IRC).

3. I suspect there is a memory barrier in T2 that is missing. Perhaps
there should be an smp_mb__before_atomic right before the test_bit so that
it's ordered with the implied smp_mb in T1's prepare_to_wait.

2018-02-23 15:40:06

by Alan Cox

[permalink] [raw]

Subject: Re: Reasoning about memory ordering

> Given this is the current state of the code (it's part of btrfs) I believe
> the following could/should be done:

Is there benchmarking data to show that a custom lock is justified
(especiaally given it's going to mean btrfs and rtlinux don't play
together nicely since it won't be able to see the mutex lock and do
priority boosting ?)

Alan

2018-02-23 16:02:21

by Nikolay Borisov

[permalink] [raw]

Subject: Re: Reasoning about memory ordering

On 23.02.2018 17:38, Alan Cox wrote:
>> Given this is the current state of the code (it's part of btrfs) I believe
>> the following could/should be done:
>
> Is there benchmarking data to show that a custom lock is justified
> (especiaally given it's going to mean btrfs and rtlinux don't play
> together nicely since it won't be able to see the mutex lock and do
> priority boosting ?)

No, unfortunately this is not code that was written by me. I'm just
arguing it's not entirely correct. Generally in btrfs we'd like to allow
concurrent DIO read/write accesses to unrelated portionsin a file. Hence
I can't just use inode_Lock (DIO writes are synchronized with
inode_lock). Also I cannot use inode_shared_Lock to synchronize multiple
DIO reader against writers (including truncate) since this tanks
perfromance and I already tried[0] this approach and received a NAK...

[0] https://patchwork.kernel.org/patch/10232015/

>
> Alan
>
>
>

2018-02-23 17:33:12

by Andrea Parri

[permalink] [raw]

Subject: Re: Reasoning about memory ordering

On Fri, Feb 23, 2018 at 02:30:22PM +0200, Nikolay Borisov wrote:
> Hello,
>
> I'm cc'ing a bunch of people I know are well-versed in
> the black arts of memory ordering!
>
> Currently in btrfs we have roughly the following sequence:
>
> T1: T2:
> i_size_write(inode, newsize);
> set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags); atomic_inc(&inode->i_dio_count);
> smp_mb(); if (iov_iter_rw(iter) == READ) {
> if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK, &BTRFS_I(inode)->runtime_flags)) {
> if (atomic_read(&inode->i_dio_count)) { if (atomic_dec_and_test(&inode->i_dio_count))
> wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP); wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
> DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP); }
> if (offset >= i_size_read(inode))
> do { return;
> prepare_to_wait(wq, &q.wq_entry, TASK_UNINTERRUPTIBLE); }
> if (atomic_read(&inode->i_dio_count))
> schedule();
> } while (atomic_read(&inode->i_dio_count));
> finish_wait(wq, &q.wq_entry);
> }
>
> smp_mb__before_atomic();
> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>
> The semantics I'm after are:
>
> 1. If T1 goes to sleep, then T2 would see the
> BTRFS_INODE_READDIO_NEED_LOCK and hence will execute the
> atomic_dec_and_test and possibly wake up T1. This flag serves as a way
> to indicate to possibly multiple T2 (dio readers) that T1 is blocked
> and they should unblock it and resort to acquiring some locks (this is not
> visible in this excerpt of code for brevity). It's sort of a back-off
> mechanism.

I don't see how this could be guaranteed, even in a sequentially consistent
world (disclaimer: I'm certainly not familiar with btrfs): what is wrong in

T1 T2

atomic_inc(i_dio_count)
test_bit(NEED_LOCK, flags) // unset
set_bit(NEED_LOCK, flags)
atomic_read(i_dio_count) // >1
--> go to sleep

Thanks,
Andrea

>
> 2. BTRFS_INODE_READDIO_NEED_LOCK bit must be set _before_ going to sleep
>
> 3. BTRFS_INODE_READDIO_NEED_LOCK must be cleared _after_ the thread has
> been woken up.
>
> 4. After T1 is woken up, it's possible that a new T2 comes and doesn't see
> the BTRFS_INODE_READDIO_NEED_LOCK flag set but this is fine, since the check
> for i_size should cause T2 to just return (it will also execute atomic_dec_and_test)
>
> Given this is the current state of the code (it's part of btrfs) I believe
> the following could/should be done:
>
> 1. The smp_mb after the set_bit in T1 could be removed, since there is
> already an implied full mm in prepare_to_wait. That is if we go to sleep,
> then T2 is guaranteed to see the flag/i_size_write happening by merit of
> the implied memory barrier in prepare_to_wait/schedule. But what if it doesn't
> go to sleep? I still would like the i_size_write to be visible to T2
>
> 2. The bit clearing code in T1 should be possible to be replaced by
> clear_bit_unlock (this was suggested by PeterZ on IRC).
>
> 3. I suspect there is a memory barrier in T2 that is missing. Perhaps
> there should be an smp_mb__before_atomic right before the test_bit so that
> it's ordered with the implied smp_mb in T1's prepare_to_wait.

2018-02-23 17:48:54

by Nikolay Borisov

[permalink] [raw]

Subject: Re: Reasoning about memory ordering

On 23.02.2018 19:31, Andrea Parri wrote:
> On Fri, Feb 23, 2018 at 02:30:22PM +0200, Nikolay Borisov wrote:
>> Hello,
>>
>> I'm cc'ing a bunch of people I know are well-versed in
>> the black arts of memory ordering!
>>
>> Currently in btrfs we have roughly the following sequence:
>>
>> T1: T2:
>> i_size_write(inode, newsize);
>> set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags); atomic_inc(&inode->i_dio_count);
>> smp_mb(); if (iov_iter_rw(iter) == READ) {
>> if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK, &BTRFS_I(inode)->runtime_flags)) {
>> if (atomic_read(&inode->i_dio_count)) { if (atomic_dec_and_test(&inode->i_dio_count))
>> wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP); wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
>> DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP); }
>> if (offset >= i_size_read(inode))
>> do { return;
>> prepare_to_wait(wq, &q.wq_entry, TASK_UNINTERRUPTIBLE); }
>> if (atomic_read(&inode->i_dio_count))
>> schedule();
>> } while (atomic_read(&inode->i_dio_count));
>> finish_wait(wq, &q.wq_entry);
>> }
>>
>> smp_mb__before_atomic();
>> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>>
>> The semantics I'm after are:
>>
>> 1. If T1 goes to sleep, then T2 would see the
>> BTRFS_INODE_READDIO_NEED_LOCK and hence will execute the
>> atomic_dec_and_test and possibly wake up T1. This flag serves as a way
>> to indicate to possibly multiple T2 (dio readers) that T1 is blocked
>> and they should unblock it and resort to acquiring some locks (this is not
>> visible in this excerpt of code for brevity). It's sort of a back-off
>> mechanism.
>
> I don't see how this could be guaranteed, even in a sequentially consistent
> world (disclaimer: I'm certainly not familiar with btrfs): what is wrong in
>
> T1 T2
>
> atomic_inc(i_dio_count)
> test_bit(NEED_LOCK, flags) // unset
> set_bit(NEED_LOCK, flags)
> atomic_read(i_dio_count) // >1
> --> go to sleep

You are correct, so looking at btrfs_direct_IO again it seems this kind
of execution is fine since we also do inode_dio_end (i.e. the
atomic_dec_andtest/wake_up) sequence even outside of the test_bit()
conditional. So I guess the first requirement is really
unsatisfiable/not required.

>
> Thanks,
> Andrea
>
>
>>
>> 2. BTRFS_INODE_READDIO_NEED_LOCK bit must be set _before_ going to sleep
>>
>> 3. BTRFS_INODE_READDIO_NEED_LOCK must be cleared _after_ the thread has
>> been woken up.
>>
>> 4. After T1 is woken up, it's possible that a new T2 comes and doesn't see
>> the BTRFS_INODE_READDIO_NEED_LOCK flag set but this is fine, since the check
>> for i_size should cause T2 to just return (it will also execute atomic_dec_and_test)
>>
>> Given this is the current state of the code (it's part of btrfs) I believe
>> the following could/should be done:
>>
>> 1. The smp_mb after the set_bit in T1 could be removed, since there is
>> already an implied full mm in prepare_to_wait. That is if we go to sleep,
>> then T2 is guaranteed to see the flag/i_size_write happening by merit of
>> the implied memory barrier in prepare_to_wait/schedule. But what if it doesn't
>> go to sleep? I still would like the i_size_write to be visible to T2
>>
>> 2. The bit clearing code in T1 should be possible to be replaced by
>> clear_bit_unlock (this was suggested by PeterZ on IRC).
>>
>> 3. I suspect there is a memory barrier in T2 that is missing. Perhaps
>> there should be an smp_mb__before_atomic right before the test_bit so that
>> it's ordered with the implied smp_mb in T1's prepare_to_wait.
>