2019-07-16 16:05:56

by Jan Stancek

[permalink] [raw]
Subject: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty

LTP mtest06 has been observed to rarely hit "still mapped when deleted"
and following BUG_ON on arm64:
page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
xfs_address_space_operations [xfs]
flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
------------[ cut here ]------------
kernel BUG at mm/filemap.c:171!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
pstate: 40400089 (nZcv daIf +PAN -UAO)
pc : unaccount_page_cache_page+0x17c/0x1a0
lr : unaccount_page_cache_page+0x17c/0x1a0
Call trace:
unaccount_page_cache_page+0x17c/0x1a0
delete_from_page_cache_batch+0xa0/0x300
truncate_inode_pages_range+0x1b8/0x640
truncate_inode_pages_final+0x88/0xa8
evict+0x1a0/0x1d8
iput+0x150/0x240
dentry_unlink_inode+0x120/0x130
__dentry_kill+0xd8/0x1d0
dentry_kill+0x88/0x248
dput+0x168/0x1b8
__fput+0xe8/0x208
____fput+0x20/0x30
task_work_run+0xc0/0xf0
do_notify_resume+0x2b0/0x328
work_pending+0x8/0x10

The extra mapcount originated from pagefault handler, which handled
pagefault for vma that has already been detached. vma is detached
under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
also invalidates vmacache.

When pagefault handler (under mmap_sem read lock) called find_vma(),
vmacache_valid() wrongly reported vmacache as valid.

After rwsem down_read() returns via 'queue empty' path (as of v5.2),
it does so without issuing read_acquire on sem->count:
down_read
__down_read
rwsem_down_read_failed
__rwsem_down_read_failed_common
raw_spin_lock_irq(&sem->wait_lock);
if (list_empty(&sem->wait_list)) {
if (atomic_long_read(&sem->count) >= 0) {
raw_spin_unlock_irq(&sem->wait_lock);
return sem;

Suspected problem here is that last *_acquire on down_read() side
happens before write side issues *_release:
1. writer: has the lock
2. reader: down_read() issues *read_acquire on entry
3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
5. reader: observes stale mm->vmacache_seqnum

I can reproduce the problem by running LTP mtest06 in a loop and building
kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
within ~hour. Patched kernel ran fine for 5+ hours with clean dmesg.
Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
Enable readers spinning on writer") makes it much harder to reproduce.

Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")

Signed-off-by: Jan Stancek <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
kernel/locking/rwsem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..757b198d7a5b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
* exit the slowpath and return immediately as its
* RWSEM_READER_BIAS has already been set in the count.
*/
- if (adjustment && !(atomic_long_read(&sem->count) &
+ if (adjustment && !(atomic_long_read_acquire(&sem->count) &
(RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
--
1.8.3.1


2019-07-16 16:55:32

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty

On 7/16/19 12:04 PM, Jan Stancek wrote:
> LTP mtest06 has been observed to rarely hit "still mapped when deleted"
> and following BUG_ON on arm64:
> page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
> xfs_address_space_operations [xfs]
> flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
> page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
> ------------[ cut here ]------------
> kernel BUG at mm/filemap.c:171!
> Internal error: Oops - BUG: 0 [#1] SMP
> CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
> Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
> pstate: 40400089 (nZcv daIf +PAN -UAO)
> pc : unaccount_page_cache_page+0x17c/0x1a0
> lr : unaccount_page_cache_page+0x17c/0x1a0
> Call trace:
> unaccount_page_cache_page+0x17c/0x1a0
> delete_from_page_cache_batch+0xa0/0x300
> truncate_inode_pages_range+0x1b8/0x640
> truncate_inode_pages_final+0x88/0xa8
> evict+0x1a0/0x1d8
> iput+0x150/0x240
> dentry_unlink_inode+0x120/0x130
> __dentry_kill+0xd8/0x1d0
> dentry_kill+0x88/0x248
> dput+0x168/0x1b8
> __fput+0xe8/0x208
> ____fput+0x20/0x30
> task_work_run+0xc0/0xf0
> do_notify_resume+0x2b0/0x328
> work_pending+0x8/0x10
>
> The extra mapcount originated from pagefault handler, which handled
> pagefault for vma that has already been detached. vma is detached
> under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
> also invalidates vmacache.
>
> When pagefault handler (under mmap_sem read lock) called find_vma(),
> vmacache_valid() wrongly reported vmacache as valid.
>
> After rwsem down_read() returns via 'queue empty' path (as of v5.2),
> it does so without issuing read_acquire on sem->count:
> down_read
> __down_read
> rwsem_down_read_failed
> __rwsem_down_read_failed_common
> raw_spin_lock_irq(&sem->wait_lock);
> if (list_empty(&sem->wait_list)) {
> if (atomic_long_read(&sem->count) >= 0) {
> raw_spin_unlock_irq(&sem->wait_lock);
> return sem;
>
> Suspected problem here is that last *_acquire on down_read() side
> happens before write side issues *_release:
> 1. writer: has the lock
> 2. reader: down_read() issues *read_acquire on entry
> 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
> 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
> 5. reader: observes stale mm->vmacache_seqnum
>
> I can reproduce the problem by running LTP mtest06 in a loop and building
> kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
> on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
> within ~hour. Patched kernel ran fine for 5+ hours with clean dmesg.
> Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
> Enable readers spinning on writer") makes it much harder to reproduce.
The explanation makes sense to me. Thanks for the investigation.
>
> Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
>
> Signed-off-by: Jan Stancek <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> ---
> kernel/locking/rwsem.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..757b198d7a5b 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> * exit the slowpath and return immediately as its
> * RWSEM_READER_BIAS has already been set in the count.
> */
> - if (adjustment && !(atomic_long_read(&sem->count) &
> + if (adjustment && !(atomic_long_read_acquire(&sem->count) &
> (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> raw_spin_unlock_irq(&sem->wait_lock);
> rwsem_set_reader_owned(sem);

The chance of taking this path is not that high. So instead of
increasing the cost of the test by adding an acquire barrier, how about
just adding smp_mb__after_spinlock() before spin_unlock_irq(). This
should have the same effect of making sure that no stale data will be
used in the read-lock critical section.

-Longman

2019-07-16 18:35:27

by Jan Stancek

[permalink] [raw]
Subject: Re: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty



----- Original Message -----
> On 7/16/19 12:04 PM, Jan Stancek wrote:
> > LTP mtest06 has been observed to rarely hit "still mapped when deleted"
> > and following BUG_ON on arm64:
> > page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0
> > index:0x0
> > xfs_address_space_operations [xfs]
> > flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
> > page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
> > ------------[ cut here ]------------
> > kernel BUG at mm/filemap.c:171!
> > Internal error: Oops - BUG: 0 [#1] SMP
> > CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
> > Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10
> > 05/17/2019
> > pstate: 40400089 (nZcv daIf +PAN -UAO)
> > pc : unaccount_page_cache_page+0x17c/0x1a0
> > lr : unaccount_page_cache_page+0x17c/0x1a0
> > Call trace:
> > unaccount_page_cache_page+0x17c/0x1a0
> > delete_from_page_cache_batch+0xa0/0x300
> > truncate_inode_pages_range+0x1b8/0x640
> > truncate_inode_pages_final+0x88/0xa8
> > evict+0x1a0/0x1d8
> > iput+0x150/0x240
> > dentry_unlink_inode+0x120/0x130
> > __dentry_kill+0xd8/0x1d0
> > dentry_kill+0x88/0x248
> > dput+0x168/0x1b8
> > __fput+0xe8/0x208
> > ____fput+0x20/0x30
> > task_work_run+0xc0/0xf0
> > do_notify_resume+0x2b0/0x328
> > work_pending+0x8/0x10
> >
> > The extra mapcount originated from pagefault handler, which handled
> > pagefault for vma that has already been detached. vma is detached
> > under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
> > also invalidates vmacache.
> >
> > When pagefault handler (under mmap_sem read lock) called find_vma(),
> > vmacache_valid() wrongly reported vmacache as valid.
> >
> > After rwsem down_read() returns via 'queue empty' path (as of v5.2),
> > it does so without issuing read_acquire on sem->count:
> > down_read
> > __down_read
> > rwsem_down_read_failed
> > __rwsem_down_read_failed_common
> > raw_spin_lock_irq(&sem->wait_lock);
> > if (list_empty(&sem->wait_list)) {
> > if (atomic_long_read(&sem->count) >= 0) {
> > raw_spin_unlock_irq(&sem->wait_lock);
> > return sem;
> >
> > Suspected problem here is that last *_acquire on down_read() side
> > happens before write side issues *_release:
> > 1. writer: has the lock
> > 2. reader: down_read() issues *read_acquire on entry
> > 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
> > 4. reader: __rwsem_down_read_failed_common() finds it can take lock and
> > returns
> > 5. reader: observes stale mm->vmacache_seqnum
> >
> > I can reproduce the problem by running LTP mtest06 in a loop and building
> > kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
> > on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
> > within ~hour. Patched kernel ran fine for 5+ hours with clean dmesg.
> > Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
> > Enable readers spinning on writer") makes it much harder to reproduce.
> The explanation makes sense to me. Thanks for the investigation.
> >
> > Related:
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> > Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
> > munmap")
> > Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty
> > & no writer")
> >
> > Signed-off-by: Jan Stancek <[email protected]>
> > Cc: Waiman Long <[email protected]>
> > Cc: Davidlohr Bueso <[email protected]>
> > Cc: Will Deacon <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > ---
> > kernel/locking/rwsem.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index 37524a47f002..757b198d7a5b 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct
> > rw_semaphore *sem,
> > * exit the slowpath and return immediately as its
> > * RWSEM_READER_BIAS has already been set in the count.
> > */
> > - if (adjustment && !(atomic_long_read(&sem->count) &
> > + if (adjustment && !(atomic_long_read_acquire(&sem->count) &
> > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> > raw_spin_unlock_irq(&sem->wait_lock);
> > rwsem_set_reader_owned(sem);
>
> The chance of taking this path is not that high. So instead of
> increasing the cost of the test by adding an acquire barrier,

Yes, that is good point.

> how about
> just adding smp_mb__after_spinlock() before spin_unlock_irq(). This
> should have the same effect of making sure that no stale data will be
> used in the read-lock critical section.

I'll redo the tests with smp_mb__after_spinlock().

Thanks,
Jan

>
> -Longman
>
>

2019-07-16 19:00:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty

On Tue, Jul 16, 2019 at 12:53:14PM -0400, Waiman Long wrote:
> On 7/16/19 12:04 PM, Jan Stancek wrote:

> > Suspected problem here is that last *_acquire on down_read() side
> > happens before write side issues *_release:
> > 1. writer: has the lock
> > 2. reader: down_read() issues *read_acquire on entry
> > 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
> > 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
> > 5. reader: observes stale mm->vmacache_seqnum
> >
> > I can reproduce the problem by running LTP mtest06 in a loop and building
> > kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
> > on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
> > within ~hour. Patched kernel ran fine for 5+ hours with clean dmesg.
> > Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
> > Enable readers spinning on writer") makes it much harder to reproduce.

> > Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
> > Signed-off-by: Jan Stancek <[email protected]>
> > Cc: Waiman Long <[email protected]>
> > Cc: Davidlohr Bueso <[email protected]>
> > Cc: Will Deacon <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > ---
> > kernel/locking/rwsem.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index 37524a47f002..757b198d7a5b 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> > * exit the slowpath and return immediately as its
> > * RWSEM_READER_BIAS has already been set in the count.
> > */
> > - if (adjustment && !(atomic_long_read(&sem->count) &
> > + if (adjustment && !(atomic_long_read_acquire(&sem->count) &
> > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> > raw_spin_unlock_irq(&sem->wait_lock);
> > rwsem_set_reader_owned(sem);
>
> The chance of taking this path is not that high. So instead of
> increasing the cost of the test by adding an acquire barrier, how about
> just adding smp_mb__after_spinlock() before spin_unlock_irq(). This
> should have the same effect of making sure that no stale data will be
> used in the read-lock critical section.

That's actually more expensive on something like ARM64 I expect.

The far cheaper alternative is smp_acquire__after_ctrl_dep(), however in
general Will seems to prefer using load-acquire over separate barriers,
and for x86 it doesn't matter anyway. For PowerPC these two are a wash,
both end up with LWSYNC (over SYNC for your alternative).


2019-07-16 19:10:29

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty

On 7/16/19 2:58 PM, Peter Zijlstra wrote:
> On Tue, Jul 16, 2019 at 12:53:14PM -0400, Waiman Long wrote:
>> On 7/16/19 12:04 PM, Jan Stancek wrote:
> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
> Signed-off-by: Jan Stancek <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> ---
> kernel/locking/rwsem.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..757b198d7a5b 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> * exit the slowpath and return immediately as its
> * RWSEM_READER_BIAS has already been set in the count.
> */
> - if (adjustment && !(atomic_long_read(&sem->count) &
> + if (adjustment && !(atomic_long_read_acquire(&sem->count) &
> (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> raw_spin_unlock_irq(&sem->wait_lock);
> rwsem_set_reader_owned(sem);
>> The chance of taking this path is not that high. So instead of
>> increasing the cost of the test by adding an acquire barrier, how about
>> just adding smp_mb__after_spinlock() before spin_unlock_irq(). This
>> should have the same effect of making sure that no stale data will be
>> used in the read-lock critical section.
> That's actually more expensive on something like ARM64 I expect.
>
> The far cheaper alternative is smp_acquire__after_ctrl_dep(), however in
> general Will seems to prefer using load-acquire over separate barriers,
> and for x86 it doesn't matter anyway. For PowerPC these two are a wash,
> both end up with LWSYNC (over SYNC for your alternative).

With lock event counting turned on, my experience with this code path
was that it got hit very infrequently. It is even less frequent with the
latest reader optimistic spinning patch. That is why I prefer making it
a bit more costly when the condition is true without incurring a cost at
all when the condition is false which is the majority of the cases.
Anyway, this additional cost is for arm64 only, but it is still more
than compensated by all skipping all the waiting list manipulation and
waking up itself code.

Cheers,
Longman

2019-07-17 12:03:34

by Jan Stancek

[permalink] [raw]
Subject: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

LTP mtest06 has been observed to rarely hit "still mapped when deleted"
and following BUG_ON on arm64:
page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
xfs_address_space_operations [xfs]
flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
------------[ cut here ]------------
kernel BUG at mm/filemap.c:171!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
pstate: 40400089 (nZcv daIf +PAN -UAO)
pc : unaccount_page_cache_page+0x17c/0x1a0
lr : unaccount_page_cache_page+0x17c/0x1a0
Call trace:
unaccount_page_cache_page+0x17c/0x1a0
delete_from_page_cache_batch+0xa0/0x300
truncate_inode_pages_range+0x1b8/0x640
truncate_inode_pages_final+0x88/0xa8
evict+0x1a0/0x1d8
iput+0x150/0x240
dentry_unlink_inode+0x120/0x130
__dentry_kill+0xd8/0x1d0
dentry_kill+0x88/0x248
dput+0x168/0x1b8
__fput+0xe8/0x208
____fput+0x20/0x30
task_work_run+0xc0/0xf0
do_notify_resume+0x2b0/0x328
work_pending+0x8/0x10

The extra mapcount originated from pagefault handler, which handled
pagefault for vma that has already been detached. vma is detached
under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
also invalidates vmacache.

When pagefault handler (under mmap_sem read lock) called find_vma(),
vmacache_valid() wrongly reported vmacache as valid.

After rwsem down_read() returns via 'queue empty' path (as of v5.2),
it does so without issuing read_acquire on sem->count:
down_read
__down_read
rwsem_down_read_failed
__rwsem_down_read_failed_common
raw_spin_lock_irq(&sem->wait_lock);
if (list_empty(&sem->wait_list)) {
if (atomic_long_read(&sem->count) >= 0) {
raw_spin_unlock_irq(&sem->wait_lock);
return sem;

Suspected problem here is that last *_acquire on down_read() side
happens before write side issues *_release:
1. writer: has the lock
2. reader: down_read() issues *read_acquire on entry
3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
5. reader: observes stale mm->vmacache_seqnum

I can reproduce the problem by running LTP mtest06 in a loop and building
kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
Enable readers spinning on writer") makes it much harder to reproduce.

v2: Move barrier after test (Waiman Long)
Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)

Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")

Signed-off-by: Jan Stancek <[email protected]>
Cc: [email protected] # v4.20+
Cc: Waiman Long <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
kernel/locking/rwsem.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..5ac72b60608b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
*/
if (adjustment && !(atomic_long_read(&sem->count) &
(RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+ smp_acquire__after_ctrl_dep();
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
lockevent_inc(rwsem_rlock_fast);
--
1.8.3.1

2019-07-17 13:14:26

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

On Wed, Jul 17, 2019 at 02:02:20PM +0200, Jan Stancek wrote:
> LTP mtest06 has been observed to rarely hit "still mapped when deleted"
> and following BUG_ON on arm64:
> page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
> xfs_address_space_operations [xfs]
> flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
> page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
> ------------[ cut here ]------------
> kernel BUG at mm/filemap.c:171!
> Internal error: Oops - BUG: 0 [#1] SMP
> CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
> Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
> pstate: 40400089 (nZcv daIf +PAN -UAO)
> pc : unaccount_page_cache_page+0x17c/0x1a0
> lr : unaccount_page_cache_page+0x17c/0x1a0
> Call trace:
> unaccount_page_cache_page+0x17c/0x1a0
> delete_from_page_cache_batch+0xa0/0x300
> truncate_inode_pages_range+0x1b8/0x640
> truncate_inode_pages_final+0x88/0xa8
> evict+0x1a0/0x1d8
> iput+0x150/0x240
> dentry_unlink_inode+0x120/0x130
> __dentry_kill+0xd8/0x1d0
> dentry_kill+0x88/0x248
> dput+0x168/0x1b8
> __fput+0xe8/0x208
> ____fput+0x20/0x30
> task_work_run+0xc0/0xf0
> do_notify_resume+0x2b0/0x328
> work_pending+0x8/0x10
>
> The extra mapcount originated from pagefault handler, which handled
> pagefault for vma that has already been detached. vma is detached
> under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
> also invalidates vmacache.
>
> When pagefault handler (under mmap_sem read lock) called find_vma(),
> vmacache_valid() wrongly reported vmacache as valid.
>
> After rwsem down_read() returns via 'queue empty' path (as of v5.2),
> it does so without issuing read_acquire on sem->count:
> down_read
> __down_read
> rwsem_down_read_failed
> __rwsem_down_read_failed_common
> raw_spin_lock_irq(&sem->wait_lock);
> if (list_empty(&sem->wait_list)) {
> if (atomic_long_read(&sem->count) >= 0) {
> raw_spin_unlock_irq(&sem->wait_lock);
> return sem;
>
> Suspected problem here is that last *_acquire on down_read() side
> happens before write side issues *_release:
> 1. writer: has the lock
> 2. reader: down_read() issues *read_acquire on entry
> 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
> 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
> 5. reader: observes stale mm->vmacache_seqnum
>
> I can reproduce the problem by running LTP mtest06 in a loop and building
> kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
> on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
> within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
> Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
> Enable readers spinning on writer") makes it much harder to reproduce.
>
> v2: Move barrier after test (Waiman Long)
> Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)
>
> Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
>
> Signed-off-by: Jan Stancek <[email protected]>
> Cc: [email protected] # v4.20+
> Cc: Waiman Long <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> ---
> kernel/locking/rwsem.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..5ac72b60608b 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> */
> if (adjustment && !(atomic_long_read(&sem->count) &
> (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> + smp_acquire__after_ctrl_dep();
> raw_spin_unlock_irq(&sem->wait_lock);
> rwsem_set_reader_owned(sem);
> lockevent_inc(rwsem_rlock_fast);

If you add a comment to the code outlining the issue (preferably as a litmus
test involving sem->count and some shared data which happens to be
vmacache_seqnum in your test)), then:

Reviewed-by: Will Deacon <[email protected]>

Thanks,

Will

2019-07-17 14:19:51

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

On 7/17/19 9:13 AM, Will Deacon wrote:
> On Wed, Jul 17, 2019 at 02:02:20PM +0200, Jan Stancek wrote:
>> LTP mtest06 has been observed to rarely hit "still mapped when deleted"
>> and following BUG_ON on arm64:
>> page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
>> xfs_address_space_operations [xfs]
>> flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
>> page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
>> ------------[ cut here ]------------
>> kernel BUG at mm/filemap.c:171!
>> Internal error: Oops - BUG: 0 [#1] SMP
>> CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
>> Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
>> pstate: 40400089 (nZcv daIf +PAN -UAO)
>> pc : unaccount_page_cache_page+0x17c/0x1a0
>> lr : unaccount_page_cache_page+0x17c/0x1a0
>> Call trace:
>> unaccount_page_cache_page+0x17c/0x1a0
>> delete_from_page_cache_batch+0xa0/0x300
>> truncate_inode_pages_range+0x1b8/0x640
>> truncate_inode_pages_final+0x88/0xa8
>> evict+0x1a0/0x1d8
>> iput+0x150/0x240
>> dentry_unlink_inode+0x120/0x130
>> __dentry_kill+0xd8/0x1d0
>> dentry_kill+0x88/0x248
>> dput+0x168/0x1b8
>> __fput+0xe8/0x208
>> ____fput+0x20/0x30
>> task_work_run+0xc0/0xf0
>> do_notify_resume+0x2b0/0x328
>> work_pending+0x8/0x10
>>
>> The extra mapcount originated from pagefault handler, which handled
>> pagefault for vma that has already been detached. vma is detached
>> under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
>> also invalidates vmacache.
>>
>> When pagefault handler (under mmap_sem read lock) called find_vma(),
>> vmacache_valid() wrongly reported vmacache as valid.
>>
>> After rwsem down_read() returns via 'queue empty' path (as of v5.2),
>> it does so without issuing read_acquire on sem->count:
>> down_read
>> __down_read
>> rwsem_down_read_failed
>> __rwsem_down_read_failed_common
>> raw_spin_lock_irq(&sem->wait_lock);
>> if (list_empty(&sem->wait_list)) {
>> if (atomic_long_read(&sem->count) >= 0) {
>> raw_spin_unlock_irq(&sem->wait_lock);
>> return sem;
>>
>> Suspected problem here is that last *_acquire on down_read() side
>> happens before write side issues *_release:
>> 1. writer: has the lock
>> 2. reader: down_read() issues *read_acquire on entry
>> 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
>> 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
>> 5. reader: observes stale mm->vmacache_seqnum
>>
>> I can reproduce the problem by running LTP mtest06 in a loop and building
>> kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
>> on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
>> within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
>> Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
>> Enable readers spinning on writer") makes it much harder to reproduce.
>>
>> v2: Move barrier after test (Waiman Long)
>> Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)
>>
>> Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
>> Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
>> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
>>
>> Signed-off-by: Jan Stancek <[email protected]>
>> Cc: [email protected] # v4.20+
>> Cc: Waiman Long <[email protected]>
>> Cc: Davidlohr Bueso <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> ---
>> kernel/locking/rwsem.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index 37524a47f002..5ac72b60608b 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>> */
>> if (adjustment && !(atomic_long_read(&sem->count) &
>> (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
>> + smp_acquire__after_ctrl_dep();
>> raw_spin_unlock_irq(&sem->wait_lock);
>> rwsem_set_reader_owned(sem);
>> lockevent_inc(rwsem_rlock_fast);
> If you add a comment to the code outlining the issue (preferably as a litmus
> test involving sem->count and some shared data which happens to be
> vmacache_seqnum in your test)), then:
>
> Reviewed-by: Will Deacon <[email protected]>
>
> Thanks,
>
> Will

Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
is needed will be great.

Other than that,

Acked-by: Waiman Long <[email protected]>

2019-07-17 15:36:22

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

On 7/17/19 8:02 AM, Jan Stancek wrote:
> LTP mtest06 has been observed to rarely hit "still mapped when deleted"
> and following BUG_ON on arm64:
> page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
> xfs_address_space_operations [xfs]
> flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
> page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
> ------------[ cut here ]------------
> kernel BUG at mm/filemap.c:171!
> Internal error: Oops - BUG: 0 [#1] SMP
> CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
> Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
> pstate: 40400089 (nZcv daIf +PAN -UAO)
> pc : unaccount_page_cache_page+0x17c/0x1a0
> lr : unaccount_page_cache_page+0x17c/0x1a0
> Call trace:
> unaccount_page_cache_page+0x17c/0x1a0
> delete_from_page_cache_batch+0xa0/0x300
> truncate_inode_pages_range+0x1b8/0x640
> truncate_inode_pages_final+0x88/0xa8
> evict+0x1a0/0x1d8
> iput+0x150/0x240
> dentry_unlink_inode+0x120/0x130
> __dentry_kill+0xd8/0x1d0
> dentry_kill+0x88/0x248
> dput+0x168/0x1b8
> __fput+0xe8/0x208
> ____fput+0x20/0x30
> task_work_run+0xc0/0xf0
> do_notify_resume+0x2b0/0x328
> work_pending+0x8/0x10
>
> The extra mapcount originated from pagefault handler, which handled
> pagefault for vma that has already been detached. vma is detached
> under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
> also invalidates vmacache.
>
> When pagefault handler (under mmap_sem read lock) called find_vma(),
> vmacache_valid() wrongly reported vmacache as valid.
>
> After rwsem down_read() returns via 'queue empty' path (as of v5.2),
> it does so without issuing read_acquire on sem->count:
> down_read
> __down_read
> rwsem_down_read_failed
> __rwsem_down_read_failed_common
> raw_spin_lock_irq(&sem->wait_lock);
> if (list_empty(&sem->wait_list)) {
> if (atomic_long_read(&sem->count) >= 0) {
> raw_spin_unlock_irq(&sem->wait_lock);
> return sem;
>
> Suspected problem here is that last *_acquire on down_read() side
> happens before write side issues *_release:
> 1. writer: has the lock
> 2. reader: down_read() issues *read_acquire on entry
> 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
> 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
> 5. reader: observes stale mm->vmacache_seqnum
>
> I can reproduce the problem by running LTP mtest06 in a loop and building
> kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
> on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
> within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
> Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
> Enable readers spinning on writer") makes it much harder to reproduce.
>
> v2: Move barrier after test (Waiman Long)
> Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)
>
> Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
>
> Signed-off-by: Jan Stancek <[email protected]>
> Cc: [email protected] # v4.20+
> Cc: Waiman Long <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> ---
> kernel/locking/rwsem.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..5ac72b60608b 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> */
> if (adjustment && !(atomic_long_read(&sem->count) &
> (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> + smp_acquire__after_ctrl_dep();
> raw_spin_unlock_irq(&sem->wait_lock);
> rwsem_set_reader_owned(sem);
> lockevent_inc(rwsem_rlock_fast);

The corresponding change for 5.2 or earlier kernels are:

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index fbe96341beee..2fbbb2d46396 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -246,6 +246,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore
*sem, in
                 * been set in the count.
                 */
                if (atomic_long_read(&sem->count) >= 0) {
+                       smp_acquire__after_ctrl_dep();
                        raw_spin_unlock_irq(&sem->wait_lock);
                        return sem;
                }

-Longman

2019-07-17 19:23:34

by Jan Stancek

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
>> If you add a comment to the code outlining the issue (preferably as a litmus
>> test involving sem->count and some shared data which happens to be
>> vmacache_seqnum in your test)), then:
>>
>> Reviewed-by: Will Deacon <[email protected]>
>>
>> Thanks,
>>
>> Will
>
>Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
>is needed will be great.
>
>Other than that,
>
>Acked-by: Waiman Long <[email protected]>
>

litmus test looks a bit long, would following be acceptable?

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..d9c96651bfc7 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
*/
if (adjustment && !(atomic_long_read(&sem->count) &
(RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+ /*
+ * down_read() issued ACQUIRE on enter, but we can race
+ * with writer who did RELEASE only after us.
+ * ACQUIRE here makes sure reader operations happen only
+ * after all writer ones.
+ */
+ smp_acquire__after_ctrl_dep();
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
lockevent_inc(rwsem_rlock_fast);


with litmus test in commit log:
----------------------------------- 8< ------------------------------------
C rwsem

{
atomic_t rwsem_count = ATOMIC_INIT(1);
int vmacache_seqnum = 10;
}

P0(int *vmacache_seqnum, atomic_t *rwsem_count)
{
r0 = READ_ONCE(*vmacache_seqnum);
WRITE_ONCE(*vmacache_seqnum, r0 + 1);
/* downgrade_write */
r1 = atomic_fetch_add_release(-1+256, rwsem_count);
}

P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
{
/* rwsem_read_trylock */
r0 = atomic_add_return_acquire(256, rwsem_count);
/* rwsem_down_read_slowpath */
spin_lock(sem_wait_lock);
r0 = atomic_read(rwsem_count);
if ((r0 & 1) == 0) {
// BUG: needs barrier
spin_unlock(sem_wait_lock);
r1 = READ_ONCE(*vmacache_seqnum);
}
}
exists (1:r1=10)
----------------------------------- 8< ------------------------------------

Thanks,
Jan

2019-07-17 19:42:28

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

On 7/17/19 3:22 PM, Jan Stancek wrote:
> On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
>>> If you add a comment to the code outlining the issue (preferably as
>>> a litmus
>>> test involving sem->count and some shared data which happens to be
>>> vmacache_seqnum in your test)), then:
>>>
>>> Reviewed-by: Will Deacon <[email protected]>
>>>
>>> Thanks,
>>>
>>> Will
>>
>> Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
>> is needed will be great.
>>
>> Other than that,
>>
>> Acked-by: Waiman Long <[email protected]>
>>
>
> litmus test looks a bit long, would following be acceptable?
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..d9c96651bfc7 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1032,6 +1032,13 @@ static inline bool
> rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>           */
>          if (adjustment && !(atomic_long_read(&sem->count) &
>               (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> +            /*
> +             * down_read() issued ACQUIRE on enter, but we can race
> +             * with writer who did RELEASE only after us.
> +             * ACQUIRE here makes sure reader operations happen only
> +             * after all writer ones.
> +             */


How about that?

                /*
                 * Add an acquire barrier here to make sure no stale data
                 * acquired before the above test where the writer may still
                 * be holding the lock will be reused in the reader
critical
                 * section.
                 */

Thanks,
Longman


2019-07-18 08:53:32

by Jan Stancek

[permalink] [raw]
Subject: [PATCH v3] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

LTP mtest06 has been observed to rarely hit "still mapped when deleted"
and following BUG_ON on arm64:
page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
xfs_address_space_operations [xfs]
flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
------------[ cut here ]------------
kernel BUG at mm/filemap.c:171!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
pstate: 40400089 (nZcv daIf +PAN -UAO)
pc : unaccount_page_cache_page+0x17c/0x1a0
lr : unaccount_page_cache_page+0x17c/0x1a0
Call trace:
unaccount_page_cache_page+0x17c/0x1a0
delete_from_page_cache_batch+0xa0/0x300
truncate_inode_pages_range+0x1b8/0x640
truncate_inode_pages_final+0x88/0xa8
evict+0x1a0/0x1d8
iput+0x150/0x240
dentry_unlink_inode+0x120/0x130
__dentry_kill+0xd8/0x1d0
dentry_kill+0x88/0x248
dput+0x168/0x1b8
__fput+0xe8/0x208
____fput+0x20/0x30
task_work_run+0xc0/0xf0
do_notify_resume+0x2b0/0x328
work_pending+0x8/0x10

The extra mapcount originated from pagefault handler, which handled
pagefault for vma that has already been detached. vma is detached
under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
also invalidates vmacache.

When pagefault handler (under mmap_sem read lock) called find_vma(),
vmacache_valid() wrongly reported vmacache as valid.

After rwsem down_read() returns via 'queue empty' path (as of v5.2),
it does so without issuing read_acquire on sem->count:
down_read
__down_read
rwsem_down_read_failed
__rwsem_down_read_failed_common
raw_spin_lock_irq(&sem->wait_lock);
if (list_empty(&sem->wait_list)) {
if (atomic_long_read(&sem->count) >= 0) {
raw_spin_unlock_irq(&sem->wait_lock);
return sem;

Suspected problem here is that last *_acquire on down_read() side
happens before write side issues *_release:
1. writer: has the lock
2. reader: down_read() issues *read_acquire on entry
3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
5. reader: observes stale mm->vmacache_seqnum

----------------------------------- 8< ------------------------------------
C rwsem

{
atomic_t rwsem_count = ATOMIC_INIT(1);
int vmacache_seqnum = 10;
}

P0(int *vmacache_seqnum, atomic_t *rwsem_count)
{
r0 = READ_ONCE(*vmacache_seqnum);
WRITE_ONCE(*vmacache_seqnum, r0 + 1);
/* downgrade_write */
r1 = atomic_fetch_add_release(-1+256, rwsem_count);
}

P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
{
/* rwsem_read_trylock */
r0 = atomic_add_return_acquire(256, rwsem_count);
/* rwsem_down_read_slowpath */
spin_lock(sem_wait_lock);
r0 = atomic_read(rwsem_count);
if ((r0 & 1) == 0) {
// BUG: needs barrier
spin_unlock(sem_wait_lock);
r1 = READ_ONCE(*vmacache_seqnum);
}
}
exists (1:r1=10)
----------------------------------- >8 ------------------------------------

I can reproduce the problem by running LTP mtest06 in a loop and building
kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
Enable readers spinning on writer") makes it much harder to reproduce.

v2: Move barrier after test (Waiman Long)
Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)
v3: Add comment to barrier (Waiman Long, Will Deacon)
Add litmus test

Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")

Signed-off-by: Jan Stancek <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
Acked-by: Waiman Long <[email protected]>
Cc: [email protected] # v4.20+
Cc: Waiman Long <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
kernel/locking/rwsem.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..fe02aef39e9d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
*/
if (adjustment && !(atomic_long_read(&sem->count) &
(RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+ /*
+ * Add an acquire barrier here to make sure no stale
+ * data acquired before the above test, where the writer
+ * may still be holding the lock, will be reused in the
+ * reader critical section.
+ */
+ smp_acquire__after_ctrl_dep();
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
lockevent_inc(rwsem_rlock_fast);
--
1.8.3.1

2019-07-18 09:27:27

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

Hi Jan, Waiman, [+Jade and Paul for the litmus test at the end]

On Wed, Jul 17, 2019 at 09:22:00PM +0200, Jan Stancek wrote:
> On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
> > > If you add a comment to the code outlining the issue (preferably as a litmus
> > > test involving sem->count and some shared data which happens to be
> > > vmacache_seqnum in your test)), then:
> > >
> > > Reviewed-by: Will Deacon <[email protected]>
> > >
> > > Thanks,
> > >
> > > Will
> >
> > Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
> > is needed will be great.
> >
> > Other than that,
> >
> > Acked-by: Waiman Long <[email protected]>
> >
>
> litmus test looks a bit long, would following be acceptable?
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..d9c96651bfc7 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> */
> if (adjustment && !(atomic_long_read(&sem->count) &
> (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> + /*
> + * down_read() issued ACQUIRE on enter, but we can race
> + * with writer who did RELEASE only after us.
> + * ACQUIRE here makes sure reader operations happen only
> + * after all writer ones.
> + */

How about an abridged form of the litmus test here, just to show the cod
flow? e.g.:

/*
* We need to ensure ACQUIRE semantics when reading sem->count so that
* we pair with the RELEASE store performed by an unlocking/downgrading
* writer.
*
* P0 (writer) P1 (reader)
*
* down_write(sem);
* <write shared data>
* downgrade_write(sem);
* -> fetch_add_release(&sem->count)
*
* down_read_slowpath(sem);
* -> atomic_read(&sem->count)
* <ctrl dep>
* smp_acquire__after_ctrl_dep()
* <read shared data>
*/

In writing this, I also noticed that we don't have any explicit ordering
at the end of the reader slowpath when we wait on the queue but get woken
immediately:

if (!waiter.task)
break;

Am I missing something?

> + smp_acquire__after_ctrl_dep();
> raw_spin_unlock_irq(&sem->wait_lock);
> rwsem_set_reader_owned(sem);
> lockevent_inc(rwsem_rlock_fast);
>
>
> with litmus test in commit log:
> ----------------------------------- 8< ------------------------------------
> C rwsem
>
> {
> atomic_t rwsem_count = ATOMIC_INIT(1);
> int vmacache_seqnum = 10;
> }
>
> P0(int *vmacache_seqnum, atomic_t *rwsem_count)
> {
> r0 = READ_ONCE(*vmacache_seqnum);
> WRITE_ONCE(*vmacache_seqnum, r0 + 1);
> /* downgrade_write */
> r1 = atomic_fetch_add_release(-1+256, rwsem_count);
> }
>
> P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
> {
> /* rwsem_read_trylock */
> r0 = atomic_add_return_acquire(256, rwsem_count);
> /* rwsem_down_read_slowpath */
> spin_lock(sem_wait_lock);
> r0 = atomic_read(rwsem_count);
> if ((r0 & 1) == 0) {
> // BUG: needs barrier
> spin_unlock(sem_wait_lock);
> r1 = READ_ONCE(*vmacache_seqnum);
> }
> }
> exists (1:r1=10)
> ----------------------------------- 8< ------------------------------------

Thanks for writing this! It's definitely worth sticking it in the commit
log, but Paul and Jade might also like to include it as part of their litmus
test repository too.

Will

2019-07-18 10:53:15

by Jan Stancek

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty


----- Original Message -----
> Hi Jan, Waiman, [+Jade and Paul for the litmus test at the end]
>
> On Wed, Jul 17, 2019 at 09:22:00PM +0200, Jan Stancek wrote:
> > On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
> > > > If you add a comment to the code outlining the issue (preferably as a
> > > > litmus
> > > > test involving sem->count and some shared data which happens to be
> > > > vmacache_seqnum in your test)), then:
> > > >
> > > > Reviewed-by: Will Deacon <[email protected]>
> > > >
> > > > Thanks,
> > > >
> > > > Will
> > >
> > > Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
> > > is needed will be great.
> > >
> > > Other than that,
> > >
> > > Acked-by: Waiman Long <[email protected]>
> > >
> >
> > litmus test looks a bit long, would following be acceptable?
> >
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index 37524a47f002..d9c96651bfc7 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct
> > rw_semaphore *sem,
> > */
> > if (adjustment && !(atomic_long_read(&sem->count) &
> > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> > + /*
> > + * down_read() issued ACQUIRE on enter, but we can race
> > + * with writer who did RELEASE only after us.
> > + * ACQUIRE here makes sure reader operations happen only
> > + * after all writer ones.
> > + */
>
> How about an abridged form of the litmus test here, just to show the cod
> flow? e.g.:
>
> /*
> * We need to ensure ACQUIRE semantics when reading sem->count so that
> * we pair with the RELEASE store performed by an unlocking/downgrading
> * writer.
> *
> * P0 (writer) P1 (reader)
> *
> * down_write(sem);
> * <write shared data>
> * downgrade_write(sem);
> * -> fetch_add_release(&sem->count)
> *
> * down_read_slowpath(sem);
> * -> atomic_read(&sem->count)
> * <ctrl dep>
> * smp_acquire__after_ctrl_dep()
> * <read shared data>
> */

Works for me. The code is at 3 level of indentation, but I can try
to squeeze it in for v4.

>
> In writing this, I also noticed that we don't have any explicit ordering
> at the end of the reader slowpath when we wait on the queue but get woken
> immediately:
>
> if (!waiter.task)
> break;
>
> Am I missing something?

I'm assuming this isn't problem, because set_current_state() on line above
is using smp_store_mb().

>
> > + smp_acquire__after_ctrl_dep();
> > raw_spin_unlock_irq(&sem->wait_lock);
> > rwsem_set_reader_owned(sem);
> > lockevent_inc(rwsem_rlock_fast);
> >
> >
> > with litmus test in commit log:
> > ----------------------------------- 8< ------------------------------------
> > C rwsem
> >
> > {
> > atomic_t rwsem_count = ATOMIC_INIT(1);
> > int vmacache_seqnum = 10;
> > }
> >
> > P0(int *vmacache_seqnum, atomic_t *rwsem_count)
> > {
> > r0 = READ_ONCE(*vmacache_seqnum);
> > WRITE_ONCE(*vmacache_seqnum, r0 + 1);
> > /* downgrade_write */
> > r1 = atomic_fetch_add_release(-1+256, rwsem_count);
> > }
> >
> > P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
> > {
> > /* rwsem_read_trylock */
> > r0 = atomic_add_return_acquire(256, rwsem_count);
> > /* rwsem_down_read_slowpath */
> > spin_lock(sem_wait_lock);
> > r0 = atomic_read(rwsem_count);
> > if ((r0 & 1) == 0) {
> > // BUG: needs barrier
> > spin_unlock(sem_wait_lock);
> > r1 = READ_ONCE(*vmacache_seqnum);
> > }
> > }
> > exists (1:r1=10)
> > ----------------------------------- 8< ------------------------------------
>
> Thanks for writing this! It's definitely worth sticking it in the commit
> log, but Paul and Jade might also like to include it as part of their litmus
> test repository too.
>
> Will
>

2019-07-18 10:59:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

On Thu, Jul 18, 2019 at 10:26:41AM +0100, Will Deacon wrote:

> /*
> * We need to ensure ACQUIRE semantics when reading sem->count so that
> * we pair with the RELEASE store performed by an unlocking/downgrading
> * writer.
> *
> * P0 (writer) P1 (reader)
> *
> * down_write(sem);
> * <write shared data>
> * downgrade_write(sem);
> * -> fetch_add_release(&sem->count)
> *
> * down_read_slowpath(sem);
> * -> atomic_read(&sem->count)
> * <ctrl dep>
> * smp_acquire__after_ctrl_dep()
> * <read shared data>
> */

So I'm thinking all this is excessive; the simple rule is: lock acquire
should imply ACQUIRE, we all know why.

> In writing this, I also noticed that we don't have any explicit ordering
> at the end of the reader slowpath when we wait on the queue but get woken
> immediately:
>
> if (!waiter.task)
> break;
>
> Am I missing something?

Ha!, I ran into the very same one. I keep confusing myself, but I think
you're right and that needs to be smp_load_acquire() to match the
smp_store_release() in rwsem_mark_wake().

(the actual race there is _tiny_ due to the smp_mb() right before it,
but I cannot convince myself that is indeed sufficient)

The signal_pending_state() case is also fun, but I think wait_lock there
is sufficient (even under RCpc).

I've ended up with this..

---
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..9eb630904a17 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1000,6 +1000,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
adjustment = 0;
if (rwsem_optimistic_spin(sem, false)) {
+ /* rwsem_optimistic_spin() implies ACQUIRE through rwsem_*trylock() */
/*
* Wake up other readers in the wait list if the front
* waiter is a reader.
@@ -1014,6 +1015,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
}
return sem;
} else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
+ /* rwsem_reader_phase_trylock() implies ACQUIRE */
return sem;
}

@@ -1032,6 +1034,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
*/
if (adjustment && !(atomic_long_read(&sem->count) &
(RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+ /* Provide lock ACQUIRE */
+ smp_acquire__after_ctrl_dep();
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
lockevent_inc(rwsem_rlock_fast);
@@ -1065,15 +1069,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
wake_up_q(&wake_q);

/* wait to be given the lock */
- while (true) {
+ for (;;) {
set_current_state(state);
- if (!waiter.task)
+ if (!smp_load_acquire(&waiter.task)) {
+ /*
+ * Matches rwsem_mark_wake()'s smp_store_release() and ensures
+ * we're ordered against its sem->count operations.
+ */
break;
+ }
if (signal_pending_state(state, current)) {
raw_spin_lock_irq(&sem->wait_lock);
if (waiter.task)
goto out_nolock;
raw_spin_unlock_irq(&sem->wait_lock);
+ /*
+ * Ordered by sem->wait_lock against rwsem_mark_wake(), if we
+ * see its waiter.task store, we must also see its sem->count
+ * changes.
+ */
break;
}
schedule();
@@ -1083,6 +1097,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
__set_current_state(TASK_RUNNING);
lockevent_inc(rwsem_rlock);
return sem;
+
out_nolock:
list_del(&waiter.list);
if (list_empty(&sem->wait_list)) {

2019-07-18 11:06:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

On Thu, Jul 18, 2019 at 06:50:52AM -0400, Jan Stancek wrote:
> > In writing this, I also noticed that we don't have any explicit ordering
> > at the end of the reader slowpath when we wait on the queue but get woken
> > immediately:
> >
> > if (!waiter.task)
> > break;
> >
> > Am I missing something?
>
> I'm assuming this isn't problem, because set_current_state() on line above
> is using smp_store_mb().


X = 0;

X = 1;
rwsem_down_read() rwsem_up_write();

for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);

rwsem_mark_wake()
atomic_long_add(adjustment, &sem->count);
smp_store_release(&waiter->task, NULL);

if (!waiter.task)
break;

...
}


r = X;


can I think result in r==0 just fine, because there's nothing ordering
the load of waiter.task with the store of X.

It is exceedingly unlikely, but not impossible afaict.

2019-07-18 11:11:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty


It's simpler like so:

On Thu, Jul 18, 2019 at 01:04:46PM +0200, Peter Zijlstra wrote:
> X = 0;
>
> rwsem_down_read()
> for (;;) {
> set_current_state(TASK_UNINTERRUPTIBLE);
>
> X = 1;
> rwsem_up_write();
> rwsem_mark_wake()
> atomic_long_add(adjustment, &sem->count);
> smp_store_release(&waiter->task, NULL);
>
> if (!waiter.task)
> break;
>
> ...
> }
>
> r = X;
>

2019-07-18 11:39:04

by Jan Stancek

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty


----- Original Message -----
>
> It's simpler like so:
>
> On Thu, Jul 18, 2019 at 01:04:46PM +0200, Peter Zijlstra wrote:
> > X = 0;
> >
> > rwsem_down_read()
> > for (;;) {
> > set_current_state(TASK_UNINTERRUPTIBLE);
> >
> > X = 1;
> > rwsem_up_write();
> > rwsem_mark_wake()
> > atomic_long_add(adjustment, &sem->count);
> > smp_store_release(&waiter->task, NULL);
> >
> > if (!waiter.task)
> > break;
> >
> > ...
> > }
> >
> > r = X;
> >

I see - it looks possible. Thank you for this example.

2019-07-18 11:46:40

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

On Thu, Jul 18, 2019 at 12:58:12PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 18, 2019 at 10:26:41AM +0100, Will Deacon wrote:
>
> > /*
> > * We need to ensure ACQUIRE semantics when reading sem->count so that
> > * we pair with the RELEASE store performed by an unlocking/downgrading
> > * writer.
> > *
> > * P0 (writer) P1 (reader)
> > *
> > * down_write(sem);
> > * <write shared data>
> > * downgrade_write(sem);
> > * -> fetch_add_release(&sem->count)
> > *
> > * down_read_slowpath(sem);
> > * -> atomic_read(&sem->count)
> > * <ctrl dep>
> > * smp_acquire__after_ctrl_dep()
> > * <read shared data>
> > */
>
> So I'm thinking all this is excessive; the simple rule is: lock acquire
> should imply ACQUIRE, we all know why.

Fair enough, I just thought this was worth highlighting because you can't
reply on the wait_lock to give you ACQUIRE ordering.

> > In writing this, I also noticed that we don't have any explicit ordering
> > at the end of the reader slowpath when we wait on the queue but get woken
> > immediately:
> >
> > if (!waiter.task)
> > break;
> >
> > Am I missing something?
>
> Ha!, I ran into the very same one. I keep confusing myself, but I think
> you're right and that needs to be smp_load_acquire() to match the
> smp_store_release() in rwsem_mark_wake().
>
> (the actual race there is _tiny_ due to the smp_mb() right before it,
> but I cannot convince myself that is indeed sufficient)
>
> The signal_pending_state() case is also fun, but I think wait_lock there
> is sufficient (even under RCpc).
>
> I've ended up with this..
>
> ---
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..9eb630904a17 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1000,6 +1000,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> adjustment = 0;
> if (rwsem_optimistic_spin(sem, false)) {
> + /* rwsem_optimistic_spin() implies ACQUIRE through rwsem_*trylock() */

I couldn't figure out if this was dependent on the return value or not,
and looking at osq_lock() I also couldn't see the ACQUIRE barrier when we're
spinning on node->locked. Hmm.

> /*
> * Wake up other readers in the wait list if the front
> * waiter is a reader.
> @@ -1014,6 +1015,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> }
> return sem;
> } else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
> + /* rwsem_reader_phase_trylock() implies ACQUIRE */

Can we add "on success" to the end of this, please?

> return sem;
> }
>
> @@ -1032,6 +1034,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> */
> if (adjustment && !(atomic_long_read(&sem->count) &
> (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> + /* Provide lock ACQUIRE */
> + smp_acquire__after_ctrl_dep();
> raw_spin_unlock_irq(&sem->wait_lock);
> rwsem_set_reader_owned(sem);
> lockevent_inc(rwsem_rlock_fast);
> @@ -1065,15 +1069,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> wake_up_q(&wake_q);
>
> /* wait to be given the lock */
> - while (true) {
> + for (;;) {
> set_current_state(state);
> - if (!waiter.task)
> + if (!smp_load_acquire(&waiter.task)) {
> + /*
> + * Matches rwsem_mark_wake()'s smp_store_release() and ensures
> + * we're ordered against its sem->count operations.
> + */
> break;
> + }

Ack. Also, grepping for 'waiter.task' reveals a similar usage in
drivers/tty/tty_ldsem.c if you're feeling brave enough.

Will

2019-07-18 12:13:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

On Thu, Jul 18, 2019 at 06:50:52AM -0400, Jan Stancek wrote:
>
> ----- Original Message -----
> > Hi Jan, Waiman, [+Jade and Paul for the litmus test at the end]
> >
> > On Wed, Jul 17, 2019 at 09:22:00PM +0200, Jan Stancek wrote:
> > > On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
> > > > > If you add a comment to the code outlining the issue (preferably as a
> > > > > litmus
> > > > > test involving sem->count and some shared data which happens to be
> > > > > vmacache_seqnum in your test)), then:
> > > > >
> > > > > Reviewed-by: Will Deacon <[email protected]>
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Will
> > > >
> > > > Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
> > > > is needed will be great.
> > > >
> > > > Other than that,
> > > >
> > > > Acked-by: Waiman Long <[email protected]>
> > > >
> > >
> > > litmus test looks a bit long, would following be acceptable?
> > >
> > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > > index 37524a47f002..d9c96651bfc7 100644
> > > --- a/kernel/locking/rwsem.c
> > > +++ b/kernel/locking/rwsem.c
> > > @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct
> > > rw_semaphore *sem,
> > > */
> > > if (adjustment && !(atomic_long_read(&sem->count) &
> > > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> > > + /*
> > > + * down_read() issued ACQUIRE on enter, but we can race
> > > + * with writer who did RELEASE only after us.
> > > + * ACQUIRE here makes sure reader operations happen only
> > > + * after all writer ones.
> > > + */
> >
> > How about an abridged form of the litmus test here, just to show the cod
> > flow? e.g.:
> >
> > /*
> > * We need to ensure ACQUIRE semantics when reading sem->count so that
> > * we pair with the RELEASE store performed by an unlocking/downgrading
> > * writer.
> > *
> > * P0 (writer) P1 (reader)
> > *
> > * down_write(sem);
> > * <write shared data>
> > * downgrade_write(sem);
> > * -> fetch_add_release(&sem->count)
> > *
> > * down_read_slowpath(sem);
> > * -> atomic_read(&sem->count)
> > * <ctrl dep>
> > * smp_acquire__after_ctrl_dep()
> > * <read shared data>
> > */
>
> Works for me. The code is at 3 level of indentation, but I can try
> to squeeze it in for v4.
>
> >
> > In writing this, I also noticed that we don't have any explicit ordering
> > at the end of the reader slowpath when we wait on the queue but get woken
> > immediately:
> >
> > if (!waiter.task)
> > break;
> >
> > Am I missing something?
>
> I'm assuming this isn't problem, because set_current_state() on line above
> is using smp_store_mb().
>
> >
> > > + smp_acquire__after_ctrl_dep();
> > > raw_spin_unlock_irq(&sem->wait_lock);
> > > rwsem_set_reader_owned(sem);
> > > lockevent_inc(rwsem_rlock_fast);
> > >
> > >
> > > with litmus test in commit log:
> > > ----------------------------------- 8< ------------------------------------
> > > C rwsem
> > >
> > > {
> > > atomic_t rwsem_count = ATOMIC_INIT(1);
> > > int vmacache_seqnum = 10;
> > > }
> > >
> > > P0(int *vmacache_seqnum, atomic_t *rwsem_count)
> > > {
> > > r0 = READ_ONCE(*vmacache_seqnum);
> > > WRITE_ONCE(*vmacache_seqnum, r0 + 1);
> > > /* downgrade_write */
> > > r1 = atomic_fetch_add_release(-1+256, rwsem_count);
> > > }
> > >
> > > P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
> > > {
> > > /* rwsem_read_trylock */
> > > r0 = atomic_add_return_acquire(256, rwsem_count);
> > > /* rwsem_down_read_slowpath */
> > > spin_lock(sem_wait_lock);
> > > r0 = atomic_read(rwsem_count);
> > > if ((r0 & 1) == 0) {
> > > // BUG: needs barrier
> > > spin_unlock(sem_wait_lock);
> > > r1 = READ_ONCE(*vmacache_seqnum);
> > > }
> > > }
> > > exists (1:r1=10)
> > > ----------------------------------- 8< ------------------------------------
> >
> > Thanks for writing this! It's definitely worth sticking it in the commit
> > log, but Paul and Jade might also like to include it as part of their litmus
> > test repository too.

Thank you for forwarding this! It may now be found at:

https://github.com/paulmckrcu/litmus/blob/master/manual/kernel/C-JanStancek-rwsem.litmus

Thanx, Paul

2019-07-18 12:24:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty

On Thu, Jul 18, 2019 at 12:45:47PM +0100, Will Deacon wrote:
> On Thu, Jul 18, 2019 at 12:58:12PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 18, 2019 at 10:26:41AM +0100, Will Deacon wrote:
> >
> > > /*
> > > * We need to ensure ACQUIRE semantics when reading sem->count so that
> > > * we pair with the RELEASE store performed by an unlocking/downgrading
> > > * writer.
> > > *
> > > * P0 (writer) P1 (reader)
> > > *
> > > * down_write(sem);
> > > * <write shared data>
> > > * downgrade_write(sem);
> > > * -> fetch_add_release(&sem->count)
> > > *
> > > * down_read_slowpath(sem);
> > > * -> atomic_read(&sem->count)
> > > * <ctrl dep>
> > > * smp_acquire__after_ctrl_dep()
> > > * <read shared data>
> > > */
> >
> > So I'm thinking all this is excessive; the simple rule is: lock acquire
> > should imply ACQUIRE, we all know why.
>
> Fair enough, I just thought this was worth highlighting because you can't
> reply on the wait_lock to give you ACQUIRE ordering.

Right, not in this case, because sem->count is not fully serialized by
it, whereas below the wait-queue is.

> > ---
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index 37524a47f002..9eb630904a17 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1000,6 +1000,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> > atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> > adjustment = 0;
> > if (rwsem_optimistic_spin(sem, false)) {
> > + /* rwsem_optimistic_spin() implies ACQUIRE through rwsem_*trylock() */
>
> I couldn't figure out if this was dependent on the return value or not,

I went with the fact that the only way to return true is if taken
becomes true; and that only happens through
rwsem_try_{read,write}_lock_unqueued(), and both imply ACQUIRE on
success.

> and looking at osq_lock() I also couldn't see the ACQUIRE barrier when we're
> spinning on node->locked. Hmm.

Yes, osq is not a full lock and does not imply these barriers. This came
up somewhere, did we forget to write a comment on that? Lemme go look.

> > /*
> > * Wake up other readers in the wait list if the front
> > * waiter is a reader.
> > @@ -1014,6 +1015,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> > }
> > return sem;
> > } else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
> > + /* rwsem_reader_phase_trylock() implies ACQUIRE */
>
> Can we add "on success" to the end of this, please?

Good point.

> > return sem;
> > }
> >
> > @@ -1032,6 +1034,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> > */
> > if (adjustment && !(atomic_long_read(&sem->count) &
> > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> > + /* Provide lock ACQUIRE */
> > + smp_acquire__after_ctrl_dep();
> > raw_spin_unlock_irq(&sem->wait_lock);
> > rwsem_set_reader_owned(sem);
> > lockevent_inc(rwsem_rlock_fast);
> > @@ -1065,15 +1069,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> > wake_up_q(&wake_q);
> >
> > /* wait to be given the lock */
> > - while (true) {
> > + for (;;) {
> > set_current_state(state);
> > - if (!waiter.task)
> > + if (!smp_load_acquire(&waiter.task)) {
> > + /*
> > + * Matches rwsem_mark_wake()'s smp_store_release() and ensures
> > + * we're ordered against its sem->count operations.
> > + */
> > break;
> > + }
>
> Ack. Also, grepping for 'waiter.task' reveals a similar usage in
> drivers/tty/tty_ldsem.c if you're feeling brave enough.

*sigh* of course, for every bug there needs to be a second copy
somewhere.

I'll go look there too. Thanks!


Subject: [tip:locking/core] locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty

Commit-ID: e1b98fa316648420d0434d9ff5b92ad6609ba6c3
Gitweb: https://git.kernel.org/tip/e1b98fa316648420d0434d9ff5b92ad6609ba6c3
Author: Jan Stancek <[email protected]>
AuthorDate: Thu, 18 Jul 2019 10:51:25 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 25 Jul 2019 15:39:23 +0200

locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty

LTP mtest06 has been observed to occasionally hit "still mapped when
deleted" and following BUG_ON on arm64.

The extra mapcount originated from pagefault handler, which handled
pagefault for vma that has already been detached. vma is detached
under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
also invalidates vmacache.

When the pagefault handler (under mmap_sem read lock) calls
find_vma(), vmacache_valid() wrongly reports vmacache as valid.

After rwsem down_read() returns via 'queue empty' path (as of v5.2),
it does so without an ACQUIRE on sem->count:

down_read()
__down_read()
rwsem_down_read_failed()
__rwsem_down_read_failed_common()
raw_spin_lock_irq(&sem->wait_lock);
if (list_empty(&sem->wait_list)) {
if (atomic_long_read(&sem->count) >= 0) {
raw_spin_unlock_irq(&sem->wait_lock);
return sem;

The problem can be reproduced by running LTP mtest06 in a loop and
building the kernel (-j $NCPUS) in parallel. It does reproduces since
v4.20 on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It
triggers reliably in about an hour.

The patched kernel ran fine for 10+ hours.

Signed-off-by: Jan Stancek <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
Acked-by: Waiman Long <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
Link: https://lkml.kernel.org/r/50b8914e20d1d62bb2dee42d342836c2c16ebee7.1563438048.git.jstancek@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/rwsem.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index bc91aacaab58..d3ce7c6c42a6 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1036,6 +1036,8 @@ queue:
*/
if (adjustment && !(atomic_long_read(&sem->count) &
(RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+ /* Provide lock ACQUIRE */
+ smp_acquire__after_ctrl_dep();
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
lockevent_inc(rwsem_rlock_fast);