2020-10-26 15:59:51

by Filipe Manana

[permalink] [raw]
Subject: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

Hello,

I've recently started to hit a warning followed by tasks hanging after
attempts to freeze a filesystem. A git bisection pointed to the
following commit:

commit 4d004099a668c41522242aa146a38cc4eb59cb1e
Author: Peter Zijlstra <[email protected]>
Date: Fri Oct 2 11:04:21 2020 +0200

lockdep: Fix lockdep recursion

This happens very reliably when running all xfstests with lockdep
enabled, and the tested filesystem is btrfs (haven't tried other
filesystems, but it shouldn't matter). The warning and task hangs always
happen at either test generic/068 or test generic/390, and (oddly)
always have to run all tests for it to trigger, running those tests
individually on an infinite loop doesn't seem to trigger it (at least
for a couple hours).

The warning triggered is at fs/super.c:__sb_start_write() which always
results later in several tasks hanging on a percpu rw_sem:

https://pastebin.com/qnLvf94E

What happens is percpu_rwsem_is_held() is apparently returning a false
positive, so this makes __sb_start_write() do a
percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
is expected to always succeed, because if the calling task is holding a
freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
the semaphore at level 2, since the freeze semaphores are always
acquired by increasing level order.

But the try_lock fails, it triggers the warning at __sb_start_write(),
then its caller sb_start_pagefault() ignores the return value and
callers such as btrfs_page_mkwrite() make the assumption the freeze
semaphore was taken, proceed to do their stuff, and later call
sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
at level 2 despite not having not been able to down_read() the
semaphore. This obviously corrupts the semaphore's read_count state, and
later causes any task trying to down_write() it to hang forever.

After such a hang I ran a drgn script to confirm it:

$ cat dump_freeze_sems.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *

mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'

mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass

if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)

def dump_sem(level_enum):
level = level_enum.value_()
sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
print(f'freeze semaphore at level {level}, {str(level_enum)}')
print(f' block {sem.block.counter.value_()}')
for i in for_each_possible_cpu(prog):
read_count = per_cpu_ptr(sem.read_count, i)
print(f' read_count at cpu {i} = {read_count}')
print()

# dump semaphore read counts for all freeze levels (fs.h)
dump_sem(prog['SB_FREEZE_WRITE'])
dump_sem(prog['SB_FREEZE_PAGEFAULT'])
dump_sem(prog['SB_FREEZE_FS'])


$ drgn dump_freeze_sems.py
freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
block 1
read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293

freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
block 1
read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0

freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
block 0
read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0

At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
sums to -1. The system remains like that for hours at least, with no
progress at all.

Is there a known regression with that lockdep commit?
Anything I can do to help debug it in case it's not obvious?

Thanks.


2020-11-03 14:12:19

by Boqun Feng

[permalink] [raw]
Subject: Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

Hi Filipe,

On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
> Hello,
>
> I've recently started to hit a warning followed by tasks hanging after
> attempts to freeze a filesystem. A git bisection pointed to the
> following commit:
>
> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
> Author: Peter Zijlstra <[email protected]>
> Date: Fri Oct 2 11:04:21 2020 +0200
>
> lockdep: Fix lockdep recursion
>
> This happens very reliably when running all xfstests with lockdep
> enabled, and the tested filesystem is btrfs (haven't tried other
> filesystems, but it shouldn't matter). The warning and task hangs always
> happen at either test generic/068 or test generic/390, and (oddly)
> always have to run all tests for it to trigger, running those tests
> individually on an infinite loop doesn't seem to trigger it (at least
> for a couple hours).
>
> The warning triggered is at fs/super.c:__sb_start_write() which always
> results later in several tasks hanging on a percpu rw_sem:
>
> https://pastebin.com/qnLvf94E
>

In your dmesg, I see line:

[ 9304.920151] INFO: lockdep is turned off.

, that means debug_locks is 0, that usually happens when lockdep find a
problem (i.e. a deadlock) and it turns itself off, because a problem is
found and it's pointless for lockdep to continue to run.

And I haven't found a lockdep splat in your dmesg, do you have a full
dmesg so that I can have a look?

This may be relevant because in commit 4d004099a66, we have

@@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
unsigned long flags;
int ret = 0;

- if (unlikely(current->lockdep_recursion))
+ if (unlikely(!lockdep_enabled()))
return 1; /* avoid false negative lockdep_assert_held() */

before this commit lock_is_held_type() and its friends may return false
if debug_locks==0, after this commit lock_is_held_type() and its friends
will always return true if debug_locks == 0. That could cause the
behavior here.

In case I'm correct, the following "fix" may be helpful.

Regards,
Boqun

----------8
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 3e99dfef8408..c0e27fb949ff 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
unsigned long flags;
int ret = 0;

- if (unlikely(!lockdep_enabled()))
+ if (unlikely(debug_locks && !lockdep_enabled()))
return 1; /* avoid false negative lockdep_assert_held() */

raw_local_irq_save(flags);



> What happens is percpu_rwsem_is_held() is apparently returning a false
> positive, so this makes __sb_start_write() do a
> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
> is expected to always succeed, because if the calling task is holding a
> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
> the semaphore at level 2, since the freeze semaphores are always
> acquired by increasing level order.
>
> But the try_lock fails, it triggers the warning at __sb_start_write(),
> then its caller sb_start_pagefault() ignores the return value and
> callers such as btrfs_page_mkwrite() make the assumption the freeze
> semaphore was taken, proceed to do their stuff, and later call
> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
> at level 2 despite not having not been able to down_read() the
> semaphore. This obviously corrupts the semaphore's read_count state, and
> later causes any task trying to down_write() it to hang forever.
>
> After such a hang I ran a drgn script to confirm it:
>
> $ cat dump_freeze_sems.py
> import sys
> import drgn
> from drgn import NULL, Object, cast, container_of, execscript, \
> reinterpret, sizeof
> from drgn.helpers.linux import *
>
> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
>
> mnt = None
> for mnt in for_each_mount(prog, dst = mnt_path):
> pass
>
> if mnt is None:
> sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
> sys.exit(1)
>
> def dump_sem(level_enum):
> level = level_enum.value_()
> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
> print(f'freeze semaphore at level {level}, {str(level_enum)}')
> print(f' block {sem.block.counter.value_()}')
> for i in for_each_possible_cpu(prog):
> read_count = per_cpu_ptr(sem.read_count, i)
> print(f' read_count at cpu {i} = {read_count}')
> print()
>
> # dump semaphore read counts for all freeze levels (fs.h)
> dump_sem(prog['SB_FREEZE_WRITE'])
> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
> dump_sem(prog['SB_FREEZE_FS'])
>
>
> $ drgn dump_freeze_sems.py
> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
> block 1
> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
>
> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
> block 1
> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
>
> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
> block 0
> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
>
> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
> sums to -1. The system remains like that for hours at least, with no
> progress at all.
>
> Is there a known regression with that lockdep commit?
> Anything I can do to help debug it in case it's not obvious?
>
> Thanks.

2020-11-03 14:26:07

by Filipe Manana

[permalink] [raw]
Subject: Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")



On 03/11/20 14:08, Boqun Feng wrote:
> Hi Filipe,
>
> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>> Hello,
>>
>> I've recently started to hit a warning followed by tasks hanging after
>> attempts to freeze a filesystem. A git bisection pointed to the
>> following commit:
>>
>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>> Author: Peter Zijlstra <[email protected]>
>> Date: Fri Oct 2 11:04:21 2020 +0200
>>
>> lockdep: Fix lockdep recursion
>>
>> This happens very reliably when running all xfstests with lockdep
>> enabled, and the tested filesystem is btrfs (haven't tried other
>> filesystems, but it shouldn't matter). The warning and task hangs always
>> happen at either test generic/068 or test generic/390, and (oddly)
>> always have to run all tests for it to trigger, running those tests
>> individually on an infinite loop doesn't seem to trigger it (at least
>> for a couple hours).
>>
>> The warning triggered is at fs/super.c:__sb_start_write() which always
>> results later in several tasks hanging on a percpu rw_sem:
>>
>> https://pastebin.com/qnLvf94E
>>
>
> In your dmesg, I see line:
>
> [ 9304.920151] INFO: lockdep is turned off.
>
> , that means debug_locks is 0, that usually happens when lockdep find a
> problem (i.e. a deadlock) and it turns itself off, because a problem is
> found and it's pointless for lockdep to continue to run.
>
> And I haven't found a lockdep splat in your dmesg, do you have a full
> dmesg so that I can have a look?

Everytime I run into the issue it produces similar results. It always
starts with the WARN_ON() at __sb_start_write(), followed by several
"hung task" traces, and then some time after lockdep is turned off.

>
> This may be relevant because in commit 4d004099a66, we have
>
> @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> unsigned long flags;
> int ret = 0;
>
> - if (unlikely(current->lockdep_recursion))
> + if (unlikely(!lockdep_enabled()))
> return 1; /* avoid false negative lockdep_assert_held() */
>
> before this commit lock_is_held_type() and its friends may return false
> if debug_locks==0, after this commit lock_is_held_type() and its friends
> will always return true if debug_locks == 0. That could cause the
> behavior here.
>
> In case I'm correct, the following "fix" may be helpful.

I'll try it and let you now.
Thanks!

>
> Regards,
> Boqun
>
> ----------8
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 3e99dfef8408..c0e27fb949ff 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> unsigned long flags;
> int ret = 0;
>
> - if (unlikely(!lockdep_enabled()))
> + if (unlikely(debug_locks && !lockdep_enabled()))
> return 1; /* avoid false negative lockdep_assert_held() */
>
> raw_local_irq_save(flags);
>
>
>
>> What happens is percpu_rwsem_is_held() is apparently returning a false
>> positive, so this makes __sb_start_write() do a
>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
>> is expected to always succeed, because if the calling task is holding a
>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
>> the semaphore at level 2, since the freeze semaphores are always
>> acquired by increasing level order.
>>
>> But the try_lock fails, it triggers the warning at __sb_start_write(),
>> then its caller sb_start_pagefault() ignores the return value and
>> callers such as btrfs_page_mkwrite() make the assumption the freeze
>> semaphore was taken, proceed to do their stuff, and later call
>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
>> at level 2 despite not having not been able to down_read() the
>> semaphore. This obviously corrupts the semaphore's read_count state, and
>> later causes any task trying to down_write() it to hang forever.
>>
>> After such a hang I ran a drgn script to confirm it:
>>
>> $ cat dump_freeze_sems.py
>> import sys
>> import drgn
>> from drgn import NULL, Object, cast, container_of, execscript, \
>> reinterpret, sizeof
>> from drgn.helpers.linux import *
>>
>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
>>
>> mnt = None
>> for mnt in for_each_mount(prog, dst = mnt_path):
>> pass
>>
>> if mnt is None:
>> sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
>> sys.exit(1)
>>
>> def dump_sem(level_enum):
>> level = level_enum.value_()
>> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
>> print(f'freeze semaphore at level {level}, {str(level_enum)}')
>> print(f' block {sem.block.counter.value_()}')
>> for i in for_each_possible_cpu(prog):
>> read_count = per_cpu_ptr(sem.read_count, i)
>> print(f' read_count at cpu {i} = {read_count}')
>> print()
>>
>> # dump semaphore read counts for all freeze levels (fs.h)
>> dump_sem(prog['SB_FREEZE_WRITE'])
>> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
>> dump_sem(prog['SB_FREEZE_FS'])
>>
>>
>> $ drgn dump_freeze_sems.py
>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
>> block 1
>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
>>
>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
>> block 1
>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
>>
>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
>> block 0
>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
>>
>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
>> sums to -1. The system remains like that for hours at least, with no
>> progress at all.
>>
>> Is there a known regression with that lockdep commit?
>> Anything I can do to help debug it in case it's not obvious?
>>
>> Thanks.
>

2020-11-03 19:46:39

by Filipe Manana

[permalink] [raw]
Subject: Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")



On 03/11/20 14:08, Boqun Feng wrote:
> Hi Filipe,
>
> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>> Hello,
>>
>> I've recently started to hit a warning followed by tasks hanging after
>> attempts to freeze a filesystem. A git bisection pointed to the
>> following commit:
>>
>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>> Author: Peter Zijlstra <[email protected]>
>> Date: Fri Oct 2 11:04:21 2020 +0200
>>
>> lockdep: Fix lockdep recursion
>>
>> This happens very reliably when running all xfstests with lockdep
>> enabled, and the tested filesystem is btrfs (haven't tried other
>> filesystems, but it shouldn't matter). The warning and task hangs always
>> happen at either test generic/068 or test generic/390, and (oddly)
>> always have to run all tests for it to trigger, running those tests
>> individually on an infinite loop doesn't seem to trigger it (at least
>> for a couple hours).
>>
>> The warning triggered is at fs/super.c:__sb_start_write() which always
>> results later in several tasks hanging on a percpu rw_sem:
>>
>> https://pastebin.com/qnLvf94E
>>
>
> In your dmesg, I see line:
>
> [ 9304.920151] INFO: lockdep is turned off.
>
> , that means debug_locks is 0, that usually happens when lockdep find a
> problem (i.e. a deadlock) and it turns itself off, because a problem is
> found and it's pointless for lockdep to continue to run.
>
> And I haven't found a lockdep splat in your dmesg, do you have a full
> dmesg so that I can have a look?
>
> This may be relevant because in commit 4d004099a66, we have
>
> @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> unsigned long flags;
> int ret = 0;
>
> - if (unlikely(current->lockdep_recursion))
> + if (unlikely(!lockdep_enabled()))
> return 1; /* avoid false negative lockdep_assert_held() */
>
> before this commit lock_is_held_type() and its friends may return false
> if debug_locks==0, after this commit lock_is_held_type() and its friends
> will always return true if debug_locks == 0. That could cause the
> behavior here.
>
> In case I'm correct, the following "fix" may be helpful.
>
> Regards,
> Boqun
>
> ----------8
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 3e99dfef8408..c0e27fb949ff 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> unsigned long flags;
> int ret = 0;
>
> - if (unlikely(!lockdep_enabled()))
> + if (unlikely(debug_locks && !lockdep_enabled()))
> return 1; /* avoid false negative lockdep_assert_held() */
>
> raw_local_irq_save(flags);

Boqun, the patch fixes the problem for me!
You can have Tested-by: Filipe Manana <[email protected]>

Thanks!

>
>
>
>> What happens is percpu_rwsem_is_held() is apparently returning a false
>> positive, so this makes __sb_start_write() do a
>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
>> is expected to always succeed, because if the calling task is holding a
>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
>> the semaphore at level 2, since the freeze semaphores are always
>> acquired by increasing level order.
>>
>> But the try_lock fails, it triggers the warning at __sb_start_write(),
>> then its caller sb_start_pagefault() ignores the return value and
>> callers such as btrfs_page_mkwrite() make the assumption the freeze
>> semaphore was taken, proceed to do their stuff, and later call
>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
>> at level 2 despite not having not been able to down_read() the
>> semaphore. This obviously corrupts the semaphore's read_count state, and
>> later causes any task trying to down_write() it to hang forever.
>>
>> After such a hang I ran a drgn script to confirm it:
>>
>> $ cat dump_freeze_sems.py
>> import sys
>> import drgn
>> from drgn import NULL, Object, cast, container_of, execscript, \
>> reinterpret, sizeof
>> from drgn.helpers.linux import *
>>
>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
>>
>> mnt = None
>> for mnt in for_each_mount(prog, dst = mnt_path):
>> pass
>>
>> if mnt is None:
>> sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
>> sys.exit(1)
>>
>> def dump_sem(level_enum):
>> level = level_enum.value_()
>> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
>> print(f'freeze semaphore at level {level}, {str(level_enum)}')
>> print(f' block {sem.block.counter.value_()}')
>> for i in for_each_possible_cpu(prog):
>> read_count = per_cpu_ptr(sem.read_count, i)
>> print(f' read_count at cpu {i} = {read_count}')
>> print()
>>
>> # dump semaphore read counts for all freeze levels (fs.h)
>> dump_sem(prog['SB_FREEZE_WRITE'])
>> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
>> dump_sem(prog['SB_FREEZE_FS'])
>>
>>
>> $ drgn dump_freeze_sems.py
>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
>> block 1
>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
>>
>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
>> block 1
>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
>>
>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
>> block 0
>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
>>
>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
>> sums to -1. The system remains like that for hours at least, with no
>> progress at all.
>>
>> Is there a known regression with that lockdep commit?
>> Anything I can do to help debug it in case it's not obvious?
>>
>> Thanks.
>

2020-11-04 02:26:02

by Boqun Feng

[permalink] [raw]
Subject: Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote:
>
>
> On 03/11/20 14:08, Boqun Feng wrote:
> > Hi Filipe,
> >
> > On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
> >> Hello,
> >>
> >> I've recently started to hit a warning followed by tasks hanging after
> >> attempts to freeze a filesystem. A git bisection pointed to the
> >> following commit:
> >>
> >> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
> >> Author: Peter Zijlstra <[email protected]>
> >> Date: Fri Oct 2 11:04:21 2020 +0200
> >>
> >> lockdep: Fix lockdep recursion
> >>
> >> This happens very reliably when running all xfstests with lockdep
> >> enabled, and the tested filesystem is btrfs (haven't tried other
> >> filesystems, but it shouldn't matter). The warning and task hangs always
> >> happen at either test generic/068 or test generic/390, and (oddly)
> >> always have to run all tests for it to trigger, running those tests
> >> individually on an infinite loop doesn't seem to trigger it (at least
> >> for a couple hours).
> >>
> >> The warning triggered is at fs/super.c:__sb_start_write() which always
> >> results later in several tasks hanging on a percpu rw_sem:
> >>
> >> https://pastebin.com/qnLvf94E
> >>
> >
> > In your dmesg, I see line:
> >
> > [ 9304.920151] INFO: lockdep is turned off.
> >
> > , that means debug_locks is 0, that usually happens when lockdep find a
> > problem (i.e. a deadlock) and it turns itself off, because a problem is
> > found and it's pointless for lockdep to continue to run.
> >
> > And I haven't found a lockdep splat in your dmesg, do you have a full
> > dmesg so that I can have a look?
> >
> > This may be relevant because in commit 4d004099a66, we have
> >
> > @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> > unsigned long flags;
> > int ret = 0;
> >
> > - if (unlikely(current->lockdep_recursion))
> > + if (unlikely(!lockdep_enabled()))
> > return 1; /* avoid false negative lockdep_assert_held() */
> >
> > before this commit lock_is_held_type() and its friends may return false
> > if debug_locks==0, after this commit lock_is_held_type() and its friends
> > will always return true if debug_locks == 0. That could cause the
> > behavior here.
> >
> > In case I'm correct, the following "fix" may be helpful.
> >
> > Regards,
> > Boqun
> >
> > ----------8
> > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > index 3e99dfef8408..c0e27fb949ff 100644
> > --- a/kernel/locking/lockdep.c
> > +++ b/kernel/locking/lockdep.c
> > @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> > unsigned long flags;
> > int ret = 0;
> >
> > - if (unlikely(!lockdep_enabled()))
> > + if (unlikely(debug_locks && !lockdep_enabled()))
> > return 1; /* avoid false negative lockdep_assert_held() */
> >
> > raw_local_irq_save(flags);
>
> Boqun, the patch fixes the problem for me!
> You can have Tested-by: Filipe Manana <[email protected]>
>

Thanks. Although I think it still means that we have a lock issue when
running xfstests (because we don't know why debug_locks gets cleared),
I guess I will have to reproduce this myself for further analysis, could
you share you .config?

Anyway, I think this fix still makes a bit sense, I will send a proper
patch so that the problem won't block fs folks.

Regards,
Boqun

> Thanks!
>
> >
> >
> >
> >> What happens is percpu_rwsem_is_held() is apparently returning a false
> >> positive, so this makes __sb_start_write() do a
> >> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
> >> is expected to always succeed, because if the calling task is holding a
> >> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
> >> the semaphore at level 2, since the freeze semaphores are always
> >> acquired by increasing level order.
> >>
> >> But the try_lock fails, it triggers the warning at __sb_start_write(),
> >> then its caller sb_start_pagefault() ignores the return value and
> >> callers such as btrfs_page_mkwrite() make the assumption the freeze
> >> semaphore was taken, proceed to do their stuff, and later call
> >> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
> >> at level 2 despite not having not been able to down_read() the
> >> semaphore. This obviously corrupts the semaphore's read_count state, and
> >> later causes any task trying to down_write() it to hang forever.
> >>
> >> After such a hang I ran a drgn script to confirm it:
> >>
> >> $ cat dump_freeze_sems.py
> >> import sys
> >> import drgn
> >> from drgn import NULL, Object, cast, container_of, execscript, \
> >> reinterpret, sizeof
> >> from drgn.helpers.linux import *
> >>
> >> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
> >>
> >> mnt = None
> >> for mnt in for_each_mount(prog, dst = mnt_path):
> >> pass
> >>
> >> if mnt is None:
> >> sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
> >> sys.exit(1)
> >>
> >> def dump_sem(level_enum):
> >> level = level_enum.value_()
> >> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
> >> print(f'freeze semaphore at level {level}, {str(level_enum)}')
> >> print(f' block {sem.block.counter.value_()}')
> >> for i in for_each_possible_cpu(prog):
> >> read_count = per_cpu_ptr(sem.read_count, i)
> >> print(f' read_count at cpu {i} = {read_count}')
> >> print()
> >>
> >> # dump semaphore read counts for all freeze levels (fs.h)
> >> dump_sem(prog['SB_FREEZE_WRITE'])
> >> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
> >> dump_sem(prog['SB_FREEZE_FS'])
> >>
> >>
> >> $ drgn dump_freeze_sems.py
> >> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
> >> block 1
> >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
> >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
> >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
> >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
> >>
> >> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
> >> block 1
> >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
> >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
> >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
> >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
> >>
> >> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
> >> block 0
> >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
> >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
> >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
> >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
> >>
> >> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
> >> sums to -1. The system remains like that for hours at least, with no
> >> progress at all.
> >>
> >> Is there a known regression with that lockdep commit?
> >> Anything I can do to help debug it in case it's not obvious?
> >>
> >> Thanks.
> >

2020-11-04 03:48:28

by Boqun Feng

[permalink] [raw]
Subject: Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

On Wed, Nov 04, 2020 at 10:22:36AM +0800, Boqun Feng wrote:
> On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote:
> >
> >
> > On 03/11/20 14:08, Boqun Feng wrote:
> > > Hi Filipe,
> > >
> > > On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
> > >> Hello,
> > >>
> > >> I've recently started to hit a warning followed by tasks hanging after
> > >> attempts to freeze a filesystem. A git bisection pointed to the
> > >> following commit:
> > >>
> > >> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
> > >> Author: Peter Zijlstra <[email protected]>
> > >> Date: Fri Oct 2 11:04:21 2020 +0200
> > >>
> > >> lockdep: Fix lockdep recursion
> > >>
> > >> This happens very reliably when running all xfstests with lockdep
> > >> enabled, and the tested filesystem is btrfs (haven't tried other
> > >> filesystems, but it shouldn't matter). The warning and task hangs always
> > >> happen at either test generic/068 or test generic/390, and (oddly)
> > >> always have to run all tests for it to trigger, running those tests
> > >> individually on an infinite loop doesn't seem to trigger it (at least
> > >> for a couple hours).
> > >>
> > >> The warning triggered is at fs/super.c:__sb_start_write() which always
> > >> results later in several tasks hanging on a percpu rw_sem:
> > >>
> > >> https://pastebin.com/qnLvf94E
> > >>
> > >
> > > In your dmesg, I see line:
> > >
> > > [ 9304.920151] INFO: lockdep is turned off.
> > >
> > > , that means debug_locks is 0, that usually happens when lockdep find a
> > > problem (i.e. a deadlock) and it turns itself off, because a problem is
> > > found and it's pointless for lockdep to continue to run.
> > >
> > > And I haven't found a lockdep splat in your dmesg, do you have a full
> > > dmesg so that I can have a look?
> > >
> > > This may be relevant because in commit 4d004099a66, we have
> > >
> > > @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> > > unsigned long flags;
> > > int ret = 0;
> > >
> > > - if (unlikely(current->lockdep_recursion))
> > > + if (unlikely(!lockdep_enabled()))
> > > return 1; /* avoid false negative lockdep_assert_held() */
> > >
> > > before this commit lock_is_held_type() and its friends may return false
> > > if debug_locks==0, after this commit lock_is_held_type() and its friends
> > > will always return true if debug_locks == 0. That could cause the
> > > behavior here.
> > >
> > > In case I'm correct, the following "fix" may be helpful.
> > >
> > > Regards,
> > > Boqun
> > >
> > > ----------8
> > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > index 3e99dfef8408..c0e27fb949ff 100644
> > > --- a/kernel/locking/lockdep.c
> > > +++ b/kernel/locking/lockdep.c
> > > @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> > > unsigned long flags;
> > > int ret = 0;
> > >
> > > - if (unlikely(!lockdep_enabled()))
> > > + if (unlikely(debug_locks && !lockdep_enabled()))
> > > return 1; /* avoid false negative lockdep_assert_held() */
> > >
> > > raw_local_irq_save(flags);
> >
> > Boqun, the patch fixes the problem for me!
> > You can have Tested-by: Filipe Manana <[email protected]>
> >
>
> Thanks. Although I think it still means that we have a lock issue when
> running xfstests (because we don't know why debug_locks gets cleared),

I might find a place where we could turn lockdep off silently:

in print_circular_bug(), we turn off lockdep via
debug_locks_off_graph_unlock(), and then we try to save the trace for
lockdep splat, however, if we use up the stack_trace buffer (i.e.
nr_stack_trace_entries), save_trace() will return NULL and we return
silently.

Filipe, in order to check whethter that happens, could you share me your
/proc/lockdep_stats after the full set of xfstests is finished?

Alternatively, it's also helpful if you can try the following debug
diff, with teh full set of xfstests:

Thanks! Just trying to understand the real problem.

Regards,
Boqun

-------------->8
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index b71ad8d9f1c9..9ae3e089e5c0 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
LOCK_TRACE_SIZE_IN_LONGS;

if (max_entries <= 0) {
- if (!debug_locks_off_graph_unlock())
+ if (!debug_locks_off_graph_unlock()) {
+ WARN_ON_ONCE(1);
return NULL;
+ }

print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
dump_stack();

> I guess I will have to reproduce this myself for further analysis, could
> you share you .config?
>
> Anyway, I think this fix still makes a bit sense, I will send a proper
> patch so that the problem won't block fs folks.
>
> Regards,
> Boqun
>
> > Thanks!
> >
> > >
> > >
> > >
> > >> What happens is percpu_rwsem_is_held() is apparently returning a false
> > >> positive, so this makes __sb_start_write() do a
> > >> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
> > >> is expected to always succeed, because if the calling task is holding a
> > >> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
> > >> the semaphore at level 2, since the freeze semaphores are always
> > >> acquired by increasing level order.
> > >>
> > >> But the try_lock fails, it triggers the warning at __sb_start_write(),
> > >> then its caller sb_start_pagefault() ignores the return value and
> > >> callers such as btrfs_page_mkwrite() make the assumption the freeze
> > >> semaphore was taken, proceed to do their stuff, and later call
> > >> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
> > >> at level 2 despite not having not been able to down_read() the
> > >> semaphore. This obviously corrupts the semaphore's read_count state, and
> > >> later causes any task trying to down_write() it to hang forever.
> > >>
> > >> After such a hang I ran a drgn script to confirm it:
> > >>
> > >> $ cat dump_freeze_sems.py
> > >> import sys
> > >> import drgn
> > >> from drgn import NULL, Object, cast, container_of, execscript, \
> > >> reinterpret, sizeof
> > >> from drgn.helpers.linux import *
> > >>
> > >> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
> > >>
> > >> mnt = None
> > >> for mnt in for_each_mount(prog, dst = mnt_path):
> > >> pass
> > >>
> > >> if mnt is None:
> > >> sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
> > >> sys.exit(1)
> > >>
> > >> def dump_sem(level_enum):
> > >> level = level_enum.value_()
> > >> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
> > >> print(f'freeze semaphore at level {level}, {str(level_enum)}')
> > >> print(f' block {sem.block.counter.value_()}')
> > >> for i in for_each_possible_cpu(prog):
> > >> read_count = per_cpu_ptr(sem.read_count, i)
> > >> print(f' read_count at cpu {i} = {read_count}')
> > >> print()
> > >>
> > >> # dump semaphore read counts for all freeze levels (fs.h)
> > >> dump_sem(prog['SB_FREEZE_WRITE'])
> > >> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
> > >> dump_sem(prog['SB_FREEZE_FS'])
> > >>
> > >>
> > >> $ drgn dump_freeze_sems.py
> > >> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
> > >> block 1
> > >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
> > >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
> > >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
> > >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
> > >>
> > >> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
> > >> block 1
> > >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
> > >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
> > >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
> > >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
> > >>
> > >> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
> > >> block 0
> > >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
> > >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
> > >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
> > >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
> > >>
> > >> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
> > >> sums to -1. The system remains like that for hours at least, with no
> > >> progress at all.
> > >>
> > >> Is there a known regression with that lockdep commit?
> > >> Anything I can do to help debug it in case it's not obvious?
> > >>
> > >> Thanks.
> > >

2020-11-04 09:50:50

by Filipe Manana

[permalink] [raw]
Subject: Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")



On 04/11/20 03:44, Boqun Feng wrote:
> On Wed, Nov 04, 2020 at 10:22:36AM +0800, Boqun Feng wrote:
>> On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote:
>>>
>>>
>>> On 03/11/20 14:08, Boqun Feng wrote:
>>>> Hi Filipe,
>>>>
>>>> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>>>>> Hello,
>>>>>
>>>>> I've recently started to hit a warning followed by tasks hanging after
>>>>> attempts to freeze a filesystem. A git bisection pointed to the
>>>>> following commit:
>>>>>
>>>>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>>>>> Author: Peter Zijlstra <[email protected]>
>>>>> Date: Fri Oct 2 11:04:21 2020 +0200
>>>>>
>>>>> lockdep: Fix lockdep recursion
>>>>>
>>>>> This happens very reliably when running all xfstests with lockdep
>>>>> enabled, and the tested filesystem is btrfs (haven't tried other
>>>>> filesystems, but it shouldn't matter). The warning and task hangs always
>>>>> happen at either test generic/068 or test generic/390, and (oddly)
>>>>> always have to run all tests for it to trigger, running those tests
>>>>> individually on an infinite loop doesn't seem to trigger it (at least
>>>>> for a couple hours).
>>>>>
>>>>> The warning triggered is at fs/super.c:__sb_start_write() which always
>>>>> results later in several tasks hanging on a percpu rw_sem:
>>>>>
>>>>> https://pastebin.com/qnLvf94E
>>>>>
>>>>
>>>> In your dmesg, I see line:
>>>>
>>>> [ 9304.920151] INFO: lockdep is turned off.
>>>>
>>>> , that means debug_locks is 0, that usually happens when lockdep find a
>>>> problem (i.e. a deadlock) and it turns itself off, because a problem is
>>>> found and it's pointless for lockdep to continue to run.
>>>>
>>>> And I haven't found a lockdep splat in your dmesg, do you have a full
>>>> dmesg so that I can have a look?
>>>>
>>>> This may be relevant because in commit 4d004099a66, we have
>>>>
>>>> @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
>>>> unsigned long flags;
>>>> int ret = 0;
>>>>
>>>> - if (unlikely(current->lockdep_recursion))
>>>> + if (unlikely(!lockdep_enabled()))
>>>> return 1; /* avoid false negative lockdep_assert_held() */
>>>>
>>>> before this commit lock_is_held_type() and its friends may return false
>>>> if debug_locks==0, after this commit lock_is_held_type() and its friends
>>>> will always return true if debug_locks == 0. That could cause the
>>>> behavior here.
>>>>
>>>> In case I'm correct, the following "fix" may be helpful.
>>>>
>>>> Regards,
>>>> Boqun
>>>>
>>>> ----------8
>>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>>> index 3e99dfef8408..c0e27fb949ff 100644
>>>> --- a/kernel/locking/lockdep.c
>>>> +++ b/kernel/locking/lockdep.c
>>>> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
>>>> unsigned long flags;
>>>> int ret = 0;
>>>>
>>>> - if (unlikely(!lockdep_enabled()))
>>>> + if (unlikely(debug_locks && !lockdep_enabled()))
>>>> return 1; /* avoid false negative lockdep_assert_held() */
>>>>
>>>> raw_local_irq_save(flags);
>>>
>>> Boqun, the patch fixes the problem for me!
>>> You can have Tested-by: Filipe Manana <[email protected]>
>>>
>>
>> Thanks. Although I think it still means that we have a lock issue when
>> running xfstests (because we don't know why debug_locks gets cleared),
>
> I might find a place where we could turn lockdep off silently:
>
> in print_circular_bug(), we turn off lockdep via
> debug_locks_off_graph_unlock(), and then we try to save the trace for
> lockdep splat, however, if we use up the stack_trace buffer (i.e.
> nr_stack_trace_entries), save_trace() will return NULL and we return
> silently.
>
> Filipe, in order to check whethter that happens, could you share me your
> /proc/lockdep_stats after the full set of xfstests is finished?

Here it is:

$ cat /proc/lockdep_stats
lock-classes: 1831 [max: 8192]
direct dependencies: 17774 [max: 32768]
indirect dependencies: 75662
all direct dependencies: 325284
dependency chains: 34223 [max: 65536]
dependency chain hlocks used: 158129 [max: 327680]
dependency chain hlocks lost: 0
in-hardirq chains: 57
in-softirq chains: 658
in-process chains: 33508
stack-trace entries: 160748 [max: 524288]
number of stack traces: 9237
number of stack hash chains: 7076
combined max dependencies: 1280780998
hardirq-safe locks: 43
hardirq-unsafe locks: 1337
softirq-safe locks: 179
softirq-unsafe locks: 1236
irq-safe locks: 187
irq-unsafe locks: 1337
hardirq-read-safe locks: 2
hardirq-read-unsafe locks: 209
softirq-read-safe locks: 9
softirq-read-unsafe locks: 204
irq-read-safe locks: 9
irq-read-unsafe locks: 209
uncategorized locks: 274
unused locks: 0
max locking depth: 15
max bfs queue depth: 337
debug_locks: 0

zapped classes: 2278
zapped lock chains: 17915
large chain blocks: 1

(That's the result after running all fstests with the previous one line
patch you sent.)

My kernel .config: https://pastebin.com/4xEMvLJ9

I'll try the debugging patch and let you know the results. It will take
some 4 hours or so to get back with the result.

Thanks!

>
> Alternatively, it's also helpful if you can try the following debug
> diff, with teh full set of xfstests:
>
> Thanks! Just trying to understand the real problem.
>
> Regards,
> Boqun
>
> -------------->8
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index b71ad8d9f1c9..9ae3e089e5c0 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
> LOCK_TRACE_SIZE_IN_LONGS;
>
> if (max_entries <= 0) {
> - if (!debug_locks_off_graph_unlock())
> + if (!debug_locks_off_graph_unlock()) {
> + WARN_ON_ONCE(1);
> return NULL;
> + }
>
> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
> dump_stack();
>
>> I guess I will have to reproduce this myself for further analysis, could
>> you share you .config?
>>
>> Anyway, I think this fix still makes a bit sense, I will send a proper
>> patch so that the problem won't block fs folks.
>>
>> Regards,
>> Boqun
>>
>>> Thanks!
>>>
>>>>
>>>>
>>>>
>>>>> What happens is percpu_rwsem_is_held() is apparently returning a false
>>>>> positive, so this makes __sb_start_write() do a
>>>>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
>>>>> is expected to always succeed, because if the calling task is holding a
>>>>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
>>>>> the semaphore at level 2, since the freeze semaphores are always
>>>>> acquired by increasing level order.
>>>>>
>>>>> But the try_lock fails, it triggers the warning at __sb_start_write(),
>>>>> then its caller sb_start_pagefault() ignores the return value and
>>>>> callers such as btrfs_page_mkwrite() make the assumption the freeze
>>>>> semaphore was taken, proceed to do their stuff, and later call
>>>>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
>>>>> at level 2 despite not having not been able to down_read() the
>>>>> semaphore. This obviously corrupts the semaphore's read_count state, and
>>>>> later causes any task trying to down_write() it to hang forever.
>>>>>
>>>>> After such a hang I ran a drgn script to confirm it:
>>>>>
>>>>> $ cat dump_freeze_sems.py
>>>>> import sys
>>>>> import drgn
>>>>> from drgn import NULL, Object, cast, container_of, execscript, \
>>>>> reinterpret, sizeof
>>>>> from drgn.helpers.linux import *
>>>>>
>>>>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
>>>>>
>>>>> mnt = None
>>>>> for mnt in for_each_mount(prog, dst = mnt_path):
>>>>> pass
>>>>>
>>>>> if mnt is None:
>>>>> sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
>>>>> sys.exit(1)
>>>>>
>>>>> def dump_sem(level_enum):
>>>>> level = level_enum.value_()
>>>>> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
>>>>> print(f'freeze semaphore at level {level}, {str(level_enum)}')
>>>>> print(f' block {sem.block.counter.value_()}')
>>>>> for i in for_each_possible_cpu(prog):
>>>>> read_count = per_cpu_ptr(sem.read_count, i)
>>>>> print(f' read_count at cpu {i} = {read_count}')
>>>>> print()
>>>>>
>>>>> # dump semaphore read counts for all freeze levels (fs.h)
>>>>> dump_sem(prog['SB_FREEZE_WRITE'])
>>>>> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
>>>>> dump_sem(prog['SB_FREEZE_FS'])
>>>>>
>>>>>
>>>>> $ drgn dump_freeze_sems.py
>>>>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
>>>>> block 1
>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
>>>>>
>>>>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
>>>>> block 1
>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
>>>>>
>>>>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
>>>>> block 0
>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
>>>>>
>>>>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
>>>>> sums to -1. The system remains like that for hours at least, with no
>>>>> progress at all.
>>>>>
>>>>> Is there a known regression with that lockdep commit?
>>>>> Anything I can do to help debug it in case it's not obvious?
>>>>>
>>>>> Thanks.
>>>>
>

2020-11-05 00:59:35

by Filipe Manana

[permalink] [raw]
Subject: Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")



On 04/11/20 09:49, Filipe Manana wrote:
>
>
> On 04/11/20 03:44, Boqun Feng wrote:
>> On Wed, Nov 04, 2020 at 10:22:36AM +0800, Boqun Feng wrote:
>>> On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote:
>>>>
>>>>
>>>> On 03/11/20 14:08, Boqun Feng wrote:
>>>>> Hi Filipe,
>>>>>
>>>>> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I've recently started to hit a warning followed by tasks hanging after
>>>>>> attempts to freeze a filesystem. A git bisection pointed to the
>>>>>> following commit:
>>>>>>
>>>>>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>>>>>> Author: Peter Zijlstra <[email protected]>
>>>>>> Date: Fri Oct 2 11:04:21 2020 +0200
>>>>>>
>>>>>> lockdep: Fix lockdep recursion
>>>>>>
>>>>>> This happens very reliably when running all xfstests with lockdep
>>>>>> enabled, and the tested filesystem is btrfs (haven't tried other
>>>>>> filesystems, but it shouldn't matter). The warning and task hangs always
>>>>>> happen at either test generic/068 or test generic/390, and (oddly)
>>>>>> always have to run all tests for it to trigger, running those tests
>>>>>> individually on an infinite loop doesn't seem to trigger it (at least
>>>>>> for a couple hours).
>>>>>>
>>>>>> The warning triggered is at fs/super.c:__sb_start_write() which always
>>>>>> results later in several tasks hanging on a percpu rw_sem:
>>>>>>
>>>>>> https://pastebin.com/qnLvf94E
>>>>>>
>>>>>
>>>>> In your dmesg, I see line:
>>>>>
>>>>> [ 9304.920151] INFO: lockdep is turned off.
>>>>>
>>>>> , that means debug_locks is 0, that usually happens when lockdep find a
>>>>> problem (i.e. a deadlock) and it turns itself off, because a problem is
>>>>> found and it's pointless for lockdep to continue to run.
>>>>>
>>>>> And I haven't found a lockdep splat in your dmesg, do you have a full
>>>>> dmesg so that I can have a look?
>>>>>
>>>>> This may be relevant because in commit 4d004099a66, we have
>>>>>
>>>>> @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
>>>>> unsigned long flags;
>>>>> int ret = 0;
>>>>>
>>>>> - if (unlikely(current->lockdep_recursion))
>>>>> + if (unlikely(!lockdep_enabled()))
>>>>> return 1; /* avoid false negative lockdep_assert_held() */
>>>>>
>>>>> before this commit lock_is_held_type() and its friends may return false
>>>>> if debug_locks==0, after this commit lock_is_held_type() and its friends
>>>>> will always return true if debug_locks == 0. That could cause the
>>>>> behavior here.
>>>>>
>>>>> In case I'm correct, the following "fix" may be helpful.
>>>>>
>>>>> Regards,
>>>>> Boqun
>>>>>
>>>>> ----------8
>>>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>>>> index 3e99dfef8408..c0e27fb949ff 100644
>>>>> --- a/kernel/locking/lockdep.c
>>>>> +++ b/kernel/locking/lockdep.c
>>>>> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
>>>>> unsigned long flags;
>>>>> int ret = 0;
>>>>>
>>>>> - if (unlikely(!lockdep_enabled()))
>>>>> + if (unlikely(debug_locks && !lockdep_enabled()))
>>>>> return 1; /* avoid false negative lockdep_assert_held() */
>>>>>
>>>>> raw_local_irq_save(flags);
>>>>
>>>> Boqun, the patch fixes the problem for me!
>>>> You can have Tested-by: Filipe Manana <[email protected]>
>>>>
>>>
>>> Thanks. Although I think it still means that we have a lock issue when
>>> running xfstests (because we don't know why debug_locks gets cleared),
>>
>> I might find a place where we could turn lockdep off silently:
>>
>> in print_circular_bug(), we turn off lockdep via
>> debug_locks_off_graph_unlock(), and then we try to save the trace for
>> lockdep splat, however, if we use up the stack_trace buffer (i.e.
>> nr_stack_trace_entries), save_trace() will return NULL and we return
>> silently.
>>
>> Filipe, in order to check whethter that happens, could you share me your
>> /proc/lockdep_stats after the full set of xfstests is finished?
>
> Here it is:
>
> $ cat /proc/lockdep_stats
> lock-classes: 1831 [max: 8192]
> direct dependencies: 17774 [max: 32768]
> indirect dependencies: 75662
> all direct dependencies: 325284
> dependency chains: 34223 [max: 65536]
> dependency chain hlocks used: 158129 [max: 327680]
> dependency chain hlocks lost: 0
> in-hardirq chains: 57
> in-softirq chains: 658
> in-process chains: 33508
> stack-trace entries: 160748 [max: 524288]
> number of stack traces: 9237
> number of stack hash chains: 7076
> combined max dependencies: 1280780998
> hardirq-safe locks: 43
> hardirq-unsafe locks: 1337
> softirq-safe locks: 179
> softirq-unsafe locks: 1236
> irq-safe locks: 187
> irq-unsafe locks: 1337
> hardirq-read-safe locks: 2
> hardirq-read-unsafe locks: 209
> softirq-read-safe locks: 9
> softirq-read-unsafe locks: 204
> irq-read-safe locks: 9
> irq-read-unsafe locks: 209
> uncategorized locks: 274
> unused locks: 0
> max locking depth: 15
> max bfs queue depth: 337
> debug_locks: 0
>
> zapped classes: 2278
> zapped lock chains: 17915
> large chain blocks: 1
>
> (That's the result after running all fstests with the previous one line
> patch you sent.)
>
> My kernel .config: https://pastebin.com/4xEMvLJ9
>
> I'll try the debugging patch and let you know the results. It will take
> some 4 hours or so to get back with the result.

Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index b71ad8d9f1c9..b31d4ad482c7 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
LOCK_TRACE_SIZE_IN_LONGS;

if (max_entries <= 0) {
- if (!debug_locks_off_graph_unlock())
+ if (!debug_locks_off_graph_unlock()) {
+ WARN_ON_ONCE(1);
return NULL;
+ }

print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
dump_stack();
@@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
lockdep_map *lock, int read)
unsigned long flags;
int ret = 0;

- if (unlikely(!lockdep_enabled()))
+ if (unlikely(debug_locks && !lockdep_enabled()))
return 1; /* avoid false negative lockdep_assert_held() */

raw_local_irq_save(flags);

With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
Unexpected, right?

Should I try something else?

Thanks.


>
> Thanks!
>
>>
>> Alternatively, it's also helpful if you can try the following debug
>> diff, with teh full set of xfstests:
>>
>> Thanks! Just trying to understand the real problem.
>>
>> Regards,
>> Boqun
>>
>> -------------->8
>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>> index b71ad8d9f1c9..9ae3e089e5c0 100644
>> --- a/kernel/locking/lockdep.c
>> +++ b/kernel/locking/lockdep.c
>> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
>> LOCK_TRACE_SIZE_IN_LONGS;
>>
>> if (max_entries <= 0) {
>> - if (!debug_locks_off_graph_unlock())
>> + if (!debug_locks_off_graph_unlock()) {
>> + WARN_ON_ONCE(1);
>> return NULL;
>> + }
>>
>> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
>> dump_stack();
>>
>>> I guess I will have to reproduce this myself for further analysis, could
>>> you share you .config?
>>>
>>> Anyway, I think this fix still makes a bit sense, I will send a proper
>>> patch so that the problem won't block fs folks.
>>>
>>> Regards,
>>> Boqun
>>>
>>>> Thanks!
>>>>
>>>>>
>>>>>
>>>>>
>>>>>> What happens is percpu_rwsem_is_held() is apparently returning a false
>>>>>> positive, so this makes __sb_start_write() do a
>>>>>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
>>>>>> is expected to always succeed, because if the calling task is holding a
>>>>>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
>>>>>> the semaphore at level 2, since the freeze semaphores are always
>>>>>> acquired by increasing level order.
>>>>>>
>>>>>> But the try_lock fails, it triggers the warning at __sb_start_write(),
>>>>>> then its caller sb_start_pagefault() ignores the return value and
>>>>>> callers such as btrfs_page_mkwrite() make the assumption the freeze
>>>>>> semaphore was taken, proceed to do their stuff, and later call
>>>>>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
>>>>>> at level 2 despite not having not been able to down_read() the
>>>>>> semaphore. This obviously corrupts the semaphore's read_count state, and
>>>>>> later causes any task trying to down_write() it to hang forever.
>>>>>>
>>>>>> After such a hang I ran a drgn script to confirm it:
>>>>>>
>>>>>> $ cat dump_freeze_sems.py
>>>>>> import sys
>>>>>> import drgn
>>>>>> from drgn import NULL, Object, cast, container_of, execscript, \
>>>>>> reinterpret, sizeof
>>>>>> from drgn.helpers.linux import *
>>>>>>
>>>>>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
>>>>>>
>>>>>> mnt = None
>>>>>> for mnt in for_each_mount(prog, dst = mnt_path):
>>>>>> pass
>>>>>>
>>>>>> if mnt is None:
>>>>>> sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
>>>>>> sys.exit(1)
>>>>>>
>>>>>> def dump_sem(level_enum):
>>>>>> level = level_enum.value_()
>>>>>> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
>>>>>> print(f'freeze semaphore at level {level}, {str(level_enum)}')
>>>>>> print(f' block {sem.block.counter.value_()}')
>>>>>> for i in for_each_possible_cpu(prog):
>>>>>> read_count = per_cpu_ptr(sem.read_count, i)
>>>>>> print(f' read_count at cpu {i} = {read_count}')
>>>>>> print()
>>>>>>
>>>>>> # dump semaphore read counts for all freeze levels (fs.h)
>>>>>> dump_sem(prog['SB_FREEZE_WRITE'])
>>>>>> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
>>>>>> dump_sem(prog['SB_FREEZE_FS'])
>>>>>>
>>>>>>
>>>>>> $ drgn dump_freeze_sems.py
>>>>>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
>>>>>> block 1
>>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
>>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
>>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
>>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
>>>>>>
>>>>>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
>>>>>> block 1
>>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
>>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
>>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
>>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
>>>>>>
>>>>>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
>>>>>> block 0
>>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
>>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
>>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
>>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
>>>>>>
>>>>>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
>>>>>> sums to -1. The system remains like that for hours at least, with no
>>>>>> progress at all.
>>>>>>
>>>>>> Is there a known regression with that lockdep commit?
>>>>>> Anything I can do to help debug it in case it's not obvious?
>>>>>>
>>>>>> Thanks.
>>>>>
>>

2020-11-05 04:28:33

by Boqun Feng

[permalink] [raw]
Subject: Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

On Wed, Nov 04, 2020 at 07:54:40PM +0000, Filipe Manana wrote:
[...]
>
> Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):
>
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index b71ad8d9f1c9..b31d4ad482c7 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
> LOCK_TRACE_SIZE_IN_LONGS;
>
> if (max_entries <= 0) {
> - if (!debug_locks_off_graph_unlock())
> + if (!debug_locks_off_graph_unlock()) {
> + WARN_ON_ONCE(1);
> return NULL;
> + }
>
> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
> dump_stack();
> @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
> lockdep_map *lock, int read)
> unsigned long flags;
> int ret = 0;
>
> - if (unlikely(!lockdep_enabled()))
> + if (unlikely(debug_locks && !lockdep_enabled()))
> return 1; /* avoid false negative lockdep_assert_held() */
>
> raw_local_irq_save(flags);
>
> With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
> Unexpected, right?
>

Kinda, that means we still don't know why lockdep was turned off.

> Should I try something else?
>

Thanks for trying this. Let me set up the reproducer on my side, and see
if I could get more information.

Regards,
Boqun

> Thanks.
>
>
> >
> > Thanks!
> >
> >>
> >> Alternatively, it's also helpful if you can try the following debug
> >> diff, with teh full set of xfstests:
> >>
> >> Thanks! Just trying to understand the real problem.
> >>
> >> Regards,
> >> Boqun
> >>
> >> -------------->8
> >> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> >> index b71ad8d9f1c9..9ae3e089e5c0 100644
> >> --- a/kernel/locking/lockdep.c
> >> +++ b/kernel/locking/lockdep.c
> >> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
> >> LOCK_TRACE_SIZE_IN_LONGS;
> >>
> >> if (max_entries <= 0) {
> >> - if (!debug_locks_off_graph_unlock())
> >> + if (!debug_locks_off_graph_unlock()) {
> >> + WARN_ON_ONCE(1);
> >> return NULL;
> >> + }
> >>
> >> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
> >> dump_stack();
> >>
> >>> I guess I will have to reproduce this myself for further analysis, could
> >>> you share you .config?
> >>>
> >>> Anyway, I think this fix still makes a bit sense, I will send a proper
> >>> patch so that the problem won't block fs folks.
> >>>
> >>> Regards,
> >>> Boqun
> >>>
> >>>> Thanks!
> >>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> What happens is percpu_rwsem_is_held() is apparently returning a false
> >>>>>> positive, so this makes __sb_start_write() do a
> >>>>>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
> >>>>>> is expected to always succeed, because if the calling task is holding a
> >>>>>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
> >>>>>> the semaphore at level 2, since the freeze semaphores are always
> >>>>>> acquired by increasing level order.
> >>>>>>
> >>>>>> But the try_lock fails, it triggers the warning at __sb_start_write(),
> >>>>>> then its caller sb_start_pagefault() ignores the return value and
> >>>>>> callers such as btrfs_page_mkwrite() make the assumption the freeze
> >>>>>> semaphore was taken, proceed to do their stuff, and later call
> >>>>>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
> >>>>>> at level 2 despite not having not been able to down_read() the
> >>>>>> semaphore. This obviously corrupts the semaphore's read_count state, and
> >>>>>> later causes any task trying to down_write() it to hang forever.
> >>>>>>
> >>>>>> After such a hang I ran a drgn script to confirm it:
> >>>>>>
> >>>>>> $ cat dump_freeze_sems.py
> >>>>>> import sys
> >>>>>> import drgn
> >>>>>> from drgn import NULL, Object, cast, container_of, execscript, \
> >>>>>> reinterpret, sizeof
> >>>>>> from drgn.helpers.linux import *
> >>>>>>
> >>>>>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
> >>>>>>
> >>>>>> mnt = None
> >>>>>> for mnt in for_each_mount(prog, dst = mnt_path):
> >>>>>> pass
> >>>>>>
> >>>>>> if mnt is None:
> >>>>>> sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
> >>>>>> sys.exit(1)
> >>>>>>
> >>>>>> def dump_sem(level_enum):
> >>>>>> level = level_enum.value_()
> >>>>>> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
> >>>>>> print(f'freeze semaphore at level {level}, {str(level_enum)}')
> >>>>>> print(f' block {sem.block.counter.value_()}')
> >>>>>> for i in for_each_possible_cpu(prog):
> >>>>>> read_count = per_cpu_ptr(sem.read_count, i)
> >>>>>> print(f' read_count at cpu {i} = {read_count}')
> >>>>>> print()
> >>>>>>
> >>>>>> # dump semaphore read counts for all freeze levels (fs.h)
> >>>>>> dump_sem(prog['SB_FREEZE_WRITE'])
> >>>>>> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
> >>>>>> dump_sem(prog['SB_FREEZE_FS'])
> >>>>>>
> >>>>>>
> >>>>>> $ drgn dump_freeze_sems.py
> >>>>>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
> >>>>>> block 1
> >>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
> >>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
> >>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
> >>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
> >>>>>>
> >>>>>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
> >>>>>> block 1
> >>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
> >>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
> >>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
> >>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
> >>>>>>
> >>>>>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
> >>>>>> block 0
> >>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
> >>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
> >>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
> >>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
> >>>>>>
> >>>>>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
> >>>>>> sums to -1. The system remains like that for hours at least, with no
> >>>>>> progress at all.
> >>>>>>
> >>>>>> Is there a known regression with that lockdep commit?
> >>>>>> Anything I can do to help debug it in case it's not obvious?
> >>>>>>
> >>>>>> Thanks.
> >>>>>
> >>

2020-11-09 08:46:23

by Boqun Feng

[permalink] [raw]
Subject: Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

Hi Filipe,

On Thu, Nov 05, 2020 at 09:10:12AM +0800, Boqun Feng wrote:
> On Wed, Nov 04, 2020 at 07:54:40PM +0000, Filipe Manana wrote:
> [...]
> >
> > Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):
> >
> > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > index b71ad8d9f1c9..b31d4ad482c7 100644
> > --- a/kernel/locking/lockdep.c
> > +++ b/kernel/locking/lockdep.c
> > @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
> > LOCK_TRACE_SIZE_IN_LONGS;
> >
> > if (max_entries <= 0) {
> > - if (!debug_locks_off_graph_unlock())
> > + if (!debug_locks_off_graph_unlock()) {
> > + WARN_ON_ONCE(1);
> > return NULL;
> > + }
> >
> > print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
> > dump_stack();
> > @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
> > lockdep_map *lock, int read)
> > unsigned long flags;
> > int ret = 0;
> >
> > - if (unlikely(!lockdep_enabled()))
> > + if (unlikely(debug_locks && !lockdep_enabled()))
> > return 1; /* avoid false negative lockdep_assert_held() */
> >
> > raw_local_irq_save(flags);
> >
> > With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
> > Unexpected, right?
> >
>
> Kinda, that means we still don't know why lockdep was turned off.
>
> > Should I try something else?
> >
>
> Thanks for trying this. Let me set up the reproducer on my side, and see
> if I could get more information.
>

I could hit this with btrfs/187, and when we hit it, lockdep will report
the deadlock and turn off, and I think this is the root cause for your
hitting the original problem, I will add some analysis after the lockdep
splat.

[12295.973309] ============================================
[12295.974770] WARNING: possible recursive locking detected
[12295.974770] 5.10.0-rc2-btrfs-next-71 #20 Not tainted
[12295.974770] --------------------------------------------
[12295.974770] zsh/701247 is trying to acquire lock:
[12295.974770] ffff92cef43480b8 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770]
but task is already holding lock:
[12295.974770] ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770]
other info that might help us debug this:
[12295.974770] Possible unsafe locking scenario:

[12295.974770] CPU0
[12295.974770] ----
[12295.974770] lock(&eb->lock);
[12295.974770] lock(&eb->lock);
[12295.974770]
*** DEADLOCK ***

[12295.974770] May be due to missing lock nesting notation

[12295.974770] 2 locks held by zsh/701247:
[12295.974770] #0: ffff92cef3d315b0 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0xaa/0x920
[12295.974770] #1: ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770]
stack backtrace:
[12295.974770] CPU: 6 PID: 701247 Comm: zsh Not tainted 5.10.0-rc2-btrfs-next-71 #20
[12295.974770] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
[12295.974770] Call Trace:
[12295.974770] dump_stack+0x8b/0xb0
[12295.974770] __lock_acquire.cold+0x175/0x2e9
[12295.974770] lock_acquire+0x15b/0x490
[12295.974770] ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770] ? read_block_for_search+0xf4/0x350 [btrfs]
[12295.974770] _raw_read_lock+0x40/0xa0
[12295.974770] ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770] btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770] btrfs_search_slot+0x6ac/0xca0 [btrfs]
[12295.974770] btrfs_lookup_xattr+0x7d/0xd0 [btrfs]
[12295.974770] btrfs_getxattr+0x67/0x130 [btrfs]
[12295.974770] __vfs_getxattr+0x53/0x70
[12295.974770] get_vfs_caps_from_disk+0x68/0x1a0
[12295.974770] ? sched_clock_cpu+0x114/0x180
[12295.974770] cap_bprm_creds_from_file+0x181/0x6c0
[12295.974770] security_bprm_creds_from_file+0x2a/0x40
[12295.974770] begin_new_exec+0xf4/0xc40
[12295.974770] ? load_elf_phdrs+0x6b/0xb0
[12295.974770] load_elf_binary+0x66b/0x1620
[12295.974770] ? read_hv_sched_clock_tsc+0x5/0x20
[12295.974770] ? sched_clock+0x5/0x10
[12295.974770] ? sched_clock_local+0x12/0x80
[12295.974770] ? sched_clock_cpu+0x114/0x180
[12295.974770] bprm_execve+0x3ce/0x920
[12295.974770] do_execveat_common+0x1b0/0x1f0
[12295.974770] __x64_sys_execve+0x39/0x50
[12295.974770] do_syscall_64+0x33/0x80
[12295.974770] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[12295.974770] RIP: 0033:0x7f6aaefc13cb
[12295.974770] Code: 48 3d 00 f0 ff ff 76 e7 f7 d8 64 41 89 00 eb df 0f 1f 80 00 00 00 00 f7 d8 64 41 89 00 eb dc f3 0f 1e fa b8 3b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 4a 0f 00 f7 d8 64 89 01 48
[12295.974770] RSP: 002b:00007ffd33b54d58 EFLAGS: 00000207 ORIG_RAX: 000000000000003b
[12295.974770] RAX: ffffffffffffffda RBX: 00007f6aaf28bc88 RCX: 00007f6aaefc13cb
[12295.974770] RDX: 00007ffd33b5fd98 RSI: 00007f6aaf28bc88 RDI: 00007ffd33b55280
[12295.974770] RBP: 00007ffd33b54d80 R08: 00007ffd33b54ce0 R09: 0000000000000001
[12295.974770] R10: 0000000000000008 R11: 0000000000000207 R12: 00007ffd33b55280

The deadlock senario reported by this splat is as follow:

CPU 0 CPU 2
===== =====
btrfs_search_slot():
btrfs_tree_read_lock_atomic():
read_lock(&eb->lock); <a random writer>
write_lock(&eb->lock);
// waiting for the lock
...
btrfs_tree_read_lock_atomic():
read_lock(&eb->lock); // blocked because the fairness.

In short, you can not use nested read_lock() on the same lock. However,
I'm not sure whether we have the real problem here, because I don't
think btrfs_search_slot() can pick the same extent buffer twice in
btrfs_search_slot(). Also copy the author of the code for more
information.

Regards,
Boqun

> Regards,
> Boqun
>
> > Thanks.
> >
> >

2020-11-09 09:59:02

by Filipe Manana

[permalink] [raw]
Subject: Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")



On 09/11/20 08:44, Boqun Feng wrote:
> Hi Filipe,
>
> On Thu, Nov 05, 2020 at 09:10:12AM +0800, Boqun Feng wrote:
>> On Wed, Nov 04, 2020 at 07:54:40PM +0000, Filipe Manana wrote:
>> [...]
>>>
>>> Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):
>>>
>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>> index b71ad8d9f1c9..b31d4ad482c7 100644
>>> --- a/kernel/locking/lockdep.c
>>> +++ b/kernel/locking/lockdep.c
>>> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
>>> LOCK_TRACE_SIZE_IN_LONGS;
>>>
>>> if (max_entries <= 0) {
>>> - if (!debug_locks_off_graph_unlock())
>>> + if (!debug_locks_off_graph_unlock()) {
>>> + WARN_ON_ONCE(1);
>>> return NULL;
>>> + }
>>>
>>> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
>>> dump_stack();
>>> @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
>>> lockdep_map *lock, int read)
>>> unsigned long flags;
>>> int ret = 0;
>>>
>>> - if (unlikely(!lockdep_enabled()))
>>> + if (unlikely(debug_locks && !lockdep_enabled()))
>>> return 1; /* avoid false negative lockdep_assert_held() */
>>>
>>> raw_local_irq_save(flags);
>>>
>>> With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
>>> Unexpected, right?
>>>
>>
>> Kinda, that means we still don't know why lockdep was turned off.
>>
>>> Should I try something else?
>>>
>>
>> Thanks for trying this. Let me set up the reproducer on my side, and see
>> if I could get more information.
>>
>
> I could hit this with btrfs/187, and when we hit it, lockdep will report
> the deadlock and turn off, and I think this is the root cause for your
> hitting the original problem, I will add some analysis after the lockdep
> splat.
>
> [12295.973309] ============================================
> [12295.974770] WARNING: possible recursive locking detected
> [12295.974770] 5.10.0-rc2-btrfs-next-71 #20 Not tainted
> [12295.974770] --------------------------------------------
> [12295.974770] zsh/701247 is trying to acquire lock:
> [12295.974770] ffff92cef43480b8 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770]
> but task is already holding lock:
> [12295.974770] ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770]
> other info that might help us debug this:
> [12295.974770] Possible unsafe locking scenario:
>
> [12295.974770] CPU0
> [12295.974770] ----
> [12295.974770] lock(&eb->lock);
> [12295.974770] lock(&eb->lock);
> [12295.974770]
> *** DEADLOCK ***
>
> [12295.974770] May be due to missing lock nesting notation
>
> [12295.974770] 2 locks held by zsh/701247:
> [12295.974770] #0: ffff92cef3d315b0 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0xaa/0x920
> [12295.974770] #1: ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770]
> stack backtrace:
> [12295.974770] CPU: 6 PID: 701247 Comm: zsh Not tainted 5.10.0-rc2-btrfs-next-71 #20
> [12295.974770] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
> [12295.974770] Call Trace:
> [12295.974770] dump_stack+0x8b/0xb0
> [12295.974770] __lock_acquire.cold+0x175/0x2e9
> [12295.974770] lock_acquire+0x15b/0x490
> [12295.974770] ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770] ? read_block_for_search+0xf4/0x350 [btrfs]
> [12295.974770] _raw_read_lock+0x40/0xa0
> [12295.974770] ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770] btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770] btrfs_search_slot+0x6ac/0xca0 [btrfs]
> [12295.974770] btrfs_lookup_xattr+0x7d/0xd0 [btrfs]
> [12295.974770] btrfs_getxattr+0x67/0x130 [btrfs]
> [12295.974770] __vfs_getxattr+0x53/0x70
> [12295.974770] get_vfs_caps_from_disk+0x68/0x1a0
> [12295.974770] ? sched_clock_cpu+0x114/0x180
> [12295.974770] cap_bprm_creds_from_file+0x181/0x6c0
> [12295.974770] security_bprm_creds_from_file+0x2a/0x40
> [12295.974770] begin_new_exec+0xf4/0xc40
> [12295.974770] ? load_elf_phdrs+0x6b/0xb0
> [12295.974770] load_elf_binary+0x66b/0x1620
> [12295.974770] ? read_hv_sched_clock_tsc+0x5/0x20
> [12295.974770] ? sched_clock+0x5/0x10
> [12295.974770] ? sched_clock_local+0x12/0x80
> [12295.974770] ? sched_clock_cpu+0x114/0x180
> [12295.974770] bprm_execve+0x3ce/0x920
> [12295.974770] do_execveat_common+0x1b0/0x1f0
> [12295.974770] __x64_sys_execve+0x39/0x50
> [12295.974770] do_syscall_64+0x33/0x80
> [12295.974770] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [12295.974770] RIP: 0033:0x7f6aaefc13cb
> [12295.974770] Code: 48 3d 00 f0 ff ff 76 e7 f7 d8 64 41 89 00 eb df 0f 1f 80 00 00 00 00 f7 d8 64 41 89 00 eb dc f3 0f 1e fa b8 3b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 4a 0f 00 f7 d8 64 89 01 48
> [12295.974770] RSP: 002b:00007ffd33b54d58 EFLAGS: 00000207 ORIG_RAX: 000000000000003b
> [12295.974770] RAX: ffffffffffffffda RBX: 00007f6aaf28bc88 RCX: 00007f6aaefc13cb
> [12295.974770] RDX: 00007ffd33b5fd98 RSI: 00007f6aaf28bc88 RDI: 00007ffd33b55280
> [12295.974770] RBP: 00007ffd33b54d80 R08: 00007ffd33b54ce0 R09: 0000000000000001
> [12295.974770] R10: 0000000000000008 R11: 0000000000000207 R12: 00007ffd33b55280
>
> The deadlock senario reported by this splat is as follow:
>
> CPU 0 CPU 2
> ===== =====
> btrfs_search_slot():
> btrfs_tree_read_lock_atomic():
> read_lock(&eb->lock); <a random writer>
> write_lock(&eb->lock);
> // waiting for the lock
> ...
> btrfs_tree_read_lock_atomic():
> read_lock(&eb->lock); // blocked because the fairness.
>
> In short, you can not use nested read_lock() on the same lock. However,
> I'm not sure whether we have the real problem here, because I don't
> think btrfs_search_slot() can pick the same extent buffer twice in
> btrfs_search_slot(). Also copy the author of the code for more
> information.

Ah yes. We have several lockdep splats for which the fixes are not in
5.10-rcs yet. Some may be already in the integration branch [1] and
others not yet in any tree, but can be found in the btrfs mailing list.

For that particular one, it's fixed by the following patch:

https://lore.kernel.org/linux-btrfs/1cee2922a32c305056a9559ccf7aede49777beae.1604591048.git.josef@toxicpanda.com/

(It belongs to a series)

You'll likely see more different lockdep splats, this is due to a
transition from custom btree locks to rw semaphores that is in progress,
and some preparatory work for that is already in 5.10-rcs.

Btw, I could hit the fs freeze deadlock even when lockdep didn't produce
any previous splat.

Jan was hitting it too with ext4, and Darrick with xfs - though I can't
tell if they hit any lockdep splat before hitting the freeze deadlock.

Thanks for reporting it.

[1] https://github.com/kdave/btrfs-devel/commits/misc-next

>
> Regards,
> Boqun
>
>> Regards,
>> Boqun
>>
>>> Thanks.
>>>
>>>
>

2020-11-10 01:43:36

by Boqun Feng

[permalink] [raw]
Subject: Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

On Mon, Nov 09, 2020 at 09:57:05AM +0000, Filipe Manana wrote:
>
>
> On 09/11/20 08:44, Boqun Feng wrote:
> > Hi Filipe,
> >
> > On Thu, Nov 05, 2020 at 09:10:12AM +0800, Boqun Feng wrote:
> >> On Wed, Nov 04, 2020 at 07:54:40PM +0000, Filipe Manana wrote:
> >> [...]
> >>>
> >>> Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):
> >>>
> >>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> >>> index b71ad8d9f1c9..b31d4ad482c7 100644
> >>> --- a/kernel/locking/lockdep.c
> >>> +++ b/kernel/locking/lockdep.c
> >>> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
> >>> LOCK_TRACE_SIZE_IN_LONGS;
> >>>
> >>> if (max_entries <= 0) {
> >>> - if (!debug_locks_off_graph_unlock())
> >>> + if (!debug_locks_off_graph_unlock()) {
> >>> + WARN_ON_ONCE(1);
> >>> return NULL;
> >>> + }
> >>>
> >>> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
> >>> dump_stack();
> >>> @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
> >>> lockdep_map *lock, int read)
> >>> unsigned long flags;
> >>> int ret = 0;
> >>>
> >>> - if (unlikely(!lockdep_enabled()))
> >>> + if (unlikely(debug_locks && !lockdep_enabled()))
> >>> return 1; /* avoid false negative lockdep_assert_held() */
> >>>
> >>> raw_local_irq_save(flags);
> >>>
> >>> With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
> >>> Unexpected, right?
> >>>
> >>
> >> Kinda, that means we still don't know why lockdep was turned off.
> >>
> >>> Should I try something else?
> >>>
> >>
> >> Thanks for trying this. Let me set up the reproducer on my side, and see
> >> if I could get more information.
> >>
> >
> > I could hit this with btrfs/187, and when we hit it, lockdep will report
> > the deadlock and turn off, and I think this is the root cause for your
> > hitting the original problem, I will add some analysis after the lockdep
> > splat.
> >
> > [12295.973309] ============================================
> > [12295.974770] WARNING: possible recursive locking detected
> > [12295.974770] 5.10.0-rc2-btrfs-next-71 #20 Not tainted
> > [12295.974770] --------------------------------------------
> > [12295.974770] zsh/701247 is trying to acquire lock:
> > [12295.974770] ffff92cef43480b8 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770]
> > but task is already holding lock:
> > [12295.974770] ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770]
> > other info that might help us debug this:
> > [12295.974770] Possible unsafe locking scenario:
> >
> > [12295.974770] CPU0
> > [12295.974770] ----
> > [12295.974770] lock(&eb->lock);
> > [12295.974770] lock(&eb->lock);
> > [12295.974770]
> > *** DEADLOCK ***
> >
> > [12295.974770] May be due to missing lock nesting notation
> >
> > [12295.974770] 2 locks held by zsh/701247:
> > [12295.974770] #0: ffff92cef3d315b0 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0xaa/0x920
> > [12295.974770] #1: ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770]
> > stack backtrace:
> > [12295.974770] CPU: 6 PID: 701247 Comm: zsh Not tainted 5.10.0-rc2-btrfs-next-71 #20
> > [12295.974770] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
> > [12295.974770] Call Trace:
> > [12295.974770] dump_stack+0x8b/0xb0
> > [12295.974770] __lock_acquire.cold+0x175/0x2e9
> > [12295.974770] lock_acquire+0x15b/0x490
> > [12295.974770] ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770] ? read_block_for_search+0xf4/0x350 [btrfs]
> > [12295.974770] _raw_read_lock+0x40/0xa0
> > [12295.974770] ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770] btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770] btrfs_search_slot+0x6ac/0xca0 [btrfs]
> > [12295.974770] btrfs_lookup_xattr+0x7d/0xd0 [btrfs]
> > [12295.974770] btrfs_getxattr+0x67/0x130 [btrfs]
> > [12295.974770] __vfs_getxattr+0x53/0x70
> > [12295.974770] get_vfs_caps_from_disk+0x68/0x1a0
> > [12295.974770] ? sched_clock_cpu+0x114/0x180
> > [12295.974770] cap_bprm_creds_from_file+0x181/0x6c0
> > [12295.974770] security_bprm_creds_from_file+0x2a/0x40
> > [12295.974770] begin_new_exec+0xf4/0xc40
> > [12295.974770] ? load_elf_phdrs+0x6b/0xb0
> > [12295.974770] load_elf_binary+0x66b/0x1620
> > [12295.974770] ? read_hv_sched_clock_tsc+0x5/0x20
> > [12295.974770] ? sched_clock+0x5/0x10
> > [12295.974770] ? sched_clock_local+0x12/0x80
> > [12295.974770] ? sched_clock_cpu+0x114/0x180
> > [12295.974770] bprm_execve+0x3ce/0x920
> > [12295.974770] do_execveat_common+0x1b0/0x1f0
> > [12295.974770] __x64_sys_execve+0x39/0x50
> > [12295.974770] do_syscall_64+0x33/0x80
> > [12295.974770] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [12295.974770] RIP: 0033:0x7f6aaefc13cb
> > [12295.974770] Code: 48 3d 00 f0 ff ff 76 e7 f7 d8 64 41 89 00 eb df 0f 1f 80 00 00 00 00 f7 d8 64 41 89 00 eb dc f3 0f 1e fa b8 3b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 4a 0f 00 f7 d8 64 89 01 48
> > [12295.974770] RSP: 002b:00007ffd33b54d58 EFLAGS: 00000207 ORIG_RAX: 000000000000003b
> > [12295.974770] RAX: ffffffffffffffda RBX: 00007f6aaf28bc88 RCX: 00007f6aaefc13cb
> > [12295.974770] RDX: 00007ffd33b5fd98 RSI: 00007f6aaf28bc88 RDI: 00007ffd33b55280
> > [12295.974770] RBP: 00007ffd33b54d80 R08: 00007ffd33b54ce0 R09: 0000000000000001
> > [12295.974770] R10: 0000000000000008 R11: 0000000000000207 R12: 00007ffd33b55280
> >
> > The deadlock senario reported by this splat is as follow:
> >
> > CPU 0 CPU 2
> > ===== =====
> > btrfs_search_slot():
> > btrfs_tree_read_lock_atomic():
> > read_lock(&eb->lock); <a random writer>
> > write_lock(&eb->lock);
> > // waiting for the lock
> > ...
> > btrfs_tree_read_lock_atomic():
> > read_lock(&eb->lock); // blocked because the fairness.
> >
> > In short, you can not use nested read_lock() on the same lock. However,
> > I'm not sure whether we have the real problem here, because I don't
> > think btrfs_search_slot() can pick the same extent buffer twice in
> > btrfs_search_slot(). Also copy the author of the code for more
> > information.
>
> Ah yes. We have several lockdep splats for which the fixes are not in
> 5.10-rcs yet. Some may be already in the integration branch [1] and
> others not yet in any tree, but can be found in the btrfs mailing list.
>
> For that particular one, it's fixed by the following patch:
>
> https://lore.kernel.org/linux-btrfs/1cee2922a32c305056a9559ccf7aede49777beae.1604591048.git.josef@toxicpanda.com/
>
> (It belongs to a series)
>
> You'll likely see more different lockdep splats, this is due to a
> transition from custom btree locks to rw semaphores that is in progress,
> and some preparatory work for that is already in 5.10-rcs.
>

Thanks for the information!

> Btw, I could hit the fs freeze deadlock even when lockdep didn't produce
> any previous splat.
>

And this is the weird part, which means lockdep get turned off silently.
That probably means there is a bug (either in fs code or lockdep) that
we don't know, and my previous "fix" although works as per your testing,
but it may hide the real problem...

So I just send out another fix, which is similar to my previous one, but
leave a warning so that once we hit the problem we can be warned and not
miss the silent lockdep turn-off.

Regards,
Boqun

> Jan was hitting it too with ext4, and Darrick with xfs - though I can't
> tell if they hit any lockdep splat before hitting the freeze deadlock.
>
> Thanks for reporting it.
>
> [1] https://github.com/kdave/btrfs-devel/commits/misc-next
>
> >
> > Regards,
> > Boqun
> >
> >> Regards,
> >> Boqun
> >>
> >>> Thanks.
> >>>
> >>>
> >