LinuxLists.cc - Re: possible deadlock in start_this

2021-02-15 14:31:29

Subject: Re: possible deadlock in start_this_handle (2)

On Mon 15-02-21 23:06:15, Tetsuo Handa wrote:
> On 2021/02/15 21:45, Jan Kara wrote:
> > On Sat 13-02-21 23:26:37, Tetsuo Handa wrote:
> >> Excuse me, but it seems to me that nothing prevents
> >> ext4_xattr_set_handle() from reaching ext4_xattr_inode_lookup_create()
> >> without memalloc_nofs_save() when hitting ext4_get_nojournal() path.
> >> Will you explain when ext4_get_nojournal() path is executed?
> >
> > That's a good question but sadly I don't think that's it.
> > ext4_get_nojournal() is called when the filesystem is created without a
> > journal. In that case we also don't acquire jbd2_handle lockdep map. In the
> > syzbot report we can see:
>
> Since syzbot can test filesystem images, syzbot might have tested a filesystem
> image created both with and without journal within this boot.

a) I think that syzbot reboots the VM between executing different tests to
get reproducible conditions. But in theory I agree the test may have
contained one image with and one image without a journal.

*but*

b) as I wrote in the email you are replying to, the jbd2_handle key is
private per filesystem. Thus for lockdep to complain about
jbd2_handle->fs_reclaim->jbd2_handle deadlock, those jbd2_handle lockdep
maps must come from the same filesystem.

*and*

c) filesystem without journal doesn't use jbd2_handle lockdep map at all so
for such filesystems lockdep creates no dependency for jbd2_handle map.

Honza

>
> >
> > kswapd0/2246 is trying to acquire lock:
> > ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> >
> > but task is already holding lock:
> > ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> >
> > So this filesystem has very clearly been created with a journal. Also the
> > journal lockdep tracking machinery uses:
>
> While locks held by kswapd0/2246 are fs_reclaim, shrinker_rwsem, &type->s_umount_key#38
> and jbd2_handle, isn't the dependency lockdep considers problematic is
>
> Chain exists of:
> jbd2_handle --> &ei->xattr_sem --> fs_reclaim
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(fs_reclaim);
> lock(&ei->xattr_sem);
> lock(fs_reclaim);
> lock(jbd2_handle);
>
> where CPU0 is kswapd/2246 and CPU1 is the case of ext4_get_nojournal() path?
> If someone has taken jbd2_handle and &ei->xattr_sem in this order, isn't this
> dependency true?
>
> >
> > rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
> >
> > so a lockdep key is per-filesystem. Thus it is not possible that lockdep
> > would combine lock dependencies from two different filesystems.
> >
> > But I guess we could narrow the search for this problem by adding WARN_ONs
> > to ext4_xattr_set_handle() and ext4_xattr_inode_lookup_create() like:
> >
> > WARN_ON(ext4_handle_valid(handle) && !(current->flags & PF_MEMALLOC_NOFS));
> >
> > It would narrow down a place in which PF_MEMALLOC_NOFS flag isn't set
> > properly... At least that seems like the most plausible way forward to me.
>
> You can use CONFIG_DEBUG_AID_FOR_SYZBOT for adding such WARN_ONs on linux-next.
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-02-19 10:19:30

by Tetsuo Handa

[permalink] [raw]

Subject: Re: possible deadlock in start_this_handle (2)

On 2021/02/15 23:29, Jan Kara wrote:
> On Mon 15-02-21 23:06:15, Tetsuo Handa wrote:
>> On 2021/02/15 21:45, Jan Kara wrote:
>>> On Sat 13-02-21 23:26:37, Tetsuo Handa wrote:
>>>> Excuse me, but it seems to me that nothing prevents
>>>> ext4_xattr_set_handle() from reaching ext4_xattr_inode_lookup_create()
>>>> without memalloc_nofs_save() when hitting ext4_get_nojournal() path.
>>>> Will you explain when ext4_get_nojournal() path is executed?
>>>
>>> That's a good question but sadly I don't think that's it.
>>> ext4_get_nojournal() is called when the filesystem is created without a
>>> journal. In that case we also don't acquire jbd2_handle lockdep map. In the
>>> syzbot report we can see:
>>
>> Since syzbot can test filesystem images, syzbot might have tested a filesystem
>> image created both with and without journal within this boot.
>
> a) I think that syzbot reboots the VM between executing different tests to
> get reproducible conditions. But in theory I agree the test may have
> contained one image with and one image without a journal.

syzkaller reboots the VM upon a crash.
syzkaller can test multiple filesystem images within one boot.

https://storage.googleapis.com/syzkaller/cover/ci-qemu-upstream-386.html (this
statistic snapshot is volatile) reports that ext4_get_nojournal() is partially covered
( https://github.com/google/syzkaller/blob/master/docs/coverage.md ) by syzkaller.

/* Just increment the non-pointer handle value */
static handle_t *ext4_get_nojournal(void)
{
86 handle_t *handle = current->journal_info;
unsigned long ref_cnt = (unsigned long)handle;

BUG_ON(ref_cnt >= EXT4_NOJOURNAL_MAX_REF_COUNT);

86 ref_cnt++;
handle = (handle_t *)ref_cnt;

current->journal_info = handle;
2006 return handle;
}

/* Decrement the non-pointer handle value */
static void ext4_put_nojournal(handle_t *handle)
{
unsigned long ref_cnt = (unsigned long)handle;

85 BUG_ON(ref_cnt == 0);

85 ref_cnt--;
handle = (handle_t *)ref_cnt;

current->journal_info = handle;
}

handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
int type, int blocks, int rsv_blocks,
int revoke_creds)
{
journal_t *journal;
int err;

2006 trace_ext4_journal_start(sb, blocks, rsv_blocks, revoke_creds,
2006 _RET_IP_);
2006 err = ext4_journal_check_start(sb);
if (err < 0)
return ERR_PTR(err);

2005 journal = EXT4_SB(sb)->s_journal;
1969 if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
2006 return ext4_get_nojournal();
1969 return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
GFP_NOFS, type, line);
}

>
> *but*
>
> b) as I wrote in the email you are replying to, the jbd2_handle key is
> private per filesystem. Thus for lockdep to complain about
> jbd2_handle->fs_reclaim->jbd2_handle deadlock, those jbd2_handle lockdep
> maps must come from the same filesystem.
>
> *and*
>
> c) filesystem without journal doesn't use jbd2_handle lockdep map at all so
> for such filesystems lockdep creates no dependency for jbd2_handle map.
>

What about "EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)" case?
Does this case happen on filesystem with journal, and can this case be executed
by fuzzing a crafted (a sort of erroneous) filesystem with journal, and are
the jbd2_handle for calling ext4_get_nojournal() case and the jbd2_handle for
calling jbd2__journal_start() case the same?

Also, I worry that jbd2__journal_restart() is problematic, for it calls
stop_this_handle() (which calls memalloc_nofs_restore()) and then calls
start_this_handle() (which fails to call memalloc_nofs_save() if an error
occurs). An error from start_this_handle() from journal restart operation
might need special handling (i.e. either remount-ro or panic) ?

2021-02-19 17:25:50

by harshad shirwadkar

[permalink] [raw]

Subject: Re: possible deadlock in start_this_handle (2)

On Fri, Feb 19, 2021 at 2:20 AM Tetsuo Handa
<[email protected]> wrote:
>
> On 2021/02/15 23:29, Jan Kara wrote:
> > On Mon 15-02-21 23:06:15, Tetsuo Handa wrote:
> >> On 2021/02/15 21:45, Jan Kara wrote:
> >>> On Sat 13-02-21 23:26:37, Tetsuo Handa wrote:
> >>>> Excuse me, but it seems to me that nothing prevents
> >>>> ext4_xattr_set_handle() from reaching ext4_xattr_inode_lookup_create()
> >>>> without memalloc_nofs_save() when hitting ext4_get_nojournal() path.
> >>>> Will you explain when ext4_get_nojournal() path is executed?
> >>>
> >>> That's a good question but sadly I don't think that's it.
> >>> ext4_get_nojournal() is called when the filesystem is created without a
> >>> journal. In that case we also don't acquire jbd2_handle lockdep map. In the
> >>> syzbot report we can see:
> >>
> >> Since syzbot can test filesystem images, syzbot might have tested a filesystem
> >> image created both with and without journal within this boot.
> >
> > a) I think that syzbot reboots the VM between executing different tests to
> > get reproducible conditions. But in theory I agree the test may have
> > contained one image with and one image without a journal.
>
> syzkaller reboots the VM upon a crash.
> syzkaller can test multiple filesystem images within one boot.
>
> https://storage.googleapis.com/syzkaller/cover/ci-qemu-upstream-386.html (this
> statistic snapshot is volatile) reports that ext4_get_nojournal() is partially covered
> ( https://github.com/google/syzkaller/blob/master/docs/coverage.md ) by syzkaller.
>
> /* Just increment the non-pointer handle value */
> static handle_t *ext4_get_nojournal(void)
> {
> 86 handle_t *handle = current->journal_info;
> unsigned long ref_cnt = (unsigned long)handle;
>
> BUG_ON(ref_cnt >= EXT4_NOJOURNAL_MAX_REF_COUNT);
>
> 86 ref_cnt++;
> handle = (handle_t *)ref_cnt;
>
> current->journal_info = handle;
> 2006 return handle;
> }
>
>
> /* Decrement the non-pointer handle value */
> static void ext4_put_nojournal(handle_t *handle)
> {
> unsigned long ref_cnt = (unsigned long)handle;
>
> 85 BUG_ON(ref_cnt == 0);
>
> 85 ref_cnt--;
> handle = (handle_t *)ref_cnt;
>
> current->journal_info = handle;
> }
>
>
> handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
> int type, int blocks, int rsv_blocks,
> int revoke_creds)
> {
> journal_t *journal;
> int err;
>
> 2006 trace_ext4_journal_start(sb, blocks, rsv_blocks, revoke_creds,
> 2006 _RET_IP_);
> 2006 err = ext4_journal_check_start(sb);
> if (err < 0)
> return ERR_PTR(err);
>
> 2005 journal = EXT4_SB(sb)->s_journal;
> 1969 if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
> 2006 return ext4_get_nojournal();
> 1969 return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
> GFP_NOFS, type, line);
> }
>
> >
> > *but*
> >
> > b) as I wrote in the email you are replying to, the jbd2_handle key is
> > private per filesystem. Thus for lockdep to complain about
> > jbd2_handle->fs_reclaim->jbd2_handle deadlock, those jbd2_handle lockdep
> > maps must come from the same filesystem.
> >
> > *and*
> >
> > c) filesystem without journal doesn't use jbd2_handle lockdep map at all so
> > for such filesystems lockdep creates no dependency for jbd2_handle map.
> >
>
> What about "EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)" case?
> Does this case happen on filesystem with journal, and can this case be executed
> by fuzzing a crafted (a sort of erroneous) filesystem with journal, and are
> the jbd2_handle for calling ext4_get_nojournal() case and the jbd2_handle for
> calling jbd2__journal_start() case the same?
EXT4_FC_REPLAY is a mount state that is only set during jbd2 journal
recovery. The only way for jbd2 journal recovery to set EXT4_FC_REPLAY
option is if after a journal crash there are special fast_commit
blocks present in the journal. For these fast_commit blocks to be
present in the journal, the Ext4 file system prior to crash should
have had "fast_commit" feature enabled.

If we have a way to look at the Ext4 partition that syzbot used for
reporting this bug, it is very easy to see if this FC_REPLAY will ever
be set or not. Just running "debugfs <image>" and inside debugfs,
running logdump will show us if there are any fast commit blocks
present in the journal.

Having said that, I have following reason to believe that this option
wasn't set during the syzbot failure:

EXT4_FC_REPLAY will only be set during journal recovery and is cleared
immediately after. Which means EXT4_FC_REPLAY will only be set during
mount and as soon as mount returns the option will be cleared. Looking
at the stack trace, it shows no evidence that we are in the journal
recovery phase. It seems like most of the traces are resulting from
system calls made by the user. I checked if we are accidentally
setting this flag even after journal recovery, but that doesn't seem
to be the case. On a successfully mounted file system, we
unconditionally clear this flag.

- Harshad

>
> Also, I worry that jbd2__journal_restart() is problematic, for it calls
> stop_this_handle() (which calls memalloc_nofs_restore()) and then calls
> start_this_handle() (which fails to call memalloc_nofs_save() if an error
> occurs). An error from start_this_handle() from journal restart operation
> might need special handling (i.e. either remount-ro or panic) ?
>