2023-10-30 08:39:28

by ChenXiaoSong

[permalink] [raw]
Subject: Question about LTS 4.19 patch "89047634f5ce NFS: Don't interrupt file writeout due to fatal errors"

Hi Trond and Greg:

LTS 4.19 reported null-ptr-deref BUG as follows:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
Call Trace:
 nfs_inode_add_request+0x1cc/0x5b8
 nfs_setup_write_request+0x1fa/0x1fc
 nfs_writepage_setup+0x2d/0x7d
 nfs_updatepage+0x8b8/0x936
 nfs_write_end+0x61d/0xd45
 generic_perform_write+0x19a/0x3f0
 nfs_file_write+0x2cc/0x6e5
 new_sync_write+0x442/0x560
 __vfs_write+0xda/0xef
 vfs_write+0x176/0x48b
 ksys_write+0x10a/0x1e9
 __se_sys_write+0x24/0x29
 __x64_sys_write+0x79/0x93
 do_syscall_64+0x16d/0x4bb
 entry_SYSCALL_64_after_hwframe+0x5c/0xc1

The reason is: generic_error_remove_page set page->mapping to NULL when
nfs server have a fatal error:

nfs_updatepage
  nfs_writepage_setup
    nfs_setup_write_request
      nfs_try_to_update_request // return NULL
        nfs_wb_page // return 0
          nfs_writepage_locked // return 0
            nfs_do_writepage // return 0
              nfs_page_async_flush // return 0
                nfs_error_is_fatal_on_server
                generic_error_remove_page
                  truncate_inode_page
                    delete_from_page_cache
                      __delete_from_page_cache
                        page_cache_tree_delete
                          page->mapping = NULL // this is point
      nfs_create_request
        req->wb_page    = page // the page is freed
      nfs_inode_add_request
        mapping = page_file_mapping(req->wb_page)
          return page->mapping
        spin_lock(&mapping->private_lock) // mapping is NULL

It is reasonable by reverting the patch "89047634f5ce NFS: Don't
interrupt file writeout due to fatal errors" to fix this bug?


This patch is one patch of patchset [Fix up soft mounts for
NFSv4.x](https://lore.kernel.org/all/[email protected]/),
the patchset replace custom error reporting mechanism. it seams that we
should merge all the patchset to LTS 4.19, or all patchs should not be
merged. And the "Fixes:" label is not correct, this patch is a
refactoring patch, not for fixing bugs.


2023-10-30 08:43:37

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Question about LTS 4.19 patch "89047634f5ce NFS: Don't interrupt file writeout due to fatal errors"

On Mon, Oct 30, 2023 at 04:39:11PM +0800, ChenXiaoSong wrote:
> Hi Trond and Greg:
>
> LTS 4.19 reported null-ptr-deref BUG as follows:
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
> Call Trace:
> ?nfs_inode_add_request+0x1cc/0x5b8
> ?nfs_setup_write_request+0x1fa/0x1fc
> ?nfs_writepage_setup+0x2d/0x7d
> ?nfs_updatepage+0x8b8/0x936
> ?nfs_write_end+0x61d/0xd45
> ?generic_perform_write+0x19a/0x3f0
> ?nfs_file_write+0x2cc/0x6e5
> ?new_sync_write+0x442/0x560
> ?__vfs_write+0xda/0xef
> ?vfs_write+0x176/0x48b
> ?ksys_write+0x10a/0x1e9
> ?__se_sys_write+0x24/0x29
> ?__x64_sys_write+0x79/0x93
> ?do_syscall_64+0x16d/0x4bb
> ?entry_SYSCALL_64_after_hwframe+0x5c/0xc1
>
> The reason is: generic_error_remove_page set page->mapping to NULL when nfs
> server have a fatal error:
>
> nfs_updatepage
> ? nfs_writepage_setup
> ??? nfs_setup_write_request
> ????? nfs_try_to_update_request // return NULL
> ??????? nfs_wb_page // return 0
> ????????? nfs_writepage_locked // return 0
> ??????????? nfs_do_writepage // return 0
> ????????????? nfs_page_async_flush // return 0
> ??????????????? nfs_error_is_fatal_on_server
> ??????????????? generic_error_remove_page
> ????????????????? truncate_inode_page
> ??????????????????? delete_from_page_cache
> ????????????????????? __delete_from_page_cache
> ??????????????????????? page_cache_tree_delete
> ????????????????????????? page->mapping = NULL // this is point
> ????? nfs_create_request
> ??????? req->wb_page??? = page // the page is freed
> ????? nfs_inode_add_request
> ??????? mapping = page_file_mapping(req->wb_page)
> ????????? return page->mapping
> ??????? spin_lock(&mapping->private_lock) // mapping is NULL
>
> It is reasonable by reverting the patch "89047634f5ce NFS: Don't interrupt
> file writeout due to fatal errors" to fix this bug?

Try it and see, but note, that came from the 4.19.99 release which was
released years ago, are you sure you are using the most recent 4.19.y
release?

> This patch is one patch of patchset [Fix up soft mounts for NFSv4.x](https://lore.kernel.org/all/[email protected]/),
> the patchset replace custom error reporting mechanism. it seams that we
> should merge all the patchset to LTS 4.19, or all patchs should not be
> merged. And the "Fixes:" label is not correct, this patch is a refactoring
> patch, not for fixing bugs.

If we missed some patches, that should be added on top of the current
tree, please let us know the git commit ids of them after you have
tested them that they work properly, and we will gladly apply them.

thanks,

greg k-h

2023-10-30 08:58:58

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Question about LTS 4.19 patch "89047634f5ce NFS: Don't interrupt file writeout due to fatal errors"

On Mon, Oct 30, 2023 at 04:54:02PM +0800, ChenXiaoSong wrote:
> On 2023/10/30 16:43, Greg KH wrote:
> > Try it and see, but note, that came from the 4.19.99 release which was
> > released years ago, are you sure you are using the most recent 4.19.y
> > release?
>
> I use the most recent release 1b540579cf66 Linux 4.19.296.
>
> > If we missed some patches, that should be added on top of the current
> > tree, please let us know the git commit ids of them after you have
> > tested them that they work properly, and we will gladly apply them.
> Merging the entire patchset may not be the best option. Perhaps a better
> choice is to revert this patch. And I would like to see Trond's suggestion.
>

If you just revert that one patch, is the issue resolved? And what
about other stable releases that have that change in it?

If this does need to be reverted, please submit a patch that does so and
we can review it that way.

thanks,

greg k-h

2023-10-30 09:04:59

by ChenXiaoSong

[permalink] [raw]
Subject: Re: Question about LTS 4.19 patch "89047634f5ce NFS: Don't interrupt file writeout due to fatal errors"


On 2023/10/30 16:58, Greg KH wrote:
> If you just revert that one patch, is the issue resolved? And what
> about other stable releases that have that change in it?
>
> If this does need to be reverted, please submit a patch that does so and
> we can review it that way.
>
> thanks,
>
> greg k-h


In my opinion, this patch is not for fixing a bug, but is part of a
refactoring patchset. The 'Fixes:' tag should not be added.


2023-10-30 09:19:20

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Question about LTS 4.19 patch "89047634f5ce NFS: Don't interrupt file writeout due to fatal errors"

On Mon, Oct 30, 2023 at 05:04:35PM +0800, ChenXiaoSong wrote:
>
> On 2023/10/30 16:58, Greg KH wrote:
> > If you just revert that one patch, is the issue resolved? And what
> > about other stable releases that have that change in it?
> >
> > If this does need to be reverted, please submit a patch that does so and
> > we can review it that way.
> >
> > thanks,
> >
> > greg k-h
>
>
> In my opinion, this patch is not for fixing a bug, but is part of a
> refactoring patchset. The 'Fixes:' tag should not be added.

Nothing we can do about that now, right? And to try to ask developers
about a change they made in 2019 is a bit rough, don't you think? If
you think the change is incorrect, please submit a patch to resolve it
after you have tested that it works properly.

thanks,

greg k-h

2023-10-30 14:57:05

by Trond Myklebust

[permalink] [raw]
Subject: Re: Question about LTS 4.19 patch "89047634f5ce NFS: Don't interrupt file writeout due to fatal errors"

On Mon, 2023-10-30 at 17:04 +0800, ChenXiaoSong wrote:
> [You don't often get email from [email protected]. Learn
> why this is important at
> https://aka.ms/LearnAboutSenderIdentification ]
>
> On 2023/10/30 16:58, Greg KH wrote:
> > If you just revert that one patch, is the issue resolved?  And what
> > about other stable releases that have that change in it?
> >
> > If this does need to be reverted, please submit a patch that does
> > so and
> > we can review it that way.
> >
> > thanks,
> >
> > greg k-h
>
>
> In my opinion, this patch is not for fixing a bug, but is part of a
> refactoring patchset. The 'Fixes:' tag should not be added.
>
>

A refactoring is by definition a change that does not affect code
behaviour. It is obvious that this was never intended to be such a
patch.

The reason that the bug is occurring in 4.19.x, and not in the latest
kernels, is because the former is missing another bugfix (one which
actually is missing a "Fixes:" tag).

Can you therefore please check if applying commit 22876f540bdf ("NFS:
Don't call generic_error_remove_page() while holding locks") fixes the
issue.

Note that the latter patch is needed in any case in order to fix a read
deadlock (as indicated on the label).

Thanks,
Trond

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2023-11-17 03:28:56

by ChenXiaoSong

[permalink] [raw]
Subject: Re: Question about LTS 4.19 patch "89047634f5ce NFS: Don't interrupt file writeout due to fatal errors"

On 2023/10/30 22:56, Trond Myklebust wrote:
> A refactoring is by definition a change that does not affect code
> behaviour. It is obvious that this was never intended to be such a
> patch.
>
> The reason that the bug is occurring in 4.19.x, and not in the latest
> kernels, is because the former is missing another bugfix (one which
> actually is missing a "Fixes:" tag).
>
> Can you therefore please check if applying commit 22876f540bdf ("NFS:
> Don't call generic_error_remove_page() while holding locks") fixes the
> issue.
>
> Note that the latter patch is needed in any case in order to fix a read
> deadlock (as indicated on the label).
>
> Thanks,
> Trond
>
After applying commit 22876f540bdf ("NFS: Don't call
generic_error_remove_page() while holding locks"), I encountered an
issue of infinite loop:

write ... nfs_updatepage nfs_writepage_setup nfs_setup_write_request
nfs_try_to_update_request nfs_wb_page if (clear_page_dirty_for_io(page))
// true nfs_writepage_locked // return 0 nfs_do_writepage // return 0
nfs_page_async_flush // return 0 nfs_error_is_fatal_on_server
nfs_write_error_remove_page SetPageError // instead of
generic_error_remove_page // loop begin if
(clear_page_dirty_for_io(page)) // false if (!PagePrivate(page)) //
false ret = nfs_commit_inode = 0 // loop again, never quit

before applying commit 22876f540bdf ("NFS: Don't call
generic_error_remove_page() while holding locks"),
generic_error_remove_page() will clear PG_private, and infinite loop
will never happen:

generic_error_remove_page truncate_inode_page truncate_cleanup_page
do_invalidatepage nfs_invalidate_page nfs_wb_page_cancel
nfs_inode_remove_request ClearPagePrivate(head->wb_page)

If applying this patch, are other patches required? And I cannot
reproducethe read deadlock bug that the patch want to fix, are there
specific conditions required to reproduce this read deadlock bug?