2023-07-25 15:10:41

by Scott Mayhew

[permalink] [raw]
Subject: [PATCH 1/1] NFS: Fix potential oops in nfs_inode_remove_request()

Once a folio's private data has been cleared, it's possible for another
process to clear the folio->mapping (e.g. via invalidate_complete_folio2
or evict_mapping_folio), so it wouldn't be safe to call
nfs_page_to_inode() after that.

Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios")
Signed-off-by: Scott Mayhew <[email protected]>
---
fs/nfs/write.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index f4cca8f00c0c..489c3f9dae23 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct nfs_page *req)
*/
static void nfs_inode_remove_request(struct nfs_page *req)
{
+ struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
+
if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
struct folio *folio = nfs_page_to_folio(req->wb_head);
struct address_space *mapping = folio_file_mapping(folio);
@@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)

if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
nfs_release_request(req);
- atomic_long_dec(&NFS_I(nfs_page_to_inode(req))->nrequests);
+ atomic_long_dec(&nfsi->nrequests);
}
}

--
2.41.0



2023-07-25 15:15:19

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFS: Fix potential oops in nfs_inode_remove_request()

On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> Once a folio's private data has been cleared, it's possible for
> another
> process to clear the folio->mapping (e.g. via
> invalidate_complete_folio2
> or evict_mapping_folio), so it wouldn't be safe to call
> nfs_page_to_inode() after that.
>
> Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios")
> Signed-off-by: Scott Mayhew <[email protected]>
> ---
>  fs/nfs/write.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index f4cca8f00c0c..489c3f9dae23 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct nfs_page
> *req)
>   */
>  static void nfs_inode_remove_request(struct nfs_page *req)
>  {
> +       struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> +
>         if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
>                 struct folio *folio = nfs_page_to_folio(req-
> >wb_head);
>                 struct address_space *mapping =
> folio_file_mapping(folio);
> @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct
> nfs_page *req)
>  
>         if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
>                 nfs_release_request(req);
> -               atomic_long_dec(&NFS_I(nfs_page_to_inode(req))-
> >nrequests);
> +               atomic_long_dec(&nfsi->nrequests);

Why not just invert the order of the atomic_long_dec() and the
nfs_release_request()? That way you are also ensuring that the inode is
still pinned in memory by the open context.
>         }
>  }
>  

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2023-07-25 16:41:56

by Scott Mayhew

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFS: Fix potential oops in nfs_inode_remove_request()

On Tue, 25 Jul 2023, Trond Myklebust wrote:

> On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> > Once a folio's private data has been cleared, it's possible for
> > another
> > process to clear the folio->mapping (e.g. via
> > invalidate_complete_folio2
> > or evict_mapping_folio), so it wouldn't be safe to call
> > nfs_page_to_inode() after that.
> >
> > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios")
> > Signed-off-by: Scott Mayhew <[email protected]>
> > ---
> > ?fs/nfs/write.c | 4 +++-
> > ?1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index f4cca8f00c0c..489c3f9dae23 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct nfs_page
> > *req)
> > ? */
> > ?static void nfs_inode_remove_request(struct nfs_page *req)
> > ?{
> > +???????struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> > +
> > ????????if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> > ????????????????struct folio *folio = nfs_page_to_folio(req-
> > >wb_head);
> > ????????????????struct address_space *mapping =
> > folio_file_mapping(folio);
> > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct
> > nfs_page *req)
> > ?
> > ????????if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
> > ????????????????nfs_release_request(req);
> > -???????????????atomic_long_dec(&NFS_I(nfs_page_to_inode(req))-
> > >nrequests);
> > +???????????????atomic_long_dec(&nfsi->nrequests);
>
> Why not just invert the order of the atomic_long_dec() and the
> nfs_release_request()? That way you are also ensuring that the inode is
> still pinned in memory by the open context.

I'm not following. How does inverting the order prevent the
folio->mapping from getting clobbered?

-Scott
> > ????????}
> > ?}
> > ?
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>


2023-07-25 17:50:50

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFS: Fix potential oops in nfs_inode_remove_request()

On Tue, 2023-07-25 at 12:24 -0400, Scott Mayhew wrote:
> On Tue, 25 Jul 2023, Trond Myklebust wrote:
>
> > On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> > > Once a folio's private data has been cleared, it's possible for
> > > another
> > > process to clear the folio->mapping (e.g. via
> > > invalidate_complete_folio2
> > > or evict_mapping_folio), so it wouldn't be safe to call
> > > nfs_page_to_inode() after that.
> > >
> > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use
> > > folios")
> > > Signed-off-by: Scott Mayhew <[email protected]>
> > > ---
> > >  fs/nfs/write.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > index f4cca8f00c0c..489c3f9dae23 100644
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct
> > > nfs_page
> > > *req)
> > >   */
> > >  static void nfs_inode_remove_request(struct nfs_page *req)
> > >  {
> > > +       struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> > > +
> > >         if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> > >                 struct folio *folio = nfs_page_to_folio(req-
> > > > wb_head);
> > >                 struct address_space *mapping =
> > > folio_file_mapping(folio);
> > > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct
> > > nfs_page *req)
> > >  
> > >         if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
> > >                 nfs_release_request(req);
> > > -               atomic_long_dec(&NFS_I(nfs_page_to_inode(req))-
> > > > nrequests);
> > > +               atomic_long_dec(&nfsi->nrequests);
> >
> > Why not just invert the order of the atomic_long_dec() and the
> > nfs_release_request()? That way you are also ensuring that the
> > inode is
> > still pinned in memory by the open context.
>
> I'm not following.  How does inverting the order prevent the
> folio->mapping from getting clobbered?
>

The open/lock context is refcounted by the nfs_page until the latter is
released. That's why the inode is guaranteed to remain around at least
until the call to nfs_release_request().


--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2023-07-26 13:03:29

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFS: Fix potential oops in nfs_inode_remove_request()

On Tue, 2023-07-25 at 17:41 +0000, Trond Myklebust wrote:
> On Tue, 2023-07-25 at 12:24 -0400, Scott Mayhew wrote:
> > On Tue, 25 Jul 2023, Trond Myklebust wrote:
> >
> > > On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> > > > Once a folio's private data has been cleared, it's possible for
> > > > another
> > > > process to clear the folio->mapping (e.g. via
> > > > invalidate_complete_folio2
> > > > or evict_mapping_folio), so it wouldn't be safe to call
> > > > nfs_page_to_inode() after that.
> > > >
> > > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use
> > > > folios")
> > > > Signed-off-by: Scott Mayhew <[email protected]>
> > > > ---
> > > > ?fs/nfs/write.c | 4 +++-
> > > > ?1 file changed, 3 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > > index f4cca8f00c0c..489c3f9dae23 100644
> > > > --- a/fs/nfs/write.c
> > > > +++ b/fs/nfs/write.c
> > > > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct
> > > > nfs_page
> > > > *req)
> > > > ? */
> > > > ?static void nfs_inode_remove_request(struct nfs_page *req)
> > > > ?{
> > > > +???????struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> > > > +
> > > > ????????if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> > > > ????????????????struct folio *folio = nfs_page_to_folio(req-
> > > > > wb_head);
> > > > ????????????????struct address_space *mapping =
> > > > folio_file_mapping(folio);
> > > > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct
> > > > nfs_page *req)
> > > > ?
> > > > ????????if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
> > > > ????????????????nfs_release_request(req);
> > > > -???????????????atomic_long_dec(&NFS_I(nfs_page_to_inode(req))-
> > > > > nrequests);
> > > > +???????????????atomic_long_dec(&nfsi->nrequests);
> > >
> > > Why not just invert the order of the atomic_long_dec() and the
> > > nfs_release_request()? That way you are also ensuring that the
> > > inode is
> > > still pinned in memory by the open context.
> >
> > I'm not following.? How does inverting the order prevent the
> > folio->mapping from getting clobbered?
> >
>
> The open/lock context is refcounted by the nfs_page until the latter is
> released. That's why the inode is guaranteed to remain around at least
> until the call to nfs_release_request().
>

The problem is not that the inode is going away, but rather that we
can't guarantee that the page is still part of the mapping at this
point, and so we can't safely dereference page->mapping there. I do see
that nfs_release_request releases a reference to the page, but I don't
think that's sufficient to ensure that it remains part of the mapping.

AFAICT, once we clear page->private, the page is subject to be removed
from the mapping. So, I *think* it's safe to just move the call to
nfs_page_to_inode prior to the call to nfs_page_group_sync_on_bit.
--
Jeff Layton <[email protected]>

2023-07-26 14:10:00

by Scott Mayhew

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFS: Fix potential oops in nfs_inode_remove_request()

On Wed, 26 Jul 2023, Jeff Layton wrote:

> On Tue, 2023-07-25 at 17:41 +0000, Trond Myklebust wrote:
> > On Tue, 2023-07-25 at 12:24 -0400, Scott Mayhew wrote:
> > > On Tue, 25 Jul 2023, Trond Myklebust wrote:
> > >
> > > > On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> > > > > Once a folio's private data has been cleared, it's possible for
> > > > > another
> > > > > process to clear the folio->mapping (e.g. via
> > > > > invalidate_complete_folio2
> > > > > or evict_mapping_folio), so it wouldn't be safe to call
> > > > > nfs_page_to_inode() after that.
> > > > >
> > > > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use
> > > > > folios")
> > > > > Signed-off-by: Scott Mayhew <[email protected]>
> > > > > ---
> > > > > ?fs/nfs/write.c | 4 +++-
> > > > > ?1 file changed, 3 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > > > index f4cca8f00c0c..489c3f9dae23 100644
> > > > > --- a/fs/nfs/write.c
> > > > > +++ b/fs/nfs/write.c
> > > > > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct
> > > > > nfs_page
> > > > > *req)
> > > > > ? */
> > > > > ?static void nfs_inode_remove_request(struct nfs_page *req)
> > > > > ?{
> > > > > +???????struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> > > > > +
> > > > > ????????if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> > > > > ????????????????struct folio *folio = nfs_page_to_folio(req-
> > > > > > wb_head);
> > > > > ????????????????struct address_space *mapping =
> > > > > folio_file_mapping(folio);
> > > > > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct
> > > > > nfs_page *req)
> > > > > ?
> > > > > ????????if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
> > > > > ????????????????nfs_release_request(req);
> > > > > -???????????????atomic_long_dec(&NFS_I(nfs_page_to_inode(req))-
> > > > > > nrequests);
> > > > > +???????????????atomic_long_dec(&nfsi->nrequests);
> > > >
> > > > Why not just invert the order of the atomic_long_dec() and the
> > > > nfs_release_request()? That way you are also ensuring that the
> > > > inode is
> > > > still pinned in memory by the open context.
> > >
> > > I'm not following.? How does inverting the order prevent the
> > > folio->mapping from getting clobbered?
> > >
> >
> > The open/lock context is refcounted by the nfs_page until the latter is
> > released. That's why the inode is guaranteed to remain around at least
> > until the call to nfs_release_request().
> >
>
> The problem is not that the inode is going away, but rather that we
> can't guarantee that the page is still part of the mapping at this
> point, and so we can't safely dereference page->mapping there. I do see
> that nfs_release_request releases a reference to the page, but I don't
> think that's sufficient to ensure that it remains part of the mapping.
>
> AFAICT, once we clear page->private, the page is subject to be removed
> from the mapping. So, I *think* it's safe to just move the call to
> nfs_page_to_inode prior to the call to nfs_page_group_sync_on_bit.

Yeah, the inode hasn't gone away. I can pick the nfs_commit_data
address off the stack in nfs_commit_release_pages:

crash> nfs_commit_data.inode c0000006774cae00
inode = 0xc00000006c1b05f8,

The nfs_inode is still allocated:

crash> kmem 0xc00000006c1b05f8
CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME
c000000030332600 1088 128 5959 101 64k nfs_inode_cache
SLAB MEMORY NODE TOTAL ALLOCATED FREE
c00c0000001b06c0 c00000006c1b0000 0 59 1 58
FREE / [ALLOCATED]
[c00000006c1b0448]

PAGE PHYSICAL MAPPING INDEX CNT FLAGS
c00c0000001b06c0 6c1b0000 c000000030332600 c00000006c1b4480 1 23ffff800000200 slab

The vfs_inode:

crash> px &((struct nfs_inode *)0xc00000006c1b0448)->vfs_inode
$7 = (struct inode *) 0xc00000006c1b05f8

Matches the inodes open by both nfs_flock programs from the test:

crash> foreach nfs_flock files
PID: 4006780 TASK: c00000009d436600 CPU: 43 COMMAND: "nfs_flock"
ROOT: / CWD: /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/
FD FILE DENTRY INODE TYPE PATH
0 c000000196e9a000 c000000004090840 c00000000ae13bf0 CHR /dev/null
1 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG /opt/ltp/output/nfslock01.sh_20230610112802
2 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG /opt/ltp/output/nfslock01.sh_20230610112802
3 c000000196e97700 c000000419ccb040 c00000006c1b05f8 REG /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/flock_idata
^^^^^^^^^^^^^^^^

PID: 4006781 TASK: c00000009d42d500 CPU: 42 COMMAND: "nfs_flock"
ROOT: / CWD: /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/
FD FILE DENTRY INODE TYPE PATH
0 c0000000f0812200 c000000004090840 c00000000ae13bf0 CHR /dev/null
1 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG /opt/ltp/output/nfslock01.sh_20230610112802
2 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG /opt/ltp/output/nfslock01.sh_20230610112802
3 c0000000f0813c00 c000000419ccb040 c00000006c1b05f8 REG /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/flock_idata
^^^^^^^^^^^^^^^^

The file->f_mapping for both struct files matches the inode->i_data:

crash> file.f_mapping c000000196e97700
f_mapping = 0xc00000006c1b0770,
crash> file.f_mapping c0000000f0813c00
f_mapping = 0xc00000006c1b0770,
crash> px &((struct inode *)0xc00000006c1b05f8)->i_data
$8 = (struct address_space *) 0xc00000006c1b0770

and if I look at one of those nfs_flock tasks, the folio
passed in to nfs_read_folio has the same mapping:

crash> bt 4006781
PID: 4006781 TASK: c00000009d42d500 CPU: 42 COMMAND: "nfs_flock"
#0 [c000000177053710] __schedule at c000000000f61d9c
#1 [c0000001770537d0] schedule at c000000000f621f4
#2 [c000000177053840] io_schedule at c000000000f62354
#3 [c000000177053870] folio_wait_bit_common at c00000000042dc60
#4 [c000000177053970] nfs_read_folio at c0080000050108a8 [nfs]
#5 [c000000177053a60] nfs_write_begin at c008000004fff06c [nfs]
#6 [c000000177053b10] generic_perform_write at c00000000042b044
#7 [c000000177053bc0] nfs_file_write at c008000004ffda08 [nfs]
#8 [c000000177053c60] new_sync_write at c00000000057fdd8
#9 [c000000177053d10] vfs_write at c000000000582fd4
#10 [c000000177053d60] ksys_write at c0000000005833a4
#11 [c000000177053db0] system_call_exception at c00000000002f434
#12 [c000000177053e10] system_call_vectored_common at c00000000000bfe8

crash> folio.mapping c00c000000564400
mapping = 0xc00000006c1b0770,

It's just that if we go back to the nfs_page being released by our panic
task, the folio->mapping has been cleared, so we panic when we try to go
folio->mapping->host.

crash> nfs_page c00000016fb2a600
struct nfs_page {
wb_list = {
next = 0xc00000016fb2a600,
prev = 0xc00000016fb2a600
},
{
wb_page = 0xc00c000001d49580,
wb_folio = 0xc00c000001d49580
},
wb_lock_context = 0xc00000010518b2c0,
wb_index = 0x1,
wb_offset = 0x6940,
wb_pgbase = 0x6940,
wb_bytes = 0x40,
wb_kref = {
refcount = {
refs = {
counter = 0x1
}
}
},
wb_flags = 0x5,
wb_verf = {
data = "\214\205_d\214\210W\036"
},
wb_this_page = 0xc00000016fb2a600,
wb_head = 0xc00000016fb2a600,
wb_nio = 0x0
}
crash> folio.mapping 0xc00c000001d49580
mapping = 0x0,

-Scott

> --
> Jeff Layton <[email protected]>
>


2023-10-02 21:45:11

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFS: Fix potential oops in nfs_inode_remove_request()

On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> Once a folio's private data has been cleared, it's possible for another
> process to clear the folio->mapping (e.g. via invalidate_complete_folio2
> or evict_mapping_folio), so it wouldn't be safe to call
> nfs_page_to_inode() after that.
>
> Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios")
> Signed-off-by: Scott Mayhew <[email protected]>
> ---
> fs/nfs/write.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index f4cca8f00c0c..489c3f9dae23 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct nfs_page *req)
> */
> static void nfs_inode_remove_request(struct nfs_page *req)
> {
> + struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> +
> if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> struct folio *folio = nfs_page_to_folio(req->wb_head);
> struct address_space *mapping = folio_file_mapping(folio);
> @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
>
> if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
> nfs_release_request(req);
> - atomic_long_dec(&NFS_I(nfs_page_to_inode(req))->nrequests);
> + atomic_long_dec(&nfsi->nrequests);
> }
> }
>

Reviewed-by: Jeff Layton <[email protected]>

2023-10-02 22:50:48

by Benjamin Coddington

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFS: Fix potential oops in nfs_inode_remove_request()

On 25 Jul 2023, at 11:08, Scott Mayhew wrote:

> Once a folio's private data has been cleared, it's possible for another
> process to clear the folio->mapping (e.g. via invalidate_complete_folio2
> or evict_mapping_folio), so it wouldn't be safe to call
> nfs_page_to_inode() after that.
>
> Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios")
> Signed-off-by: Scott Mayhew <[email protected]>

Reviewed-by: Benjamin Coddington <[email protected]>
Tested-by: Benjamin Coddington <[email protected]>

Ben