2020-09-12 06:20:47

by Amir Goldstein

[permalink] [raw]
Subject: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <[email protected]> wrote:
>
> From: Dave Chinner <[email protected]>
>
> The page faultround path ->map_pages is implemented in XFS via
> filemap_map_pages(). This function checks that pages found in page
> cache lookups have not raced with truncate based invalidation by
> checking page->mapping is correct and page->index is within EOF.
>
> However, we've known for a long time that this is not sufficient to
> protect against races with invalidations done by operations that do
> not change EOF. e.g. hole punching and other fallocate() based
> direct extent manipulations. The way we protect against these
> races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> lock so they serialise against fallocate and truncate before calling
> into the filemap function that processes the fault.
>
> Do the same for XFS's ->map_pages implementation to close this
> potential data corruption issue.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> fs/xfs/xfs_file.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7b05f8fd7b3d..4b185a907432 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite(
> return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
> }
>
> +static void
> +xfs_filemap_map_pages(
> + struct vm_fault *vmf,
> + pgoff_t start_pgoff,
> + pgoff_t end_pgoff)
> +{
> + struct inode *inode = file_inode(vmf->vma->vm_file);
> +
> + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> + filemap_map_pages(vmf, start_pgoff, end_pgoff);
> + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +}
> +
> static const struct vm_operations_struct xfs_file_vm_ops = {
> .fault = xfs_filemap_fault,
> .huge_fault = xfs_filemap_huge_fault,
> - .map_pages = filemap_map_pages,
> + .map_pages = xfs_filemap_map_pages,
> .page_mkwrite = xfs_filemap_page_mkwrite,
> .pfn_mkwrite = xfs_filemap_pfn_mkwrite,
> };
> --
> 2.26.2.761.g0e0b3e54be
>

It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix

zonefs does not support hole punching, so it may not need to use
mmap_sem at all.

It is interesting to look at how this bug came to be duplicated in so
many filesystems, because there are lessons to be learned.

Commit f1820361f83d ("mm: implement ->map_pages for page cache")
added to ->map_pages() operation and its commit message said:

"...It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault()."

At the time, all of the aforementioned filesystems used filemap_fault()
for ->fault().

But since then, ext4, xfs, f2fs and just recently gfs2 have added a filesystem
->fault() operation.

orangefs has added vm_operations since and zonefs was added since,
probably copying the mmap_sem handling from ext4. Both have a filesystem
->fault() operation.

It was surprising for me to see that some of the filesystem developers
signed on the added ->fault() operations are not strangers to mm. The
recent gfs2 change was even reviewed by an established mm developer
[1].

So what can we learn from this case study? How could we fix the interface to
avoid repeating the same mistake in the future?

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/[email protected]/


2020-09-14 11:41:59

by Jan Kara

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Sat 12-09-20 09:19:11, Amir Goldstein wrote:
> On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <[email protected]> wrote:
> >
> > From: Dave Chinner <[email protected]>
> >
> > The page faultround path ->map_pages is implemented in XFS via
> > filemap_map_pages(). This function checks that pages found in page
> > cache lookups have not raced with truncate based invalidation by
> > checking page->mapping is correct and page->index is within EOF.
> >
> > However, we've known for a long time that this is not sufficient to
> > protect against races with invalidations done by operations that do
> > not change EOF. e.g. hole punching and other fallocate() based
> > direct extent manipulations. The way we protect against these
> > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> > lock so they serialise against fallocate and truncate before calling
> > into the filemap function that processes the fault.
> >
> > Do the same for XFS's ->map_pages implementation to close this
> > potential data corruption issue.
> >
> > Signed-off-by: Dave Chinner <[email protected]>
> > ---
> > fs/xfs/xfs_file.c | 15 ++++++++++++++-
> > 1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 7b05f8fd7b3d..4b185a907432 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite(
> > return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
> > }
> >
> > +static void
> > +xfs_filemap_map_pages(
> > + struct vm_fault *vmf,
> > + pgoff_t start_pgoff,
> > + pgoff_t end_pgoff)
> > +{
> > + struct inode *inode = file_inode(vmf->vma->vm_file);
> > +
> > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > + filemap_map_pages(vmf, start_pgoff, end_pgoff);
> > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +}
> > +
> > static const struct vm_operations_struct xfs_file_vm_ops = {
> > .fault = xfs_filemap_fault,
> > .huge_fault = xfs_filemap_huge_fault,
> > - .map_pages = filemap_map_pages,
> > + .map_pages = xfs_filemap_map_pages,
> > .page_mkwrite = xfs_filemap_page_mkwrite,
> > .pfn_mkwrite = xfs_filemap_pfn_mkwrite,
> > };
> > --
> > 2.26.2.761.g0e0b3e54be
> >
>
> It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix
>
> zonefs does not support hole punching, so it may not need to use
> mmap_sem at all.
>
> It is interesting to look at how this bug came to be duplicated in so
> many filesystems, because there are lessons to be learned.
>
> Commit f1820361f83d ("mm: implement ->map_pages for page cache")
> added to ->map_pages() operation and its commit message said:
>
> "...It should be safe to use filemap_map_pages() for ->map_pages() if
> filesystem use filemap_fault() for ->fault()."
>
> At the time, all of the aforementioned filesystems used filemap_fault()
> for ->fault().
>
> But since then, ext4, xfs, f2fs and just recently gfs2 have added a
> filesystem ->fault() operation.
>
> orangefs has added vm_operations since and zonefs was added since,
> probably copying the mmap_sem handling from ext4. Both have a filesystem
> ->fault() operation.

A standard pattern of copying bug from one place into many. Sadly it's
happening all the time for stuff that's complex enough that only a few
people (if anybody) are carrying all the details in their head.

> It was surprising for me to see that some of the filesystem developers
> signed on the added ->fault() operations are not strangers to mm. The
> recent gfs2 change was even reviewed by an established mm developer
> [1].

Well, people do miss things... And this stuff is twisted maze so it is easy
to miss something even for an experienced developer.

> So what can we learn from this case study? How could we fix the interface to
> avoid repeating the same mistake in the future?

IMO the serialization between page cache and various fs operations is just
too complex with too many special corner cases. But that's difficult to
change while keeping all the features and performance. So the best
realistic answer I have (and this is not meant to discourage anybody from
trying to implement a simpler scheme of page-cache - filesystem interaction
:) is that we should have added a fstest when XFS fix landed which would
then hopefully catch attention of other fs maintainers (at least those that
do run fstest).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-14 17:25:01

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

Could the xfs mmap lock documentation please be cleaned up? For
example, the xfs_ilock description says:

> * In addition to i_rwsem in the VFS inode, the xfs inode contains 2
> * multi-reader locks: i_mmap_lock and the i_lock. This routine allows
> * various combinations of the locks to be obtained.

The field in struct xfs_inode is called i_mmaplock though, not
i_mmap_lock. In addition, struct inode has an i_mmap_rwsem field which
is also referred to as i_mmap_lock. If that isn't irritating enough.

Thanks,
Andreas

2020-09-16 21:05:54

by Jan Kara

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Sat 12-09-20 09:19:11, Amir Goldstein wrote:
> On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <[email protected]> wrote:
> >
> > From: Dave Chinner <[email protected]>
> >
> > The page faultround path ->map_pages is implemented in XFS via
> > filemap_map_pages(). This function checks that pages found in page
> > cache lookups have not raced with truncate based invalidation by
> > checking page->mapping is correct and page->index is within EOF.
> >
> > However, we've known for a long time that this is not sufficient to
> > protect against races with invalidations done by operations that do
> > not change EOF. e.g. hole punching and other fallocate() based
> > direct extent manipulations. The way we protect against these
> > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> > lock so they serialise against fallocate and truncate before calling
> > into the filemap function that processes the fault.
> >
> > Do the same for XFS's ->map_pages implementation to close this
> > potential data corruption issue.
> >
> > Signed-off-by: Dave Chinner <[email protected]>
> > ---
> > fs/xfs/xfs_file.c | 15 ++++++++++++++-
> > 1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 7b05f8fd7b3d..4b185a907432 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite(
> > return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
> > }
> >
> > +static void
> > +xfs_filemap_map_pages(
> > + struct vm_fault *vmf,
> > + pgoff_t start_pgoff,
> > + pgoff_t end_pgoff)
> > +{
> > + struct inode *inode = file_inode(vmf->vma->vm_file);
> > +
> > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > + filemap_map_pages(vmf, start_pgoff, end_pgoff);
> > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +}
> > +
> > static const struct vm_operations_struct xfs_file_vm_ops = {
> > .fault = xfs_filemap_fault,
> > .huge_fault = xfs_filemap_huge_fault,
> > - .map_pages = filemap_map_pages,
> > + .map_pages = xfs_filemap_map_pages,
> > .page_mkwrite = xfs_filemap_page_mkwrite,
> > .pfn_mkwrite = xfs_filemap_pfn_mkwrite,
> > };
> > --
> > 2.26.2.761.g0e0b3e54be
> >
>
> It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix
>
> zonefs does not support hole punching, so it may not need to use
> mmap_sem at all.

So I've written an ext4 fix for this but before actually posting it Nikolay
working on btrfs fix asked why exactly is filemap_map_pages() actually a
problem and I think he's right it actually isn't a problem. The thing is:
filemap_map_pages() never does any page mapping or IO. It only creates PTEs
for uptodate pages that are already in page cache. As such it is a rather
different beast compared to the fault handler from fs POV and does not need
protection from hole punching (current serialization on page lock and
checking of page->mapping is enough).

That being said I agree this is subtle and the moment someone adds e.g. a
readahead call into filemap_map_pages() we have a real problem. I'm not
sure how to prevent this risk...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-17 02:04:05

by Dave Chinner

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Wed, Sep 16, 2020 at 05:58:51PM +0200, Jan Kara wrote:
> On Sat 12-09-20 09:19:11, Amir Goldstein wrote:
> > On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <[email protected]> wrote:
> > >
> > > From: Dave Chinner <[email protected]>
> > >
> > > The page faultround path ->map_pages is implemented in XFS via
> > > filemap_map_pages(). This function checks that pages found in page
> > > cache lookups have not raced with truncate based invalidation by
> > > checking page->mapping is correct and page->index is within EOF.
> > >
> > > However, we've known for a long time that this is not sufficient to
> > > protect against races with invalidations done by operations that do
> > > not change EOF. e.g. hole punching and other fallocate() based
> > > direct extent manipulations. The way we protect against these
> > > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> > > lock so they serialise against fallocate and truncate before calling
> > > into the filemap function that processes the fault.
> > >
> > > Do the same for XFS's ->map_pages implementation to close this
> > > potential data corruption issue.
> > >
> > > Signed-off-by: Dave Chinner <[email protected]>
> > > ---
> > > fs/xfs/xfs_file.c | 15 ++++++++++++++-
> > > 1 file changed, 14 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 7b05f8fd7b3d..4b185a907432 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite(
> > > return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
> > > }
> > >
> > > +static void
> > > +xfs_filemap_map_pages(
> > > + struct vm_fault *vmf,
> > > + pgoff_t start_pgoff,
> > > + pgoff_t end_pgoff)
> > > +{
> > > + struct inode *inode = file_inode(vmf->vma->vm_file);
> > > +
> > > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > + filemap_map_pages(vmf, start_pgoff, end_pgoff);
> > > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > +}
> > > +
> > > static const struct vm_operations_struct xfs_file_vm_ops = {
> > > .fault = xfs_filemap_fault,
> > > .huge_fault = xfs_filemap_huge_fault,
> > > - .map_pages = filemap_map_pages,
> > > + .map_pages = xfs_filemap_map_pages,
> > > .page_mkwrite = xfs_filemap_page_mkwrite,
> > > .pfn_mkwrite = xfs_filemap_pfn_mkwrite,
> > > };
> > > --
> > > 2.26.2.761.g0e0b3e54be
> > >
> >
> > It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix
> >
> > zonefs does not support hole punching, so it may not need to use
> > mmap_sem at all.
>
> So I've written an ext4 fix for this but before actually posting it Nikolay
> working on btrfs fix asked why exactly is filemap_map_pages() actually a
> problem and I think he's right it actually isn't a problem. The thing is:
> filemap_map_pages() never does any page mapping or IO. It only creates PTEs
> for uptodate pages that are already in page cache.

So....

P0 p1

hole punch starts
takes XFS_MMAPLOCK_EXCL
truncate_pagecache_range()
unmap_mapping_range(start, end)
<clears ptes>
<read fault>
do_fault_around()
->map_pages
filemap_map_pages()
page mapping valid,
page is up to date
maps PTEs
<fault done>
truncate_inode_pages_range()
truncate_cleanup_page(page)
invalidates page
delete_from_page_cache_batch(page)
frees page
<pte now points to a freed page>

That doesn't seem good to me.

Sure, maybe the page hasn't been freed back to the free lists
because of elevated refcounts. But it's been released by the
filesystem and not longer in the page cache so nothing good can come
of this situation...

AFAICT, this race condition exists for the truncate case as well
as filemap_map_pages() doesn't have a "page beyond inode size" check
in it. However, exposure here is very limited in the truncate case
because truncate_setsize()->truncate_pagecache() zaps the PTEs
again after invalidating the page cache.

Either way, adding the XFS_MMAPLOCK_SHARED around
filemap_map_pages() avoids the race condition for fallocate and
truncate operations for XFS...

> As such it is a rather
> different beast compared to the fault handler from fs POV and does not need
> protection from hole punching (current serialization on page lock and
> checking of page->mapping is enough).
> That being said I agree this is subtle and the moment someone adds e.g. a
> readahead call into filemap_map_pages() we have a real problem. I'm not
> sure how to prevent this risk...

Subtle, yes. So subtle, in fact, I fail to see any reason why the
above race cannot occur as there's no obvious serialisation in the
page cache between PTE zapping and page invalidation to prevent a
fault from coming in an re-establishing the PTEs before invalidation
occurs.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-09-17 02:14:56

by Hugh Dickins

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Thu, 17 Sep 2020, Dave Chinner wrote:
>
> So....
>
> P0 p1
>
> hole punch starts
> takes XFS_MMAPLOCK_EXCL
> truncate_pagecache_range()
> unmap_mapping_range(start, end)
> <clears ptes>
> <read fault>
> do_fault_around()
> ->map_pages
> filemap_map_pages()
> page mapping valid,
> page is up to date
> maps PTEs
> <fault done>
> truncate_inode_pages_range()
> truncate_cleanup_page(page)
> invalidates page
> delete_from_page_cache_batch(page)
> frees page
> <pte now points to a freed page>

No. filemap_map_pages() checks page->mapping after trylock_page(),
before setting up the pte; and truncate_cleanup_page() does a one-page
unmap_mapping_range() if page_mapped(), while holding page lock.

(Of course, there's a different thread, in which less reliance on
page lock is being discussed, but that would be a future thing.)

Hugh

2020-09-17 03:09:46

by Matthew Wilcox

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Thu, Sep 17, 2020 at 11:44:54AM +1000, Dave Chinner wrote:
> So....
>
> P0 p1
>
> hole punch starts
> takes XFS_MMAPLOCK_EXCL
> truncate_pagecache_range()

... locks page ...

> unmap_mapping_range(start, end)
> <clears ptes>
> <read fault>
> do_fault_around()
> ->map_pages
> filemap_map_pages()

... trylock page fails ...

> page mapping valid,
> page is up to date
> maps PTEs
> <fault done>
> truncate_inode_pages_range()
> truncate_cleanup_page(page)
> invalidates page
> delete_from_page_cache_batch(page)
> frees page
> <pte now points to a freed page>
>
> That doesn't seem good to me.
>
> Sure, maybe the page hasn't been freed back to the free lists
> because of elevated refcounts. But it's been released by the
> filesystem and not longer in the page cache so nothing good can come
> of this situation...
>
> AFAICT, this race condition exists for the truncate case as well
> as filemap_map_pages() doesn't have a "page beyond inode size" check
> in it. However, exposure here is very limited in the truncate case
> because truncate_setsize()->truncate_pagecache() zaps the PTEs
> again after invalidating the page cache.
>
> Either way, adding the XFS_MMAPLOCK_SHARED around
> filemap_map_pages() avoids the race condition for fallocate and
> truncate operations for XFS...
>
> > As such it is a rather
> > different beast compared to the fault handler from fs POV and does not need
> > protection from hole punching (current serialization on page lock and
> > checking of page->mapping is enough).
> > That being said I agree this is subtle and the moment someone adds e.g. a
> > readahead call into filemap_map_pages() we have a real problem. I'm not
> > sure how to prevent this risk...
>
> Subtle, yes. So subtle, in fact, I fail to see any reason why the
> above race cannot occur as there's no obvious serialisation in the
> page cache between PTE zapping and page invalidation to prevent a
> fault from coming in an re-establishing the PTEs before invalidation
> occurs.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>

2020-09-17 05:55:06

by Nikolay Borisov

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())



On 17.09.20 г. 4:44 ч., Dave Chinner wrote:
> On Wed, Sep 16, 2020 at 05:58:51PM +0200, Jan Kara wrote:
>> On Sat 12-09-20 09:19:11, Amir Goldstein wrote:
>>> On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <[email protected]> wrote:

<snip>

>
> So....
>
> P0 p1
>
> hole punch starts
> takes XFS_MMAPLOCK_EXCL
> truncate_pagecache_range()
> unmap_mapping_range(start, end)
> <clears ptes>
> <read fault>
> do_fault_around()
> ->map_pages
> filemap_map_pages()
> page mapping valid,
> page is up to date
> maps PTEs
> <fault done>
> truncate_inode_pages_range()
> truncate_cleanup_page(page)
> invalidates page
> delete_from_page_cache_batch(page)
> frees page
> <pte now points to a freed page>
>
> That doesn't seem good to me.
>
> Sure, maybe the page hasn't been freed back to the free lists
> because of elevated refcounts. But it's been released by the
> filesystem and not longer in the page cache so nothing good can come
> of this situation...
>
> AFAICT, this race condition exists for the truncate case as well
> as filemap_map_pages() doesn't have a "page beyond inode size" check
> in it.

(It's not relevant to the discussion at hand but for the sake of
completeness):

It does have a check:

max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
if (page->index >= max_idx)
goto unlock;


<snip>

2020-09-17 06:47:18

by Dave Chinner

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Wed, Sep 16, 2020 at 07:04:46PM -0700, Hugh Dickins wrote:
> On Thu, 17 Sep 2020, Dave Chinner wrote:
> >
> > So....
> >
> > P0 p1
> >
> > hole punch starts
> > takes XFS_MMAPLOCK_EXCL
> > truncate_pagecache_range()
> > unmap_mapping_range(start, end)
> > <clears ptes>
> > <read fault>
> > do_fault_around()
> > ->map_pages
> > filemap_map_pages()
> > page mapping valid,
> > page is up to date
> > maps PTEs
> > <fault done>
> > truncate_inode_pages_range()
> > truncate_cleanup_page(page)
> > invalidates page
> > delete_from_page_cache_batch(page)
> > frees page
> > <pte now points to a freed page>
>
> No. filemap_map_pages() checks page->mapping after trylock_page(),
> before setting up the pte; and truncate_cleanup_page() does a one-page
> unmap_mapping_range() if page_mapped(), while holding page lock.

Ok, fair, I missed that.

So why does truncate_pagecache() talk about fault races and require
a second unmap range after the invalidation "for correctness" if
this sort of race cannot happen?

Why is that different to truncate_pagecache_range() which -doesn't-i
do that second removal? It's called for more than just hole_punch -
from the filesystem's persepective holepunch should do exactly the
same as truncate to the page cache, and for things like
COLLAPSE_RANGE it is absolutely essential because the data in that
range is -not zero- and will be stale if the mappings are not
invalidated completely....

Also, if page->mapping == NULL is sufficient to detect an invalidated
page in all cases, then why does page_cache_delete() explicitly
leave page->index intact:

page->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */


Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-09-17 07:48:54

by Hugh Dickins

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Thu, 17 Sep 2020, Dave Chinner wrote:
> On Wed, Sep 16, 2020 at 07:04:46PM -0700, Hugh Dickins wrote:
> > On Thu, 17 Sep 2020, Dave Chinner wrote:
> > > <pte now points to a freed page>
> >
> > No. filemap_map_pages() checks page->mapping after trylock_page(),
> > before setting up the pte; and truncate_cleanup_page() does a one-page
> > unmap_mapping_range() if page_mapped(), while holding page lock.
>
> Ok, fair, I missed that.
>
> So why does truncate_pagecache() talk about fault races and require
> a second unmap range after the invalidation "for correctness" if
> this sort of race cannot happen?

I thought the comment
* unmap_mapping_range is called twice, first simply for
* efficiency so that truncate_inode_pages does fewer
* single-page unmaps. However after this first call, and
* before truncate_inode_pages finishes, it is possible for
* private pages to be COWed, which remain after
* truncate_inode_pages finishes, hence the second
* unmap_mapping_range call must be made for correctness.
explains it fairly well. It's because POSIX demanded that when a file
is truncated, the user will get SIGBUS on trying to access even the
COWed pages beyond EOF in a MAP_PRIVATE mapping. Page lock on the
cache page does not serialize the pages COWed from it very well.

But there's no such SIGBUS requirement in the case of hole-punching,
and trying to unmap those pages racily instantiated just after the
punching cursor passed, would probably do more harm than good.

>
> Why is that different to truncate_pagecache_range() which -doesn't-i
> do that second removal? It's called for more than just hole_punch -
> from the filesystem's persepective holepunch should do exactly the
> same as truncate to the page cache, and for things like
> COLLAPSE_RANGE it is absolutely essential because the data in that
> range is -not zero- and will be stale if the mappings are not
> invalidated completely....

I can't speak to COLLAPSE_RANGE.

>
> Also, if page->mapping == NULL is sufficient to detect an invalidated
> page in all cases, then why does page_cache_delete() explicitly
> leave page->index intact:
>
> page->mapping = NULL;
> /* Leave page->index set: truncation lookup relies upon it */

Because there was, and I think still is (but might it now be xarrayed
away?), code (mainly in mm/truncate.c) which finds it convenient to
check page->index for end of range, without necessitating the overhead
of getting page lock. I've no doubt it's an (minor) optimization that
could be discarded if there were ever a need to invalidate page->index
when deleting; but nobody has required that yet.

Hugh

2020-09-17 07:57:08

by Jan Kara

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Thu 17-09-20 08:37:17, Nikolay Borisov wrote:
> On 17.09.20 г. 4:44 ч., Dave Chinner wrote:
> > On Wed, Sep 16, 2020 at 05:58:51PM +0200, Jan Kara wrote:
> >> On Sat 12-09-20 09:19:11, Amir Goldstein wrote:
> >>> On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <[email protected]> wrote:
>
> <snip>
>
> >
> > So....
> >
> > P0 p1
> >
> > hole punch starts
> > takes XFS_MMAPLOCK_EXCL
> > truncate_pagecache_range()
> > unmap_mapping_range(start, end)
> > <clears ptes>
> > <read fault>
> > do_fault_around()
> > ->map_pages
> > filemap_map_pages()
> > page mapping valid,
> > page is up to date
> > maps PTEs
> > <fault done>
> > truncate_inode_pages_range()
> > truncate_cleanup_page(page)
> > invalidates page
> > delete_from_page_cache_batch(page)
> > frees page
> > <pte now points to a freed page>
> >
> > That doesn't seem good to me.
> >
> > Sure, maybe the page hasn't been freed back to the free lists
> > because of elevated refcounts. But it's been released by the
> > filesystem and not longer in the page cache so nothing good can come
> > of this situation...
> >
> > AFAICT, this race condition exists for the truncate case as well
> > as filemap_map_pages() doesn't have a "page beyond inode size" check
> > in it.
>
> (It's not relevant to the discussion at hand but for the sake of
> completeness):
>
> It does have a check:
>
> max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
> if (page->index >= max_idx)
> goto unlock;

Yes, but this does something meaningful only for truncate. For other
operations such as hole punch this check doesn't bring anything. That's why
only filesystems supporting hole punching and similar advanced operations
need an equivalent of mmap_lock.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-21 08:28:11

by Dave Chinner

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Thu, Sep 17, 2020 at 12:47:10AM -0700, Hugh Dickins wrote:
> On Thu, 17 Sep 2020, Dave Chinner wrote:
> > On Wed, Sep 16, 2020 at 07:04:46PM -0700, Hugh Dickins wrote:
> > > On Thu, 17 Sep 2020, Dave Chinner wrote:
> > > > <pte now points to a freed page>
> > >
> > > No. filemap_map_pages() checks page->mapping after trylock_page(),
> > > before setting up the pte; and truncate_cleanup_page() does a one-page
> > > unmap_mapping_range() if page_mapped(), while holding page lock.
> >
> > Ok, fair, I missed that.
> >
> > So why does truncate_pagecache() talk about fault races and require
> > a second unmap range after the invalidation "for correctness" if
> > this sort of race cannot happen?
>
> I thought the comment
> * unmap_mapping_range is called twice, first simply for
> * efficiency so that truncate_inode_pages does fewer
> * single-page unmaps. However after this first call, and
> * before truncate_inode_pages finishes, it is possible for
> * private pages to be COWed, which remain after
> * truncate_inode_pages finishes, hence the second
> * unmap_mapping_range call must be made for correctness.
> explains it fairly well.

Not to me. It explains what the code is doing, and the why is simply
"correctness".

I have no idea what "correctness" actually means in this context
because there is no reference to what correct behaviour should be.
Nor do I have any idea why COW faults might behave differently to a
normal read/write page fault...

> It's because POSIX demanded that when a file
> is truncated, the user will get SIGBUS on trying to access even the
> COWed pages beyond EOF in a MAP_PRIVATE mapping. Page lock on the
> cache page does not serialize the pages COWed from it very well.

And there's the "why". I don't find the "page lock doesn't
serialise COW faults very well" particularly reassuring in this
case....

> But there's no such SIGBUS requirement in the case of hole-punching,
> and trying to unmap those pages racily instantiated just after the
> punching cursor passed, would probably do more harm than good.

There isn't a SIGBUS requirement for fallocate operations, just a
"don't expose stale data to userspace" requirement.

FWIW, how does a COW fault even work with file backed pages? We can
only have a single page attached to the inode address space for a given
offset, so if there's been a COW fault and a new page faulted in for
the write fault in that VMA, doesn't that imply the user data then
written to that page is never going to be written back to storage
because the COW page is not tracked by the inode address space?

> > Why is that different to truncate_pagecache_range() which -doesn't-i
> > do that second removal? It's called for more than just hole_punch -
> > from the filesystem's persepective holepunch should do exactly the
> > same as truncate to the page cache, and for things like
> > COLLAPSE_RANGE it is absolutely essential because the data in that
> > range is -not zero- and will be stale if the mappings are not
> > invalidated completely....
>
> I can't speak to COLLAPSE_RANGE.

It moves data around, doesn't replace data with zeros. Hence the
contents of any page that isn't invalidated entirely by
truncate_pagecache_range() is now entirely incorrect...

> > Also, if page->mapping == NULL is sufficient to detect an invalidated
> > page in all cases, then why does page_cache_delete() explicitly
> > leave page->index intact:
> >
> > page->mapping = NULL;
> > /* Leave page->index set: truncation lookup relies upon it */
>
> Because there was, and I think still is (but might it now be xarrayed
> away?), code (mainly in mm/truncate.c) which finds it convenient to
> check page->index for end of range, without necessitating the overhead
> of getting page lock. I've no doubt it's an (minor) optimization that
> could be discarded if there were ever a need to invalidate page->index
> when deleting; but nobody has required that yet.

And that's exactly my concern w.r.t. fallocate based invalidation:
checking the page is beyond EOF without locking the page or checking
the mapping does not detect pages invalidated by hole punching and
other fallocate() operations because page->index on the invalidated
pages is never beyond EOF....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-09-21 09:15:37

by Jan Kara

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Mon 21-09-20 18:26:00, Dave Chinner wrote:
> On Thu, Sep 17, 2020 at 12:47:10AM -0700, Hugh Dickins wrote:
> > It's because POSIX demanded that when a file
> > is truncated, the user will get SIGBUS on trying to access even the
> > COWed pages beyond EOF in a MAP_PRIVATE mapping. Page lock on the
> > cache page does not serialize the pages COWed from it very well.
>
> And there's the "why". I don't find the "page lock doesn't
> serialise COW faults very well" particularly reassuring in this
> case....
>
> > But there's no such SIGBUS requirement in the case of hole-punching,
> > and trying to unmap those pages racily instantiated just after the
> > punching cursor passed, would probably do more harm than good.
>
> There isn't a SIGBUS requirement for fallocate operations, just a
> "don't expose stale data to userspace" requirement.
>
> FWIW, how does a COW fault even work with file backed pages? We can
> only have a single page attached to the inode address space for a given
> offset, so if there's been a COW fault and a new page faulted in for
> the write fault in that VMA, doesn't that imply the user data then
> written to that page is never going to be written back to storage
> because the COW page is not tracked by the inode address space?

Correct. Private file mappings work so that on first write fault on some
page offset we allocate anonymous page for that offset, copy to it current
contents of the corresponding file page, and from that moment on it behaves
as an anonymous page. Except that on truncate, we have to unmap these
anonymous pages in private file mappings as well...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-21 16:31:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Mon, Sep 21, 2020 at 2:11 AM Jan Kara <[email protected]> wrote:
>
> Except that on truncate, we have to unmap these
> anonymous pages in private file mappings as well...

I'm actually not 100% sure we strictly would need to care.

Once we've faulted in a private file mapping page, that page is
"ours". That's kind of what MAP_PRIVATE means.

If we haven't written to it, we do keep things coherent with the file,
but that's actually not required by POSIX afaik - it's a QoI issue,
and a lot of (bad) Unixes didn't do it at all.

So as long as truncate _clears_ the pages it truncates, I think we'd
actually be ok.

The SIGBUS is supposed to happen, but that's really only relevant for
the _first_ access. Once we've accessed the page, and have it mapped,
the private part really means that there are no guarantees it stays
coherent.

In particular, obviously if we've written to a page, we've lost the
association with the original file entirely. And I'm pretty sure that
a private mapping is allowed to act as if it was a private copy
without that association in the first place.

That said, this _is_ a QoI thing, and in Linux we've generally tried
quite hard to stay as coherent as possible even with private mappings.

In fact, before we had real shared file mappings (in a distant past,
long long ago), we allowed read-only shared mappings because we
internally turned them into read-only private mappings and our private
mappings were coherent.

And those "fake" read-only shared mappings actually were much better
than some other platforms "real" shared mappings (*cough*hpux*cough*)
and actually worked with things that mixed "write()" and "mmap()" and
expected coherency.

Admittedly the only case I'm aware of that did that was nntpd or
something like that. Exactly because a lot of Unixes were *not*
coherent (either because they had actual hardware cache coherency
issues, or because their VM was not set up for it).

Linus

2020-09-21 18:03:43

by Matthew Wilcox

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Mon, Sep 21, 2020 at 09:20:25AM -0700, Linus Torvalds wrote:
> On Mon, Sep 21, 2020 at 2:11 AM Jan Kara <[email protected]> wrote:
> >
> > Except that on truncate, we have to unmap these
> > anonymous pages in private file mappings as well...
>
> I'm actually not 100% sure we strictly would need to care.
>
> Once we've faulted in a private file mapping page, that page is
> "ours". That's kind of what MAP_PRIVATE means.
>
> If we haven't written to it, we do keep things coherent with the file,
> but that's actually not required by POSIX afaik - it's a QoI issue,
> and a lot of (bad) Unixes didn't do it at all.
> So as long as truncate _clears_ the pages it truncates, I think we'd
> actually be ok.

We don't even need to do that ...

"If the size of the mapped file changes after the call to mmap()
as a result of some other operation on the mapped file, the effect of
references to portions of the mapped region that correspond to added or
removed portions of the file is unspecified."

https://pubs.opengroup.org/onlinepubs/9699919799/functions/mmap.html

As you say, there's a QoI here, and POSIX permits some shockingly
bad and useless implementations.

2020-09-22 08:35:23

by Jan Kara

[permalink] [raw]
Subject: Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

On Mon 21-09-20 18:59:43, Matthew Wilcox wrote:
> On Mon, Sep 21, 2020 at 09:20:25AM -0700, Linus Torvalds wrote:
> > On Mon, Sep 21, 2020 at 2:11 AM Jan Kara <[email protected]> wrote:
> > >
> > > Except that on truncate, we have to unmap these
> > > anonymous pages in private file mappings as well...
> >
> > I'm actually not 100% sure we strictly would need to care.
> >
> > Once we've faulted in a private file mapping page, that page is
> > "ours". That's kind of what MAP_PRIVATE means.
> >
> > If we haven't written to it, we do keep things coherent with the file,
> > but that's actually not required by POSIX afaik - it's a QoI issue,
> > and a lot of (bad) Unixes didn't do it at all.
> > So as long as truncate _clears_ the pages it truncates, I think we'd
> > actually be ok.
>
> We don't even need to do that ...
>
> "If the size of the mapped file changes after the call to mmap()
> as a result of some other operation on the mapped file, the effect of
> references to portions of the mapped region that correspond to added or
> removed portions of the file is unspecified."
>
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/mmap.html
>
> As you say, there's a QoI here, and POSIX permits some shockingly
> bad and useless implementations.

Something from ftruncate(2) POSIX definition [1] for comparison:

If the effect of ftruncate() is to decrease the size of a memory mapped
file or a shared memory object and whole pages beyond the new end were
previously mapped, then the whole pages beyond the new end shall be
discarded.

References to discarded pages shall result in the generation of a SIGBUS
signal.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html

Now pick... ;)

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR