2024-03-08 07:53:05

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait

On Thu, 7 Mar 2024 at 11:36, David Howells <[email protected]> wrote:

> (2) invalidate_inode_pages2() is used in some places to effect invalidation
> of the pagecache in the case where the server tells us that a third party
> modified the server copy of a file. What the right behaviour should be
> here, I'm not sure, but at the moment, any dirty data will get laundered
> back to the server. Possibly it should be simply invalidated locally or
> the user asked how they want to handle the divergence.

Skipping ->launder_page will mean there's a window where the data
*will* be lost, AFAICS.

Of course concurrent cached writes on different hosts against the same
region (the size of which depends on how the caching is done) will
conflict.

But if concurrent writes are to different regions, then they shouldn't
be lost, no? Without the current ->launder_page thing I don't see how
that could be guaranteed.

Thanks,
Miklos


2024-03-19 14:15:04

by David Howells

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait

Miklos Szeredi <[email protected]> wrote:

> > (2) invalidate_inode_pages2() is used in some places to effect
> > invalidation of the pagecache in the case where the server tells us
> > that a third party modified the server copy of a file. What the
> > right behaviour should be here, I'm not sure, but at the moment, any
> > dirty data will get laundered back to the server. Possibly it should
> > be simply invalidated locally or the user asked how they want to
> > handle the divergence.
>
> Skipping ->launder_page will mean there's a window where the data
> *will* be lost, AFAICS.
>
> Of course concurrent cached writes on different hosts against the same
> region (the size of which depends on how the caching is done) will
> conflict.

Indeed. Depending on when you're using invalidate_inode_pages2() and co. and
what circumstances you're using it for, you *are* going to suffer data loss.

For instance, if you have dirty data on the local host and get an invalidation
notification from the server: if you write just your dirty data back, you may
corrupt the file on the server, losing the third party changes; if you write
back your entire copy of the file, you might avoid corrupting the file, but
completely obliterate the third party changes; if you discard your changes,
you lose those instead, but save the third party changes.

I'm working towards supporting disconnected operation where I'll need to add
some sort of user interaction mechanism that will allow the user to say how
they want to handle this.

> But if concurrent writes are to different regions, then they shouldn't
> be lost, no? Without the current ->launder_page thing I don't see how
> that could be guaranteed.

Define "different regions". If they're not on the same folios, then why would
they be lost by simply flushing the data before doing the invalidation? If
they are on different parts of the same folio, all the above still apply when
you flush the whole folio.

Now, you can mitigate the latter case by keeping track of which bytes changed,
but that still allows you to corrupt the file by writing back just your
particular changes.

And then there's the joker in the deck: mmap. The main advantage of
invalidate_inode_pages2() is that it forcibly unmaps the page before
laundering it. However, this doesn't prevent you then corrupting the upstream
copy by writing the changes back.

What particular usage case of invalidate_inode_pages2() are you thinking of?

DIO read/write can only be best effort: flush, invalidate then do the DIO
which may bring the buffers back in because they're mmapped. In which case
doing a flush and a non-laundering invalidate that leaves dirty pages in place
ought to be fine.

David


2024-03-19 16:22:35

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait

On Tue, 19 Mar 2024 at 15:15, David Howells <[email protected]> wrote:

> What particular usage case of invalidate_inode_pages2() are you thinking of?

FUSE_NOTIFY_INVAL_INODE will trigger invalidate_inode_pages2_range()
to clean up the cache.

The server is free to discard writes resulting from this invalidation
and delay reads in the region until the invalidation finishes. This
would no longer work with your change, since the mapping could
silently be reinstated between the writeback and the removal from the
cache due to the page being unlocked/relocked.

I'm not saying such a filesystem actually exists, but it's a
theoretical possibility. And maybe there are cases which I haven't
thought of.

Thanks,
Miklos

2024-03-19 16:41:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait

On Tue, 19 Mar 2024 at 17:13, Miklos Szeredi <[email protected]> wrote:
>
> On Tue, 19 Mar 2024 at 15:15, David Howells <[email protected]> wrote:
>
> > What particular usage case of invalidate_inode_pages2() are you thinking of?
>
> FUSE_NOTIFY_INVAL_INODE will trigger invalidate_inode_pages2_range()
> to clean up the cache.
>
> The server is free to discard writes resulting from this invalidation
> and delay reads in the region until the invalidation finishes. This
> would no longer work with your change, since the mapping could
> silently be reinstated between the writeback and the removal from the
> cache due to the page being unlocked/relocked.

This would also matter if a distributed filesystem wanted to implement
coherence even if there are mmaps. I.e. a client could get exclusive
access to a region by issuing FUSE_NOTIFY_INVAL_INODE on all other
clients and blocking reads. With your change this would fail.

Again, this is purely theoretical, and without a way to differentiate
between the read-only and write cases it has limited usefulness.
Adding leases to fuse (which I plan to do) would make this much more
useful.

Thanks,
Miklos

2024-03-19 16:59:57

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait



On 3/19/24 17:40, Miklos Szeredi wrote:
> On Tue, 19 Mar 2024 at 17:13, Miklos Szeredi <[email protected]> wrote:
>>
>> On Tue, 19 Mar 2024 at 15:15, David Howells <[email protected]> wrote:
>>
>>> What particular usage case of invalidate_inode_pages2() are you thinking of?
>>
>> FUSE_NOTIFY_INVAL_INODE will trigger invalidate_inode_pages2_range()
>> to clean up the cache.
>>
>> The server is free to discard writes resulting from this invalidation
>> and delay reads in the region until the invalidation finishes. This
>> would no longer work with your change, since the mapping could
>> silently be reinstated between the writeback and the removal from the
>> cache due to the page being unlocked/relocked.
>
> This would also matter if a distributed filesystem wanted to implement
> coherence even if there are mmaps. I.e. a client could get exclusive
> access to a region by issuing FUSE_NOTIFY_INVAL_INODE on all other
> clients and blocking reads. With your change this would fail.
>
> Again, this is purely theoretical, and without a way to differentiate
> between the read-only and write cases it has limited usefulness.
> Adding leases to fuse (which I plan to do) would make this much more
> useful.

Thanks Miklos! Fyi, we are actually planning to extend fuse
notifications from inode to page ranges.


Thanks,
Bernd