LinuxLists.cc - [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races

2021-01-20 16:12:03

Subject: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races

Hello,

Amir has reported [1] a that ext4 has a potential issues when reads can race
with hole punching possibly exposing stale data from freed blocks or even
corrupting filesystem when stale mapping data gets used for writeout. The
problem is that during hole punching, new page cache pages can get instantiated
in a punched range after truncate_inode_pages() has run but before the
filesystem removes blocks from the file. In principle any filesystem
implementing hole punching thus needs to implement a mechanism to block
instantiating page cache pages during hole punching to avoid this race. This is
further complicated by the fact that there are multiple places that can
instantiate pages in page cache. We can have regular read(2) or page fault
doing this but fadvise(2) or madvise(2) can also result in reading in page
cache pages through force_page_cache_readahead().

There are couple of ways how to fix this. First way (currently implemented by
XFS) is to protect read(2) and *advise(2) calls with i_rwsem so that they are
serialized with hole punching. This is easy to do but as a result all reads
would then be serialized with writes and thus mixed read-write workloads suffer
heavily on ext4. Thus for ext4 I want to use EXT4_I(inode)->i_mmap_sem for
serialization of reads and hole punching. The same serialization that is
already currently used in ext4 to close this race for page faults. This is
conceptually simple but lock ordering is troublesome - since
EXT4_I(inode)->i_mmap_sem is used in page fault path, it ranks below mmap_sem.
Thus we cannot simply grab EXT4_I(inode)->i_mmap_sem in ext4_file_read_iter()
as generic_file_buffered_read() copies data to userspace which may require
grabbing mmap_sem. Also grabbing EXT4_I(inode)->i_mmap_sem in ext4_readpages()
/ ext4_readpage() is problematic because at that point we already have locked
pages instantiated in the page cache. So EXT4_I(inode)->i_mmap_sem would
effectively rank below page lock which is too low in the locking hierarchy. So
for ext4 (and other filesystems with similar locking constraints - F2FS, GFS2,
OCFS2, ...) we'd need another hook in the read path that can wrap around
insertion of pages into page cache but does not contain copying of data into
userspace.

This patch set implements one possibility of such hook - we essentially
abstract generic_file_buffered_read_get_pages() into a hook. I'm not completely
sold on the naming or the API, or even whether this is the best place for the
hook. But I wanted to send something out for further discussion. For example
another workable option for ext4 would be to have an aops hook for adding a
page into page cache (essentially abstract add_to_page_cache_lru()). There will
be slight downside that it would mean per-page acquisition of the lock instead
of a per-batch-of-pages, also if we ever transition to range locking the
mapping, per-batch locking would be more efficient.

What do people think about this?

Honza

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahcyeaEVOFKVQ5dw@mail.gmail.com/

2021-01-21 19:31:01

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races

On Wed, Jan 20, 2021 at 05:06:08PM +0100, Jan Kara wrote:
> Hello,
>
> Amir has reported [1] a that ext4 has a potential issues when reads can race
> with hole punching possibly exposing stale data from freed blocks or even
> corrupting filesystem when stale mapping data gets used for writeout. The
> problem is that during hole punching, new page cache pages can get instantiated
> in a punched range after truncate_inode_pages() has run but before the
> filesystem removes blocks from the file. In principle any filesystem
> implementing hole punching thus needs to implement a mechanism to block
> instantiating page cache pages during hole punching to avoid this race. This is
> further complicated by the fact that there are multiple places that can
> instantiate pages in page cache. We can have regular read(2) or page fault
> doing this but fadvise(2) or madvise(2) can also result in reading in page
> cache pages through force_page_cache_readahead().

Doesn't this indicate that we're doing truncates in the wrong order?
ie first we should deallocate the blocks, then we should free the page
cache that was caching the contents of those blocks. We'd need to
make sure those pages in the page cache don't get written back to disc
(either by taking pages in the page cache off the lru list or having
the filesystem handle writeback of pages to a freed extent as a no-op).

2021-01-22 14:44:34

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races

On Thu 21-01-21 19:27:55, Matthew Wilcox wrote:
> On Wed, Jan 20, 2021 at 05:06:08PM +0100, Jan Kara wrote:
> > Hello,
> >
> > Amir has reported [1] a that ext4 has a potential issues when reads can race
> > with hole punching possibly exposing stale data from freed blocks or even
> > corrupting filesystem when stale mapping data gets used for writeout. The
> > problem is that during hole punching, new page cache pages can get instantiated
> > in a punched range after truncate_inode_pages() has run but before the
> > filesystem removes blocks from the file. In principle any filesystem
> > implementing hole punching thus needs to implement a mechanism to block
> > instantiating page cache pages during hole punching to avoid this race. This is
> > further complicated by the fact that there are multiple places that can
> > instantiate pages in page cache. We can have regular read(2) or page fault
> > doing this but fadvise(2) or madvise(2) can also result in reading in page
> > cache pages through force_page_cache_readahead().
>
> Doesn't this indicate that we're doing truncates in the wrong order?
> ie first we should deallocate the blocks, then we should free the page
> cache that was caching the contents of those blocks. We'd need to
> make sure those pages in the page cache don't get written back to disc
> (either by taking pages in the page cache off the lru list or having
> the filesystem handle writeback of pages to a freed extent as a no-op).

Well, it depends on how much you wish to complicate the matters :).
Filesystems have metadata information attached to pages (e.g. buffer
heads), once you are removing blocks from a file, this information is
becoming stale. So it makes perfect sense to first evict page cache to
remove this metadata caching information and then remove blocks from
on-disk structures.

You can obviously try to do it the other way around - i.e., first remove
blocks from on-disk structures and then remove the cached information from
the page cache. But then you have to make sure stale cached information
isn't used until everything is in sync. So whichever way you slice it, you
have information in two places, you need to keep it in sync and you need
some synchronization between different updaters of this information
in both places so that they cannot get those two places out of sync...

TLDR: I don't see much benefit in switching the ordering.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-04-02 19:35:44

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races

On Wed, Jan 20, 2021 at 05:06:08PM +0100, Jan Kara wrote:
>
> Amir has reported [1] a that ext4 has a potential issues when reads can race
> with hole punching possibly exposing stale data from freed blocks or even
> corrupting filesystem when stale mapping data gets used for writeout. The
> problem is that during hole punching, new page cache pages can get instantiated
> in a punched range after truncate_inode_pages() has run but before the
> filesystem removes blocks from the file. In principle any filesystem
> implementing hole punching thus needs to implement a mechanism to block
> instantiating page cache pages during hole punching to avoid this race. This is
> further complicated by the fact that there are multiple places that can
> instantiate pages in page cache. We can have regular read(2) or page fault
> doing this but fadvise(2) or madvise(2) can also result in reading in page
> cache pages through force_page_cache_readahead().

What's the current status of this patch set? I'm going through
pending patches and it looks like folks don't like Jan's proposed
solution. What are next steps?

Thanks,

- Ted

2021-04-06 21:14:10

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races

On Fri 02-04-21 15:34:13, Theodore Ts'o wrote:
> On Wed, Jan 20, 2021 at 05:06:08PM +0100, Jan Kara wrote:
> >
> > Amir has reported [1] a that ext4 has a potential issues when reads can race
> > with hole punching possibly exposing stale data from freed blocks or even
> > corrupting filesystem when stale mapping data gets used for writeout. The
> > problem is that during hole punching, new page cache pages can get instantiated
> > in a punched range after truncate_inode_pages() has run but before the
> > filesystem removes blocks from the file. In principle any filesystem
> > implementing hole punching thus needs to implement a mechanism to block
> > instantiating page cache pages during hole punching to avoid this race. This is
> > further complicated by the fact that there are multiple places that can
> > instantiate pages in page cache. We can have regular read(2) or page fault
> > doing this but fadvise(2) or madvise(2) can also result in reading in page
> > cache pages through force_page_cache_readahead().
>
> What's the current status of this patch set? I'm going through
> pending patches and it looks like folks don't like Jan's proposed
> solution. What are next steps?

Note that I did post v2 here:

https://lore.kernel.org/linux-fsdevel/[email protected]/

It didn't get much comments though. I guess I'll rebase the series, include
the explanations I've added in my reply to Dave and resend.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-04-07 07:19:02

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races

On Tue, Apr 06, 2021 at 02:17:02PM +0200, Jan Kara wrote:
>
> Note that I did post v2 here:
>
> https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> It didn't get much comments though. I guess I'll rebase the series, include
> the explanations I've added in my reply to Dave and resend.

Hmm, I wonder if it somehow got hit by the vger.kernel.org
instabilities a while back? As near as I can tell your v2 patches
never were received by:

http://patchwork.ozlabs.org/project/linux-ext4/

(There are no ext4 patches on patchwork.ozlabs.org on February 8th,
although I do see patches hitting patchwork on February 7th and 9th.)

Resending sounds like a plan. :-)

- Ted

2021-04-07 07:25:04

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races

On Tue, Apr 06, 2021 at 12:45:28PM -0400, Theodore Ts'o wrote:
>
> Hmm, I wonder if it somehow got hit by the vger.kernel.org
> instabilities a while back? As near as I can tell your v2 patches
> never were received by:
>
> http://patchwork.ozlabs.org/project/linux-ext4/
>
> (There are no ext4 patches on patchwork.ozlabs.org on February 8th,
> although I do see patches hitting patchwork on February 7th and 9th.)

I just noticed that the V2 patches were only sent to linux-fsdevel and
not cc'ed to linux-ext4. So that explains why I didn't see it on
patchwork. :-)

- Ted