2024-04-15 12:34:17

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix race condition between buffer write and page_mkwrite

On Mon 15-04-24 12:28:01, Baokun Li wrote:
> On 2023/6/5 23:08, Jan Kara wrote:
> > On Mon 05-06-23 15:55:35, Matthew Wilcox wrote:
> > > On Mon, Jun 05, 2023 at 02:21:41PM +0200, Jan Kara wrote:
> > > > On Mon 05-06-23 11:16:55, Jan Kara wrote:
> > > > > Yeah, I agree, that is also the conclusion I have arrived at when thinking
> > > > > about this problem now. We should be able to just remove the conversion
> > > > > from ext4_page_mkwrite() and rely on write(2) or truncate(2) doing it when
> > > > > growing i_size.
> > > > OK, thinking more about this and searching through the history, I've
> > > > realized why the conversion is originally in ext4_page_mkwrite(). The
> > > > problem is described in commit 7b4cc9787fe35b ("ext4: evict inline data
> > > > when writing to memory map") but essentially it boils down to the fact that
> > > > ext4 writeback code does not expect dirty page for a file with inline data
> > > > because ext4_write_inline_data_end() should have copied the data into the
> > > > inode and cleared the folio's dirty flag.
> > > >
> > > > Indeed messing with xattrs from the writeback path to copy page contents
> > > > into inline data xattr would be ... interesting. Hum, out of good ideas for
> > > > now :-|.
> > > Is it so bad? Now that we don't have writepage in ext4, only
> > > writepages, it seems like we have a considerably more benign locking
> > > environment to work in.
> > Well, yes, without ->writepage() it might be *possible*. But still rather
> > ugly. The problem is that in ->writepages() i_size is not stable. Thus also
> > whether the inode data is inline or not is not stable. We'd need inode_lock
> > for that but that is not easily doable in the writeback path - inode lock
> > would then become fs_reclaim unsafe...
> >
> > Honza
> Hi Honza!
> Hi Ted!
> Hi Matthew!
>
> Long time later came back to this, because while discussing another similar
> ABBA problem with Hou Tao, he mentioned VM_FAULT_RETRY, and then I
> thought that this could be used to solve this problem as well.
>
> The general idea is that if we see a file with inline data in
> ext4_page_mkwrite(),
> we release the mmap_lock and grab the inode_lock to convert the inline data,
> and then return VM_FAULT_RETRY to retry to get the mmap_lock.
>
> The code implementation is as follows, do you have any thoughts?

So the problem with this is that VM_FAULT_RETRY is not always an option -
in particular the caller has to set FAULT_FLAG_ALLOW_RETRY to indicate it
is prepared to handle VM_FAULT_RETRY return. See how
maybe_unlock_mmap_for_io() is carefully checking this. There are callers
(most notably some get_user_pages() users) that don't set
FAULT_FLAG_ALLOW_RETRY so the escape through VM_FAULT_RETRY is sadly not a
reliable solution.

My long-term wish is we were always allowed to use VM_FAULT_RETRY and that
was actually what motivated some get_user_pages() cleanups I did couple
years ago. But dealing with all the cases in various drivers was too
difficult and I've run out of time. Now maybe it would be worth it to
revisit this since things have changed noticeably and maybe now it would be
easier to achive the goal...

Honza

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 537803250ca9..e044c11c9cf6 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -6056,15 +6056,36 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
> ???? if (unlikely(IS_IMMUTABLE(inode)))
> ???? ??? return VM_FAULT_SIGBUS;
>
> +??? /*
> +??? ?* The ext4 writeback code does not expect dirty page for a file with
> +??? ?* inline data, so inline data needs to be converted. But it needs to
> +??? ?* hold the inode_lock when converting, and trying to lock the inode
> +??? ?* while holding the mmap_lock may result in an ABBA deadlock. So here
> +??? ?* we release the mmap_lock for conversion and retry after conversion.
> +??? ?* Only one retry is allowed to avoid endless loops.
> +??? ?* Acquire xattr_sem to avoid race with inline data conversion.
> +??? ?*/
> +??? down_read(&EXT4_I(inode)->xattr_sem);
> +??? if (ext4_has_inline_data(inode)) {
> +??? ??? up_read(&EXT4_I(inode)->xattr_sem);
> +
> +??? ??? if (!fault_flag_allow_retry_first(vmf->flags))
> +??? ??? ??? return VM_FAULT_SIGBUS;
> +
> +??? ??? release_fault_lock(vmf);
> +
> +??? ??? inode_lock(inode);
> +??? ??? ext4_convert_inline_data(inode);
> +??? ??? inode_unlock(inode);
> +??? ??? return VM_FAULT_RETRY;
> +??? }
> +??? up_read(&EXT4_I(inode)->xattr_sem);
> +
> ???? sb_start_pagefault(inode->i_sb);
> ???? file_update_time(vma->vm_file);
>
> ???? filemap_invalidate_lock_shared(mapping);
>
> -??? err = ext4_convert_inline_data(inode);
> -??? if (err)
> -??? ??? goto out_ret;
> -
> ???? /*
> ???? ?* On data journalling we skip straight to the transaction handle:
> ???? ?* there's no delalloc; page truncated will be checked later; the
>
> --
> With Best Regards,
> Baokun Li
> .
>
--
Jan Kara <[email protected]>
SUSE Labs, CR