LinuxLists.cc - Do we need to unrevert "fs: do not prefault sys

2021-06-22 15:21:41

Subject: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

Hi Linus,

I've been looking at generic_perform_write() with an eye to adapting a version
for network filesystems in general. I'm wondering if it's actually safe or
whether it needs 00a3d660cbac05af34cca149cb80fb611e916935 reverting, which is
itself a revert of 998ef75ddb5709bbea0bf1506cd2717348a3c647.

Anyway, I was looking at this bit:

bytes = min_t(unsigned long, PAGE_SIZE - offset,
iov_iter_count(i));
...
if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
status = -EFAULT;
break;
}

if (fatal_signal_pending(current)) {
status = -EINTR;
break;
}

status = a_ops->write_begin(file, mapping, pos, bytes, flags,
&page, &fsdata);
if (unlikely(status < 0))
break;

if (mapping_writably_mapped(mapping))
flush_dcache_page(page);

copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);

and wondering if the iov_iter_fault_in_readable() is actually effective. Yes,
it can make sure that the page we're intending to modify is dragged into the
pagecache and marked uptodate so that it can be read from, but is it possible
for the page to then get reclaimed before we get to
iov_iter_copy_from_user_atomic()? a_ops->write_begin() could potentially take
a long time, say if it has to go and get a lock/lease from a server.

Also, I've been thinking about Willy's folio/THP stuff that allows bunches of
pages to be glued together into single objects for efficiency. This is
problematic with the above code because the faultahead is limited to a maximum
of PAGE_SIZE, but we might be wanting to modify a larger object than that.

David

2021-06-22 17:57:02

by David Howells

[permalink] [raw]

Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

Linus Torvalds <[email protected]> wrote:

> End result: doing the fault_in_readable "unnecessarily" at the
> beginning is likely the better optimization. It's basically free when
> it's not necessary, and it avoids an extra fault (and extra
> lock/unlock and retry) when it does end up faulting pages in.

It may also cause the read in to happen in the background whilst write_begin
is being done.

David

2021-06-22 18:15:47

by David Howells

[permalink] [raw]

Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

Matthew Wilcox <[email protected]> wrote:

> > It may also cause the read in to happen in the background whilst write_begin
> > is being done.
>
> Huh? Last I checked, the fault_in_readable actually read a byte from
> the page. It has to wait for the read to complete before that can
> happen.

Ah, good point.

David

2021-06-22 18:39:00

by Matthew Wilcox (Oracle)

[permalink] [raw]

Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

On Tue, Jun 22, 2021 at 11:28:30AM -0700, Linus Torvalds wrote:
> On Tue, Jun 22, 2021 at 11:23 AM Matthew Wilcox <[email protected]> wrote:
> >
> > It wouldn't be _that_ bad necessarily. filemap_fault:
>
> It's not actually the mm code that is the biggest problem. We
> obviously already have readahead support.
>
> It's the *fault* side.
>
> In particular, since the fault would return without actually filling
> in the page table entry (because the page isn't ready yet, and you
> cannot expose it to other threads!), you also have to jump over the
> instruction that caused this all.

Oh, I was assuming that it'd be a function call like
get_user_pages_fast(), not an instruction that was specially marked to
be jumped over. Gag reflex diminishing now?

2021-06-22 22:06:42

by Matthew Wilcox (Oracle)

[permalink] [raw]

Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

On Tue, Jun 22, 2021 at 09:55:09PM +0000, David Laight wrote:
> From: David Howells
> > Sent: 22 June 2021 17:27
> >
> > Al Viro <[email protected]> wrote:
> >
> > > On Tue, Jun 22, 2021 at 04:20:40PM +0100, David Howells wrote:
> > >
> > > > and wondering if the iov_iter_fault_in_readable() is actually effective.
> > > > Yes, it can make sure that the page we're intending to modify is dragged
> > > > into the pagecache and marked uptodate so that it can be read from, but is
> > > > it possible for the page to then get reclaimed before we get to
> > > > iov_iter_copy_from_user_atomic()? a_ops->write_begin() could potentially
> > > > take a long time, say if it has to go and get a lock/lease from a server.
> > >
> > > Yes, it is. So what? We'll just retry. You *can't* take faults while
> > > holding some pages locked; not without shitloads of deadlocks.
> >
> > In that case, can we amend the comment immediately above
> > iov_iter_fault_in_readable()?
> >
> > /*
> > * Bring in the user page that we will copy from _first_.
> > * Otherwise there's a nasty deadlock on copying from the
> > * same page as we're writing to, without it being marked
> > * up-to-date.
> > *
> > * Not only is this an optimisation, but it is also required
> > * to check that the address is actually valid, when atomic
> > * usercopies are used, below.
> > */
> > if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
> >
> > The first part suggests this is for deadlock avoidance. If that's not true,
> > then this should perhaps be changed.
>
> I'd say something like:
> /*
> * The actual copy_from_user() is done with a lock held
> * so cannot fault in missing pages.
> * So fault in the pages first.
> * If they get paged out the inatomic usercopy will fail
> * and the whole operation is retried.
> *
> * Hopefully there are enough memory pages available to
> * stop this looping forever.
> */
>
> It is perfectly possible for another application thread to
> invalidate one of the buffer fragments after iov_iter_fault_in_readable()
> return success - so it will then fail on the second pass.
>
> The maximum number of pages required is twice the maximum number
> of iov fragments.
> If the system is crawling along with no available memory pages
> the same physical page could get used for two user pages.

I would suggest reading the function before you suggest modifications
to it.

offset = (pos & (PAGE_SIZE - 1));
bytes = min_t(unsigned long, PAGE_SIZE - offset,
iov_iter_count(i));