2021-06-22 15:28:33

by Al Viro

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

On Tue, Jun 22, 2021 at 04:20:40PM +0100, David Howells wrote:

> and wondering if the iov_iter_fault_in_readable() is actually effective. Yes,
> it can make sure that the page we're intending to modify is dragged into the
> pagecache and marked uptodate so that it can be read from, but is it possible
> for the page to then get reclaimed before we get to
> iov_iter_copy_from_user_atomic()? a_ops->write_begin() could potentially take
> a long time, say if it has to go and get a lock/lease from a server.

Yes, it is. So what? We'll just retry. You *can't* take faults while holding
some pages locked; not without shitloads of deadlocks.


2021-06-22 15:36:56

by Al Viro

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

On Tue, Jun 22, 2021 at 03:27:43PM +0000, Al Viro wrote:
> On Tue, Jun 22, 2021 at 04:20:40PM +0100, David Howells wrote:
>
> > and wondering if the iov_iter_fault_in_readable() is actually effective. Yes,
> > it can make sure that the page we're intending to modify is dragged into the
> > pagecache and marked uptodate so that it can be read from, but is it possible
> > for the page to then get reclaimed before we get to
> > iov_iter_copy_from_user_atomic()? a_ops->write_begin() could potentially take
> > a long time, say if it has to go and get a lock/lease from a server.
>
> Yes, it is. So what? We'll just retry. You *can't* take faults while holding
> some pages locked; not without shitloads of deadlocks.

Note that the revert you propose is going to do fault-in anyway; we really can't
avoid it. The only thing it does is optimistically trying without that the
first time around, which is going to be an overall loss exactly in "slow
write_begin" case. If source pages are absent, you'll get copyin fail;
iov_iter_copy_from_user_atomic() (or its replacement) is disabling pagefaults
itself.

2021-06-22 16:27:46

by David Howells

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

Al Viro <[email protected]> wrote:

> On Tue, Jun 22, 2021 at 04:20:40PM +0100, David Howells wrote:
>
> > and wondering if the iov_iter_fault_in_readable() is actually effective.
> > Yes, it can make sure that the page we're intending to modify is dragged
> > into the pagecache and marked uptodate so that it can be read from, but is
> > it possible for the page to then get reclaimed before we get to
> > iov_iter_copy_from_user_atomic()? a_ops->write_begin() could potentially
> > take a long time, say if it has to go and get a lock/lease from a server.
>
> Yes, it is. So what? We'll just retry. You *can't* take faults while
> holding some pages locked; not without shitloads of deadlocks.

In that case, can we amend the comment immediately above
iov_iter_fault_in_readable()?

/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
*
* Not only is this an optimisation, but it is also required
* to check that the address is actually valid, when atomic
* usercopies are used, below.
*/
if (unlikely(iov_iter_fault_in_readable(i, bytes))) {

The first part suggests this is for deadlock avoidance. If that's not true,
then this should perhaps be changed.

David

2021-06-22 17:26:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

On Tue, Jun 22, 2021 at 03:36:22PM +0000, Al Viro wrote:
> On Tue, Jun 22, 2021 at 03:27:43PM +0000, Al Viro wrote:
> > On Tue, Jun 22, 2021 at 04:20:40PM +0100, David Howells wrote:
> >
> > > and wondering if the iov_iter_fault_in_readable() is actually effective. Yes,
> > > it can make sure that the page we're intending to modify is dragged into the
> > > pagecache and marked uptodate so that it can be read from, but is it possible
> > > for the page to then get reclaimed before we get to
> > > iov_iter_copy_from_user_atomic()? a_ops->write_begin() could potentially take
> > > a long time, say if it has to go and get a lock/lease from a server.
> >
> > Yes, it is. So what? We'll just retry. You *can't* take faults while holding
> > some pages locked; not without shitloads of deadlocks.
>
> Note that the revert you propose is going to do fault-in anyway; we really can't
> avoid it. The only thing it does is optimistically trying without that the
> first time around, which is going to be an overall loss exactly in "slow
> write_begin" case. If source pages are absent, you'll get copyin fail;
> iov_iter_copy_from_user_atomic() (or its replacement) is disabling pagefaults
> itself.

Let's not overstate the case. I think for the vast majority of write()
calls, the data being written has recently been accessed. So this
userspace access is unnecessary. From the commentary around commits
00a3d660cbac and 998ef75ddb57, it seems that Dave had a CPU which was
particularly inefficient at accessing userspace. I assume Intel have
fixed that by now and the extra load is in the noise. But maybe enough
CPU errata have accumulated that it's slow again?

2021-06-22 17:40:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

On Tue, Jun 22, 2021 at 10:26 AM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Jun 22, 2021 at 03:36:22PM +0000, Al Viro wrote:
> >
> > Note that the revert you propose is going to do fault-in anyway; we really can't
> > avoid it. The only thing it does is optimistically trying without that the
> > first time around, which is going to be an overall loss exactly in "slow
> > write_begin" case. If source pages are absent, you'll get copyin fail;
> > iov_iter_copy_from_user_atomic() (or its replacement) is disabling pagefaults
> > itself.
>
> Let's not overstate the case. I think for the vast majority of write()
> calls, the data being written has recently been accessed. So this
> userspace access is unnecessary.

Note that the fault_in_readable is very much necessary - the only
question is whether it happens before the actual access, or after it
in the "oh, it failed, need to retry" case.

There are two cases:

(a) the user page is there and accessible, and fault_in_readable
isn't necessary

(b) not

and as you say, case (a) is generally the common one by far, although
it will depend on the exact load (iow, (b) *could* be the common case:
you can have situations where you mmap() things only to then write the
mapping out, and then accesses will fault a lot).

But if it's case (a), then the fault_in_readable is going to be pretty
cheap. We're talking "tens of CPU cycles", unlikely to really be an
issue.

If the case is (b), then the cost is not actually the access at all,
it's the *fault* and the retry. Now we're talking easily thousands of
cycles.

And that's where it matters whether the fault_in_readable is before or
after. If it's before the actual access, then you'll have just _one_
fault, and it will handle the fault.

If the fault_in_readable is only done in the allegedly unlikely
faulting case and is _after_ the actual user space atomic access,
you'll have *two* faults. First the copy_from_user_atomic() will
fault, and return a partial result. But the page won't actually be
populated, so then the fault_in_readable will have to fault _again_,
in order to finally populate the page. And then we retry
(successfully, except for the unbelievably rare case of racing with
pageout) the actual copy_from_user_atomic().

End result: doing the fault_in_readable "unnecessarily" at the
beginning is likely the better optimization. It's basically free when
it's not necessary, and it avoids an extra fault (and extra
lock/unlock and retry) when it does end up faulting pages in.

Linus

2021-06-22 21:55:51

by David Laight

[permalink] [raw]
Subject: RE: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

From: David Howells
> Sent: 22 June 2021 17:27
>
> Al Viro <[email protected]> wrote:
>
> > On Tue, Jun 22, 2021 at 04:20:40PM +0100, David Howells wrote:
> >
> > > and wondering if the iov_iter_fault_in_readable() is actually effective.
> > > Yes, it can make sure that the page we're intending to modify is dragged
> > > into the pagecache and marked uptodate so that it can be read from, but is
> > > it possible for the page to then get reclaimed before we get to
> > > iov_iter_copy_from_user_atomic()? a_ops->write_begin() could potentially
> > > take a long time, say if it has to go and get a lock/lease from a server.
> >
> > Yes, it is. So what? We'll just retry. You *can't* take faults while
> > holding some pages locked; not without shitloads of deadlocks.
>
> In that case, can we amend the comment immediately above
> iov_iter_fault_in_readable()?
>
> /*
> * Bring in the user page that we will copy from _first_.
> * Otherwise there's a nasty deadlock on copying from the
> * same page as we're writing to, without it being marked
> * up-to-date.
> *
> * Not only is this an optimisation, but it is also required
> * to check that the address is actually valid, when atomic
> * usercopies are used, below.
> */
> if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
>
> The first part suggests this is for deadlock avoidance. If that's not true,
> then this should perhaps be changed.

I'd say something like:
/*
* The actual copy_from_user() is done with a lock held
* so cannot fault in missing pages.
* So fault in the pages first.
* If they get paged out the inatomic usercopy will fail
* and the whole operation is retried.
*
* Hopefully there are enough memory pages available to
* stop this looping forever.
*/

It is perfectly possible for another application thread to
invalidate one of the buffer fragments after iov_iter_fault_in_readable()
return success - so it will then fail on the second pass.

The maximum number of pages required is twice the maximum number
of iov fragments.
If the system is crawling along with no available memory pages
the same physical page could get used for two user pages.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-06-22 22:21:20

by Dave Chinner

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

On Tue, Jun 22, 2021 at 09:55:09PM +0000, David Laight wrote:
> From: David Howells
> > Sent: 22 June 2021 17:27
> >
> > Al Viro <[email protected]> wrote:
> >
> > > On Tue, Jun 22, 2021 at 04:20:40PM +0100, David Howells wrote:
> > >
> > > > and wondering if the iov_iter_fault_in_readable() is actually effective.
> > > > Yes, it can make sure that the page we're intending to modify is dragged
> > > > into the pagecache and marked uptodate so that it can be read from, but is
> > > > it possible for the page to then get reclaimed before we get to
> > > > iov_iter_copy_from_user_atomic()? a_ops->write_begin() could potentially
> > > > take a long time, say if it has to go and get a lock/lease from a server.
> > >
> > > Yes, it is. So what? We'll just retry. You *can't* take faults while
> > > holding some pages locked; not without shitloads of deadlocks.
> >
> > In that case, can we amend the comment immediately above
> > iov_iter_fault_in_readable()?
> >
> > /*
> > * Bring in the user page that we will copy from _first_.
> > * Otherwise there's a nasty deadlock on copying from the
> > * same page as we're writing to, without it being marked
> > * up-to-date.
> > *
> > * Not only is this an optimisation, but it is also required
> > * to check that the address is actually valid, when atomic
> > * usercopies are used, below.
> > */
> > if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
> >
> > The first part suggests this is for deadlock avoidance. If that's not true,
> > then this should perhaps be changed.
>
> I'd say something like:
> /*
> * The actual copy_from_user() is done with a lock held
> * so cannot fault in missing pages.
> * So fault in the pages first.
> * If they get paged out the inatomic usercopy will fail
> * and the whole operation is retried.
> *
> * Hopefully there are enough memory pages available to
> * stop this looping forever.
> */

What about the other 4 or 5 copies of this loop in the kernel?

This is a pattern, not a one off implementation. Comments describing
how the pattern works belong in the API documentation, not on a
single implemenation of the pattern...

Cheers,

Dave.
--
Dave Chinner
[email protected]