2021-06-22 18:10:07

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

On Tue, Jun 22, 2021 at 06:55:33PM +0100, David Howells wrote:
> Linus Torvalds <[email protected]> wrote:
>
> > End result: doing the fault_in_readable "unnecessarily" at the
> > beginning is likely the better optimization. It's basically free when
> > it's not necessary, and it avoids an extra fault (and extra
> > lock/unlock and retry) when it does end up faulting pages in.
>
> It may also cause the read in to happen in the background whilst write_begin
> is being done.

Huh? Last I checked, the fault_in_readable actually read a byte from
the page. It has to wait for the read to complete before that can
happen.


2021-06-22 18:10:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

On Tue, Jun 22, 2021 at 11:05 AM Matthew Wilcox <[email protected]> wrote:
>
> Huh? Last I checked, the fault_in_readable actually read a byte from
> the page. It has to wait for the read to complete before that can
> happen.

Yeah, we don't have any kind of async fault-in model.

I'm not sure how that would even look. I don't think it would
necessarily be *impossible* (special marker in the exception table to
let the fault code know that this is a "prepare" fault), but it would
be pretty challenging.

Linus

2021-06-22 18:17:08

by Nadav Amit

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?



> On Jun 22, 2021, at 11:07 AM, Linus Torvalds <[email protected]> wrote:
>
> On Tue, Jun 22, 2021 at 11:05 AM Matthew Wilcox <[email protected]> wrote:
>>
>> Huh? Last I checked, the fault_in_readable actually read a byte from
>> the page. It has to wait for the read to complete before that can
>> happen.
>
> Yeah, we don't have any kind of async fault-in model.
>
> I'm not sure how that would even look. I don't think it would
> necessarily be *impossible* (special marker in the exception table to
> let the fault code know that this is a "prepare" fault), but it would
> be pretty challenging.

I did send an RFC some time ago for “prepare” fault, but it the RFC
requires some more work.

https://lore.kernel.org/lkml/[email protected]/

2021-06-22 18:24:16

by David Howells

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

Linus Torvalds <[email protected]> wrote:

> I'm not sure how that would even look. I don't think it would
> necessarily be *impossible* (special marker in the exception table to
> let the fault code know that this is a "prepare" fault), but it would
> be pretty challenging.

Probably the most obvious way would be to set a flag in task_struct saying
what you're doing and have the point that would otherwise wait for the page to
become unlocked skip to the fault fixup code if the page is locked after
->readahead() has been invoked and the flag is set, then use get_user() in
iov_iter_fault_in_readable().

But, as Willy says, there's a reasonable chance that the source page is
present anyway (presumably you want to write out data you've just constructed
or modified), in which case it's probably not worth the complexity.

David

2021-06-22 18:25:10

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

On Tue, Jun 22, 2021 at 11:07:56AM -0700, Linus Torvalds wrote:
> On Tue, Jun 22, 2021 at 11:05 AM Matthew Wilcox <[email protected]> wrote:
> >
> > Huh? Last I checked, the fault_in_readable actually read a byte from
> > the page. It has to wait for the read to complete before that can
> > happen.
>
> Yeah, we don't have any kind of async fault-in model.
>
> I'm not sure how that would even look. I don't think it would
> necessarily be *impossible* (special marker in the exception table to
> let the fault code know that this is a "prepare" fault), but it would
> be pretty challenging.

It wouldn't be _that_ bad necessarily. filemap_fault:

page = find_get_page(mapping, offset);
...
} else if (!page) {
fpin = do_sync_mmap_readahead(vmf);

... and we could return at that point if the flag was set. There'd be
some more details to fill in (if there's a !uptodate page in the page
cache, don't wait for it), but it might not be too bad.

2021-06-22 18:29:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

On Tue, Jun 22, 2021 at 11:23 AM Matthew Wilcox <[email protected]> wrote:
>
> It wouldn't be _that_ bad necessarily. filemap_fault:

It's not actually the mm code that is the biggest problem. We
obviously already have readahead support.

It's the *fault* side.

In particular, since the fault would return without actually filling
in the page table entry (because the page isn't ready yet, and you
cannot expose it to other threads!), you also have to jump over the
instruction that caused this all.

Doable? Oh, absolutely. But you basically need a new inline asm thing
for each architecture, so that the architecture knows how to skip the
instruction that caused the page fault, and the new exception table
entry type to say 'this is that magic VM prefetch instruction".

So filemap_fault() is the least of the problems.

Honestly, it doesn't look _complicated_, but it does look like a fair
number of small details..

Linus

2021-06-22 18:33:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: Do we need to unrevert "fs: do not prefault sys_write() user buffer pages"?

On Tue, Jun 22, 2021 at 11:23 AM David Howells <[email protected]> wrote:
>
> Probably the most obvious way would be to set a flag in task_struct saying
> what you're doing and have the point that would otherwise wait for the page to
> become unlocked skip to the fault fixup code if the page is locked after
> ->readahead() has been invoked and the flag is set, then use get_user() in
> iov_iter_fault_in_readable().

Yeah, the existing user access exception handling code _almost_
handles it, except for one thing: you'd need to have some way to
distinguish between "prefetch successful" and "fault failed".

And yeah, I guess it could be a flag in task_struct, but at that point
my gag reflex starts acting up.

Linus