by Jason Gunthorpe

[permalink] [raw]

On Thu, Sep 24, 2020 at 2:30 PM Peter Xu <[email protected]> wrote:
>
> > >
> > > With the extra mprotect(!WRITE), I think we'll see a !pte_write() entry. Then
> > > it'll not go into maybe_dma_pinned() at all since cow==false.
> >
> > Hum that seems like a problem in this patch, we still need to do the
> > DMA pinned logic even if the pte is already write protected.
>
> Yes I agree. I'll take care of that in the next version too.

You people seem to be worrying too much about crazy use cases.

The fact is, if people do pinning, they had better be careful
afterwards. I agree that marking things MADV_DONTFORK may not be
great, and there may be apps that do it. But honestly, if people then
do mprotect() to make a VM non-writable after pinning a page for
writing (and before the IO has completed), such an app only has itself
to blame.

So I don't think this issue is even worth worrying about. At some
point, when apps do broken things, the kernel says "you broke it, you
get to keep both pieces". Not "Oh, you're doing unreasonable things,
let me help you".

This has dragged out a lot longer than I hoped it would, and I think
it's been over-complicated.

In fact, looking at this all, I'm starting to think that we don't
actually even need the mm_struct.has_pinned logic, because we can work
with something much simpler: the page mapping count.

A pinned page will have the page count increased by
GUP_PIN_COUNTING_BIAS, and my worry was that this would be ambiguous
with the traditional "fork a lot" UNIX style behavior. And that
traditional case is obviously one of the cases we very much don't want
to slow down.

But a pinned page has _another_ thing that is special about it: the
pinning action broke COW.

So I think we can simply add a

if (page_mapcount(page) != 1)
return false;

to page_maybe_dma_pinned(), and that very naturally protects against
the "is the page count perhaps elevated due to a lot of forking?"

Because pinning forces the mapcount to 1, and while it is pinned,
nothing else should possibly increase it - since the only thing that
would increase it is fork, and the whole point is that we won't be
doing that "page_dup_rmap()" for this page (which is what increases
the mapcount).

So we actually already have a very nice flag for "this page isn't
duplicated by forking".

And if we keep the existing early "ptep_set_wrprotect()", we also know
that we cannot be racing with another thread that is pinning at the
same time, because the fast-gup code won't be touching a read-only
pte.

So we'll just have to mark it writable again before we release the
page table lock, and we avoid that race too.

And honestly, since this is all getting fairly late in the rc, and it
took longer than I thought, I think we should do the GFP_ATOMIC
approach for now - not great, but since it only triggers for this case
that really should never happen anyway, I think it's probably the best
thing for 5.9, and we can improve on things later.

Linus

2020-09-25 21:09:07

On Mon, Sep 28, 2020 at 11:39 AM Jason Gunthorpe <[email protected]> wrote:
>
> All of gup_fast and copy_mm could be wrappered in a seq count so that
> gup_fast always goes to the slow path if fork is concurrent.
>
> That doesn't sound too expensive and avoids all the problems you
> pointed with the WP scheme.

Ok, I'll start by just removing the "write protect early trick". It
really doesn't work reliably anyway due to memory ordering, and while
I think the dirty bit is ok (and we could probably also set it
unconditionally to make _sure_ it's not dropped like Peter says) it
just makes me convinced it's the wrong approach.

Fixing it at a per-pte level is too expensive, and yeah, if we really
really care about the fork consistency, the sequence count approach
should be much simpler and more obvious.

So I'll do the pte wrprotect/restore removal. Anybody willing to do
and test the sequence count approach?

Linus

2020-09-28 19:51:51

On Mon, Sep 28, 2020 at 12:50:03PM -0700, Linus Torvalds wrote:
> On Mon, Sep 28, 2020 at 12:36 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > So I'll do the pte wrprotect/restore removal. Anybody willing to do
> > and test the sequence count approach?
>
> So the wrprotect removal is trivial, with most of it being about the comments.
>
> However, when I look at this, I am - once again - tempted to just add a
>
> if (__page_mapcount(page) > 1)
> return 1;
>
> there too. Because we know it's a private mapping (shared mappings we
> checked for with the "is_cow_mapping()" earlier), and the only case we
> really care about is the one where the page is only mapped in the
> current mm (because that's what a write pinning will have done, and as
> mentioned, a read pinning doesn't do anything wrt fork() right now
> anyway).
>
> So if it's mapped in another mm, the COW clearly hasn't been broken by
> a pin, and a read pinned page had already gone through a fork.
>
> But the more I look at this code, the more I go "ok, I want somebody
> to actually test this with the rdma case".
>
> So I'll attach my suggested patch, but I won't actually commit it. I'd
> really like to have this tested, possibly _together_ with the sequence
> count addition..

Hi Linus,

We tested the suggested patch for last two weeks in our nightly regressions
and didn't experience any new failures. It looks like it is safe to use
it, but better to take the patch during/after merge window to minimize risk
of delaying v5.9.

Thanks

>
> Linus