Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20221022111403.531902164@infradead.org> <20221022114424.515572025@infradead.org>
 <2c800ed1-d17a-def4-39e1-09281ee78d05@nvidia.com> <Y1ZGNwUEdm15W6Xz@hirez.programming.kicks-ass.net>
 <CAG48ez3fG=WnvbiE5mJaD66_gJj_hohnV8CqBG9eNdjd7pJW3A@mail.gmail.com> <CAHk-=wjoGnHVVp5pcb69C6HtnJ_avqQo_b5NK-X9bgoDDw_XbA@mail.gmail.com>
In-Reply-To: <CAHk-=wjoGnHVVp5pcb69C6HtnJ_avqQo_b5NK-X9bgoDDw_XbA@mail.gmail.com>
From:   Jann Horn <jannh@google.com>
Date:   Mon, 24 Oct 2022 22:23:51 +0200
Message-ID: <CAG48ez3KMuM2aLpc8SvGiqZvv1KFc7E2DBZ4CLzxQeFhrMN_-w@mail.gmail.com>
Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
To:     Linus Torvalds <torvalds@linux-foundation.org>
Cc:     Peter Zijlstra <peterz@infradead.org>,
        John Hubbard <jhubbard@nvidia.com>, x86@kernel.org,
        willy@infradead.org, akpm@linux-foundation.org,
        linux-kernel@vger.kernel.org, linux-mm@kvack.org,
        aarcange@redhat.com, kirill.shutemov@linux.intel.com,
        jroedel@suse.de, ubizjak@gmail.com
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Mon, Oct 24, 2022 at 10:19 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Oct 24, 2022 at 12:58 PM Jann Horn <jannh@google.com> wrote:
> >
> > Unless I'm completely misunderstanding what's going on here, the whole
> > "remove_table" thing only happens when you "remove a table", meaning
> > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> > logic.
>
> I do have to admit that I'd be happier if this code - and the GUP code
> that also relies on "interrupts off" behavior - would just use a
> sequence counter instead.
>
> Relying on blocking IPI's is clever, but also clearly very subtle and
> somewhat dangerous.
>
> I think our GUP code is a *lot* more important than some "legacy
> x86-32 has problems in case you have an incredibly unlikely race that
> re-populates the page table with a different page that just happens to
> be exactly the same MOD-4GB", so honestly, I don't think the
> load-tearing is even worth worrying about - if you have hardware that
> is good enough at virtualizing things, it's almost certainly already
> 64-bit, and running 32-bit virtual machines with PAE you really only
> have yourself to blame.
>
> So I can't find it in myself to care about the 32-bit tearing thing,
> but this discussion makes me worried about Fast GUP.
>
> Note that even with proper atomic
>
>                 pte_t pte = ptep_get_lockless(ptep);
>
> in gup_pte_range(), and even if the page tables are RCU-free'd, that
> just means that the 'ptep' access itself is safe.
>
> But then you have the whole "the lookup of the page pointer is not
> atomic" wrt that. And right now that GUP code does rely on the "block
> IPI" to make it basically valid.
>
> I don't think it matters if GUP races with munmap or madvise() or
> something like that - if you get the old page, that's still a valid
> page, and the user only has himself to blame.
>
> But if we have memory pressure that causes vmscan to push out a page,
> and it gets replaced with a new page, and GUP gets the old page with
> no serialization, that sounds like a possible source of data
> inconsistency.
>
> I don't know if this can happen, but the whole "interrupts disabled
> doesn't actually block IPI's and synchronize with TLB flushes" really
> sounds like it would affect GUP too. And be much more serious there
> than on some x86-32 platform that nobody should be using anyway.

That's why GUP-fast re-checks the PTE after it has grabbed a reference
to the page. Pasting from a longer writeup that I plan to publish
soon, describing how gup_pte_range() works:

"""
This guarantees that the page tables that are being walked
aren't freed concurrently, but at the end of the walk, we
have to grab a stable reference to the referenced page.
For this we use the grab-reference-and-revalidate trick
from above again: First we (locklessly) load the page
table entry, then we grab a reference to the page that it
points to (which can fail if the refcount is zero, in that
case we bail), then we recheck that the page table entry
is still the same, and if it changed in between, we drop the
page reference and bail.
This can, again, grab a reference to a page after it has
already been freed and reallocated. The reason why this is
fine is that the metadata structure that holds this refcount,
`struct folio` (or `struct page`, depending on which kernel
version you're looking at; in current kernels it's `folio`
but `struct page` and `struct folio` are actually aliases for
the same memory, basically, though that is supposed to maybe
change at some point) is never freed; even when a page is
freed and reallocated, the corresponding `struct folio`
stays. This does have the fun consequence that whenever a
page/folio has a non-zero refcount, the refcount can
spuriously go up and then back down for a little bit.
(Also it's technically not as simple as I just described it,
because the `struct page` that the PTE points to might be
a "tail page" of a `struct folio`.
So actually we first read the PTE, the PTE gives us the
`page*`, then from that we go to the `folio*`, then we
try to grab a reference to the `folio`, then if that worked
we check that the `page` still points to the same `folio`,
and then we recheck that the PTE is still the same.)
"""