Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20221022111403.531902164@infradead.org> <20221022114424.515572025@infradead.org>
 <2c800ed1-d17a-def4-39e1-09281ee78d05@nvidia.com> <Y1ZGNwUEdm15W6Xz@hirez.programming.kicks-ass.net>
 <CAG48ez3fG=WnvbiE5mJaD66_gJj_hohnV8CqBG9eNdjd7pJW3A@mail.gmail.com>
 <Y1fsYwshJ93FT21r@hirez.programming.kicks-ass.net> <CAG48ez3VE+3dVdUMK+Pg_942gR+h_TCcSaFxGwCbNfh3W+mfOA@mail.gmail.com>
 <Y1f7YvKuwOl1XEwU@hirez.programming.kicks-ass.net> <CAG48ez05gBiB2w7bL_3O_OUoBWazmHnRwMiuni_wQyXBUcaxbQ@mail.gmail.com>
 <Y1oub9MvqwGBlHkq@hirez.programming.kicks-ass.net> <CAHk-=wihPdOtXgJ32pLB6Xd-UvnaZW9YvOOjM24JGYRjNHeykA@mail.gmail.com>
 <6C548A9A-3AF3-4EC1-B1E5-47A7FFBEB761@gmail.com> <CAHk-=wh8oi0qQtYDFTfm7d1s5C8mG7ig=NfzGWt4zbjXMzcdqQ@mail.gmail.com>
 <F9E42822-DA1D-4192-8410-3BAE42E9E4A9@gmail.com> <B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com>
 <CAHk-=wgzT1QsSCF-zN+eS06WGVTBg4sf=6oTMg95+AEq7QrSCQ@mail.gmail.com>
 <47678198-C502-47E1-B7C8-8A12352CDA95@gmail.com> <CAHk-=wjzngbbwHw4nAsqo_RpyOtUDk5G+Wus=O0w0A6goHvBWA@mail.gmail.com>
 <CAHk-=wijU_YHSZq5N7vYK+qHPX0aPkaePaGOyWk4aqMvvSXxJA@mail.gmail.com>
 <140B437E-B994-45B7-8DAC-E9B66885BEEF@gmail.com> <CAHk-=wjX_P78xoNcGDTjhkgffs-Bhzcwp-mdsE1maeF57Sh0MA@mail.gmail.com>
 <CAHk-=wio=UKK9fX4z+0CnyuZG7L+U9OB7t7Dcrg4FuFHpdSsfw@mail.gmail.com>
 <CAHk-=wgz0QQd6KaRYQ8viwkZBt4xDGuZTFiTB8ifg7E3F2FxHg@mail.gmail.com>
 <CAHk-=wiwt4LC-VmqvYrphraF0=yQV=CQimDCb0XhtXwk8oKCCA@mail.gmail.com> <A48A5CEB-2C02-4101-B315-6792D042C605@gmail.com>
In-Reply-To: <A48A5CEB-2C02-4101-B315-6792D042C605@gmail.com>
From:   Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sun, 30 Oct 2022 22:00:32 -0700
Message-ID: <CAHk-=wjZnVURfhWMmWiDX3D0kuqnJ0PLM_Na-U7ufzqPMxucjw@mail.gmail.com>
Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
To:     Nadav Amit <nadav.amit@gmail.com>
Cc:     Peter Zijlstra <peterz@infradead.org>,
        Jann Horn <jannh@google.com>,
        John Hubbard <jhubbard@nvidia.com>, X86 ML <x86@kernel.org>,
        Matthew Wilcox <willy@infradead.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        kernel list <linux-kernel@vger.kernel.org>,
        Linux-MM <linux-mm@kvack.org>,
        Andrea Arcangeli <aarcange@redhat.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        jroedel@suse.de, ubizjak@gmail.com,
        Alistair Popple <apopple@nvidia.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

On Sun, Oct 30, 2022 at 9:09 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> I am sorry for not managing to make it reproducible on your system.

Heh, that's very much *not* your fault. Honestly, I didn't try very
much or very hard.

I felt like I understood the problem cause sufficiently that I didn't
really need to have a reproducer, and I much prefer to just think the
solution through and try to make it really robust.

Or, put another way - I'm just lazy.

> Anyhow, I ran the tests with the patches and there are no failures.

Lovely.

> Thanks for addressing this issue.

Well, I'm not sure the issue is "addressed" yet. I think the patch
series is likely the right thing to do, but others may disagree with
this approach.

And regardless of that, this still leaves some questions open.

 (a) there's the issue of s390, which does its own version of
__tlb_remove_page_size.

I *think* s390 basically does the TLB flush synchronously in
zap_pte_range(), and that it would be for that reason trivial to just
add that 'flags' argument to the s390 __tlb_remove_page_size(), and
make it do

        if (flags & TLB_ZAP_RMAP)
                page_zap_pte_rmap(page);

at the top synchronously too. But some s390 person would need to look at it=
.

I *think* the issue is literally that straightforward and not a big
deal, but it's probably not even worth bothering the s390 people until
VM people have decided "yes, this makes sense".

 (b) the issue I mentioned with the currently useless
"page_mapcount(page) < 0" test with that patch.

Again, this is mostly just janitorial stuff associated with that patch seri=
es.

 (c) whether to worry about back-porting

I don't *think* this is worth backporting, but if it causes other
changes, then maybe..

> I understand from the code that you decided to drop the deferring of
> set_page_dirty(), which could - at least for the munmap case (where
> mmap_lock is taken for write) - prevent the need for =E2=80=9Cforce_flush=
=E2=80=9D and
> potentially save TLB flushes.

I really liked my dirty patch, but your warning case really made it
obvious that it was just broken.

The thing is, moving the "set_page_dirty()" to later is really nice,
and really makes a *lot* of sense from a conceptual standpoint: only
after that TLB flush do we really have no more people who can dirty
it.

BUT.

Even if we just used another bit for in the array for "dirty", and did
the set_page_dirty() later (but still before getting rid of the rmap),
that wouldn't actually *work*.

Why? Because the race with folio_mkclean() would just come back. Yes,
now we'd have the rmap data, so mkclean would be forced to serialize
with the page table lock.

But if we get rid of the "force_flush" for the dirty bit, that
serialization won't help, simply because we've *dropped* the page
table lock before we actually then do the set_page_dirty() again.

So the mkclean serialization needs *both* the late rmap dropping _and_
the page table lock being kept.

So deferring set_page_dirty() is conceptually the right thing to do
from a pure "just track dirty bit" standpoint, but it doesn't work
with the way we currently expect mkclean to work.

> I was just wondering whether the reason for that is that you wanted
> to have small backportable and conservative patches, or whether you
> changed your mind about it.

See above: I still think it would be the right thing in a perfect world.

But with the current folio_mkclean(), we just can't do it. I had
completely forgotten / repressed that horror-show.

So the current ordering rules are basically that we need to do
set_page_dirty() *and* we need to flush the TLB's before dropping the
page table lock. That's what gets us serialized with "mkclean".

The whole "drop rmap" can then happen at any later time, the only
important thing was that it was kept to at least after the TLB flush.

We could do the rmap drop still inside the page table lock, but
honestly, it just makes more sense to do as we free the batched pages
anyway.

Am I missing something still?

And again, this is about our horrid serialization between
folio_mkclean and set_page_dirty(). It's related to how GUP +
set_page_dirty() is also fundamentally problematic. So that dirty bit
situation *may* change if the rules for folio_mkclean() change...

                Linus