Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Thu, 17 Jun 2021 16:51:49 +1000
From:   Nicholas Piggin <npiggin@gmail.com>
Subject: Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
To:     Andy Lutomirski <luto@kernel.org>,
        "Peter Zijlstra (Intel)" <peterz@infradead.org>,
        Rik van Riel <riel@surriel.com>
Cc:     Andrew Morton <akpm@linux-foundation.org>,
        Dave Hansen <dave.hansen@intel.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        linux-mm@kvack.org,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        "Paul E. McKenney" <paulmck@kernel.org>,
        the arch/x86 maintainers <x86@kernel.org>
References: <cover.1623813516.git.luto@kernel.org>
        <f184d013a255a523116b692db4996c5db2569e86.1623813516.git.luto@kernel.org>
        <1623816595.myt8wbkcar.astroid@bobo.none>
        <YMmpxP+ANG5nIUcm@hirez.programming.kicks-ass.net>
        <617cb897-58b1-8266-ecec-ef210832e927@kernel.org>
        <1623893358.bbty474jyy.astroid@bobo.none>
        <58b949fb-663e-4675-8592-25933a3e361c@www.fastmail.com>
        <c3c7a1cf-1c87-42cc-b2d6-cc2df55e5b57@www.fastmail.com>
In-Reply-To: <c3c7a1cf-1c87-42cc-b2d6-cc2df55e5b57@www.fastmail.com>
MIME-Version: 1.0
Message-Id: <1623911501.q97zemobmw.astroid@bobo.none>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

Excerpts from Andy Lutomirski's message of June 17, 2021 3:32 pm:
> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
>>=20
>>=20
>> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
>> > Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
>> > > On 6/16/21 12:35 AM, Peter Zijlstra wrote:
>> > >> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>> > >>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>> > >>>> membarrier() needs a barrier after any CPU changes mm.  There is =
currently
>> > >>>> a comment explaining why this barrier probably exists in all case=
s.  This
>> > >>>> is very fragile -- any change to the relevant parts of the schedu=
ler
>> > >>>> might get rid of these barriers, and it's not really clear to me =
that
>> > >>>> the barrier actually exists in all necessary cases.
>> > >>>
>> > >>> The comments and barriers in the mmdrop() hunks? I don't see what =
is=20
>> > >>> fragile or maybe-buggy about this. The barrier definitely exists.
>> > >>>
>> > >>> And any change can change anything, that doesn't make it fragile. =
My
>> > >>> lazy tlb refcounting change avoids the mmdrop in some cases, but i=
t
>> > >>> replaces it with smp_mb for example.
>> > >>=20
>> > >> I'm with Nick again, on this. You're adding extra barriers for no
>> > >> discernible reason, that's not generally encouraged, seeing how ext=
ra
>> > >> barriers is extra slow.
>> > >>=20
>> > >> Both mmdrop() itself, as well as the callsite have comments saying =
how
>> > >> membarrier relies on the implied barrier, what's fragile about that=
?
>> > >>=20
>> > >=20
>> > > My real motivation is that mmgrab() and mmdrop() don't actually need=
 to
>> > > be full barriers.  The current implementation has them being full
>> > > barriers, and the current implementation is quite slow.  So let's tr=
y
>> > > that commit message again:
>> > >=20
>> > > membarrier() needs a barrier after any CPU changes mm.  There is cur=
rently
>> > > a comment explaining why this barrier probably exists in all cases. =
The
>> > > logic is based on ensuring that the barrier exists on every control =
flow
>> > > path through the scheduler.  It also relies on mmgrab() and mmdrop()=
 being
>> > > full barriers.
>> > >=20
>> > > mmgrab() and mmdrop() would be better if they were not full barriers=
.  As a
>> > > trivial optimization, mmgrab() could use a relaxed atomic and mmdrop=
()
>> > > could use a release on architectures that have these operations.
>> >=20
>> > I'm not against the idea, I've looked at something similar before (not
>> > for mmdrop but a different primitive). Also my lazy tlb shootdown seri=
es=20
>> > could possibly take advantage of this, I might cherry pick it and test=
=20
>> > performance :)
>> >=20
>> > I don't think it belongs in this series though. Should go together wit=
h
>> > something that takes advantage of it.
>>=20
>> I=E2=80=99m going to see if I can get hazard pointers into shape quickly=
.
>=20
> Here it is.  Not even boot tested!
>=20
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=
=3Dsched/lazymm&id=3Decc3992c36cb88087df9c537e2326efb51c95e31
>=20
> Nick, I think you can accomplish much the same thing as your patch by:
>=20
> #define for_each_possible_lazymm_cpu while (false)

I'm not sure what you mean? For powerpc, other CPUs can be using the mm=20
as lazy at this point. I must be missing something.

>=20
> although a more clever definition might be even more performant.
>=20
> I would appreciate everyone's thoughts as to whether this scheme is sane.

powerpc has no use for it, after the series in akpm's tree there's just
a small change required for radix TLB flushing to make the final flush=20
IPI also purge lazies, and then the shootdown scheme runs with zero
additional IPIs so essentially no benefit to the hazard pointer games.
I have found the additional IPIs aren't bad anyway, so not something=20
we'd bother trying to optmise away on hash, which is slowly being
de-prioritized.

I must say, I still see active_mm featuring prominently in our patch
which comes as a surprise. I would have thought some preparation and=20
cleanup work first to fix the x86 deficienies you were talking about=20
should go in first, I'm eager to see those. But either way I don't see
a fundamental reason this couldn't be done to support archs for which=20
the standard or shootdown refcounting options aren't sufficient.

IIRC I didn't see a fundamental hole in it last time you posted the
idea but I admittedly didn't go through it super carefully.

Thanks,
Nick

>=20
> Paul, I'm adding you for two reasons.  First, you seem to enjoy bizarre l=
ocking schemes.  Secondly, because maybe RCU could actually work here.  The=
 basic idea is that we want to keep an mm_struct from being freed at an ino=
pportune time.  The problem with naively using RCU is that each CPU can use=
 one single mm_struct while in an idle extended quiescent state (but not a =
user extended quiescent state).  So rcu_read_lock() is right out.  If RCU c=
ould understand this concept, then maybe it could help us, but this seems a=
 bit out of scope for RCU.
>=20
> --Andy
>=20