Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\))
Subject: Re: [PATCH v4 3/5] locking/qspinlock: Introduce CNA into the slow
 path of qspinlock
From:   Alex Kogan <alex.kogan@oracle.com>
In-Reply-To: <3ae2b6a2-ffe6-2ca1-e5bf-2292db50e26f@redhat.com>
Date:   Thu, 19 Sep 2019 11:55:21 -0400
Cc:     linux@armlinux.org.uk, peterz@infradead.org, mingo@redhat.com,
        will.deacon@arm.com, arnd@arndb.de, linux-arch@vger.kernel.org,
        linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org,
        tglx@linutronix.de, bp@alien8.de, hpa@zytor.com, x86@kernel.org,
        guohanjun@huawei.com, jglauber@marvell.com,
        steven.sistare@oracle.com, daniel.m.jordan@oracle.com,
        dave.dice@oracle.com, rahul.x.yadav@oracle.com
Content-Transfer-Encoding: quoted-printable
Message-Id: <87B87982-670F-4F12-9EE0-DC89A059FAEC@oracle.com>
References: <20190906142541.34061-1-alex.kogan@oracle.com>
 <20190906142541.34061-4-alex.kogan@oracle.com>
 <3ae2b6a2-ffe6-2ca1-e5bf-2292db50e26f@redhat.com>
To:     Waiman Long <longman@redhat.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

>> +/*
>> + * cna_try_find_next - scan the main waiting queue looking for the =
first
>> + * thread running on the same NUMA node as the lock holder. If found =
(call it
>> + * thread T), move all threads in the main queue between the lock =
holder and
>> + * T to the end of the secondary queue and return T; otherwise, =
return NULL.
>> + *
>> + * Schematically, this may look like the following (nn stands for =
numa_node and
>> + * et stands for encoded_tail).
>> + *
>> + *     when cna_try_find_next() is called (the secondary queue is =
empty):
>> + *
>> + *  A+------------+   B+--------+   C+--------+   T+--------+
>> + *   |mcs:next    | -> |mcs:next| -> |mcs:next| -> |mcs:next| -> =
NULL
>> + *   |mcs:locked=3D1|    |cna:nn=3D0|    |cna:nn=3D2|    |cna:nn=3D1|
>> + *   |cna:nn=3D1    |    +--------+    +--------+    +--------+
>> + *   +----------- +
>> + *
>> + *     when cna_try_find_next() returns (the secondary queue =
contains B and C):
>> + *
>> + *  A+----------------+    T+--------+
>> + *   |mcs:next        | ->  |mcs:next| -> NULL
>> + *   |mcs:locked=3DB.et | -+  |cna:nn=3D1|
>> + *   |cna:nn=3D1        |  |  +--------+
>> + *   +--------------- +  |
>> + *                       |
>> + *                       +->  B+--------+   C+--------+
>> + *                             |mcs:next| -> |mcs:next|
>> + *                             |cna:nn=3D0|    |cna:nn=3D2|
>> + *                             |cna:tail| -> +--------+
>> + *                             +--------+
>> + *
>> + * The worst case complexity of the scan is O(n), where n is the =
number
>> + * of current waiters. However, the fast path, which is expected to =
be the
>> + * common case, is O(1).
>> + */
>> +static struct mcs_spinlock *cna_try_find_next(struct mcs_spinlock =
*node,
>> +					      struct mcs_spinlock *next)
>> +{
>> +	struct cna_node *cn =3D (struct cna_node *)node;
>> +	struct cna_node *cni =3D (struct cna_node *)next;
>> +	struct cna_node *first, *last =3D NULL;
>> +	int my_numa_node =3D cn->numa_node;
>> +
>> +	/* fast path: immediate successor is on the same NUMA node */
>> +	if (cni->numa_node =3D=3D my_numa_node)
>> +		return next;
>> +
>> +	/* find any next waiter on 'our' NUMA node */
>> +	for (first =3D cni;
>> +	     cni && cni->numa_node !=3D my_numa_node;
>> +	     last =3D cni, cni =3D (struct cna_node =
*)READ_ONCE(cni->mcs.next))
>> +		;
>> +
>> +	/* if found, splice any skipped waiters onto the secondary queue =
*/
>> +	if (cni && last)
>> +		cna_splice_tail(cn, first, last);
>> +
>> +	return (struct mcs_spinlock *)cni;
>> +}
>=20
> At the Linux Plumbers Conference last week, Will has raised the =
concern
> about the latency of the O(1) cna_try_find_next() operation that will
> add to the lock hold time.
While the worst case complexity of the scan is O(n), I _think it can be =
proven
that the amortized complexity is O(1). For intuition, consider a =
two-node=20
system with N threads total. In the worst case scenario, the scan will =
go=20
over N/2 threads running on a different node. If the scan ultimately =
=E2=80=9Cfails=E2=80=9D
(no thread from the lock holder=E2=80=99s node is found), the lock will =
be passed
to the first thread from a different node and then between all those N/2 =
threads,
with a scan of just one node for the next N/2 - 1 passes. Otherwise, =
those=20
N/2 threads will be moved to the secondary queue. On the next lock =
handover,=20
we pass the lock either to the next thread in the main queue (as it has =
to be=20
from our node) or to the first node in the secondary queue. In both =
cases, we=20
scan just one node, and in the latter case, we have again N/2 - 1 passes =
with=20
a scan of just one node each.

> One way to hide some of the latency is to do
> a pre-scan before acquiring the lock. The CNA code could override the
> pv_wait_head_or_lock() function to call cna_try_find_next() as a
> pre-scan and return 0. What do you think?
This is certainly possible, but I do not think it would completely =
eliminate=20
the worst case scenario. It will probably make it even less likely, but =
at=20
the same time, we will reduce the chance of actually finding a thread =
from the
same node (that may enter the main queue while we wait for the owner & =
pending=20
to go away).

Regards,
=E2=80=94 Alex=