Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp933105ybl; Fri, 31 Jan 2020 10:36:53 -0800 (PST) X-Google-Smtp-Source: APXvYqxFyhlZog5oo89PNMWtd1+VkBnVIKNv1xBBKp6ugv98B0bp9MBer6vOXTkc+Shm9X9mzoHd X-Received: by 2002:aca:815:: with SMTP id 21mr7471587oii.52.1580495813515; Fri, 31 Jan 2020 10:36:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1580495813; cv=none; d=google.com; s=arc-20160816; b=05ID7+hPackAwL0a0mjmLoxnKDeWLpFo8yJhVsTix5c9zaVNTuNcS5lviR5hLrIglW fPePyyqkWs6v05uDKyTCTy/dPsql31gHo5qtrvSM0dddM2bw0e53OlB5G0jZIdKY2ASE cJMkJ4qLHhYZdUZrnRMXsYjdpr6VBkOt+xIG8bGgIiG0Ry4rjJUylGrNpeQ6vJ0uKxch qu6D/+mo/YUF/gi7e/PVFGSFcy/z3FZDokfo7HvlVB1AsDdnjU+J49ZBCzeaeoSFFzSc ZHeFM7T8LT9BUkOqli0F7iU6LLGSsOxjmtcbsr72dT2i0KrITGq6EZtHaAHhqRIkF3uG fKkg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:to:references:message-id :content-transfer-encoding:cc:date:in-reply-to:from:subject :mime-version:dkim-signature; bh=P2nZoqcfNGldK9P/dZq23Tu+F7jBb92VgkLDVWfhseI=; b=ym6NdKYBf4ElSKCUn2GLxVy/uWqqw7vBxcuRFYd+378RRMjD42pyT15j8lT2ZeP5ry 83wX7gEg4TiY14pRs3zeKgN8jKn2r1wdb8/FbDY/Gcs0D+4d97r3ZQmmXCtdDrn/qz2v wVpuvdJp8ReUPn1zuba6A1LRzN0i2rAxl6pVNQJk6OyOHfS4VzqTg/7me1CjAiwcqjKQ b+DdXqJ2Dtu92lgujy4/2nTd7CwL1BOQD6L5o/gJQfexs1b+ftf549EAeN4ly4eHw/ID 0snx6zRgOXF6IiU6XpoWaUtynnNFp37p6k/th3tbpEgQuikXAL3Z/+266jJhUtjAUAkH d5Bg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2019-08-05 header.b=RkOB+zXP; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l140si4895557oib.114.2020.01.31.10.36.41; Fri, 31 Jan 2020 10:36:53 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2019-08-05 header.b=RkOB+zXP; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726074AbgAaSer (ORCPT + 99 others); Fri, 31 Jan 2020 13:34:47 -0500 Received: from aserp2120.oracle.com ([141.146.126.78]:39670 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725909AbgAaSer (ORCPT ); Fri, 31 Jan 2020 13:34:47 -0500 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.27/8.16.0.27) with SMTP id 00VIXnLU048194; Fri, 31 Jan 2020 18:33:50 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=content-type : mime-version : subject : from : in-reply-to : date : cc : content-transfer-encoding : message-id : references : to; s=corp-2019-08-05; bh=P2nZoqcfNGldK9P/dZq23Tu+F7jBb92VgkLDVWfhseI=; b=RkOB+zXPcOMe/TzVMBmZWBg0/fmy2xQDHQnkEnZZrVssmKa2LjtgNSfZw4BdUyRj27I3 NxtdZHOzA0YwS+qsPKRs2ZdxDZ3IdPlSNndj+WXQ9Wm8p1iVrQu5316A54UXgAM9tQDk tPteAvxngkK1cDm5h9O6ir6usq8kfCyFyG4eCrg7fbSHPps0kySCBcDMUErr9j8tTPXH KXNhmKbQkDdGN2MT1wB3cWnc/Ob9BtjAWbLSDOF5TDZGT/dqtYaBaE1LuZuUDwGWKg/e w3lz2uZJKGR0/KqrbmgubqVpPIMaDtrd1MAaXuJikdVETsMJfsG4wgvIAa8GPy/CVIDH 3A== Received: from aserp3030.oracle.com (aserp3030.oracle.com [141.146.126.71]) by aserp2120.oracle.com with ESMTP id 2xrdmr491d-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 31 Jan 2020 18:33:49 +0000 Received: from pps.filterd (aserp3030.oracle.com [127.0.0.1]) by aserp3030.oracle.com (8.16.0.27/8.16.0.27) with SMTP id 00VIE1mh039787; Fri, 31 Jan 2020 18:33:27 GMT Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserp3030.oracle.com with ESMTP id 2xv8nrhqtk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 31 Jan 2020 18:33:27 +0000 Received: from abhmp0017.oracle.com (abhmp0017.oracle.com [141.146.116.23]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id 00VIXC5D029013; Fri, 31 Jan 2020 18:33:12 GMT Received: from [10.39.253.182] (/10.39.253.182) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 31 Jan 2020 10:33:12 -0800 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock From: Alex Kogan In-Reply-To: <20200131133543.GD14914@hirez.programming.kicks-ass.net> Date: Fri, 31 Jan 2020 13:33:09 -0500 Cc: linux@armlinux.org.uk, Ingo Molnar , Will Deacon , Arnd Bergmann , Waiman Long , linux-arch@vger.kernel.org, linux-arm-kernel , linux-kernel@vger.kernel.org, Thomas Gleixner , Borislav Petkov , hpa@zytor.com, x86@kernel.org, Hanjun Guo , Jan Glauber , Steven Sistare , Daniel Jordan , dave.dice@oracle.com Content-Transfer-Encoding: quoted-printable Message-Id: References: <20191125210709.10293-1-alex.kogan@oracle.com> <20191125210709.10293-4-alex.kogan@oracle.com> <20200121202919.GM11457@worktop.programming.kicks-ass.net> <20200122095127.GC14946@hirez.programming.kicks-ass.net> <20200122170456.GY14879@hirez.programming.kicks-ass.net> <20200131133543.GD14914@hirez.programming.kicks-ass.net> To: Peter Zijlstra X-Mailer: Apple Mail (2.3445.104.11) X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9517 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1911140001 definitions=main-2001310151 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9517 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1911140001 definitions=main-2001310152 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >>> + * >>> + * Returns 0 if a matching node is found; otherwise return the = encoded pointer >>> + * to the last element inspected (such that a subsequent scan can = continue there). >>> + * >>> + * The worst case complexity of the scan is O(n), where n is the = number >>> + * of current waiters. However, the amortized complexity is close = to O(1), >>> + * as the immediate successor is likely to be running on the same = node once >>> + * threads from other nodes are moved to the secondary queue. >>> + * >>> + * XXX does not compute; given equal contention it should average = to O(nr_nodes). >> Let me try to convince you. Under contention, the immediate waiter = would be >> a local one. So the scan would typically take O(1) steps. We need to = consider >> the extra work/steps we take to move nodes to the secondary queue. = The >> number of such nodes is O(n) (to be more precise, it is O(n-m), where = m >> is nr_cpus_per_node), and we spend a constant amount of work, per = node,=20 >> to scan those nodes and move them to the secondary queue. So in = 2^thresholds >> lock handovers, we scan 2^thresholds x 1 + n-m nodes. Assuming=20 >> 2^thresholds > n, the amortized complexity of this function then is = O(1). >=20 > There is no threshold in this patch. This does not change the analysis, though. > That's the next patch, and > I've been procrastinating on that one, mostly also because of the > 'reasonable' claim without data. >=20 > But Ah!, I think I see, after nr_cpus tries, all remote waiters are on > the secondary queue and only local waiters will remain. That will = indeed > depress the average a lot. Ok, cool. < snip > >=20 >>> +{ >>> + struct mcs_spinlock *head_2nd, *tail_2nd; >>> + >>> + tail_2nd =3D decode_tail(node->locked); >>> + head_2nd =3D tail_2nd->next; >> I understand that you are trying to avoid repeating those two lines >> in both places this function is called from, but you do it at the = cost >> of adding the following unnecessary if-statement, and in general this = function >> looks clunky. >=20 > Let me show you the current form: >=20 > /* > * cna_splice_head -- splice the entire secondary queue onto the head = of the > * primary queue. > * > * Returns the new primary head node or NULL on failure. Maybe explain here why failure can happen? Eg., =E2=80=9CThe latter can = happen due to a race with a waiter joining an empty primary queue." > */ > static struct mcs_spinlock * > cna_splice_head(struct qspinlock *lock, u32 val, > struct mcs_spinlock *node, struct mcs_spinlock *next) > { > struct mcs_spinlock *head_2nd, *tail_2nd; > u32 new; >=20 > tail_2nd =3D decode_tail(node->locked); > head_2nd =3D tail_2nd->next; >=20 > if (!next) { > /* > * When the primary queue is empty; our tail becomes the = primary tail. > */ >=20 > /* > * Speculatively break the secondary queue's circular = link; such that when > * the secondary tail becomes the primary tail it all = works out. > */ > tail_2nd->next =3D NULL; >=20 > /* > * tail_2nd->next =3D NULL xchg_tail(lock, = tail) > * > * cmpxchg_release(&lock, val, new) = WRITE_ONCE(prev->next, node); > * > * Such that if the cmpxchg() succeeds, our stores won't = collide. > */ > new =3D ((struct cna_node *)tail_2nd)->encoded_tail | = _Q_LOCKED_VAL; > if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) = { > /* > * Note; when this cmpxchg fails due to = concurrent > * queueing of a new waiter, then we'll try = again in > * cna_pass_lock() if required. > * > * Restore the secondary queue's circular link. > */ > tail_2nd->next =3D head_2nd; > return NULL; > } >=20 > } else { > /* > * If the primary queue is not empty; the primary tail = doesn't need > * to change and we can simply link our tail to the old = head. > */ > tail_2nd->next =3D next; > } >=20 > /* That which was the secondary queue head, is now the primary = queue head */ Rephrase the comment? > return head_2nd; > } >=20 > The function as a whole is self-contained and consistent, it deals = with > the splice 2nd to 1st queue, for all possible cases. You only have to > think about the list splice in one place, here, instead of two places. >=20 > I don't think it will actually result in more branches emitted; the > compiler should be able to use value propagation to eliminate stuff. Ok, I can see your point. >=20 >>> +static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 = val, >>> + struct mcs_spinlock *node) >>> +{ >>> + struct mcs_spinlock *next; >>> + >>> + /* >>> + * We're here because the primary queue is empty; check the = secondary >>> + * queue for remote waiters. >>> + */ >>> + if (node->locked > 1) { >>> + /* >>> + * When there are waiters on the secondary queue move = them back >>> + * onto the primary queue and let them rip. >>> + */ >>> + next =3D cna_splice_head(lock, val, node, NULL); >>> + if (next) { >> And, again, this if-statement is here only because you moved the rest = of the code >> into cna_splice_head(). Perhaps, with cna_extract_2dary_head_tail() = you can >> bring that code back? >=20 > I don't see the objection, you already had a branch there, from the > cmpxchg(), this is the same branch, the compiler should fold the lot. Now you have the branch from cmpxchg(), and another one from "if = (next)=E2=80=9D. But you are right that the compiler is likely to optimize out the = latter. > We can add an __always_inline if you're worried. Let=E2=80=99s do that. < snip > >>> +static inline void cna_pass_lock(struct mcs_spinlock *node, >>> + struct mcs_spinlock *next) >>> +{ >>> + struct cna_node *cn =3D (struct cna_node *)node; >>> + u32 partial_order =3D cn->partial_order; >>> + u32 val =3D _Q_LOCKED_VAL; >>> + >>> + /* cna_wait_head_or_lock() left work for us. */ >>> + if (partial_order) { >>> + partial_order =3D cna_order_queue(node, = decode_tail(partial_order)); >>> + >>> + if (!partial_order) { >>> + /* >>> + * We found a local waiter; reload @next in case we = called >>> + * cna_order_queue() above. >>> + */ >>> + next =3D node->next; >>> + val |=3D node->locked; /* preseve secondary queue */ >> This is wrong. node->locked can be 0, 1 or an encoded tail at this = point, and >> the latter case will be handled incorrectly. >>=20 >> Should be=20 >> if (node->locked) val =3D node->locked; >> instead. >=20 > I'm confused, who cares about the locked bit when it has an encoded = tail on? >=20 > The generic code only cares about it being !0, and the cna code always > checks if it has a tail (>1 , <=3D1) first. Ah, that may actually work, but not sure if this was your intention. The code above sets val to 1 or encoded tail + 1 (rather than encoded = tail), decode_tail(tail) does not care about LSB, and will do its calculation = correctly. IOW, decode_tail( tail ) is the same as decode_tail( tail + 1 ). I think this is a bit fragile and depends on the implementation of=20 decode_tail(), but if you are fine with that, no objections here. Regards, =E2=80=94 Alex