Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp876360pxf; Thu, 18 Mar 2021 13:49:17 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyAmGjjKDgGMNNX6ZTu9uBxQzEBNA/wqdOmIpcnAiM1KR+qRoTKzsMK0DtFYp9SJDOiTKYy X-Received: by 2002:a17:906:d114:: with SMTP id b20mr452465ejz.449.1616100557177; Thu, 18 Mar 2021 13:49:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616100557; cv=none; d=google.com; s=arc-20160816; b=0lm7UmlYkKwRCnW2ee9krBUqPDeMRyLKdMSYMOPIpHSvgB639Oov6wbaDjL9EDJhID Galk1PpNoaI4imlFFgrK/+tGlNPCKd62eEIPVRs3vNmjgGhUx6IhAITI7FpAId2k0H1G Q0EDIWonwrnAlNxOtRZx2RWD3atlIrZp1T/Z6lc8G0XuVqOL3VFO/8fb4XEm5BoKnqXz tXn/yqAzSqPdzNUQmpaP1afneI4MM2uCvjz1cm5MD8VQxpbnGOZShLaCOBkhsluNFhHf j2L0zgTyEyNkqMHTdZYSGXWh9zkwzI9ZDiBWXInKkMVZcoiTUxpb7Td2zZdqLMvVq8dY 2jwg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=Uq+IcOyEhHmAyAOdJFy70QZ+QMK1yuvFuRUnNgNgobs=; b=IeU0bHjVDhLAKGOZhjGTaUYQXLCFVXZ9PDQEelpSOz74DyE7z2+4Y6x5+gvU/woDPz uPSESY+lbRPfia5+HFpkET0Ds394ZXJTImpLCn8GBcBOqTumr61G/QdBrbtYBBkLqunB I88PnJWdJkJs9dsLmwYeRQ+YL+QFCI7R4COCLcJHt2OdzAfaG4kDrV7YKc7Ly0sfiy+O 0hcr2qTnD/jnM4er1CJE9za0Z/87sT3B5/VcGCofuxR/JQEdK6j+xlezqV/A1dWiWLpB 2aR9mZFFJi4lbz/0R60sFWaB7iJWiH2/rMzlVM5//f7DH/Gl9sKIkM3TgnhpBv8IjHVA P6eQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id i5si2393753ejh.313.2021.03.18.13.48.53; Thu, 18 Mar 2021 13:49:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231769AbhCRUr3 (ORCPT + 99 others); Thu, 18 Mar 2021 16:47:29 -0400 Received: from mx2.suse.de ([195.135.220.15]:52090 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233122AbhCRUrQ (ORCPT ); Thu, 18 Mar 2021 16:47:16 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 4DF2CAD72; Thu, 18 Mar 2021 20:47:15 +0000 (UTC) From: Davidlohr Bueso To: npiggin@gmail.com Cc: benh@kernel.crashing.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, longman@redhat.com, mingo@redhat.com, mpe@ellerman.id.au, paulus@samba.org, peterz@infradead.org, will@kernel.org, dbueso@suse.de, Davidlohr Bueso Subject: [PATCH v2] powerpc/qspinlock: Use generic smp_cond_load_relaxed Date: Thu, 18 Mar 2021 13:47:02 -0700 Message-Id: <20210318204702.71417-1-dave@stgolabs.net> X-Mailer: git-send-email 2.26.2 In-Reply-To: <1615870473.h7h4jetmjb.astroid@bobo.none> References: <1615870473.h7h4jetmjb.astroid@bobo.none> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added busy-waiting pausing with a preferred SMT priority pattern, lowering the priority (reducing decode cycles) during the whole loop slowpath. However, data shows that while this pattern works well with simple spinlocks, queued spinlocks benefit more being kept in medium priority, with a cpu_relax() instead, being a low+medium combo on powerpc. Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with 2 sockets and 8 threads per core. 1. locktorture. This is data for the lowest and most artificial/pathological level, with increasing thread counts pounding on the lock. Metrics are total ops/minute. Despite some small hits in the 4-8 range, scenarios are either neutral or favorable to this patch. +=========+==========+==========+=======+ | # tasks | vanilla | dirty | %diff | +=========+==========+==========+=======+ | 2 | 46718565 | 48751350 | 4.35 | +---------+----------+----------+-------+ | 4 | 51740198 | 50369082 | -2.65 | +---------+----------+----------+-------+ | 8 | 63756510 | 62568821 | -1.86 | +---------+----------+----------+-------+ | 16 | 67824531 | 70966546 | 4.63 | +---------+----------+----------+-------+ | 32 | 53843519 | 61155508 | 13.58 | +---------+----------+----------+-------+ | 64 | 53005778 | 53104412 | 0.18 | +---------+----------+----------+-------+ | 128 | 53331980 | 54606910 | 2.39 | +=========+==========+==========+=======+ 2. sockperf (tcp throughput) Here a client will do one-way throughput tests to a localhost server, with increasing message sizes, dealing with the sk_lock. This patch shows to put the performance of the qspinlock back to par with that of the simple lock: simple-spinlock vanilla dirty Hmean 14 73.50 ( 0.00%) 54.44 * -25.93%* 73.45 * -0.07%* Hmean 100 654.47 ( 0.00%) 385.61 * -41.08%* 771.43 * 17.87%* Hmean 300 2719.39 ( 0.00%) 2181.67 * -19.77%* 2666.50 * -1.94%* Hmean 500 4400.59 ( 0.00%) 3390.77 * -22.95%* 4322.14 * -1.78%* Hmean 850 6726.21 ( 0.00%) 5264.03 * -21.74%* 6863.12 * 2.04%* 3. dbench (tmpfs) Configured to run with up to ncpusx8 clients, it shows both latency and throughput metrics. For the latency, with the exception of the 64 case, there is really nothing to go by: vanilla dirty Amean latency-1 1.67 ( 0.00%) 1.67 * 0.09%* Amean latency-2 2.15 ( 0.00%) 2.08 * 3.36%* Amean latency-4 2.50 ( 0.00%) 2.56 * -2.27%* Amean latency-8 2.49 ( 0.00%) 2.48 * 0.31%* Amean latency-16 2.69 ( 0.00%) 2.72 * -1.37%* Amean latency-32 2.96 ( 0.00%) 3.04 * -2.60%* Amean latency-64 7.78 ( 0.00%) 8.17 * -5.07%* Amean latency-512 186.91 ( 0.00%) 186.41 * 0.27%* For the dbench4 Throughput (misleading but traditional) there's a small but rather constant improvement: vanilla dirty Hmean 1 849.13 ( 0.00%) 851.51 * 0.28%* Hmean 2 1664.03 ( 0.00%) 1663.94 * -0.01%* Hmean 4 3073.70 ( 0.00%) 3104.29 * 1.00%* Hmean 8 5624.02 ( 0.00%) 5694.16 * 1.25%* Hmean 16 9169.49 ( 0.00%) 9324.43 * 1.69%* Hmean 32 11969.37 ( 0.00%) 12127.09 * 1.32%* Hmean 64 15021.12 ( 0.00%) 15243.14 * 1.48%* Hmean 512 14891.27 ( 0.00%) 15162.11 * 1.82%* Measuring the dbench4 Per-VFS Operation latency, shows some very minor differences within the noise level, around the 0-1% ranges. Fixes: 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) Acked-by: Nicholas Piggin Signed-off-by: Davidlohr Bueso --- Changes from v1: Added small description and labeling smp_cond_load_relaxed requested by Nick. Added Nick's ack. arch/powerpc/include/asm/barrier.h | 16 ---------------- arch/powerpc/include/asm/qspinlock.h | 7 +++++++ 2 files changed, 7 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h index aecfde829d5d..7ae29cfb06c0 100644 --- a/arch/powerpc/include/asm/barrier.h +++ b/arch/powerpc/include/asm/barrier.h @@ -80,22 +80,6 @@ do { \ ___p1; \ }) -#ifdef CONFIG_PPC64 -#define smp_cond_load_relaxed(ptr, cond_expr) ({ \ - typeof(ptr) __PTR = (ptr); \ - __unqual_scalar_typeof(*ptr) VAL; \ - VAL = READ_ONCE(*__PTR); \ - if (unlikely(!(cond_expr))) { \ - spin_begin(); \ - do { \ - VAL = READ_ONCE(*__PTR); \ - } while (!(cond_expr)); \ - spin_end(); \ - } \ - (typeof(*ptr))VAL; \ -}) -#endif - #ifdef CONFIG_PPC_BOOK3S_64 #define NOSPEC_BARRIER_SLOT nop #elif defined(CONFIG_PPC_FSL_BOOK3E) diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index b052b0624816..9da649e1a488 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -72,6 +72,13 @@ static inline void pv_spinlocks_init(void) #endif +/* + * Queued spinlocks rely heavily on smp_cond_load_relaxed to busy-wait, + * which was found that have performance problems if implemented with + * the preferred spin_begin()/spin_end() SMT priority pattern. Use the + * generic version instead. + */ + #include #endif /* _ASM_POWERPC_QSPINLOCK_H */ -- 2.26.2