Received: by 2002:a6b:fb09:0:0:0:0:0 with SMTP id h9csp4022825iog; Tue, 28 Jun 2022 07:32:35 -0700 (PDT) X-Google-Smtp-Source: AGRyM1sXvnig1G36yWrAT6YX5btDcST0OzPUfhwGqpK0SAO+CvKiZ5Gf7pecxyiH0qF5IrZcGmT4 X-Received: by 2002:a17:902:ef46:b0:168:bac3:2fd4 with SMTP id e6-20020a170902ef4600b00168bac32fd4mr5279236plx.132.1656426755277; Tue, 28 Jun 2022 07:32:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1656426755; cv=none; d=google.com; s=arc-20160816; b=ibG/jud/qpWMcSv/1KszSCCDMW2Uzt60Y+7YQXwZWxis8YQZcWx9K0JEn6tJKWDB32 6wwHreJpoSEU5FrDv3+QYnVhruba5DSboEH2SSl+LAa2cfNuAtDdLAPE7FnPnpT+BR6q EzJSdAPDEp+TZHyaBA0ZSscPjhhlLgsAk39pCTaI8XF2uXgkhdKfyJXG97P7HTvh1ZFt TlrvvYKj/hmE2Gut6qC8YmoG5qRZHiu5p1+3+oWs+oZTRCCEioUAXm9aDEAUpcQXD+Dq ZlBdz435yxzDmbtbVZ6Ub+zNRhSyE4pUUv+KfWD8zF4i39USjJAX6iMjDOw1YHoquPxU iWag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=YKpvy+ys5Io23rSSPdIQF7QGBfeppNwRyDdDVCI8M4k=; b=slBU/PjB9zZ8qXQCLExfIH2V4/k/Wk/aUGBHwNCM6BAZx3OaXTPz99+vOPPYdLQlrj MbZNu4xgQ39PK9WcSbzq4DxeyWX8GXN2x8QYpdUGNjEGDYMpiPWdOar8C0+HVB75MYsU lViFpaeKiYjSFHT36n636fENk+Iv/nqJ96XTEfzMs0nvRsG3flo+bF/BAfeuWbD6qhIZ EGxASTGdBtSdZQkzs/HZS7tYDUaK7It9t8/S3wsWc6NBv17kAe13XcyZMU4ukOJMd1mt v4vv+c7rjP1REBHTphTlQq/M//m4Px4bsuRbx0NEt9iNsQJA5xMAS8I5cKs9lox07y4H nBxg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="UKqw/evQ"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id om13-20020a17090b3a8d00b001ecc27c8cbesi19807306pjb.168.2022.06.28.07.32.18; Tue, 28 Jun 2022 07:32:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="UKqw/evQ"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346013AbiF1OPa (ORCPT + 99 others); Tue, 28 Jun 2022 10:15:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55114 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345249AbiF1OP2 (ORCPT ); Tue, 28 Jun 2022 10:15:28 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 2F7F92EA0C for ; Tue, 28 Jun 2022 07:15:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1656425724; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YKpvy+ys5Io23rSSPdIQF7QGBfeppNwRyDdDVCI8M4k=; b=UKqw/evQvnMwX8MBzhLRgh4s8NidKPN8h3qEuh/T2rBojX/ci8FtFzfIBy9KtgTRmezv0y iGBefmxceQAXzTiUdHp9PtHyo+991Z0ogr0WG77rl0RECValJyZC/tnqtl19bui2RLzVuv cE5vpj+iNdoyWyarRwWVNdb7BA/eslY= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-365-RapeBXjPNjO2KnlgZm8wiw-1; Tue, 28 Jun 2022 10:15:21 -0400 X-MC-Unique: RapeBXjPNjO2KnlgZm8wiw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 8C91E811E76; Tue, 28 Jun 2022 14:15:07 +0000 (UTC) Received: from [10.22.34.187] (unknown [10.22.34.187]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7EA61C28118; Tue, 28 Jun 2022 14:15:06 +0000 (UTC) Message-ID: <588a3276-5481-0a9f-9eac-fed09eede4f2@redhat.com> Date: Tue, 28 Jun 2022 10:15:06 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Subject: Re: [PATCH v6] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation Content-Language: en-US To: Guo Hui , peterz@infradead.org Cc: jgross@suse.com, srivatsa@csail.mit.edu, amakhalov@vmware.com, pv-drivers@vmware.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, will@kernel.org, boqun.feng@gmail.com, virtualization@lists.linux-foundation.org, wangxiaohua@uniontech.com, linux-kernel@vger.kernel.org References: <20220628125421.12364-1-guohui@uniontech.com> From: Waiman Long In-Reply-To: <20220628125421.12364-1-guohui@uniontech.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.85 on 10.11.54.8 X-Spam-Status: No, score=-3.2 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 6/28/22 08:54, Guo Hui wrote: > The instructions assigned to the vcpu_is_preempted function parameter > in the X86 architecture physical machine are redundant instructions, > causing the multi-core performance of Unixbench to drop by about 4% to 5%. > The C function is as follows: > static bool vcpu_is_preempted(long vcpu); > > The parameter 'vcpu' in the function osq_lock > that calls the function vcpu_is_preempted is assigned as follows: > > The C code is in the function node_cpu: > cpu = node->cpu - 1; > > The instructions corresponding to the C code are: > mov 0x14(%rax),%edi > sub $0x1,%edi > > The above instructions are unnecessary > in the X86 Native operating environment, > causing high cache-misses and degrading performance. > > This patch uses static_key to not execute this instruction > in the Native runtime environment. > > The patch effect is as follows two machines, > Unixbench runs with full core score: > > 1. Machine configuration: > Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz > CPU core: 40 > Memory: 256G > OS Kernel: 5.19-rc3 > > Before using the patch: > System Benchmarks Index Values BASELINE RESULT INDEX > Dhrystone 2 using register variables 116700.0 948326591.2 81261.9 > Double-Precision Whetstone 55.0 211986.3 38543.0 > Execl Throughput 43.0 43453.2 10105.4 > File Copy 1024 bufsize 2000 maxblocks 3960.0 438936.2 1108.4 > File Copy 256 bufsize 500 maxblocks 1655.0 118197.4 714.2 > File Copy 4096 bufsize 8000 maxblocks 5800.0 1534674.7 2646.0 > Pipe Throughput 12440.0 46482107.6 37365.0 > Pipe-based Context Switching 4000.0 1915094.2 4787.7 > Process Creation 126.0 85442.2 6781.1 > Shell Scripts (1 concurrent) 42.4 69400.7 16368.1 > Shell Scripts (8 concurrent) 6.0 8877.2 14795.3 > System Call Overhead 15000.0 4714906.1 3143.3 > ======== > System Benchmarks Index Score 7923.3 > > After using the patch: > System Benchmarks Index Values BASELINE RESULT INDEX > Dhrystone 2 using register variables 116700.0 947032915.5 81151.1 > Double-Precision Whetstone 55.0 211971.2 38540.2 > Execl Throughput 43.0 45054.8 10477.9 > File Copy 1024 bufsize 2000 maxblocks 3960.0 515024.9 1300.6 > File Copy 256 bufsize 500 maxblocks 1655.0 146354.6 884.3 > File Copy 4096 bufsize 8000 maxblocks 5800.0 1679995.9 2896.5 > Pipe Throughput 12440.0 46466394.2 37352.4 > Pipe-based Context Switching 4000.0 1898221.4 4745.6 > Process Creation 126.0 85653.1 6797.9 > Shell Scripts (1 concurrent) 42.4 69437.3 16376.7 > Shell Scripts (8 concurrent) 6.0 8898.9 14831.4 > System Call Overhead 15000.0 4658746.7 3105.8 > ======== > System Benchmarks Index Score 8248.8 > > 2. Machine configuration: > Hygon C86 7185 32-core Processor > CPU core: 128 > Memory: 256G > OS Kernel: 5.19-rc3 > > Before using the patch: > System Benchmarks Index Values BASELINE RESULT INDEX > Dhrystone 2 using register variables 116700.0 2256644068.3 193371.4 > Double-Precision Whetstone 55.0 438969.9 79812.7 > Execl Throughput 43.0 10108.6 2350.8 > File Copy 1024 bufsize 2000 maxblocks 3960.0 275892.8 696.7 > File Copy 256 bufsize 500 maxblocks 1655.0 72082.7 435.5 > File Copy 4096 bufsize 8000 maxblocks 5800.0 925043.4 1594.9 > Pipe Throughput 12440.0 118905512.5 95583.2 > Pipe-based Context Switching 4000.0 7820945.7 19552.4 > Process Creation 126.0 31233.3 2478.8 > Shell Scripts (1 concurrent) 42.4 49042.8 11566.7 > Shell Scripts (8 concurrent) 6.0 6656.0 11093.3 > System Call Overhead 15000.0 6816047.5 4544.0 > ======== > System Benchmarks Index Score 7756.6 > > After using the patch: > System Benchmarks Index Values BASELINE RESULT INDEX > Dhrystone 2 using register variables 116700.0 2252272929.4 192996.8 > Double-Precision Whetstone 55.0 451847.2 82154.0 > Execl Throughput 43.0 10595.1 2464.0 > File Copy 1024 bufsize 2000 maxblocks 3960.0 301279.3 760.8 > File Copy 256 bufsize 500 maxblocks 1655.0 79291.3 479.1 > File Copy 4096 bufsize 8000 maxblocks 5800.0 1039755.2 1792.7 > Pipe Throughput 12440.0 118701468.1 95419.2 > Pipe-based Context Switching 4000.0 8073453.3 20183.6 > Process Creation 126.0 33440.9 2654.0 > Shell Scripts (1 concurrent) 42.4 52722.6 12434.6 > Shell Scripts (8 concurrent) 6.0 7050.4 11750.6 > System Call Overhead 15000.0 6834371.5 4556.2 > ======== > System Benchmarks Index Score 8157.8 > > Signed-off-by: Guo Hui > --- > arch/x86/kernel/paravirt-spinlocks.c | 4 ++++ > kernel/locking/osq_lock.c | 12 +++++++++++- > 2 files changed, 15 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c > index 9e1ea99..a2eb375 100644 > --- a/arch/x86/kernel/paravirt-spinlocks.c > +++ b/arch/x86/kernel/paravirt-spinlocks.c > @@ -33,6 +33,8 @@ bool pv_is_native_vcpu_is_preempted(void) > __raw_callee_save___native_vcpu_is_preempted; > } > > +DECLARE_STATIC_KEY_TRUE(vcpu_has_preemption); > + > void __init paravirt_set_cap(void) > { > if (!pv_is_native_spin_unlock()) > @@ -40,4 +42,6 @@ void __init paravirt_set_cap(void) > > if (!pv_is_native_vcpu_is_preempted()) > setup_force_cpu_cap(X86_FEATURE_VCPUPREEMPT); > + else > + static_branch_disable(&vcpu_has_preemption); > } > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c > index d5610ad..883e815 100644 > --- a/kernel/locking/osq_lock.c > +++ b/kernel/locking/osq_lock.c > @@ -27,6 +27,16 @@ static inline int node_cpu(struct optimistic_spin_node *node) > return node->cpu - 1; > } > > +DEFINE_STATIC_KEY_TRUE(vcpu_has_preemption); > + > +static inline bool vcpu_is_preempted_node(struct optimistic_spin_node *node) > +{ > + if (static_branch_likely(&vcpu_has_preemption)) > + return vcpu_is_preempted(node_cpu(node->prev)); > + > + return false; > +} > + > static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) > { > int cpu_nr = encoded_cpu_val - 1; > @@ -141,7 +151,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) > * polling, be careful. > */ > if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || > - vcpu_is_preempted(node_cpu(node->prev)))) > + vcpu_is_preempted_node(node))) > return true; > > /* unqueue */ How about a further improvement on configurations that don't use vcpu_is_preempted() at all? +#ifdef vcpu_is_preempted +DEFINE_STATIC_KEY_TRUE(vcpu_has_preemption); +  static inline int node_cpu(struct optimistic_spin_node *node)  {         return node->cpu - 1;  } +static inline bool vcpu_is_preempted_node(struct optimistic_spin_node *node) +{ +       if (static_branch_likely(&vcpu_has_preemption)) +               return vcpu_is_preempted(node_cpu(node->prev)); + +       return false; +} +#else +static inline bool vcpu_is_preempted_node(struct optimistic_spin_node *node) +{ +       return false; +} +#endif +  static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)  {         int cpu_nr = encoded_cpu_val - 1; @@ -141,7 +158,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)          * polling, be careful.          */         if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || - vcpu_is_preempted(node_cpu(node->prev)))) +                                 vcpu_is_preempted_node(node)))                 return true;         /* unqueue */ For those configurations, vcpu_is_preempted_node() will just get compiled out. Cheers, Longman