Received: by 2002:a6b:fb09:0:0:0:0:0 with SMTP id h9csp3098464iog; Mon, 27 Jun 2022 09:05:42 -0700 (PDT) X-Google-Smtp-Source: AGRyM1sP8dI3KZ6kyR6TJ9WvtV1YVu+WHAH6iFaNTQ++e8jH0hf8VYewlQsJRc4tTlsvN4DPlihQ X-Received: by 2002:a05:6402:403:b0:434:eb49:218f with SMTP id q3-20020a056402040300b00434eb49218fmr17743356edv.426.1656345941916; Mon, 27 Jun 2022 09:05:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1656345941; cv=none; d=google.com; s=arc-20160816; b=xrkTUrcDRO/q2ozFCXKT9oJIppCNm64PMg73jokj1pDU9YmFAn9UplcITRmBvSKTtt MIXJD2l2taoBWi9+igwym5ns8uBslJzsaPRR8HTw9J32BE/cQ2QDXzydrc4owb+rBY4h U8cDRrb0e2G3By94rY0ehQiCSv5JRAldvo2erP79fWjpjLx2U9btwMMQS/kBCQXoJDJU 8g/wK7SdgGIVlkB9IxaaZOhb+pYU6TU2GE+Rswdr9LwI4IXguw+CT2lEMandIAL6zJlZ SZdWcHFjI52b2dR/IOLNA2i6HBEBRw1SJjhaZRMehjcMGeP+ttFlyT+VyvOELKW4NxEd +u0w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=PrdaOUdviFNqJUrxeM9jQQOmGHi626ZP241EikWZxH4=; b=KFgmOrM3zVKXNOABg2kwNNbTMsQUAPwN1tQXtw7QhOnE/F7RUJ29Pj5hh9UsffS7Gv 88fTkXnAHe7onblDx5hcjo08K31uTTQoaJHA4BzoylcyzE7dI1nOCYzP63YXXSMbyzPh ebdtBsFpTjW94I/ws+yIY5LYGnK1mXj2efka7ZdPBoY1yLq0eqD7/X/SOgIEejCMBBio GrlUFQtLjJRlN4P+n6tT9TS/dfEbRE6k7K1hhJxutwNvzDy1Fp7+tQCA6bi93k2sTzd2 YgT91C8Ja8LHBf44LnQlyKsRgoAWyUPiW5OInwTnKPrk/LebAyo4AP1DSQmmznCOgw4L T1yg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="PcpO/86m"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b15-20020a056402350f00b00435964535aasi9857973edd.322.2022.06.27.09.05.05; Mon, 27 Jun 2022 09:05:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="PcpO/86m"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238789AbiF0P7P (ORCPT + 99 others); Mon, 27 Jun 2022 11:59:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33742 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238706AbiF0P7O (ORCPT ); Mon, 27 Jun 2022 11:59:14 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 078D0D79 for ; Mon, 27 Jun 2022 08:59:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1656345551; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PrdaOUdviFNqJUrxeM9jQQOmGHi626ZP241EikWZxH4=; b=PcpO/86mxhC20LeX1LRowMVUciLybSxewqSS4jbfGIENPYBXzDQt1Qtl9mR4cnk3BYE58u kRomsyCCEybfolhmS+5TSc9NPLrKELD9W+Gr/qpTBVCAzJHI6hq4st7io5zYZRv14RGwmK lLnpbv3/im5/wquJyzNARaTe8XtmsEE= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-462-WeusmHWPNzy5143zaHwmuA-1; Mon, 27 Jun 2022 11:59:00 -0400 X-MC-Unique: WeusmHWPNzy5143zaHwmuA-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 11F7F2806AD1; Mon, 27 Jun 2022 15:59:00 +0000 (UTC) Received: from [10.22.10.125] (unknown [10.22.10.125]) by smtp.corp.redhat.com (Postfix) with ESMTP id 45DA6415F5E; Mon, 27 Jun 2022 15:58:59 +0000 (UTC) Message-ID: Date: Mon, 27 Jun 2022 11:58:58 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Subject: Re: [PATCH v4] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation Content-Language: en-US To: Guo Hui , peterz@infradead.org Cc: jgross@suse.com, srivatsa@csail.mit.edu, amakhalov@vmware.com, pv-drivers@vmware.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, will@kernel.org, boqun.feng@gmail.com, virtualization@lists.linux-foundation.org, wangxiaohua@uniontech.com, linux-kernel@vger.kernel.org References: <20220627142732.31067-1-guohui@uniontech.com> From: Waiman Long In-Reply-To: <20220627142732.31067-1-guohui@uniontech.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.85 on 10.11.54.10 X-Spam-Status: No, score=-2.5 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 6/27/22 10:27, Guo Hui wrote: > The instructions assigned to the vcpu_is_preempted function parameter > in the X86 architecture physical machine are redundant instructions, > causing the multi-core performance of Unixbench to drop by about 4% to 5%. > The C function is as follows: > static bool vcpu_is_preempted(long vcpu); > > The parameter 'vcpu' in the function osq_lock > that calls the function vcpu_is_preempted is assigned as follows: > > The C code is in the function node_cpu: > cpu = node->cpu - 1; > > The instructions corresponding to the C code are: > mov 0x14(%rax),%edi > sub $0x1,%edi > > The above instructions are unnecessary > in the X86 Native operating environment, > causing high cache-misses and degrading performance. > > This patch uses static_key to not execute this instruction > in the Native runtime environment. > > The patch effect is as follows two machines, > Unixbench runs with full core score: > > 1. Machine configuration: > Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz > CPU core: 40 > Memory: 256G > OS Kernel: 5.19-rc3 > > Before using the patch: > System Benchmarks Index Values BASELINE RESULT INDEX > Dhrystone 2 using register variables 116700.0 948326591.2 81261.9 > Double-Precision Whetstone 55.0 211986.3 38543.0 > Execl Throughput 43.0 43453.2 10105.4 > File Copy 1024 bufsize 2000 maxblocks 3960.0 438936.2 1108.4 > File Copy 256 bufsize 500 maxblocks 1655.0 118197.4 714.2 > File Copy 4096 bufsize 8000 maxblocks 5800.0 1534674.7 2646.0 > Pipe Throughput 12440.0 46482107.6 37365.0 > Pipe-based Context Switching 4000.0 1915094.2 4787.7 > Process Creation 126.0 85442.2 6781.1 > Shell Scripts (1 concurrent) 42.4 69400.7 16368.1 > Shell Scripts (8 concurrent) 6.0 8877.2 14795.3 > System Call Overhead 15000.0 4714906.1 3143.3 > ======== > System Benchmarks Index Score 7923.3 > > After using the patch: > System Benchmarks Index Values BASELINE RESULT INDEX > Dhrystone 2 using register variables 116700.0 947032915.5 81151.1 > Double-Precision Whetstone 55.0 211971.2 38540.2 > Execl Throughput 43.0 45054.8 10477.9 > File Copy 1024 bufsize 2000 maxblocks 3960.0 515024.9 1300.6 > File Copy 256 bufsize 500 maxblocks 1655.0 146354.6 884.3 > File Copy 4096 bufsize 8000 maxblocks 5800.0 1679995.9 2896.5 > Pipe Throughput 12440.0 46466394.2 37352.4 > Pipe-based Context Switching 4000.0 1898221.4 4745.6 > Process Creation 126.0 85653.1 6797.9 > Shell Scripts (1 concurrent) 42.4 69437.3 16376.7 > Shell Scripts (8 concurrent) 6.0 8898.9 14831.4 > System Call Overhead 15000.0 4658746.7 3105.8 > ======== > System Benchmarks Index Score 8248.8 > > 2. Machine configuration: > Hygon C86 7185 32-core Processor > CPU core: 128 > Memory: 256G > OS Kernel: 5.19-rc3 > > Before using the patch: > System Benchmarks Index Values BASELINE RESULT INDEX > Dhrystone 2 using register variables 116700.0 2256644068.3 193371.4 > Double-Precision Whetstone 55.0 438969.9 79812.7 > Execl Throughput 43.0 10108.6 2350.8 > File Copy 1024 bufsize 2000 maxblocks 3960.0 275892.8 696.7 > File Copy 256 bufsize 500 maxblocks 1655.0 72082.7 435.5 > File Copy 4096 bufsize 8000 maxblocks 5800.0 925043.4 1594.9 > Pipe Throughput 12440.0 118905512.5 95583.2 > Pipe-based Context Switching 4000.0 7820945.7 19552.4 > Process Creation 126.0 31233.3 2478.8 > Shell Scripts (1 concurrent) 42.4 49042.8 11566.7 > Shell Scripts (8 concurrent) 6.0 6656.0 11093.3 > System Call Overhead 15000.0 6816047.5 4544.0 > ======== > System Benchmarks Index Score 7756.6 > > After using the patch: > System Benchmarks Index Values BASELINE RESULT INDEX > Dhrystone 2 using register variables 116700.0 2252272929.4 192996.8 > Double-Precision Whetstone 55.0 451847.2 82154.0 > Execl Throughput 43.0 10595.1 2464.0 > File Copy 1024 bufsize 2000 maxblocks 3960.0 301279.3 760.8 > File Copy 256 bufsize 500 maxblocks 1655.0 79291.3 479.1 > File Copy 4096 bufsize 8000 maxblocks 5800.0 1039755.2 1792.7 > Pipe Throughput 12440.0 118701468.1 95419.2 > Pipe-based Context Switching 4000.0 8073453.3 20183.6 > Process Creation 126.0 33440.9 2654.0 > Shell Scripts (1 concurrent) 42.4 52722.6 12434.6 > Shell Scripts (8 concurrent) 6.0 7050.4 11750.6 > System Call Overhead 15000.0 6834371.5 4556.2 > ======== > System Benchmarks Index Score 8157.8 > > Signed-off-by: Guo Hui > --- > arch/x86/kernel/paravirt-spinlocks.c | 4 ++++ > kernel/locking/osq_lock.c | 12 +++++++++++- > 2 files changed, 15 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c > index 9e1ea99ad..a2eb375e2 100644 > --- a/arch/x86/kernel/paravirt-spinlocks.c > +++ b/arch/x86/kernel/paravirt-spinlocks.c > @@ -33,6 +33,8 @@ bool pv_is_native_vcpu_is_preempted(void) > __raw_callee_save___native_vcpu_is_preempted; > } > > +DECLARE_STATIC_KEY_TRUE(vcpu_has_preemption); > + > void __init paravirt_set_cap(void) > { > if (!pv_is_native_spin_unlock()) > @@ -40,4 +42,6 @@ void __init paravirt_set_cap(void) > > if (!pv_is_native_vcpu_is_preempted()) > setup_force_cpu_cap(X86_FEATURE_VCPUPREEMPT); > + else > + static_branch_disable(&vcpu_has_preemption); > } > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c > index d5610ad52..adb41080d 100644 > --- a/kernel/locking/osq_lock.c > +++ b/kernel/locking/osq_lock.c > @@ -27,6 +27,16 @@ static inline int node_cpu(struct optimistic_spin_node *node) > return node->cpu - 1; > } > > +DEFINE_STATIC_KEY_TRUE(vcpu_has_preemption); > + > +static inline bool vcpu_is_preempted_node(struct optimistic_spin_node *node) > +{ > + if (!static_branch_unlikely(&vcpu_has_preemption)) > + return false; > + > + return vcpu_is_preempted(node_cpu(node->prev)); > +} > + > static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) > { > int cpu_nr = encoded_cpu_val - 1; > @@ -141,7 +151,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) > * polling, be careful. > */ > if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || > - vcpu_is_preempted(node_cpu(node->prev)))) > + vcpu_is_preempted_node(node))) > return true; > > /* unqueue */ The patch looks good. I do have a minor nit though. Usually, DEFINE_STATIC_KEY_TRUE() is paired with static_branch_likely() and DEFINE_STATIC_KEY_FALSE() is paired with static_branch_unlikely(). I think what Peter meant is to use DEFINE_STATIC_KEY_FALSE() and enable vcpu_has_preemption together with X86_FEATURE_VCPUPREEMPT in the same if block. Cheers, Longman