Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp1985883rwb; Wed, 5 Oct 2022 07:35:37 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4NCmW+icm+049/pTofxP2BuYs8m/uOeVCLQUUJkdHPXLo+9znvW+xKfXgUAumhrvqVKdEZ X-Received: by 2002:a17:906:4fd1:b0:787:434f:d755 with SMTP id i17-20020a1709064fd100b00787434fd755mr23325753ejw.356.1664980536763; Wed, 05 Oct 2022 07:35:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1664980536; cv=none; d=google.com; s=arc-20160816; b=tkuziW1G404Aqu9/hzBLFqtED6g2UF8aG6t+j/ODVnD/lLPJeOkiIf953ONVzlDOZZ S25jl6i+XPXTPAIQGlTBpL+6REdpVz983RvncL6DTswQrVHEfHqNvzDCkZJe1rkwK6lG h4SUfEaGNC2nd5vDKWmLC0tR5dwjjWn/ihrbZf4TXiPLOUyrFT019SsiS34P9iNwAgGb Lj2isxu5lVfKPBX6VRrNeg3VdbNBXFYch08PohvEBgMsFVT75+i2y1MBeHuIt0CL66Ac JL55bEVgNx9+PqWnoT5SNNlKjqqAeKVkJbSefo6NDSsxacyQMn/e9pAGrSbl4gUTD1AN +vow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:to:references:message-id :content-transfer-encoding:cc:date:in-reply-to:from:subject :mime-version:dkim-signature; bh=YW0U49IJ9akiaPa3ZT/tCyDJCJ9H1v1xiGmSKPb9ZnY=; b=rhv5WMzUpwrcCPhJgjfBrhcLVcNkjAFRLIdPI8aRLWh0ctAojc1xwDjgFO7Syw5aj0 sWp7LtGYbjwwywLG0nNVUdA8gCVKZsOtMpMHdfhaRu3MKn6osAqJHSbhvLkn/XKsS1Jw Fo/W0wkwwYV3xbYIas97Ni2XCZiLg9JcYMB1IhnYcsdact4maPcnVf4PZMRHUTmOq8tk 2RKYERSWWlitao2EeFdrOaPsTNJNOfkSf0dHZl55DIeLsAOzPXUCv9HuZXVQKEZ1bbrW J2ufFj33VYF/QcBczkchFBml3L+LKrtRh2NouzZYW6uHG5rQcT5+QjXb7GcPEnpLNaJk p+5A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@jrtc27.com header.s=gmail.jrtc27.user header.b=gN3VraA1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hs27-20020a1709073e9b00b0078b51cf1287si4684012ejc.338.2022.10.05.07.35.11; Wed, 05 Oct 2022 07:35:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@jrtc27.com header.s=gmail.jrtc27.user header.b=gN3VraA1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229722AbiJEOZk (ORCPT + 99 others); Wed, 5 Oct 2022 10:25:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57954 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229581AbiJEOZh (ORCPT ); Wed, 5 Oct 2022 10:25:37 -0400 Received: from mail-wr1-x42f.google.com (mail-wr1-x42f.google.com [IPv6:2a00:1450:4864:20::42f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2171A4686A for ; Wed, 5 Oct 2022 07:25:35 -0700 (PDT) Received: by mail-wr1-x42f.google.com with SMTP id r13so4718060wrj.11 for ; Wed, 05 Oct 2022 07:25:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jrtc27.com; s=gmail.jrtc27.user; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date; bh=YW0U49IJ9akiaPa3ZT/tCyDJCJ9H1v1xiGmSKPb9ZnY=; b=gN3VraA1EN+MVapKnD0wr7oLCkMm4oiDdxbpzZpJiHxMP1i4Iqlwd8GCsmUZylWdM0 80qzw06S+JKFKrQC0Xwl3EGi4K5m6RvVeGZhcyL2d3WOQEmzfsHidDEqswq6Y4zVknL3 BQLgPt2AwC5L/aRzx42jdX5N8sf/LH7RUzLNHcoe5NxNlCOHCUvM8YDVc+5bWLKXhL1k RnLof8oMcK5JEPBGAAslZhD9DvTwy/cGj1Vn1HFAQj7cV1PTlnCrgc47neoqrqvzYwF0 Vsf/s5TZ1DkbCdArZhhrEXI568DPgSZuJhYN3nal4ZjeboEZZg2Uz5UPA8I+jm9cTjSU /B2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date; bh=YW0U49IJ9akiaPa3ZT/tCyDJCJ9H1v1xiGmSKPb9ZnY=; b=xV2UOQ4UANQgnWL+QBVZv7UutGmnV+KusVY1iKqplzVRVv4sCT2azWDI+QAgqjf6Wf mFYDwamWfsaJWy/jWaK/RfG2kheb9+KRH8tUEvcksvqCGThHjvrcvbNTbpo8yjXo/Q06 fQtWwIVrepPSmEMwJ8QBN9HuTWbDJtHxeAeI1Mn1r5RGWeKFZdGXHqz/83nRGawhYYFX zHsVB0xp4Lv5ZOGDo+vfGBNuwFgQ+Qr7Q/9MFdl6HWH7Wrp+RKUR7Xk9Es7cGmmjj5Dt z2CcQzFAwRumZcXXMpCy0DkHNNsq7flhBnYLdc362bmz1fvsKhF3dGIYZK3CMme6PjAX 03+A== X-Gm-Message-State: ACrzQf3fS+Ly6VSDro+JGENsfkN1ukLBrOqxirHtrGjJpvFfatibaC+A vIzoX6UtdZHNWffN01meTLAPtw== X-Received: by 2002:a05:6000:1788:b0:22e:3d0f:2547 with SMTP id e8-20020a056000178800b0022e3d0f2547mr10564561wrg.621.1664979933480; Wed, 05 Oct 2022 07:25:33 -0700 (PDT) Received: from smtpclient.apple (global-5-142.n-2.net.cam.ac.uk. [131.111.5.142]) by smtp.gmail.com with ESMTPSA id i8-20020a1c3b08000000b003b535ad4a5bsm2218476wma.9.2022.10.05.07.25.32 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 05 Oct 2022 07:25:32 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.80.82.1.1\)) Subject: Re: [PATCH] riscv: Fix build with CONFIG_CC_OPTIMIZE_FOR_SIZE=y From: Jessica Clarke In-Reply-To: Date: Wed, 5 Oct 2022 15:25:32 +0100 Cc: Atish Patra , Conor Dooley , Heiko Stuebner , Conor Dooley , Palmer Dabbelt , linux-riscv , Samuel Holland , Albert Ou , Anup Patel , Atish Patra , Dao Lu , Jisheng Zhang , Paul Walmsley , linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <20220922060958.44203-1-samuel@sholland.org> <2546376.ElGaqSPkdT@phil> <2E96A836-764D-4D07-AB79-3861B9CC2B1F@jrtc27.com> <13396584.uLZWGnKmhe@phil> <1CECF1C3-6FA1-49CC-8A7A-1E18E401B88B@jrtc27.com> To: Guo Ren X-Mailer: Apple Mail (2.3696.80.82.1.1) X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5 Oct 2022, at 02:40, Guo Ren wrote: >=20 > , >=20 > On Wed, Oct 5, 2022 at 9:01 AM Jessica Clarke = wrote: >>=20 >> On 5 Oct 2022, at 01:38, Guo Ren wrote: >>>=20 >>> On Wed, Oct 5, 2022 at 1:24 AM Jessica Clarke = wrote: >>>>=20 >>>> On 4 Oct 2022, at 17:52, Atish Patra wrote: >>>>>=20 >>>>> On Sat, Oct 1, 2022 at 1:13 PM Conor Dooley = wrote: >>>>>>=20 >>>>>> On Wed, Sep 28, 2022 at 08:26:01PM -0700, Atish Patra wrote: >>>>>>> On Wed, Sep 28, 2022 at 2:16 PM Conor Dooley = wrote: >>>>>>>>=20 >>>>>>>> On Wed, Sep 28, 2022 at 12:21:55AM -0700, Atish Patra wrote: >>>>>>>>> On Sat, Sep 24, 2022 at 4:15 PM Conor Dooley = wrote: >>>>>>>>>>=20 >>>>>>>>>> On Fri, Sep 23, 2022 at 11:01:28AM -0700, Atish Patra wrote: >>>>>>>>>>> On Fri, Sep 23, 2022 at 12:18 AM Heiko Stuebner = wrote: >>>>>>>>>>>>=20 >>>>>>>>>>>> Hi, >>>>>>>>>>>>=20 >>>>>>>>>>>> Am Donnerstag, 22. September 2022, 17:52:46 CEST schrieb = Jessica Clarke: >>>>>>>>>>>>> On 22 Sept 2022, at 16:45, Heiko Stuebner = wrote: >>>>>>>>>>>>>>=20 >>>>>>>>>>>>>> Am Donnerstag, 22. September 2022, 08:09:58 CEST schrieb = Samuel Holland: >>>>>>>>>>>>>>> commit 8eb060e10185 ("arch/riscv: add Zihintpause = support") broke >>>>>>>>>>>>>>> building with CONFIG_CC_OPTIMIZE_FOR_SIZE enabled (gcc = 11.1.0): >>>>>>>>>>>>>>>=20 >>>>>>>>>>>>>>> CC arch/riscv/kernel/vdso/vgettimeofday.o >>>>>>>>>>>>>>> In file included from : >>>>>>>>>>>>>>> ./arch/riscv/include/asm/jump_label.h: In function = 'cpu_relax': >>>>>>>>>>>>>>> ././include/linux/compiler_types.h:285:33: warning: = 'asm' operand 0 probably does not match constraints >>>>>>>>>>>>>>> 285 | #define asm_volatile_goto(x...) asm goto(x) >>>>>>>>>>>>>>> | ^~~ >>>>>>>>>>>>>>> ./arch/riscv/include/asm/jump_label.h:41:9: note: in = expansion of macro 'asm_volatile_goto' >>>>>>>>>>>>>>> 41 | asm_volatile_goto( >>>>>>>>>>>>>>> | ^~~~~~~~~~~~~~~~~ >>>>>>>>>>>>>>> ././include/linux/compiler_types.h:285:33: error: = impossible constraint in 'asm' >>>>>>>>>>>>>>> 285 | #define asm_volatile_goto(x...) asm goto(x) >>>>>>>>>>>>>>> | ^~~ >>>>>>>>>>>>>>> ./arch/riscv/include/asm/jump_label.h:41:9: note: in = expansion of macro 'asm_volatile_goto' >>>>>>>>>>>>>>> 41 | asm_volatile_goto( >>>>>>>>>>>>>>> | ^~~~~~~~~~~~~~~~~ >>>>>>>>>>>>>>> make[1]: *** [scripts/Makefile.build:249: = arch/riscv/kernel/vdso/vgettimeofday.o] Error 1 >>>>>>>>>>>>>>> make: *** [arch/riscv/Makefile:128: vdso_prepare] Error = 2 >>>>>>>>>>>>>>>=20 >>>>>>>>>>>>>>> Having a static branch in cpu_relax() is problematic = because that >>>>>>>>>>>>>>> function is widely inlined, including in some quite = complex functions >>>>>>>>>>>>>>> like in the VDSO. A quick measurement shows this static = branch is >>>>>>>>>>>>>>> responsible by itself for around 40% of the jump table. >>>>>>>>>>>>>>>=20 >>>>>>>>>>>>>>> Drop the static branch, which ends up being the same = number of >>>>>>>>>>>>>>> instructions anyway. If Zihintpause is supported, we = trade the nop from >>>>>>>>>>>>>>> the static branch for a div. If Zihintpause is = unsupported, we trade the >>>>>>>>>>>>>>> jump from the static branch for (what gets interpreted = as) a nop. >>>>>>>>>>>>>>>=20 >>>>>>>>>>>>>>> Fixes: 8eb060e10185 ("arch/riscv: add Zihintpause = support") >>>>>>>>>>>>>>> Signed-off-by: Samuel Holland >>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>=20 >>>>>>>>>>>>>>> arch/riscv/include/asm/hwcap.h | 3 --- >>>>>>>>>>>>>>> arch/riscv/include/asm/vdso/processor.h | 25 = ++++++++++--------------- >>>>>>>>>>>>>>> 2 files changed, 10 insertions(+), 18 deletions(-) >>>>>>>>>>>>>>>=20 >>>>>>>>>>>>>>> diff --git a/arch/riscv/include/asm/hwcap.h = b/arch/riscv/include/asm/hwcap.h >>>>>>>>>>>>>>> index 6f59ec64175e..b21d46e68386 100644 >>>>>>>>>>>>>>> --- a/arch/riscv/include/asm/hwcap.h >>>>>>>>>>>>>>> +++ b/arch/riscv/include/asm/hwcap.h >>>>>>>>>>>>>>> @@ -68,7 +68,6 @@ enum riscv_isa_ext_id { >>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>> enum riscv_isa_ext_key { >>>>>>>>>>>>>>> RISCV_ISA_EXT_KEY_FPU, /* For 'F' and 'D' */ >>>>>>>>>>>>>>> - RISCV_ISA_EXT_KEY_ZIHINTPAUSE, >>>>>>>>>>>>>>> RISCV_ISA_EXT_KEY_MAX, >>>>>>>>>>>>>>> }; >>>>>>>>>>>>>>>=20 >>>>>>>>>>>>>>> @@ -88,8 +87,6 @@ static __always_inline int = riscv_isa_ext2key(int num) >>>>>>>>>>>>>>> return RISCV_ISA_EXT_KEY_FPU; >>>>>>>>>>>>>>> case RISCV_ISA_EXT_d: >>>>>>>>>>>>>>> return RISCV_ISA_EXT_KEY_FPU; >>>>>>>>>>>>>>> - case RISCV_ISA_EXT_ZIHINTPAUSE: >>>>>>>>>>>>>>> - return RISCV_ISA_EXT_KEY_ZIHINTPAUSE; >>>>>>>>>>>>>>> default: >>>>>>>>>>>>>>> return -EINVAL; >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> diff --git a/arch/riscv/include/asm/vdso/processor.h = b/arch/riscv/include/asm/vdso/processor.h >>>>>>>>>>>>>>> index 1e4f8b4aef79..789bdb8211a2 100644 >>>>>>>>>>>>>>> --- a/arch/riscv/include/asm/vdso/processor.h >>>>>>>>>>>>>>> +++ b/arch/riscv/include/asm/vdso/processor.h >>>>>>>>>>>>>>> @@ -4,30 +4,25 @@ >>>>>>>>>>>>>>>=20 >>>>>>>>>>>>>>> #ifndef __ASSEMBLY__ >>>>>>>>>>>>>>>=20 >>>>>>>>>>>>>>> -#include >>>>>>>>>>>>>>> #include >>>>>>>>>>>>>>> -#include >>>>>>>>>>>>>>>=20 >>>>>>>>>>>>>>> static inline void cpu_relax(void) >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> - if = (!static_branch_likely(&riscv_isa_ext_keys[RISCV_ISA_EXT_KEY_ZIHINTPAUSE])= ) { >>>>>>>>>>>>>>> #ifdef __riscv_muldiv >>>>>>>>>>>>>>> - int dummy; >>>>>>>>>>>>>>> - /* In lieu of a halt instruction, induce a = long-latency stall. */ >>>>>>>>>>>>>>> - __asm__ __volatile__ ("div %0, %0, zero" : "=3Dr" = (dummy)); >>>>>>>>>>>>>>> + int dummy; >>>>>>>>>>>>>>> + /* In lieu of a halt instruction, induce a = long-latency stall. */ >>>>>>>>>>>>>>> + __asm__ __volatile__ ("div %0, %0, zero" : "=3Dr" = (dummy)); >>>>>>>>>>>>>>> #endif >>>>>>>>>>>>>>> - } else { >>>>>>>>>>>>>>> - /* >>>>>>>>>>>>>>> - * Reduce instruction retirement. >>>>>>>>>>>>>>> - * This assumes the PC changes. >>>>>>>>>>>>>>> - */ >>>>>>>>>>>>>>> + /* >>>>>>>>>>>>>>> + * Reduce instruction retirement. >>>>>>>>>>>>>>> + * This assumes the PC changes. >>>>>>>>>>>>>>> + */ >>>>>>>>>>>>>>> #ifdef __riscv_zihintpause >>>>>>>>>>>>>>> - __asm__ __volatile__ ("pause"); >>>>>>>>>>>>>>> + __asm__ __volatile__ ("pause"); >>>>>>>>>>>>>>> #else >>>>>>>>>>>>>>> - /* Encoding of the pause instruction */ >>>>>>>>>>>>>>> - __asm__ __volatile__ (".4byte 0x100000F"); >>>>>>>>>>>>>>> + /* Encoding of the pause instruction */ >>>>>>>>>>>>>>> + __asm__ __volatile__ (".4byte 0x100000F"); >>>>>>>>>>>>>>> #endif >>>>>>>>>>>>>>=20 >>>>>>>>>>>>>> hmm, though before this part of the code was only ever = accessed >>>>>>>>>>>>>> when the zhintpause extension was really available on the = running >>>>>>>>>>>>>> machine while now the pause instruction is called every = time. >>>>>>>>>>>>>>=20 >>>>>>>>>>>>>> So I'm just wondering, can't this run into some "illegal = instruction" >>>>>>>>>>>>>> thingy on machines not supporting the extension? >>>>>>>>>>>>>=20 >>>>>>>>>>>>> No. The encoding for pause was deliberately chosen to be = one of the >>>>>>>>>>>>> =E2=80=9Cuseless=E2=80=9D encodings of fence, with the = hope that existing >>>>>>>>>>>>> microarchitectures might take a while to execute it and = thus it would >>>>>>>>>>>>> still function as a slow-running instruction. It=E2=80=99s = somewhat >>>>>>>>>>>>> questionable whether the div is even needed, the worst = that happens is >>>>>>>>>>>>> cpu_relax isn=E2=80=99t very relaxed and you spin a bit = faster. Any >>>>>>>>>>>>> implementations where that=E2=80=99s true probably also = don=E2=80=99t have fancy >>>>>>>>>>>>> clock/power management anyway, and div isn=E2=80=99t going = to be a low-power >>>>>>>>>>>>> operation so the only real effect is likely hammering on = contended >>>>>>>>>>>>> atomics a bit more, and who cares about that on the low = core count >>>>>>>>>>>>> systems we have today. >>>>>>>>>>>>=20 >>>>>>>>>>>> thanks a lot for that explanation, which made things a lot = clearer. >>>>>>>>>>>>=20 >>>>>>>>>>>> So as you said, dropping the div part might make the = function even smaller, >>>>>>>>>>>> though somehow part of me would want to add some sort of = comment to >>>>>>>>>>>> the function for when the next developer stumbles over the = unconditional >>>>>>>>>>>> use of pause :-) . >>>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> I agree. If that's what microarch will do, we can drop div = altogether. >>>>>>>>>>> Though microarch may be treated as nop even if it is = undesirable. >>>>>>>>>>> IIRC, the div was introduced for the rocket chip which would = induce a >>>>>>>>>>> long latency stall with div instruction (zero as operands). >>>>>>>>>>>=20 >>>>>>>>>>> Does any other core or newer rocket chip actually induce a = latency >>>>>>>>>>> stall with div instruction ? >>>>>>>>>>> If not, it is equivalent to NOP as well. We can definitely = remove the div. >>>>>>>>>>> The only cores affected will be the older rocket core. >>>>>>>>>>>=20 >>>>>>>>>>> Tagging some folks to understand what their core does. >>>>>>>>>>>=20 >>>>>>>>>>> @Paul Walmsley @Guo Ren @Conor Dooley ? >>>>>>>>>>=20 >>>>>>>>>> I am no microarch expert by _any_ stretch of the imagination, = but >>>>>>>>>> from a quick experiment it looks like the u54s on PolarFire = SoC behave >>>>>>>>>> in the same way, and div w/ zero operands does in fact take = significantly >>>>>>>>>> longer than regular division (looks to be about 3x). >>>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>> Thanks. Do you have any data on how much the "pause" = instruction takes. >>>>>>>>=20 >>>>>>>> So these numbers you may consider as being pulled out of a = magic hat >>>>>>>> as all I am doing is reading the counters from userspace and = there is >>>>>>>> some variance etc. Plus the fact that I just started hacking at = some >>>>>>>> existing code I had lying around as I'm pretty snowed under at = the >>>>>>>> moment. >>>>>>>>=20 >>>>>>>> Doing the following takes about 70 cycles on both a PolarFire = SoC and an >>>>>>>> unmatched: >>>>>>>> long divisor =3D 2, dividend =3D 100000, dest; >>>>>>>> asm("div %0, zero, zero" : "=3Dr" (dest)); >>>>>>>> and equates to: >>>>>>>> sd a5,-48(s0) >>>>>>>> div a5,zero,zero >>>>>>>>=20 >>>>>>>> Clocking in at about 40 cycles is some actual divisions, I just = did the >>>>>>>> following a dozen times, doing a trivial computation: >>>>>>>> long divisor =3D 2, dividend =3D 100000, dest; >>>>>>>> asm("div %0, %1, %2" : "=3Dr" (dividend) : "r" (dividend), "r" = (divisor)) >>>>>>>>=20 >>>>>>>> ie, a load of the following: >>>>>>>> sd a5,-48(s0) >>>>>>>> ld a5,-48(s0) >>>>>>>> ld a4,-40(s0) >>>>>>>> div a5,a5,a4 >>>>>>>>=20 >>>>>>>> So clearly the div w/ zero args makes a difference... >>>>>>>>=20 >>>>>>>> On PolarFire SoC, `0x100000F` takes approx 6 cycles. On my = unmatched, it >>>>>>>> takes approx 40. Again, I just had an asm block & called the = instruction >>>>>>>> a number times and took the average - here it was 48 times. >>>>>>>>=20 >>>>>>>> Take the actual numbers with a fist full of salt, but at least = the >>>>>>>> relative numbers should be of some use to you. >>>>>>>>=20 >>>>>>>> Hope that's somewhat helpful, maybe next week I can do = something a >>>>>>>> little more useful for you... >>>>>>>>=20 >>>>>>>=20 >>>>>>> Thanks. It would be good to understand what happens when "pause" = is >>>>>>> executed on these boards ? >>>>>>=20 >>>>>> The actual pause instruction? uhh, so with the usual "I don't = know what >>>>>> I am doing" disclaimer, I ran each of the .insn and pause = instruction 48 >>>>>> times in a row and checked the time elapsed via rdcycle & then = ran that >>>>>> c program 1000 times in a bash loop. Got the below, the insns = were run >>>>>> first and then the pauses. >>>>>> insn pause >>>>>> min 2.3 3.2 >>>>>> max 9.5 10.6 >>>>>> avg 27.0 29.1 >>>>>> 5% 2.9 4.2 >>>>>> 95% 18.1 19.1 >>>>>>=20 >>>>>> Swapping the pause & insn order around made a minor difference, = but not >>>>>> enough to report on. I'd be very wary of drawing any real = conclusions >>>>>> from this data, but at least both are roughly similar (and = certainly not >>>>>> even close to doing the div w/ zero args. >>>>>>=20 >>>>>=20 >>>>> Yeah. That's what I was expecting. So we can't drop the div for = now. Otherwise, >>>>> the existing hardware(don't support Zhintpause) suffers by = spinning faster. >>>>=20 >>>> But does that actually matter in practice? If it doesn=E2=80=99t = noticeable >>>> affect performance then you don=E2=80=99t need to keep the div. = There are a lot >>>> of architectures that even just define cpu_relax() as barrier(). >>> Div is not semantic accurate for standard code, it should be in >>> vendors' errata. I agree to leave nop as default and put a pause >>> instead after the feature is detected. >>=20 >> Nobody=E2=80=99s suggesting a literal nop instruction, that would be = worse than >> either div or pause. It=E2=80=99s always safe to execute pause, the = question is >> just whether on existing systems that don=E2=80=99t implement = Zihintpause it >> gets executed too quickly such that performance is degraded due to >> spinning more aggressively. >=20 > Why do you ensure pause can't be an illegal instruction in some old = machine? Because that=E2=80=99s how it=E2=80=99s defined; it uses one of the many = hints (instructions that aren=E2=80=99t a canonical nop but are defined to = behave like one in terms of architectural side-effects) from RV32I/RV64I. > Why do you ensure div could save power for all microarchitectures? I don=E2=80=99t. In fact it almost certainly won=E2=80=99t make the core = enter a low power state. It will just help reduce the amount of memory traffic by taking a while to execute. I would rather not use div at all. Jess > nop (default) -> div/ (moved into vendor errata) > -> pause (when ZiHintPause feature detected) >=20 >>=20 >> Jess >>=20 >>>>=20 >>>> Jess >>>>=20 >>>>> Thanks for running the experiments. >>>>>=20 >>>>>> Again, hope that is helpful? >>>>>> Conor. >>>>>>=20 >>>>>>>=20 >>>>>>>> Conor. >>>>>>>>=20 >>>>>>>>> I understand that it is not available in these cores. Just = wanted to >>>>>>>>> understand if microarchitecture >>>>>>>>> actually takes a while executing the useless encoding as = pointed out by Jessica. >>>>>>>>>=20 >>>>>>>>> If that's the case, we can remove the div instruction = altogether. >>>>>>>>> Otherwise, this patch will cause some performance regression >>>>>>>>> for existing SoC (HiFive unleashed has the same core. Not sure = about >>>>>>>>> unmatched though). >>>>>>>>> This needs to be documented at least. >>>>>>>>>=20 >>>>>>>>>> Hope that's helpful, >>>>>>>>>> Conor. >>>>>>>>>>=20 >>>>>>>>>> (I just did a quick check of what pretty much amounted to a = bunch of >>>>>>>>>> div a5,zero,zero in a row versus div a5,a5,a5) >>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> (Please add anybody who may have an insight to execution = flow on >>>>>>>>>>> existing Linux capable cores) >>>>>>>>>>>=20 >>>>>>=20 >>>>>=20 >>>>>=20 >>>>> -- >>>>> Regards, >>>>> Atish >>>>=20 >>>=20 >>>=20 >>> -- >>> Best Regards >>> Guo Ren >>=20 >=20 >=20 > --=20 > Best Regards > Guo Ren