Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751907AbbEDTdw (ORCPT ); Mon, 4 May 2015 15:33:52 -0400 Received: from mx1.redhat.com ([209.132.183.28]:37650 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751834AbbEDTdo (ORCPT ); Mon, 4 May 2015 15:33:44 -0400 Message-ID: <5547C992.9000703@redhat.com> Date: Mon, 04 May 2015 12:33:38 -0700 From: Richard Henderson User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Peter Zijlstra CC: Linus Torvalds , Vladimir Makarov , Jakub Jelinek , Ingo Molnar , "H. Peter Anvin" , Thomas Gleixner , Linux Kernel Mailing List , Borislav Petkov , "gcc@gcc.gnu.org" Subject: [RFC] Design for flag bit outputs from asms References: <20150501151630.GH5029@twins.programming.kicks-ass.net> <20150501163329.GU1751@tucnak.redhat.com> <5543CDC0.6010206@redhat.com> <20150502123958.GK5029@twins.programming.kicks-ass.net> In-Reply-To: <20150502123958.GK5029@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5298 Lines: 161 On 05/02/2015 05:39 AM, Peter Zijlstra wrote: > static inline bool __test_and_clear_bit(long nr, volatile unsigned long *addr) > { > bool oldbit; > > asm volatile ("btr %2, %1" > : "CF" (oldbit), "+m" (*addr) > : "Ir" (nr)); > > return oldbit; > } > > Be the far better solution for this? Bug 59615 comment 7 states that > they actually modeled the flags in the .md file, so the above should be > possible to implement. > > Now GCC can decide to use "sbb %0, %0" to convert CF into a register > value or use "jnc" / "jc" for branches, depending on what > __test_and_clear_bit() was used for. > > We don't have to (ab)use asm goto for these things anymore; furthermore > I think the above will naturally work with our __builtin_expect() hints, > whereas the asm goto stuff has a hard time with that (afaik). > > That's not to say output operants for asm goto would not still be useful > for other things (like your EXTABLE example). > (0) The C level output variable should be an integral type, from bool on up. The flags are a scarse resource, easily clobbered. We cannot allow user code to keep data in the flags. While x86 does have lahf/sahf, they don't exactly perform well. And other targets like arm don't even have that bad option. Therefore, the language level semantics are that the output is a boolean store into the variable with a condition specified by a magic constraint. That said, just like the compiler should be able to optimize void bar(int y) { int x = (y <= 0); if (x) foo(); } such that we only use a single compare against y, the expectation is that within a similarly constrained context the compiler will not require two tests for these boolean outputs. Therefore: (1) Each target defines a set of constraint strings, E.g. for x86, wherein we're almost out of constraint letters, ja aux carry flag jc carry flag jo overflow flag jp parity flag js sign flag jz zero flag E.g. for arm/aarch64 (using "j" here, but other possibilities exist): jn negative flag jc carry flag jz zero flag jv overflow flag E.g. for s390x (I've thought less about what's useful here) j where m is a hex digit, and is the mask of CC values for which the condition is true; exactly corresponding to the M1 field in the branch on condition instruction. (2) A new target hook post-processes the asm_insn, looking for the new constraint strings. The hook expands the condition prescribed by the string, adjusting the asm_insn as required. E.g. bool x, y, z; asm ("xyzzy" : "=jc"(x), "=jp"(y), "=jo"(z) : : ); originally (parallel [ (set (reg:QI 83 [ x ]) (asm_operands/v:QI ("xyzzy") ("=jc") 0 [] [] [] z.c:4)) (set (reg:QI 84 [ y ]) (asm_operands/v:QI ("xyzzy") ("=jp") 1 [] [] [] z.c:4)) (set (reg:QI 85 [ z ]) (asm_operands/v:QI ("xyzzy") ("=jo") 2 [] [] [] z.c:4)) (clobber (reg:QI 18 fpsr)) (clobber (reg:QI 17 flags)) ]) becomes (parallel [ (set (reg:CC 17 flags) (asm_operands/v:CC ("xyzzy") ("=j_") 0 [] [] [] z.c:4)) (clobber (reg:QI 18 fpsr)) ]) (set (reg:QI 83 [ x ]) (ne:QI (reg:CCC 17 flags) (const_int 0))) (set (reg:QI 84 [ y ]) (ne:QI (reg:CCP 17 flags) (const_int 0))) (set (reg:QI 85 [ z ]) (ne:QI (reg:CCO 17 flags) (const_int 0))) which ought to assemble to something like xyzzy setc %dl setp %cl seto %r15l Note that rtl level data flow is preserved via the flags hard register, and the lifetime of flags would not extended any further than we would for a normal cstore pattern. Note that the output constraints are adjusted to a single internal "=j_" which would match the flags register in any mode. We can collapse several output flags to a single set of the flags hard register. (3) Note that ppc is both easier and more complicated. There we have 8 4-bit registers, although most of the integer non-comparisons only write to CR0. And the vector non-comparisons only write to CR1, though of course that's of less interest in the context of kernel code. For the purposes of cr0, the same scheme could certainly work, although the hook would not insert a hard register use, but rather a pseudo to be allocated to cr0 (constaint "x"). That said, it's my understanding that "dot insns", setting cr0 are expensive in current processor generations. There's also a lot less of the x86-style "operate and set a flag based on something useful". Can anyone think of any drawbacks, pitfalls, or portability issues to less popular targets that I havn't considered? r~ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/