Received: by 2002:a05:6358:45e:b0:b5:b6eb:e1f9 with SMTP id 30csp693625rwe; Wed, 24 Aug 2022 07:37:22 -0700 (PDT) X-Google-Smtp-Source: AA6agR5WHD+iIhOJZEm/4RgJUToxijdN8aWZEnBmya56Q83dNVQR0Uv2E4r9/PSkRa+3HsVfDvaK X-Received: by 2002:a05:6402:22ea:b0:445:d29c:9694 with SMTP id dn10-20020a05640222ea00b00445d29c9694mr7773304edb.359.1661351842362; Wed, 24 Aug 2022 07:37:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1661351842; cv=none; d=google.com; s=arc-20160816; b=IYiU2M4tr03JQvvGp4fGbl/LbVVO+pycYJWUirye4EBqMhD8ZyaljYBUEh9azB24ma v7Zq+qUSHCjjt4OJd366r1JOCDVDuVYLLETAf8kQSRl53p/ZaNPagH2f5FM5YBp9ttSP xce1fUp9TUmZyrZQCGzz8dFCibxw3X13IcgyX83edoE48VRSIRAErjU/xkFtKgkQ1hUL i223jW++1qiuD3qneBSMr03UWI40zioLpgAYsgvKLiLV7qehUS4QmaGDNPvs/WF7hfFR TLYyA/z+FJ407f7yuMYsaOJK15Xrtitd5gBfHTjSC2nz8atl5U4SXUWNIDfG/2DXES7d 7A5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :mime-version:accept-language:in-reply-to:references:message-id:date :thread-index:thread-topic:subject:cc:to:from; bh=MXV+TdETnaYNAOYreb0YLmwFapdpS0EjobKGsO2Id/k=; b=m7lvz/n8+bQQ5ZQSs+2+V4ZnWe3lFMoJeeWgTyUnmmQ7+xtHbMLyG/YAFcf8IM34iG 03xcgvynj78g7ZLFnchHY2U1Osg/36dRIvQeJ4RtDo9RmuzaWF9GIguJl2hMTb2Lagcf hqsh1Imeq9QptFoWXMT2H6R/tgk5QBmOGzV31FOJeT3XUqr6miqTCw66gtYMPv1RZQKz xAa0lix+2AaMDP3HnnwESlyRPV81P7UpWzwVyahorjqUYtFk1ur4cZidaBcN8F2KIZk4 I/sgB0FM5riUwzAsG5Fe7Wx6lPiMOgCCLMt9OMoV0qnXySpeR1pTY1xe1Dj7hs3E0Bh0 F7lg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aculab.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n3-20020a17090695c300b0072ab5d0fc33si1877233ejy.863.2022.08.24.07.36.55; Wed, 24 Aug 2022 07:37:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aculab.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238057AbiHXOSt convert rfc822-to-8bit (ORCPT + 99 others); Wed, 24 Aug 2022 10:18:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32802 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236375AbiHXOSr (ORCPT ); Wed, 24 Aug 2022 10:18:47 -0400 Received: from eu-smtp-delivery-151.mimecast.com (eu-smtp-delivery-151.mimecast.com [185.58.85.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 74F3E99241 for ; Wed, 24 Aug 2022 07:18:45 -0700 (PDT) Received: from AcuMS.aculab.com (156.67.243.121 [156.67.243.121]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id uk-mta-278-NAqDrFa6OFK4f2CDO1KAlQ-1; Wed, 24 Aug 2022 15:18:42 +0100 X-MC-Unique: NAqDrFa6OFK4f2CDO1KAlQ-1 Received: from AcuMS.Aculab.com (fd9f:af1c:a25b:0:994c:f5c2:35d6:9b65) by AcuMS.aculab.com (fd9f:af1c:a25b:0:994c:f5c2:35d6:9b65) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Wed, 24 Aug 2022 15:18:40 +0100 Received: from AcuMS.Aculab.com ([fe80::994c:f5c2:35d6:9b65]) by AcuMS.aculab.com ([fe80::994c:f5c2:35d6:9b65%12]) with mapi id 15.00.1497.040; Wed, 24 Aug 2022 15:18:40 +0100 From: David Laight To: 'Yury Norov' , Andy Shevchenko CC: Linus Torvalds , Linux Kernel Mailing List , Guenter Roeck , "Dennis Zhou" , Russell King , "Catalin Marinas" , Andy Shevchenko , Rasmus Villemoes , Alexey Klimov , Kees Cook , Andy Whitcroft Subject: RE: [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro Thread-Topic: [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro Thread-Index: AQHYt7wjkucH+HuwAkOnYFo2ww3faK2+F4Aw Date: Wed, 24 Aug 2022 14:18:40 +0000 Message-ID: References: <20220824012624.2826445-1-yury.norov@gmail.com> <20220824012624.2826445-2-yury.norov@gmail.com> In-Reply-To: Accept-Language: en-GB, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [10.202.205.107] MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: aculab.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ... > And generated code looks almost the same, except that > on x86_64 your version is bigger. Compare before: > 0000000000000000 <_find_first_bit>: > 0: mov %rsi,%rax > 3: test %rsi,%rsi > 6: je 35 <_find_first_bit+0x35> > 8: xor %edx,%edx > a: jmp 19 <_find_first_bit+0x19> > c: add $0x40,%rdx // Track bits and > 10: add $0x8,%rdi // index separately That add is free - happens in parallel with other instrutcions > 14: cmp %rax,%rdx > 17: jae 35 <_find_first_bit+0x35> The instructions below will (probably/hopefully) be speculatively executed in parallel with the cmp/jae above > 19: mov (%rdi),%rcx > 1c: test %rcx,%rcx > 1f: je c <_find_first_bit+0xc> > 21: tzcnt %rcx,%rcx > 26: add %rdx,%rcx > 29: cmp %rcx,%rax > 2c: cmova %rcx,%rax > 30: jmp 35 <_find_first_bit+0x35> > 35: jmp 3a <_find_first_bit+0x3a> > 3a: nopw 0x0(%rax,%rax,1) > > And after: > 0000000000000000 <_find_first_bit>: > 0: mov %rsi,%rax > 3: test %rsi,%rsi > 6: je 39 <_find_first_bit+0x39> > 8: xor %edx,%edx > a: jmp 15 <_find_first_bit+0x15> > c: add $0x40,%rdx // Track bits only > 10: cmp %rdx,%rax > 13: jbe 39 <_find_first_bit+0x39> > 15: mov %rdx,%rcx > 18: shr $0x6,%rcx // But divide here > 1c: mov (%rdi,%rcx,8),%rcx > 20: test %rcx,%rcx That is a long register dependency chain involving %cx. It will limit the execution speed to (at least 6) clocks/iteration. The older version might be 3 clocks/iteration. So this could easily run at half the speed. David > 23: je c <_find_first_bit+0xc> > 25: tzcnt %rcx,%rcx > 2a: add %rcx,%rdx > 2d: cmp %rdx,%rax > 30: cmova %rdx,%rax > 34: jmp 39 <_find_first_bit+0x39> > 39: jmp 3e <_find_first_bit+0x3e> > 3e: xchg %ax,%ax // Which adds 4 bytes to .text > > Thanks, > Yury > > > > + val = (EXPRESSION); \ > > > + if (val) { \ > > > + sz = min(idx * BITS_PER_LONG + __ffs(word_op(val)), sz);\ > > > > sz = min(idx + __ffs(...)); > > > > > + break; \ > > > + } \ > > > + } \ > > > + \ > > > + sz; \ > > > +}) > > > > > > -- > > With Best Regards, > > Andy Shevchenko - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)