Received: by 2002:ac0:e350:0:0:0:0:0 with SMTP id g16csp2056855imn; Mon, 1 Aug 2022 09:31:43 -0700 (PDT) X-Google-Smtp-Source: AA6agR6rOl1iwM91Vwm3RxD9WhGgoD2b5waWKSqS0mdYTl1AnkrngyxevAKDMHl+aYoMsiIwJWl+ X-Received: by 2002:a17:907:1c1c:b0:730:9473:7a0e with SMTP id nc28-20020a1709071c1c00b0073094737a0emr2416418ejc.492.1659371503614; Mon, 01 Aug 2022 09:31:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1659371503; cv=none; d=google.com; s=arc-20160816; b=WKABRo5bUB/KZl+AJehcVjIq4YO5sZxoeTcsRxlURDRti2reXaNCJJ33N2wjKw9lTo tIJOosDkdlcr1GtFBs7X1zVhV5cgMTLM9ylRuoRG43+MOxhuWZ093xrK79KqdNfavDXm x62I/6ocfbp+VoXHaBs0erak8djgQdBSAnIxdM3Iv7osWRHMAVk2QIrP0JDJprGIbDKq REjiXxvYReALGK/Jfa3lbE7kHKQAWdpjk/bjkGplgcgKCNX3S6dn92VrqRmBh6O+4KWu Xnc8qhONf6omPmo+wfBJFq+yMAhWPRUIVnkC/S35WvXH2emF/MLAyfSbPaaK1AB86x1O xMNg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=Bw2G8fr14ujR+mPl0CVqh1ElxFbp1/PqK8jCTY1CQ8M=; b=fK4+TJNTiAIliHhWbgYFySaZflOKD4J8WikfiGIGK8aa2XYIvGvLc4Vc/tPDywGzQW l167ewmY8gabczbkaLNLUmlYn7CZgV5IabS8b3p6t1OJUeGD6dnNEQJhPB89gt+oIUPQ C6qhM8RG+gOcWOeZNESHj33MQeDGpsejyJFDfOZNT/GpD8LPXKFQ3QtITlJl8Wqi3HsJ IDj3UNv6YCRraRlMdVpf/94WxZhoGIhHJQBNC5V77ctf8eJ7UjXit7jaBrUSTnFDBdY5 /be5pTushxIhOJIIiu9H605kq7/epOphYA+JMp0Q+YEwJ/e7UDEtATES9GhoOR5pCQaF N50A== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail (test mode) header.i=@armlinux.org.uk header.s=pandora-2019 header.b="Z/4fGfwH"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=armlinux.org.uk Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l12-20020a170906794c00b0073065831b01si4632589ejo.901.2022.08.01.09.31.14; Mon, 01 Aug 2022 09:31:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=fail (test mode) header.i=@armlinux.org.uk header.s=pandora-2019 header.b="Z/4fGfwH"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=armlinux.org.uk Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231751AbiHAPsd (ORCPT + 99 others); Mon, 1 Aug 2022 11:48:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49876 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231144AbiHAPsc (ORCPT ); Mon, 1 Aug 2022 11:48:32 -0400 Received: from pandora.armlinux.org.uk (pandora.armlinux.org.uk [IPv6:2001:4d48:ad52:32c8:5054:ff:fe00:142]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BCA4731DE0 for ; Mon, 1 Aug 2022 08:48:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=armlinux.org.uk; s=pandora-2019; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=Bw2G8fr14ujR+mPl0CVqh1ElxFbp1/PqK8jCTY1CQ8M=; b=Z/4fGfwHlhxXvzVawdJsguPCDe Ub4PLkiJaq/lWttL5s6MmwV6qKSuE8uhFRd8A7nYbpVWbzabcoPo2MVl+oJN84C8lAvoiRL4RWEq7 XWu/E5zLdcxwSm1jLVRZFwPK5QOCVBUm9qvtlaFm1eqKT6PaokCT4L5zXaQy/uGlRClqtfG20rck8 OKLFK5aKs35abkggiHECmKAReayGf9iHZrpSTOyGwy7G3vHTyNKod4kURS6WNgho+D3VNbDyYgQLV nY960hiRsDIvHD6f5biSGu2iIlYrlw/ycsKss34tx+aRls/m5zHxXdZJ0fp/jElHEwMmaJZ4bmcBi 8RueC1fg==; Received: from shell.armlinux.org.uk ([fd8f:7570:feb6:1:5054:ff:fe00:4ec]:33652) by pandora.armlinux.org.uk with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1oIXeZ-00026j-4M; Mon, 01 Aug 2022 16:48:23 +0100 Received: from linux by shell.armlinux.org.uk with local (Exim 4.94.2) (envelope-from ) id 1oIXeX-0007AU-6E; Mon, 01 Aug 2022 16:48:21 +0100 Date: Mon, 1 Aug 2022 16:48:21 +0100 From: "Russell King (Oracle)" To: Yury Norov Cc: Linus Torvalds , Dennis Zhou , Guenter Roeck , Catalin Marinas , Linux Kernel Mailing List , Geert Uytterhoeven , Alexey Klimov , linux-m68k@lists.linux-m68k.org Subject: Re: Linux 5.19-rc8 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: Russell King (Oracle) X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Oh FFS. I see you decided off your own back to remove the ARM version of the find_bit functions, with NO agreement from the arch maintainer. This is not on. On Sat, Jul 30, 2022 at 02:38:38PM -0700, Yury Norov wrote: > On Wed, Jul 27, 2022 at 08:43:22AM +0100, Russell King (Oracle) wrote: > > On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote: > > > On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle) > > > wrote: > > > > > > > > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote: > > > > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle) > > > > > wrote: > > > > > > > > > > > > Overall, I would say it's pretty similar (some generic perform > > > > > > marginally better, some native perform marginally better) with the > > > > > > exception of find_first_bit() being much better with the generic > > > > > > implementation, but find_next_zero_bit() being noticably worse. > > > > > > > > > > The generic _find_first_bit() code is actually sane and simple. It > > > > > loops over words until it finds a non-zero one, and then does trivial > > > > > calculations on that last word. > > > > > > > > > > That explains why the generic code does so much better than your byte-wise asm. > > > > > > > > > > In contrast, the generic _find_next_bit() I find almost offensively > > > > > silly - which in turn explains why your byte-wide asm does better. > > > > > > > > > > I think the generic _find_next_bit() should actually do what the m68k > > > > > find_next_bit code does: handle the first special word itself, and > > > > > then just call find_first_bit() on the rest of it. > > > > > > > > > > And it should *not* try to handle the dynamic "bswap and/or bit sense > > > > > invert" thing at all. That should be just four different (trivial) > > > > > cases for the first word. > > > > > > > > Here's the results for the native version converted to use word loads: > > > > > > > > [ 37.319937] > > > > Start testing find_bit() with random-filled bitmap > > > > [ 37.330289] find_next_bit: 2222703 ns, 163781 iterations > > > > [ 37.339186] find_next_zero_bit: 2154375 ns, 163900 iterations > > > > [ 37.348118] find_last_bit: 2208104 ns, 163780 iterations > > > > [ 37.372564] find_first_bit: 17722203 ns, 16370 iterations > > > > [ 37.737415] find_first_and_bit: 358135191 ns, 32453 iterations > > > > [ 37.745420] find_next_and_bit: 1280537 ns, 73644 iterations > > > > [ 37.752143] > > > > Start testing find_bit() with sparse bitmap > > > > [ 37.759032] find_next_bit: 41256 ns, 655 iterations > > > > [ 37.769905] find_next_zero_bit: 4148410 ns, 327026 iterations > > > > [ 37.776675] find_last_bit: 48742 ns, 655 iterations > > > > [ 37.790961] find_first_bit: 7562371 ns, 655 iterations > > > > [ 37.797743] find_first_and_bit: 47366 ns, 1 iterations > > > > [ 37.804527] find_next_and_bit: 59924 ns, 1 iterations > > > > > > > > which is generally faster than the generic version, with the exception > > > > of the sparse find_first_bit (generic was: > > > > [ 25.657304] find_first_bit: 7328573 ns, 656 iterations) > > > > > > > > find_next_{,zero_}bit() in the sparse case are quite a bit faster than > > > > the generic code. > > > > > > Look at find_{first,next}_and_bit results. Those two have no arch version > > > and in both cases use generic code. In theory they should be equally fast > > > before and after, but your testing says that generic case is slower even > > > for them, and the difference is comparable with real arch functions numbers. > > > It makes me feel like: > > > - there's something unrelated, like governor/throttling that affect results; > > > - the numbers are identical, taking the dispersion into account. > > > > > > If the difference really concerns you, I'd suggest running the test > > > several times > > > to measure confidence intervals. > > > > Given that the benchmark is run against random bitmaps and with > > interrupts enabled, there is going to be noise in the results. > > > > Here's the second run: > > > > [26234.429389] > > Start testing find_bit() with random-filled bitmap > > [26234.439722] find_next_bit: 2206687 ns, 164277 iterations > > [26234.448664] find_next_zero_bit: 2188368 ns, 163404 iterations > > [26234.457612] find_last_bit: 2223742 ns, 164278 iterations > > [26234.482056] find_first_bit: 17720726 ns, 16384 iterations > > [26234.859374] find_first_and_bit: 370602019 ns, 32877 iterations > > [26234.867379] find_next_and_bit: 1280651 ns, 74091 iterations > > [26234.874107] > > Start testing find_bit() with sparse bitmap > > [26234.881014] find_next_bit: 46142 ns, 656 iterations > > [26234.891900] find_next_zero_bit: 4158987 ns, 327025 iterations > > [26234.898672] find_last_bit: 49727 ns, 656 iterations > > [26234.912504] find_first_bit: 7107862 ns, 656 iterations > > [26234.919290] find_first_and_bit: 52092 ns, 1 iterations > > [26234.926076] find_next_and_bit: 60856 ns, 1 iterations > > > > And a third run: > > > > [26459.679524] > > Start testing find_bit() with random-filled bitmap > > [26459.689871] find_next_bit: 2199418 ns, 163311 iterations > > [26459.698798] find_next_zero_bit: 2181289 ns, 164370 iterations > > [26459.707738] find_last_bit: 2213638 ns, 163311 iterations > > [26459.732224] find_first_bit: 17764152 ns, 16429 iterations > > [26460.133823] find_first_and_bit: 394886375 ns, 32672 iterations > > [26460.141818] find_next_and_bit: 1269693 ns, 73485 iterations > > [26460.148545] > > Start testing find_bit() with sparse bitmap > > [26460.155433] find_next_bit: 40753 ns, 653 iterations > > [26460.166307] find_next_zero_bit: 4148211 ns, 327028 iterations > > [26460.173078] find_last_bit: 50017 ns, 653 iterations > > [26460.187007] find_first_bit: 7205325 ns, 653 iterations > > [26460.193790] find_first_and_bit: 49358 ns, 1 iterations > > [26460.200577] find_next_and_bit: 62332 ns, 1 iterations > > > > My gut feeling is that yes, there is some variance, but not on an > > order that is significant that would allow us to say "there's no > > difference". > > > > find_next_bit results for random are: 2222703, 2206687, 2199418, > > which is an average of 2209603 and a variance of around 0.5%. > > The difference between this and the single generic figure I have > > is on the order of 20%. > > > > I'll do the same with find_first_bit for random: 17722203, 17720726, > > and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%. > > The difference between this and the single generic figure I have is > > on the order of 5%. Not so large, but still quite a big difference > > compared to the variance. > > > > find_first_bit for sparse: 7562371, 7107862, 7205325. Average is > > 7291853. Variance is higher at about 4%. Difference between this and > > the generic figure is 0.5%, so this one is not significantly > > different. > > > > The best result looks to be find_next_zero_bit for the sparse bitmap > > case. The generic code measures 5.5ms, the native code is sitting > > around 4.1ms. That's a difference of around 34%, and by just looking > > at the range in the figures above we can see this is a significant > > result without needing to do the calculations. Similar is true of > > find_next_bit for the sparse bitmap. > > > > So, I think the results are significant in most cases and variance > > doesn't account for the differences. The only one which isn't is > > find_first_bit for the sparse case. > > Hi Russel, > > + Alexey Klimov > > This is my testings for native vs generic find_bit operations on a15 > and 17. > > The raw numbers are collected by Alexey Klimov on Odroid-xu3. All cpu > frequencies were fixed at 1000Mhz. (Thanks a lot!) > > For each native/generic @ a15/a7 configuration, the find_bit_benchmark > was run 5 times, and the results are summarized below: > > A15 Native Generic Difference > Dense ns ns % sigmas > find_next_bit: 3746929 3231641 14 8.3 > find_next_zero_bit: 3935354 3202608 19 10.4 > find_last_bit: 3134713 3129717 0 0.1 > find_first_bit: 85626542 20498669 76 172.4 > find_first_and_bit: 409252997 414820417 -1 -0.2 > find_next_and_bit: 1678706 1654420 1 0.4 > > Sparse > find_next_bit: 143208 77924 46 29.4 > find_next_zero_bit: 6893375 6059177 12 14.3 > find_last_bit: 67174 68616 -2 -1.0 > find_first_bit: 33689256 8151493 76 47.8 > find_first_and_bit: 124758 156974 -26 -1.3 > find_next_and_bit: 53391 56716 -6 -0.2 > > A7 Native Generic Difference > Dense ns ns % sigmas > find_next_bit: 4207627 5532764 -31 -14.9 > find_next_zero_bit: 4259961 5236880 -23 -10.0 > find_last_bit: 4281386 4201025 2 1.5 > find_first_bit: 236913620 50970424 78 163.3 > find_first_and_bit: 728069762 750580781 -3 -0.7 > find_next_and_bit: 2696263 2766077 -3 -0.9 > > Sparse > find_next_bit: 327241 143776 56 40.7 > find_next_zero_bit: 6987249 10235989 -46 -21.8 > find_last_bit: 97758 94725 3 0.6 > find_first_bit: 94628040 21051964 78 41.8 > find_first_and_bit: 248133 241267 3 0.3 > find_next_and_bit: 136475 154000 -13 -0.5 > > The last column is the difference between native and generic code > performance normalized to a standard deviation: > (mean(native) - mean(generic)) / max(std(native), std(generic)) > > The results look consistent to me because 'and' subtests that are always > generic differ by less than one sigma. > > On A15 generic code is a clear winner. On A7 results are inconsistent > although significant. Maybe it's worth to retest on A7. > > Regarding the choice between native and generic core - I would prefer > generic version even if it's slightly slower because it is tested and > maintained better. And because the results of the test are at least on > par, to me it's a no-brainer. > > Would be really interesting to compare performance of your LDRB->LDR > patch with the generic code using the same procedure. > > Thanks, > Yury > -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!