Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp6194822rwb; Tue, 22 Nov 2022 09:57:42 -0800 (PST) X-Google-Smtp-Source: AA0mqf6Q0K4IrLgfh7aeSP+4EY6pJdnrMjytCoQOoAmOCHgycFv/z+jVQ3IamjV2mbV5l+FGCNL7 X-Received: by 2002:a05:6a00:24c1:b0:573:a1f0:5968 with SMTP id d1-20020a056a0024c100b00573a1f05968mr4971559pfv.0.1669139862475; Tue, 22 Nov 2022 09:57:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669139862; cv=none; d=google.com; s=arc-20160816; b=FvkM0nEHluWmj+Y+Fl2hnXOXKd8nXbIbNFt9pl77kEIEju0W6Z9BVWuhk85hLtHBRv 0abFKhEZSvuToLlRvf0GY7ElIwMbDkoxXGlCKLn6IZe5h8IB++7z0fGpRjNT429DpUPW g+GJ3d37KX4CuTpv6fn2sjkw28ppGOL4TSAO3X+Ju+f/oQw8KMU3y0ObjWez7n1L/1gK RJ6ALv9gxvXtTW/TKziXYnNDrAPWKxqNkriU/wmnxZhpenwKd8y6Dk0xMus6k0U8S6F4 3zYZVkd6Frt8KFqhEBSkaSGbQ6a7Gv8c/gDscINC7qJ9++A67YSCW3kcgn0Ll80A8ut5 T0Xg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :mime-version:accept-language:in-reply-to:references:message-id:date :thread-index:thread-topic:subject:cc:to:from; bh=w1OwmvdHXOap+anDWHd6JmEWZw85MHMt4QGxmYjzASM=; b=D3JcU7K3inTnCgbFgVf9I7Z0H8263CIBW+4gZj/+u92frNlBMkME/m9WWZ2ejsMRyC 2zYV3+Igp7X4Z9BBIwPkJiJBjVIVbt8QKz7/fFk5OHxR3iMHPZHGWXZczNPJDiC9AUDc f1uy41G21ZmMIRmcHVPA8piI3Y662gLERKVvPdfV+cxtSeH+gYzSIs555MzrmQzMupaJ MPNeaR8SagKIcM7XrN7Aj+2g0Ld6wEuDMRaX+JkV9ol+C3TZS5Fea5eHqce4Rf5ucvDh eXNMwOEZsf0B3wL8jGb74WZ1A2qc4o4mWM0fDDCoDO8YdW+4FD+FVCqATpHgvSQe+dgv w4lQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aculab.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q4-20020a654944000000b0044034f2c3b8si1253329pgs.310.2022.11.22.09.57.30; Tue, 22 Nov 2022 09:57:42 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aculab.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234317AbiKVQzi convert rfc822-to-8bit (ORCPT + 90 others); Tue, 22 Nov 2022 11:55:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54020 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234316AbiKVQzd (ORCPT ); Tue, 22 Nov 2022 11:55:33 -0500 Received: from eu-smtp-delivery-151.mimecast.com (eu-smtp-delivery-151.mimecast.com [185.58.86.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CE7EB742FF for ; Tue, 22 Nov 2022 08:55:31 -0800 (PST) Received: from AcuMS.aculab.com (156.67.243.121 [156.67.243.121]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id uk-mta-283-2HMUIcawP0Ss9ayqiu8Yyg-1; Tue, 22 Nov 2022 16:55:29 +0000 X-MC-Unique: 2HMUIcawP0Ss9ayqiu8Yyg-1 Received: from AcuMS.Aculab.com (10.202.163.4) by AcuMS.aculab.com (10.202.163.4) with Microsoft SMTP Server (TLS) id 15.0.1497.42; Tue, 22 Nov 2022 16:55:27 +0000 Received: from AcuMS.Aculab.com ([::1]) by AcuMS.aculab.com ([::1]) with mapi id 15.00.1497.044; Tue, 22 Nov 2022 16:55:27 +0000 From: David Laight To: 'Willy Tarreau' CC: "linux-kernel@vger.kernel.org" , "netdev@vger.kernel.org" , "x86@kernel.org" , Arnd Bergmann , Thomas Gleixner , Ingo Molnar , "dave.hansen@linux.intel.com" Subject: RE: Optimising csum_fold() Thread-Topic: Optimising csum_fold() Thread-Index: Adj+b8b0ybT82IBbSHeFnZ0Bnl9aNQAHytGAAADzUoA= Date: Tue, 22 Nov 2022 16:55:27 +0000 Message-ID: References: <20221122162451.GB15368@1wt.eu> In-Reply-To: <20221122162451.GB15368@1wt.eu> Accept-Language: en-GB, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [10.202.205.107] MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: aculab.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Willy Tarreau > Sent: 22 November 2022 16:25 > > On Tue, Nov 22, 2022 at 01:08:23PM +0000, David Laight wrote: > > There are currently 20 copies of csum_fold(), some in C some in assembler. > > The default C version (in asm-generic/checksum.h) is pretty horrid. > > Some of the asm versions (including x86 and x86-64) aren't much better. > > > > There are 3 pretty good C versions: > > 1: (~sum - rol32(sum, 16)) >> 16 > > 2: ~(sum + rol32(sum, 16)) >> 16 > > 3: (u16)~((sum + rol32(sum, 16)) >> 16) > > All three are (usually) 4 arithmetic instructions. > > > > The first two have the advantage that the high bits are zero. > > Relevant when the value is being checked rather than set. > > > > The first one can generate better instruction scheduling (the rotate > > and invert can be executed in the same clock). > > > > The 3rd one saves an instruction on arm, but may need masking. > > (I've not compiled an arm kernel to see how often that happens.) > > > > The only architectures where (I think) the current asm code is better > > than the C above are sparc and sparc64. > > Sparc doesn't have a rotate instruction, but does have a carry flag. > > This makes the current asm version one instruction shorter. > > > > For architectures like mips and risc-v which have neither rotate > > instructions nor carry flags the C is as good as the current asm. > > The rotate is 3 instructions - the same as the extra cmp+add. > > > > Changing everything to use [1] would improve quite a few architectures > > while only adding 1 clock to some paths in arm/arm64 and sparc. > > > > Unfortunately it is all currently a mess. > > Most architectures don't include asm-generic/checksum.h at all. > > > > Thoughts? > > Then why not just have one version per arch, the most efficient one, > and use it everywhere ? The simple fact that we're discussing the > tradeoffs means that if we don't want to compromise performance here > (which I assume to be the case), then it needs to be per-arch and > that's all. At least that's the way I understand it. At the moment there are a lot of arch-specific ones that are definitely sub-optimal. I started doing some patches, my x86-64 kernel in about 4k smaller with [1]. I was going to post the patches to asm-generic an x86. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)