Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
From:   David Laight <David.Laight@ACULAB.COM>
To:     'Christophe Leroy' <christophe.leroy@csgroup.eu>,
        Benjamin Herrenschmidt <benh@kernel.crashing.org>,
        Paul Mackerras <paulus@samba.org>,
        "Michael Ellerman" <mpe@ellerman.id.au>
CC:     "linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH] powerpc: Use rol32() instead of opencoding in csum_fold()
Thread-Topic: [PATCH] powerpc: Use rol32() instead of opencoding in
 csum_fold()
Thread-Index: AQHYM4tAZCIa7HBaKU201H+8JgoYL6y2y/IA
Date:   Wed, 9 Mar 2022 09:53:04 +0000
Message-ID: <d7840750e3694b30951540b4298da208@AcuMS.aculab.com>
References: <794337eff7bb803d2c4e67d9eee635390c4c48fe.1646812553.git.christophe.leroy@csgroup.eu>
In-Reply-To: <794337eff7bb803d2c4e67d9eee635390c4c48fe.1646812553.git.christophe.leroy@csgroup.eu>
Accept-Language: en-GB, en-US
MIME-Version: 1.0
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8BIT
Precedence: bulk

From: Christophe Leroy
> Sent: 09 March 2022 07:56
...
> diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
> index 8321f6053a67..4b573a3b7e17 100644
> --- a/arch/powerpc/include/asm/checksum.h
> +++ b/arch/powerpc/include/asm/checksum.h
> @@ -38,14 +38,15 @@ extern __wsum csum_and_copy_to_user(const void *src, void __user *dst,
>   */
>  static inline __sum16 csum_fold(__wsum sum)
>  {
> +	u32 tmp = (__force u32)sum;
> +
> +	/*
> +	 * swap the two 16-bit halves of sum
> +	 * if there is a carry from adding the two 16-bit halves,
> +	 * it will carry from the lower half into the upper half,
> +	 * giving us the correct sum in the upper half.
> +	 */
> +	return (__force __sum16)(~(tmp + rol32(tmp, 16)) >> 16);
>  }
> 
>  static inline u32 from64to32(u64 x)
> --
> 2.34.1

On the face of it that is pretty generic.
Two shifts and an add (plus the invert and mask).

I suspect it generates better code than the current x86 version
which is:

static inline __sum16 csum_fold(__wsum sum)
{
	asm("addl %1, %0		;\n"
	    "adcl $0xffff, %0	;\n"
	    : "=r" (sum)
	    : "r" ((__force u32)sum << 16),
	      "0" ((__force u32)sum & 0xffff0000));
	return (__force __sum16)(~(__force u32)sum >> 16);
}

Which looks like 2 shifts, a mask, add, adc..
(Oh and the adc is two clocks on anything prior to Haswell.)
Quite why it doesn't use 16bit add and adc is anybodies guess.
Would still be shift, add, adc.
So shift, add, shift is no worse.

I wonder if any of the asm versions are actually better?
Some are the same algorithm, some are a lot worse.

Generic is:
static inline __sum16 csum_fold(__wsum csum)
{
	u32 sum = (__force u32)csum;
	sum = (sum & 0xffff) + (sum >> 16);
	sum = (sum & 0xffff) + (sum >> 16);
	return (__force __sum16)~sum;
}
Clearly can be improved.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)