LinuxLists.cc - [PATCH] parisc: Fix csum_ipv6

2024-02-13 23:46:43

Subject: [PATCH] parisc: Fix csum_ipv6_magic on 64-bit systems

hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
operations. The last add folds protocol and length fields into the 64-bit
result. While unlikely, this operation can overflow. The overflow can be
triggered with a code sequence such as the following.

/* try to trigger massive overflows */
memset(tmp_buf, 0xff, sizeof(struct in6_addr));
csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
(struct in6_addr *)tmp_buf,
0xffff, 0xff, 0xffffffff);

Fix the problem by adding any overflows from the final add operation into
the calculated checksum. Fortunately, we can do this without additional
cost by replacing the add operation used to fold the checksum into 32 bit
with "add,dc" to add in the missing carry.

Cc: Charlie Jenkins <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Guenter Roeck <[email protected]>
---
This patch does not completely fix the problems with csum_ipv6_magic seen
when running 64-bit parisc images with the C3700 emulation in qemu. That
is due to unaligned 64-bit load operations which (presumably as part of
unaligned trap handling) generate bad carry flags. It is unknown if that
is a problem with the qemu emulation or with the Linux kernel, so it is not
addressed here.

arch/parisc/include/asm/checksum.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h
index e619e67440db..c949aa20fa16 100644
--- a/arch/parisc/include/asm/checksum.h
+++ b/arch/parisc/include/asm/checksum.h
@@ -137,8 +137,8 @@ static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
" add,dc %3, %0, %0\n" /* fold in proto+len | carry bit */
" extrd,u %0, 31, 32, %4\n"/* copy upper half down */
" depdi 0, 31, 32, %0\n"/* clear upper half */
-" add %4, %0, %0\n" /* fold into 32-bits */
-" addc 0, %0, %0\n" /* add carry */
+" add,dc %4, %0, %0\n" /* fold into 32-bits, plus carry */
+" addc 0, %0, %0\n" /* add final carry */

#else

--
2.39.2

2024-02-14 20:22:48

by Charlie Jenkins

[permalink] [raw]

Subject: Re: [PATCH] parisc: Fix csum_ipv6_magic on 64-bit systems

On Tue, Feb 13, 2024 at 03:46:31PM -0800, Guenter Roeck wrote:
> hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
> operations. The last add folds protocol and length fields into the 64-bit
> result. While unlikely, this operation can overflow. The overflow can be
> triggered with a code sequence such as the following.
>
> /* try to trigger massive overflows */
> memset(tmp_buf, 0xff, sizeof(struct in6_addr));
> csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
> (struct in6_addr *)tmp_buf,
> 0xffff, 0xff, 0xffffffff);
>
> Fix the problem by adding any overflows from the final add operation into
> the calculated checksum. Fortunately, we can do this without additional
> cost by replacing the add operation used to fold the checksum into 32 bit
> with "add,dc" to add in the missing carry.
>
> Cc: Charlie Jenkins <[email protected]>
> Cc: Palmer Dabbelt <[email protected]>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Guenter Roeck <[email protected]>
> ---
> This patch does not completely fix the problems with csum_ipv6_magic seen
> when running 64-bit parisc images with the C3700 emulation in qemu. That
> is due to unaligned 64-bit load operations which (presumably as part of
> unaligned trap handling) generate bad carry flags. It is unknown if that
> is a problem with the qemu emulation or with the Linux kernel, so it is not
> addressed here.
>
> arch/parisc/include/asm/checksum.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h
> index e619e67440db..c949aa20fa16 100644
> --- a/arch/parisc/include/asm/checksum.h
> +++ b/arch/parisc/include/asm/checksum.h
> @@ -137,8 +137,8 @@ static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> " add,dc %3, %0, %0\n" /* fold in proto+len | carry bit */
> " extrd,u %0, 31, 32, %4\n"/* copy upper half down */
> " depdi 0, 31, 32, %0\n"/* clear upper half */
> -" add %4, %0, %0\n" /* fold into 32-bits */
> -" addc 0, %0, %0\n" /* add carry */
> +" add,dc %4, %0, %0\n" /* fold into 32-bits, plus carry */
> +" addc 0, %0, %0\n" /* add final carry */
>
> #else
>
> --
> 2.39.2
>

Reviewed-by: Charlie Jenkins <[email protected]>

2024-02-16 12:39:06

by Helge Deller

[permalink] [raw]

Subject: Re: [PATCH] parisc: Fix csum_ipv6_magic on 64-bit systems

* Guenter Roeck <[email protected]>:
> hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
> operations. The last add folds protocol and length fields into the 64-bit
> result. While unlikely, this operation can overflow. The overflow can be
> triggered with a code sequence such as the following.
>
> /* try to trigger massive overflows */
> memset(tmp_buf, 0xff, sizeof(struct in6_addr));
> csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
> (struct in6_addr *)tmp_buf,
> 0xffff, 0xff, 0xffffffff);
>
> Fix the problem by adding any overflows from the final add operation into
> the calculated checksum. Fortunately, we can do this without additional
> cost by replacing the add operation used to fold the checksum into 32 bit
> with "add,dc" to add in the missing carry.

Gunter,

Thanks for your patch for 32- and 64-bit systems.
But I think it's time to sunset the parisc inline assembly for ipv4/ipv6
checksumming.
The patch below converts the code to use standard C-coding (taken from the
s390 header file) and it survives the v8 lib/checksum patch.

Opinions?

Helge

Subject: [PATCH] parisc: Fix csum_ipv6_magic on 32- and 64-bit machines

Guenter noticed that the 32- and 64-bit checksum functions may calculate
wrong values under certain circumstances. He fixed it by usining
corrected carry-flags added, but overall I think it's better to convert
away from inline assembly to generic C-coding. That way the code is
much cleaner and the compiler can do much better optimizations based on
the target architecture (32- vs. 64-bit). Furthermore, the compiler
generates nowadays much better code, so inline assembly usually won't
give any benefit any longer.

Signed-off-by: Helge Deller <[email protected]>
Noticed-by: Guenter Roeck <[email protected]>

diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h
index 3c43baca7b39..c72f14536353 100644
--- a/arch/parisc/include/asm/checksum.h
+++ b/arch/parisc/include/asm/checksum.h
@@ -18,160 +18,93 @@
*/
extern __wsum csum_partial(const void *, int, __wsum);

+
/*
- * Optimized for IP headers, which always checksum on 4 octet boundaries.
- *
- * Written by Randolph Chung <[email protected]>, and then mucked with by
- * LaMont Jones <[email protected]>
+ * Fold a partial checksum without adding pseudo headers.
*/
-static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
+static inline __sum16 csum_fold(__wsum sum)
{
- unsigned int sum;
- unsigned long t0, t1, t2;
+ u32 csum = (__force u32) sum;

- __asm__ __volatile__ (
-" ldws,ma 4(%1), %0\n"
-" addib,<= -4, %2, 2f\n"
-"\n"
-" ldws 4(%1), %4\n"
-" ldws 8(%1), %5\n"
-" add %0, %4, %0\n"
-" ldws,ma 12(%1), %3\n"
-" addc %0, %5, %0\n"
-" addc %0, %3, %0\n"
-"1: ldws,ma 4(%1), %3\n"
-" addib,< 0, %2, 1b\n"
-" addc %0, %3, %0\n"
-"\n"
-" extru %0, 31, 16, %4\n"
-" extru %0, 15, 16, %5\n"
-" addc %4, %5, %0\n"
-" extru %0, 15, 16, %5\n"
-" add %0, %5, %0\n"
-" subi -1, %0, %0\n"
-"2:\n"
- : "=r" (sum), "=r" (iph), "=r" (ihl), "=r" (t0), "=r" (t1), "=r" (t2)
- : "1" (iph), "2" (ihl)
- : "memory");
-
- return (__force __sum16)sum;
+ csum += (csum >> 16) | (csum << 16);
+ csum >>= 16;
+ return (__force __sum16) ~csum;
}

/*
- * Fold a partial checksum
+ * This is a version of ip_compute_csum() optimized for IP headers,
+ * which always checksums on 4 octet boundaries.
*/
-static inline __sum16 csum_fold(__wsum csum)
+static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
{
- u32 sum = (__force u32)csum;
- /* add the swapped two 16-bit halves of sum,
- a possible carry from adding the two 16-bit halves,
- will carry from the lower half into the upper half,
- giving us the correct sum in the upper half. */
- sum += (sum << 16) + (sum >> 16);
- return (__force __sum16)(~sum >> 16);
+ __u64 csum = 0;
+ __u32 *ptr = (u32 *)iph;
+
+ csum += *ptr++;
+ csum += *ptr++;
+ csum += *ptr++;
+ csum += *ptr++;
+ ihl -= 4;
+ while (ihl--)
+ csum += *ptr++;
+ csum += (csum >> 32) | (csum << 32);
+ return csum_fold((__force __wsum)(csum >> 32));
}
-
-static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
- __u32 len, __u8 proto,
- __wsum sum)
+
+/*
+ * Computes the checksum of the TCP/UDP pseudo-header.
+ * Returns a 32-bit checksum.
+ */
+static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, __u32 len,
+ __u8 proto, __wsum sum)
{
- __asm__(
- " add %1, %0, %0\n"
- " addc %2, %0, %0\n"
- " addc %3, %0, %0\n"
- " addc %%r0, %0, %0\n"
- : "=r" (sum)
- : "r" (daddr), "r"(saddr), "r"(proto+len), "0"(sum));
- return sum;
+ __u64 csum = (__force __u64)sum;
+
+ csum += (__force __u32)saddr;
+ csum += (__force __u32)daddr;
+ csum += len;
+ csum += proto;
+ csum += (csum >> 32) | (csum << 32);
+ return (__force __wsum)(csum >> 32);
}

/*
- * computes the checksum of the TCP/UDP pseudo-header
- * returns a 16-bit checksum, already complemented
+ * Computes the checksum of the TCP/UDP pseudo-header.
+ * Returns a 16-bit checksum, already complemented.
*/
-static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
- __u32 len, __u8 proto,
- __wsum sum)
+static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len,
+ __u8 proto, __wsum sum)
{
- return csum_fold(csum_tcpudp_nofold(saddr,daddr,len,proto,sum));
+ return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
}

/*
- * this routine is used for miscellaneous IP-like checksums, mainly
- * in icmp.c
+ * Used for miscellaneous IP-like checksums, mainly icmp.
*/
-static inline __sum16 ip_compute_csum(const void *buf, int len)
+static inline __sum16 ip_compute_csum(const void *buff, int len)
{
- return csum_fold (csum_partial(buf, len, 0));
+ return csum_fold(csum_partial(buff, len, 0));
}

-
#define _HAVE_ARCH_IPV6_CSUM
-static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
- const struct in6_addr *daddr,
- __u32 len, __u8 proto,
- __wsum sum)
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+ const struct in6_addr *daddr,
+ __u32 len, __u8 proto, __wsum csum)
{
- unsigned long t0, t1, t2, t3;
-
- len += proto; /* add 16-bit proto + len */
-
- __asm__ __volatile__ (
-
-#if BITS_PER_LONG > 32
-
- /*
- ** We can execute two loads and two adds per cycle on PA 8000.
- ** But add insn's get serialized waiting for the carry bit.
- ** Try to keep 4 registers with "live" values ahead of the ALU.
- */
-
-" ldd,ma 8(%1), %4\n" /* get 1st saddr word */
-" ldd,ma 8(%2), %5\n" /* get 1st daddr word */
-" add %4, %0, %0\n"
-" ldd,ma 8(%1), %6\n" /* 2nd saddr */
-" ldd,ma 8(%2), %7\n" /* 2nd daddr */
-" add,dc %5, %0, %0\n"
-" add,dc %6, %0, %0\n"
-" add,dc %7, %0, %0\n"
-" add,dc %3, %0, %0\n" /* fold in proto+len | carry bit */
-" extrd,u %0, 31, 32, %4\n"/* copy upper half down */
-" depdi 0, 31, 32, %0\n"/* clear upper half */
-" add %4, %0, %0\n" /* fold into 32-bits */
-" addc 0, %0, %0\n" /* add carry */
-
-#else
-
- /*
- ** For PA 1.x, the insn order doesn't matter as much.
- ** Insn stream is serialized on the carry bit here too.
- ** result from the previous operation (eg r0 + x)
- */
-" ldw,ma 4(%1), %4\n" /* get 1st saddr word */
-" ldw,ma 4(%2), %5\n" /* get 1st daddr word */
-" add %4, %0, %0\n"
-" ldw,ma 4(%1), %6\n" /* 2nd saddr */
-" addc %5, %0, %0\n"
-" ldw,ma 4(%2), %7\n" /* 2nd daddr */
-" addc %6, %0, %0\n"
-" ldw,ma 4(%1), %4\n" /* 3rd saddr */
-" addc %7, %0, %0\n"
-" ldw,ma 4(%2), %5\n" /* 3rd daddr */
-" addc %4, %0, %0\n"
-" ldw,ma 4(%1), %6\n" /* 4th saddr */
-" addc %5, %0, %0\n"
-" ldw,ma 4(%2), %7\n" /* 4th daddr */
-" addc %6, %0, %0\n"
-" addc %7, %0, %0\n"
-" addc %3, %0, %0\n" /* fold in proto+len, catch carry */
-
-#endif
- : "=r" (sum), "=r" (saddr), "=r" (daddr), "=r" (len),
- "=r" (t0), "=r" (t1), "=r" (t2), "=r" (t3)
- : "0" (sum), "1" (saddr), "2" (daddr), "3" (len)
- : "memory");
- return csum_fold(sum);
+ __u64 sum = (__force __u64)csum;
+
+ sum += (__force __u32)saddr->s6_addr32[0];
+ sum += (__force __u32)saddr->s6_addr32[1];
+ sum += (__force __u32)saddr->s6_addr32[2];
+ sum += (__force __u32)saddr->s6_addr32[3];
+ sum += (__force __u32)daddr->s6_addr32[0];
+ sum += (__force __u32)daddr->s6_addr32[1];
+ sum += (__force __u32)daddr->s6_addr32[2];
+ sum += (__force __u32)daddr->s6_addr32[3];
+ sum += len;
+ sum += proto;
+ sum += (sum >> 32) | (sum << 32);
+ return csum_fold((__force __wsum)(sum >> 32));
}

#endif
-

2024-02-16 15:39:17

by Guenter Roeck

[permalink] [raw]

Subject: Re: [PATCH] parisc: Fix csum_ipv6_magic on 64-bit systems

Hi Helge,

On 2/16/24 04:38, Helge Deller wrote:
> * Guenter Roeck <[email protected]>:
>> hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
>> operations. The last add folds protocol and length fields into the 64-bit
>> result. While unlikely, this operation can overflow. The overflow can be
>> triggered with a code sequence such as the following.
>>
>> /* try to trigger massive overflows */
>> memset(tmp_buf, 0xff, sizeof(struct in6_addr));
>> csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
>> (struct in6_addr *)tmp_buf,
>> 0xffff, 0xff, 0xffffffff);
>>
>> Fix the problem by adding any overflows from the final add operation into
>> the calculated checksum. Fortunately, we can do this without additional
>> cost by replacing the add operation used to fold the checksum into 32 bit
>> with "add,dc" to add in the missing carry.
>
>
> Gunter,
>
> Thanks for your patch for 32- and 64-bit systems.
> But I think it's time to sunset the parisc inline assembly for ipv4/ipv6
> checksumming.
> The patch below converts the code to use standard C-coding (taken from the
> s390 header file) and it survives the v8 lib/checksum patch.
>
> Opinions?
>

Works for me.

> Helge
>
> Subject: [PATCH] parisc: Fix csum_ipv6_magic on 32- and 64-bit machines
>
> Guenter noticed that the 32- and 64-bit checksum functions may calculate
> wrong values under certain circumstances. He fixed it by usining

using

> corrected carry-flags added, but overall I think it's better to convert
> away from inline assembly to generic C-coding. That way the code is
> much cleaner and the compiler can do much better optimizations based on
> the target architecture (32- vs. 64-bit). Furthermore, the compiler
> generates nowadays much better code, so inline assembly usually won't
> give any benefit any longer.
>
> Signed-off-by: Helge Deller <[email protected]>
> Noticed-by: Guenter Roeck <[email protected]>

That would be a new tag. Maybe use "Reported-by:", or just drop it.

Maybe also consider adding

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

Other than that,

Tested-by: Guenter Roeck <[email protected]>

Thanks,
Guenter

2024-02-17 03:00:30

by Charlie Jenkins

[permalink] [raw]

Subject: Re: [PATCH] parisc: Fix csum_ipv6_magic on 64-bit systems

On Fri, Feb 16, 2024 at 01:38:51PM +0100, Helge Deller wrote:
> * Guenter Roeck <[email protected]>:
> > hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
> > operations. The last add folds protocol and length fields into the 64-bit
> > result. While unlikely, this operation can overflow. The overflow can be
> > triggered with a code sequence such as the following.
> >
> > /* try to trigger massive overflows */
> > memset(tmp_buf, 0xff, sizeof(struct in6_addr));
> > csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
> > (struct in6_addr *)tmp_buf,
> > 0xffff, 0xff, 0xffffffff);
> >
> > Fix the problem by adding any overflows from the final add operation into
> > the calculated checksum. Fortunately, we can do this without additional
> > cost by replacing the add operation used to fold the checksum into 32 bit
> > with "add,dc" to add in the missing carry.
>
>
> Gunter,
>
> Thanks for your patch for 32- and 64-bit systems.
> But I think it's time to sunset the parisc inline assembly for ipv4/ipv6
> checksumming.
> The patch below converts the code to use standard C-coding (taken from the
> s390 header file) and it survives the v8 lib/checksum patch.
>
> Opinions?
>
> Helge
>
> Subject: [PATCH] parisc: Fix csum_ipv6_magic on 32- and 64-bit machines
>
> Guenter noticed that the 32- and 64-bit checksum functions may calculate
> wrong values under certain circumstances. He fixed it by usining
> corrected carry-flags added, but overall I think it's better to convert
> away from inline assembly to generic C-coding. That way the code is
> much cleaner and the compiler can do much better optimizations based on
> the target architecture (32- vs. 64-bit). Furthermore, the compiler
> generates nowadays much better code, so inline assembly usually won't
> give any benefit any longer.
>
> Signed-off-by: Helge Deller <[email protected]>
> Noticed-by: Guenter Roeck <[email protected]>
>
> diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h
> index 3c43baca7b39..c72f14536353 100644
> --- a/arch/parisc/include/asm/checksum.h
> +++ b/arch/parisc/include/asm/checksum.h
> @@ -18,160 +18,93 @@
> */
> extern __wsum csum_partial(const void *, int, __wsum);
>
> +
> /*
> - * Optimized for IP headers, which always checksum on 4 octet boundaries.
> - *
> - * Written by Randolph Chung <[email protected]>, and then mucked with by
> - * LaMont Jones <[email protected]>
> + * Fold a partial checksum without adding pseudo headers.
> */
> -static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
> +static inline __sum16 csum_fold(__wsum sum)
> {
> - unsigned int sum;
> - unsigned long t0, t1, t2;
> + u32 csum = (__force u32) sum;
>
> - __asm__ __volatile__ (
> -" ldws,ma 4(%1), %0\n"
> -" addib,<= -4, %2, 2f\n"
> -"\n"
> -" ldws 4(%1), %4\n"
> -" ldws 8(%1), %5\n"
> -" add %0, %4, %0\n"
> -" ldws,ma 12(%1), %3\n"
> -" addc %0, %5, %0\n"
> -" addc %0, %3, %0\n"
> -"1: ldws,ma 4(%1), %3\n"
> -" addib,< 0, %2, 1b\n"
> -" addc %0, %3, %0\n"
> -"\n"
> -" extru %0, 31, 16, %4\n"
> -" extru %0, 15, 16, %5\n"
> -" addc %4, %5, %0\n"
> -" extru %0, 15, 16, %5\n"
> -" add %0, %5, %0\n"
> -" subi -1, %0, %0\n"
> -"2:\n"
> - : "=r" (sum), "=r" (iph), "=r" (ihl), "=r" (t0), "=r" (t1), "=r" (t2)
> - : "1" (iph), "2" (ihl)
> - : "memory");
> -
> - return (__force __sum16)sum;
> + csum += (csum >> 16) | (csum << 16);
> + csum >>= 16;
> + return (__force __sum16) ~csum;
> }

For all of my analysis I am using gcc 12.3.0.

We can do better than this! By inspection this looks like a performance
regression. The generic version of csum_fold in
include/asm-generic/checksum.h is better than this so should be used
instead.

I am not familiar with hppa assembly but this is the assembly for the
original csum_fold here:

0000000000000000 <csum_fold>:
0: 08 03 02 41 copy r3,r1
4: 08 1e 02 43 copy sp,r3
8: 73 c1 00 88 std,ma r1,40(sp)
c: d3 5a 09 fc shrpw r26,r26,16,ret0
10: 0b 9a 0a 1c add,l r26,ret0,ret0
14: 0b 80 09 9c uaddcm r0,ret0,ret0
18: db 9c 09 f0 extrd,u,* ret0,47,16,ret0
1c: 34 7e 00 80 ldo 40(r3),sp
20: e8 40 d0 00 bve (rp)
24: 53 c3 3f 8d ldd,mb -40(sp),r3

This is the assembly for the generic version:
0000000000000000 <csum_fold_generic>:
0: 08 03 02 41 copy r3,r1
4: 08 1e 02 43 copy sp,r3
8: 73 c1 00 88 std,ma r1,40(sp)
c: 0b 40 09 9c uaddcm r0,r26,ret0
10: d3 5a 09 fa shrpw r26,r26,16,r26
14: 0b 5c 04 1c sub ret0,r26,ret0
18: db 9c 09 f0 extrd,u,* ret0,47,16,ret0
1c: 34 7e 00 80 ldo 40(r3),sp
20: e8 40 d0 00 bve (rp)
24: 53 c3 3f 8d ldd,mb -40(sp),r3

This is the assembly for yours:
0000000000000028 <csum_fold_new>:
28: 08 03 02 41 copy r3,r1
2c: 08 1e 02 43 copy sp,r3
30: 73 c1 00 88 std,ma r1,40(sp)
34: d3 5a 09 fc shrpw r26,r26,16,ret0
38: 0b 9a 0a 1c add,l r26,ret0,ret0
3c: d3 9c 19 f0 extrw,u ret0,15,16,ret0
40: 0b 80 09 9c uaddcm r0,ret0,ret0
44: db 9c 0b f0 extrd,u,* ret0,63,16,ret0
48: 34 7e 00 80 ldo 40(r3),sp
4c: e8 40 d0 00 bve (rp)
50: 53 c3 3f 8d ldd,mb -40(sp),r3
54: 00 00 00 00 break 0,0

The assembly is almost the same for the generic and the original code,
except for rearranging the shift and add operation which allows the
addition to become a subtraction and shortens the dependency chain on
some architectures (looks like it doesn't change much here). However,
this new code introduces an additional extrw instruction.

>
> /*
> - * Fold a partial checksum
> + * This is a version of ip_compute_csum() optimized for IP headers,
> + * which always checksums on 4 octet boundaries.
> */
> -static inline __sum16 csum_fold(__wsum csum)
> +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
> {
> - u32 sum = (__force u32)csum;
> - /* add the swapped two 16-bit halves of sum,
> - a possible carry from adding the two 16-bit halves,
> - will carry from the lower half into the upper half,
> - giving us the correct sum in the upper half. */
> - sum += (sum << 16) + (sum >> 16);
> - return (__force __sum16)(~sum >> 16);
> + __u64 csum = 0;
> + __u32 *ptr = (u32 *)iph;
> +
> + csum += *ptr++;
> + csum += *ptr++;
> + csum += *ptr++;
> + csum += *ptr++;
> + ihl -= 4;
> + while (ihl--)
> + csum += *ptr++;
> + csum += (csum >> 32) | (csum << 32);
> + return csum_fold((__force __wsum)(csum >> 32));
> }

This doesn't leverage add with carry well. This causes the code size of this
to be dramatically larger than the original assembly, which I assume
nicely correlates to an increased execution time.

This is the original assembly in 64-bit (almost the same in 32-bit):
0000000000000028 <ip_fast_csum>:
28: 08 03 02 41 copy r3,r1
2c: 08 1e 02 43 copy sp,r3
30: 73 c1 00 88 std,ma r1,40(sp)
34: 0f 48 10 bc ldw,ma 4(r26),ret0
38: a7 39 60 70 addib,<= -4,r25,78 <ip_fast_csum+0x50>
3c: 0f 48 10 93 ldw 4(r26),r19
40: 0f 50 10 94 ldw 8(r26),r20
44: 0a 7c 06 1c add ret0,r19,ret0
48: 0f 58 10 bf ldw,ma c(r26),r31
4c: 0a 9c 07 1c add,c ret0,r20,ret0
50: 0b fc 07 1c add,c ret0,r31,ret0
54: 0f 48 10 bf ldw,ma 4(r26),r31
58: a7 20 5f ed addib,< 0,r25,54 <ip_fast_csum+0x2c>
5c: 0b fc 07 1c add,c ret0,r31,ret0
60: d3 93 1b f0 extrw,u ret0,31,16,r19
64: d3 94 19 f0 extrw,u ret0,15,16,r20
68: 0a 93 07 1c add,c r19,r20,ret0
6c: d3 94 19 f0 extrw,u ret0,15,16,r20
70: 0a 9c 06 1c add ret0,r20,ret0
74: 97 9c 07 ff subi -1,ret0,ret0
78: db 9c 0b f0 extrd,u,* ret0,63,16,ret0
7c: 34 7e 00 80 ldo 40(r3),sp
80: e8 40 d0 00 bve (rp)
84: 53 c3 3f 8d ldd,mb -40(sp),r3

This is the 64-bit assembly of your proposal:

0000000000000058 <ip_fast_csum_new>:
58: 08 03 02 41 copy r3,r1
5c: 0f c2 12 c1 std rp,-10(sp)
60: 08 1e 02 43 copy sp,r3
64: 73 c1 01 08 std,ma r1,80(sp)
68: 37 39 3f f9 ldo -4(r25),r25
6c: 0c 64 12 d0 std r4,8(r3)
70: 0f 48 10 9f ldw 4(r26),r31
74: db 39 0b e0 extrd,u,* r25,63,32,r25
78: 37 5a 00 20 ldo 10(r26),r26
7c: 0f 41 10 9c ldw -10(r26),ret0
80: 0b fc 0a 1c add,l ret0,r31,ret0
84: 0f 51 10 9f ldw -8(r26),r31
88: 0b 9f 0a 1f add,l r31,ret0,r31
8c: 0f 59 10 9c ldw -4(r26),ret0
90: 0b fc 0a 1c add,l ret0,r31,ret0
94: 37 3f 3f ff ldo -1(r25),r31
98: 8f ff 20 68 cmpib,<> -1,r31,d4 <ip_fast_csum_new+0x7c>
9c: db f9 0b e0 extrd,u,* r31,63,32,r25
a0: d3 9c 07 fa shrpd,* ret0,ret0,32,r26
a4: 37 dd 3f a1 ldo -30(sp),ret1
a8: 0b 9a 0a 1a add,l r26,ret0,r26
ac: 00 00 14 a1 mfia r1
b0: 28 20 00 00 addil L%0,r1,r1
b4: 34 21 00 00 ldo 0(r1),r1
b8: e8 20 f0 00 bve,l (r1),rp
bc: db 5a 03 e0 extrd,u,* r26,31,32,r26
c0: 0c 70 10 c4 ldd 8(r3),r4
c4: 0c 61 10 c2 ldd -10(r3),rp
c8: 34 7e 00 80 ldo 40(r3),sp
cc: e8 40 d0 00 bve (rp)
d0: 53 c3 3f 8d ldd,mb -40(sp),r3
d4: 0f 40 10 9f ldw 0(r26),r31
d8: 37 5a 00 08 ldo 4(r26),r26
dc: e8 1f 1f 65 b,l 94 <ip_fast_csum_new+0x3c>,r0
e0: 0b fc 0a 1c add,l ret0,r31,ret0
e4: 00 00 00 00 break 0,0

The 32-bit assembly is even longer.

Maybe you can get some inspiration from my riscv implementation
arch/riscv/include/asm/checksum.h. You can swap out the line:
csum += csum < ((const unsigned int *)iph)[pos];
With the addc macro defined in arch/parisc/lib/checksum.c.

I will guess that this will improve the code but I haven't checked.

> -
> -static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
> - __u32 len, __u8 proto,
> - __wsum sum)
> +
> +/*
> + * Computes the checksum of the TCP/UDP pseudo-header.
> + * Returns a 32-bit checksum.
> + */
> +static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, __u32 len,
> + __u8 proto, __wsum sum)
> {
> - __asm__(
> - " add %1, %0, %0\n"
> - " addc %2, %0, %0\n"
> - " addc %3, %0, %0\n"
> - " addc %%r0, %0, %0\n"
> - : "=r" (sum)
> - : "r" (daddr), "r"(saddr), "r"(proto+len), "0"(sum));
> - return sum;
> + __u64 csum = (__force __u64)sum;
> +
> + csum += (__force __u32)saddr;
> + csum += (__force __u32)daddr;
> + csum += len;
> + csum += proto;
> + csum += (csum >> 32) | (csum << 32);
> + return (__force __wsum)(csum >> 32);
> }

The assembly from the original:

0000000000000088 <csum_tcpudp_nofold>:
88: 08 03 02 41 copy r3,r1
8c: 08 1e 02 43 copy sp,r3
90: 73 c1 00 88 std,ma r1,40(sp)
94: da f7 0b f8 extrd,u,* r23,63,8,r23
98: 0a f8 0a 17 add,l r24,r23,r23
9c: 0a d9 06 16 add r25,r22,r22
a0: 0a da 07 16 add,c r26,r22,r22
a4: 0a d7 07 16 add,c r23,r22,r22
a8: 0a c0 07 16 add,c r0,r22,r22
ac: da dc 0b e0 extrd,u,* r22,63,32,ret0
b0: 34 7e 00 80 ldo 40(r3),sp
b4: e8 40 d0 00 bve (rp)
b8: 53 c3 3f 8d ldd,mb -40(sp),r3
bc: 00 00 00 00 break 0,0

The 64-bit assembly from your proposal:

0000000000000140 <csum_tcpudp_nofold_new>:
140: 08 03 02 41 copy r3,r1
144: 08 1e 02 43 copy sp,r3
148: 73 c1 00 88 std,ma r1,40(sp)
14c: db 5a 0b e0 extrd,u,* r26,63,32,r26
150: db 39 0b e0 extrd,u,* r25,63,32,r25
154: da f7 0b f8 extrd,u,* r23,63,8,r23
158: db 18 0b e0 extrd,u,* r24,63,32,r24
15c: da d6 0b e0 extrd,u,* r22,63,32,r22
160: 0a f8 0a 18 add,l r24,r23,r24
164: 0b 38 0a 18 add,l r24,r25,r24
168: 0b 58 0a 18 add,l r24,r26,r24
16c: 0a d8 0a 16 add,l r24,r22,r22
170: d2 d6 07 fc shrpd,* r22,r22,32,ret0
174: 0a dc 0a 1c add,l ret0,r22,ret0
178: db 9c 03 e0 extrd,u,* ret0,31,32,ret0
17c: 34 7e 00 80 ldo 40(r3),sp
180: e8 40 d0 00 bve (rp)
184: 53 c3 3f 8d ldd,mb -40(sp),r3

There are a lot of extra extrd instructions, and again is really bad on
32-bit parisc.

Maybe there is a good way to represent this in C, but the assembly is
so simple and clean for this function already.

>
> /*
> - * computes the checksum of the TCP/UDP pseudo-header
> - * returns a 16-bit checksum, already complemented
> + * Computes the checksum of the TCP/UDP pseudo-header.
> + * Returns a 16-bit checksum, already complemented.
> */
> -static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
> - __u32 len, __u8 proto,
> - __wsum sum)
> +static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len,
> + __u8 proto, __wsum sum)
> {
> - return csum_fold(csum_tcpudp_nofold(saddr,daddr,len,proto,sum));
> + return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
> }
>
The implementation of csum_tcpudp_magic is the same as the generic, so
the generic version should be used instead.

> /*
> - * this routine is used for miscellaneous IP-like checksums, mainly
> - * in icmp.c
> + * Used for miscellaneous IP-like checksums, mainly icmp.
> */
> -static inline __sum16 ip_compute_csum(const void *buf, int len)
> +static inline __sum16 ip_compute_csum(const void *buff, int len)
> {
> - return csum_fold (csum_partial(buf, len, 0));
> + return csum_fold(csum_partial(buff, len, 0));
> }

The generic version is better than this so it should be used instead.

>
> -
> #define _HAVE_ARCH_IPV6_CSUM
> -static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> - const struct in6_addr *daddr,
> - __u32 len, __u8 proto,
> - __wsum sum)
> +static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> + const struct in6_addr *daddr,
> + __u32 len, __u8 proto, __wsum csum)
> {
> - unsigned long t0, t1, t2, t3;
> -
> - len += proto; /* add 16-bit proto + len */
> -
> - __asm__ __volatile__ (
> -
> -#if BITS_PER_LONG > 32
> -
> - /*
> - ** We can execute two loads and two adds per cycle on PA 8000.
> - ** But add insn's get serialized waiting for the carry bit.
> - ** Try to keep 4 registers with "live" values ahead of the ALU.
> - */
> -
> -" ldd,ma 8(%1), %4\n" /* get 1st saddr word */
> -" ldd,ma 8(%2), %5\n" /* get 1st daddr word */
> -" add %4, %0, %0\n"
> -" ldd,ma 8(%1), %6\n" /* 2nd saddr */
> -" ldd,ma 8(%2), %7\n" /* 2nd daddr */
> -" add,dc %5, %0, %0\n"
> -" add,dc %6, %0, %0\n"
> -" add,dc %7, %0, %0\n"
> -" add,dc %3, %0, %0\n" /* fold in proto+len | carry bit */
> -" extrd,u %0, 31, 32, %4\n"/* copy upper half down */
> -" depdi 0, 31, 32, %0\n"/* clear upper half */
> -" add %4, %0, %0\n" /* fold into 32-bits */
> -" addc 0, %0, %0\n" /* add carry */
> -
> -#else
> -
> - /*
> - ** For PA 1.x, the insn order doesn't matter as much.
> - ** Insn stream is serialized on the carry bit here too.
> - ** result from the previous operation (eg r0 + x)
> - */
> -" ldw,ma 4(%1), %4\n" /* get 1st saddr word */
> -" ldw,ma 4(%2), %5\n" /* get 1st daddr word */
> -" add %4, %0, %0\n"
> -" ldw,ma 4(%1), %6\n" /* 2nd saddr */
> -" addc %5, %0, %0\n"
> -" ldw,ma 4(%2), %7\n" /* 2nd daddr */
> -" addc %6, %0, %0\n"
> -" ldw,ma 4(%1), %4\n" /* 3rd saddr */
> -" addc %7, %0, %0\n"
> -" ldw,ma 4(%2), %5\n" /* 3rd daddr */
> -" addc %4, %0, %0\n"
> -" ldw,ma 4(%1), %6\n" /* 4th saddr */
> -" addc %5, %0, %0\n"
> -" ldw,ma 4(%2), %7\n" /* 4th daddr */
> -" addc %6, %0, %0\n"
> -" addc %7, %0, %0\n"
> -" addc %3, %0, %0\n" /* fold in proto+len, catch carry */
> -
> -#endif
> - : "=r" (sum), "=r" (saddr), "=r" (daddr), "=r" (len),
> - "=r" (t0), "=r" (t1), "=r" (t2), "=r" (t3)
> - : "0" (sum), "1" (saddr), "2" (daddr), "3" (len)
> - : "memory");
> - return csum_fold(sum);
> + __u64 sum = (__force __u64)csum;
> +
> + sum += (__force __u32)saddr->s6_addr32[0];
> + sum += (__force __u32)saddr->s6_addr32[1];
> + sum += (__force __u32)saddr->s6_addr32[2];
> + sum += (__force __u32)saddr->s6_addr32[3];
> + sum += (__force __u32)daddr->s6_addr32[0];
> + sum += (__force __u32)daddr->s6_addr32[1];
> + sum += (__force __u32)daddr->s6_addr32[2];
> + sum += (__force __u32)daddr->s6_addr32[3];
> + sum += len;
> + sum += proto;
> + sum += (sum >> 32) | (sum << 32);
> + return csum_fold((__force __wsum)(sum >> 32));
> }
>
> #endif
> -

Similar story again here where the add with carry is not well translated
into C, resulting in significantly worse assembly. Using __u64 seems to
be a big contributing factor for why the 32-bit assembly is worse.

I am not sure the best way to represent this in a clean way in C.

add with carry is not well understood by GCC 12.3 it seems. These
functions are generally heavily optimized on every architecture, so I
think it is worth it to default to assembly if you aren't able to
achieve comparable performance in C.

- Charlie

2024-02-17 22:48:11

by David Laight

[permalink] [raw]

Subject: RE: [PATCH] parisc: Fix csum_ipv6_magic on 64-bit systems

..
> We can do better than this! By inspection this looks like a performance
> regression. The generic version of csum_fold in
> include/asm-generic/checksum.h is better than this so should be used
> instead.

Yes, that got changed for 6.8-rc1 (I pretty much suggested the patch)
but hadn't noticed Linus has applied it.
That C version is (probably) not worse than any of the asm versions
except sparc32 - which has a carry flag but rotate.
(It is better than the x86-64 asm one.)

..
> This doesn't leverage add with carry well. This causes the code size of this
> to be dramatically larger than the original assembly, which I assume
> nicely correlates to an increased execution time.

It is pretty much impossible to do add with carry from C.
So an asm adc block is pretty much always going to win.

For csum_partial and short to moderate length buffers
on x86 it is hard to beat 10: adc, adc, dec, jnz 10b
which (on modern intel cpu at least) does 8 bytes/clock.
You can get 12 bytes/clock but it only really wins for 256+ bytes.
(See the current x86-64 version.)

For cpu without a carry flag it is likely that a common C
function will be pretty much optimal on all architectures.
(Or maybe a couple of implementations based the actual
cpu implementation - not the architecture.)

Mostly I don't think you can beat 4 instructions/word, but they
will pipeline so with multi-issue you might get a read/clock.
Arm's barrel shifter might give 3: v + *p; x += v, y += v >> 32.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-02-19 21:12:37

by Helge Deller

[permalink] [raw]

Subject: Re: [PATCH] parisc: Fix csum_ipv6_magic on 64-bit systems

On 2/17/24 04:00, Charlie Jenkins wrote:
> On Fri, Feb 16, 2024 at 01:38:51PM +0100, Helge Deller wrote:
>> * Guenter Roeck <[email protected]>:
>>> hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
>>> operations. The last add folds protocol and length fields into the 64-bit
>>> result. While unlikely, this operation can overflow. The overflow can be
>>> triggered with a code sequence such as the following.
>>>
>>> /* try to trigger massive overflows */
>>> memset(tmp_buf, 0xff, sizeof(struct in6_addr));
>>> csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
>>> (struct in6_addr *)tmp_buf,
>>> 0xffff, 0xff, 0xffffffff);
>>>
>>> Fix the problem by adding any overflows from the final add operation into
>>> the calculated checksum. Fortunately, we can do this without additional
>>> cost by replacing the add operation used to fold the checksum into 32 bit
>>> with "add,dc" to add in the missing carry.
>>
>>
>> Gunter,
>>
>> Thanks for your patch for 32- and 64-bit systems.
>> But I think it's time to sunset the parisc inline assembly for ipv4/ipv6
>> checksumming.
>> The patch below converts the code to use standard C-coding (taken from the
>> s390 header file) and it survives the v8 lib/checksum patch.
>>
>> Opinions?
>> [...]

> We can do better than this! By inspection this looks like a performance
> regression.
> [...]
> Similar story again here where the add with carry is not well translated
> into C, resulting in significantly worse assembly. Using __u64 seems to
> be a big contributing factor for why the 32-bit assembly is worse.
>
> I am not sure the best way to represent this in a clean way in C.
>
> add with carry is not well understood by GCC 12.3 it seems. These
> functions are generally heavily optimized on every architecture, so I
> think it is worth it to default to assembly if you aren't able to
> achieve comparable performance in C.

Thanks a lot for your analysis!!!
I've now reverted my change to switch to generic code and applied
the 3 suggested patches from Guenter which fix the hppa assembly.
Let's see how they behave in the for-next git tree during the next few days.

Helge