2021-01-19 13:32:02

by Adrian Ratiu

[permalink] [raw]
Subject: [PATCH v4 0/2] xor-neon: Remove GCC warn & pragmas

Dear all,

In v4 a Clang-specific vectorization warning was added at
Arnd suggestion.

This series does not address the Clang vectorize not working
bug itself which is a known pre-existing issued documented
at [1] [2] [3]. Clang vectorization needs to be investigated
in more deepth and fixed separately. The purpouse of this is
to only fix some low-hanging-fruit GCC related isues.

Tested on next-20210118 using GCC 10.2.0 and Clang 10.0.1.

[1] https://bugs.llvm.org/show_bug.cgi?id=40976
[2] https://github.com/ClangBuiltLinux/linux/issues/503
[3] https://github.com/ClangBuiltLinux/linux/issues/496

Kind regards,
Adrian

Adrian Ratiu (1):
arm: lib: xor-neon: move pragma options to makefile

Nathan Chancellor (1):
arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

arch/arm/lib/Makefile | 2 +-
arch/arm/lib/xor-neon.c | 18 +++++-------------
2 files changed, 6 insertions(+), 14 deletions(-)

--
2.30.0


2021-01-19 13:32:02

by Adrian Ratiu

[permalink] [raw]
Subject: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

From: Nathan Chancellor <[email protected]>

Drop warning because kernel now requires GCC >= v4.9 after
commit 6ec4476ac825 ("Raise gcc version requirement to 4.9")
and clarify that -ftree-vectorize now always needs enabling
for GCC by directly testing the presence of CONFIG_CC_IS_GCC.

Another reason to remove the warning is that Clang exposes
itself as GCC < 4.6 so it triggers the warning about GCC
which doesn't make much sense and misleads Clang users by
telling them to update GCC.

Because Clang is now supported by the kernel print a clear
Clang-specific warning.

Link: https://github.com/ClangBuiltLinux/linux/issues/496
Link: https://github.com/ClangBuiltLinux/linux/issues/503
Reported-by: Nick Desaulniers <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Adrian Ratiu <[email protected]>
---
arch/arm/lib/xor-neon.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
index b99dd8e1c93f..f9f3601cc2d1 100644
--- a/arch/arm/lib/xor-neon.c
+++ b/arch/arm/lib/xor-neon.c
@@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
#error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
#endif

+/*
+ * TODO: Even though -ftree-vectorize is enabled by default in Clang, the
+ * compiler does not produce vectorized code due to its cost model.
+ * See: https://github.com/ClangBuiltLinux/linux/issues/503
+ */
+#ifdef CONFIG_CC_IS_CLANG
+#warning Clang does not vectorize code in this file.
+#endif
+
/*
* Pull in the reference implementations while instructing GCC (through
* -ftree-vectorize) to attempt to exploit implicit parallelism and emit
* NEON instructions.
*/
-#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
+#ifdef CONFIG_CC_IS_GCC
#pragma GCC optimize "tree-vectorize"
-#else
-/*
- * While older versions of GCC do not generate incorrect code, they fail to
- * recognize the parallel nature of these functions, and emit plain ARM code,
- * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
- */
-#warning This code requires at least version 4.6 of GCC
#endif

#pragma GCC diagnostic ignored "-Wunused-variable"
--
2.30.0

2021-01-19 21:23:46

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Tue, Jan 19, 2021 at 5:17 AM Adrian Ratiu <[email protected]> wrote:
>
> From: Nathan Chancellor <[email protected]>
>
> Drop warning because kernel now requires GCC >= v4.9 after
> commit 6ec4476ac825 ("Raise gcc version requirement to 4.9")
> and clarify that -ftree-vectorize now always needs enabling
> for GCC by directly testing the presence of CONFIG_CC_IS_GCC.
>
> Another reason to remove the warning is that Clang exposes
> itself as GCC < 4.6 so it triggers the warning about GCC
> which doesn't make much sense and misleads Clang users by
> telling them to update GCC.
>
> Because Clang is now supported by the kernel print a clear
> Clang-specific warning.
>
> Link: https://github.com/ClangBuiltLinux/linux/issues/496
> Link: https://github.com/ClangBuiltLinux/linux/issues/503
> Reported-by: Nick Desaulniers <[email protected]>
> Reviewed-by: Nick Desaulniers <[email protected]>

This is not the version of the patch I had reviewed; please drop my
reviewed-by tag when you change a patch significantly, as otherwise it
looks like I approved this patch.

Nacked-by: Nick Desaulniers <[email protected]>

> Signed-off-by: Nathan Chancellor <[email protected]>
> Signed-off-by: Adrian Ratiu <[email protected]>
> ---
> arch/arm/lib/xor-neon.c | 18 ++++++++++--------
> 1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index b99dd8e1c93f..f9f3601cc2d1 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
> #error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
> #endif
>
> +/*
> + * TODO: Even though -ftree-vectorize is enabled by default in Clang, the
> + * compiler does not produce vectorized code due to its cost model.
> + * See: https://github.com/ClangBuiltLinux/linux/issues/503
> + */
> +#ifdef CONFIG_CC_IS_CLANG
> +#warning Clang does not vectorize code in this file.
> +#endif

Arnd, remind me again why it's a bug that the compiler's cost model
says it's faster to not produce a vectorized version of these loops?
I stand by my previous comment: https://bugs.llvm.org/show_bug.cgi?id=40976#c8

> +
> /*
> * Pull in the reference implementations while instructing GCC (through
> * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> * NEON instructions.
> */
> -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> +#ifdef CONFIG_CC_IS_GCC
> #pragma GCC optimize "tree-vectorize"
> -#else
> -/*
> - * While older versions of GCC do not generate incorrect code, they fail to
> - * recognize the parallel nature of these functions, and emit plain ARM code,
> - * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
> - */
> -#warning This code requires at least version 4.6 of GCC
> #endif
>
> #pragma GCC diagnostic ignored "-Wunused-variable"
> --
> 2.30.0
>


--
Thanks,
~Nick Desaulniers

2021-01-19 21:39:16

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Tue, Jan 19, 2021 at 10:18 PM 'Nick Desaulniers' via Clang Built
Linux <[email protected]> wrote:
>
> On Tue, Jan 19, 2021 at 5:17 AM Adrian Ratiu <[email protected]> wrote:
> > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > index b99dd8e1c93f..f9f3601cc2d1 100644
> > --- a/arch/arm/lib/xor-neon.c
> > +++ b/arch/arm/lib/xor-neon.c
> > @@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
> > #error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
> > #endif
> >
> > +/*
> > + * TODO: Even though -ftree-vectorize is enabled by default in Clang, the
> > + * compiler does not produce vectorized code due to its cost model.
> > + * See: https://github.com/ClangBuiltLinux/linux/issues/503
> > + */
> > +#ifdef CONFIG_CC_IS_CLANG
> > +#warning Clang does not vectorize code in this file.
> > +#endif
>
> Arnd, remind me again why it's a bug that the compiler's cost model
> says it's faster to not produce a vectorized version of these loops?
> I stand by my previous comment: https://bugs.llvm.org/show_bug.cgi?id=40976#c8

The point is that without vectorizing the code, there is no point in building
both the default xor code and a "neon" version that has to save/restore
the neon registers but doesn't actually use them.

Arnd

2021-01-19 21:59:18

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Tue, Jan 19, 2021 at 03:17:23PM +0200, Adrian Ratiu wrote:
> From: Nathan Chancellor <[email protected]>
>
> Drop warning because kernel now requires GCC >= v4.9 after
> commit 6ec4476ac825 ("Raise gcc version requirement to 4.9")
> and clarify that -ftree-vectorize now always needs enabling
> for GCC by directly testing the presence of CONFIG_CC_IS_GCC.
>
> Another reason to remove the warning is that Clang exposes
> itself as GCC < 4.6 so it triggers the warning about GCC
> which doesn't make much sense and misleads Clang users by
> telling them to update GCC.
>
> Because Clang is now supported by the kernel print a clear
> Clang-specific warning.
>
> Link: https://github.com/ClangBuiltLinux/linux/issues/496
> Link: https://github.com/ClangBuiltLinux/linux/issues/503
> Reported-by: Nick Desaulniers <[email protected]>
> Reviewed-by: Nick Desaulniers <[email protected]>
> Signed-off-by: Nathan Chancellor <[email protected]>
> Signed-off-by: Adrian Ratiu <[email protected]>

The commit message looks like it is written by me but I never added a
Clang specific warning. I appreciate wanting to give me credit but when
you change things about my original commit message, please make it
clear that you did the edits, something like:

Signed-off-by: Nathan Chancellor <[email protected]>
[adrian: Add clang specific warning]
Signed-off-by: Adrian Ratiu <[email protected]>

> ---
> arch/arm/lib/xor-neon.c | 18 ++++++++++--------
> 1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index b99dd8e1c93f..f9f3601cc2d1 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
> #error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
> #endif
>
> +/*
> + * TODO: Even though -ftree-vectorize is enabled by default in Clang, the
> + * compiler does not produce vectorized code due to its cost model.
> + * See: https://github.com/ClangBuiltLinux/linux/issues/503
> + */
> +#ifdef CONFIG_CC_IS_CLANG
> +#warning Clang does not vectorize code in this file.
> +#endif

I really do not like this. With the GCC specific warning, the user could
just upgrade their GCC. With this warning, it is basically telling them
don't use clang, in which case, it would just be better to disable this
code altogether. I would rather see:

1. Just don't build this file with clang altogether, which I believe was
v1's 2/2 patch.

OR

2. Use the pragma:

#pragma clang loop vectorize(enable)

as Nick suggests in v1's 2/2 patch.

Alternatively, __restrict__ sounds like it might be beneficial for both
GCC and clang:

https://lore.kernel.org/lkml/[email protected]/

> /*
> * Pull in the reference implementations while instructing GCC (through
> * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> * NEON instructions.
> */
> -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> +#ifdef CONFIG_CC_IS_GCC
> #pragma GCC optimize "tree-vectorize"
> -#else
> -/*
> - * While older versions of GCC do not generate incorrect code, they fail to
> - * recognize the parallel nature of these functions, and emit plain ARM code,
> - * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
> - */
> -#warning This code requires at least version 4.6 of GCC
> #endif
>
> #pragma GCC diagnostic ignored "-Wunused-variable"
> --
> 2.30.0
>

2021-01-19 22:08:22

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

From: Adrian Ratiu
> Sent: 19 January 2021 13:17
> To: [email protected]
>
> Drop warning because kernel now requires GCC >= v4.9 after
> commit 6ec4476ac825 ("Raise gcc version requirement to 4.9")
> and clarify that -ftree-vectorize now always needs enabling
> for GCC by directly testing the presence of CONFIG_CC_IS_GCC.
...
> -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> +#ifdef CONFIG_CC_IS_GCC
> #pragma GCC optimize "tree-vectorize"
> -#else

I thought that #pragma optimise was very likely to remove
a random subset of the command line parameters leading to
incorrect/unexpected code.

As such the extra option needs to passed in as a per source
file compiler option.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-01-19 22:09:23

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Tue, Jan 19, 2021 at 2:04 PM David Laight <[email protected]> wrote:
>
> From: Adrian Ratiu
> > Sent: 19 January 2021 13:17
> > To: [email protected]
> >
> > Drop warning because kernel now requires GCC >= v4.9 after
> > commit 6ec4476ac825 ("Raise gcc version requirement to 4.9")
> > and clarify that -ftree-vectorize now always needs enabling
> > for GCC by directly testing the presence of CONFIG_CC_IS_GCC.
> ...
> > -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> > +#ifdef CONFIG_CC_IS_GCC
> > #pragma GCC optimize "tree-vectorize"
> > -#else
>
> I thought that #pragma optimise was very likely to remove
> a random subset of the command line parameters leading to
> incorrect/unexpected code.
>
> As such the extra option needs to passed in as a per source
> file compiler option.

Yes, patch 2 in the series removes it from .c source, and passes
-ftree-vectorize during Kbuild.

>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>


--
Thanks,
~Nick Desaulniers

2021-01-19 22:09:23

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Tue, Jan 19, 2021 at 1:35 PM Arnd Bergmann <[email protected]> wrote:
>
> On Tue, Jan 19, 2021 at 10:18 PM 'Nick Desaulniers' via Clang Built
> Linux <[email protected]> wrote:
> >
> > On Tue, Jan 19, 2021 at 5:17 AM Adrian Ratiu <[email protected]> wrote:
> > > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > > index b99dd8e1c93f..f9f3601cc2d1 100644
> > > --- a/arch/arm/lib/xor-neon.c
> > > +++ b/arch/arm/lib/xor-neon.c
> > > @@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
> > > #error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
> > > #endif
> > >
> > > +/*
> > > + * TODO: Even though -ftree-vectorize is enabled by default in Clang, the
> > > + * compiler does not produce vectorized code due to its cost model.
> > > + * See: https://github.com/ClangBuiltLinux/linux/issues/503
> > > + */
> > > +#ifdef CONFIG_CC_IS_CLANG
> > > +#warning Clang does not vectorize code in this file.
> > > +#endif
> >
> > Arnd, remind me again why it's a bug that the compiler's cost model
> > says it's faster to not produce a vectorized version of these loops?
> > I stand by my previous comment: https://bugs.llvm.org/show_bug.cgi?id=40976#c8
>
> The point is that without vectorizing the code, there is no point in building
> both the default xor code and a "neon" version that has to save/restore
> the neon registers but doesn't actually use them.
>
> Arnd

Thoughts? Also, Nathan brings up my previous point about restrict.
This would benefit both GCC and Clang as they would not produce 2
"versions" of the loop; one vectorized if the std::distance between x
& y >= 8, one scalar otherwise. But the callers would have to ensure
no overlap otherwise UB.

diff --git a/include/asm-generic/xor.h b/include/asm-generic/xor.h
index b62a2a56a4d4..7db16adc7d89 100644
--- a/include/asm-generic/xor.h
+++ b/include/asm-generic/xor.h
@@ -7,12 +7,21 @@

#include <linux/prefetch.h>

+/* Overrule LLVM's cost model in order to always produce a vectorized loop
+ * version.
+ */
+#if defined(__clang__) && defined(CONFIG_ARM)
+#define FORCE_VECTORIZE _Pragma("clang loop vectorize(enable)")
+#else
+#define CLANG_FORCE_VECTORIZE
+#endif
+
static void
xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
{
long lines = bytes / (sizeof (long)) / 8;

- do {
+ FORCE_VECTORIZE do {
p1[0] ^= p2[0];
p1[1] ^= p2[1];
p1[2] ^= p2[2];
@@ -32,7 +41,7 @@ xor_8regs_3(unsigned long bytes, unsigned long *p1,
unsigned long *p2,
{
long lines = bytes / (sizeof (long)) / 8;

- do {
+ FORCE_VECTORIZE do {
p1[0] ^= p2[0] ^ p3[0];
p1[1] ^= p2[1] ^ p3[1];
p1[2] ^= p2[2] ^ p3[2];
@@ -53,7 +62,7 @@ xor_8regs_4(unsigned long bytes, unsigned long *p1,
unsigned long *p2,
{
long lines = bytes / (sizeof (long)) / 8;

- do {
+ FORCE_VECTORIZE do {
p1[0] ^= p2[0] ^ p3[0] ^ p4[0];
p1[1] ^= p2[1] ^ p3[1] ^ p4[1];
p1[2] ^= p2[2] ^ p3[2] ^ p4[2];
@@ -75,7 +84,7 @@ xor_8regs_5(unsigned long bytes, unsigned long *p1,
unsigned long *p2,
{
long lines = bytes / (sizeof (long)) / 8;

- do {
+ FORCE_VECTORIZE do {
p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0];
p1[1] ^= p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
p1[2] ^= p2[2] ^ p3[2] ^ p4[2] ^ p5[2];
--
Thanks,
~Nick Desaulniers

2021-01-19 23:14:20

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Tue, Jan 19, 2021 at 2:04 PM Nick Desaulniers
<[email protected]> wrote:
>
> On Tue, Jan 19, 2021 at 1:35 PM Arnd Bergmann <[email protected]> wrote:
> >
> > On Tue, Jan 19, 2021 at 10:18 PM 'Nick Desaulniers' via Clang Built
> > Linux <[email protected]> wrote:
> > >
> > > On Tue, Jan 19, 2021 at 5:17 AM Adrian Ratiu <[email protected]> wrote:
> > > > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > > > index b99dd8e1c93f..f9f3601cc2d1 100644
> > > > --- a/arch/arm/lib/xor-neon.c
> > > > +++ b/arch/arm/lib/xor-neon.c
> > > > @@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
> > > > #error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
> > > > #endif
> > > >
> > > > +/*
> > > > + * TODO: Even though -ftree-vectorize is enabled by default in Clang, the
> > > > + * compiler does not produce vectorized code due to its cost model.
> > > > + * See: https://github.com/ClangBuiltLinux/linux/issues/503
> > > > + */
> > > > +#ifdef CONFIG_CC_IS_CLANG
> > > > +#warning Clang does not vectorize code in this file.
> > > > +#endif
> > >
> > > Arnd, remind me again why it's a bug that the compiler's cost model
> > > says it's faster to not produce a vectorized version of these loops?
> > > I stand by my previous comment: https://bugs.llvm.org/show_bug.cgi?id=40976#c8
> >
> > The point is that without vectorizing the code, there is no point in building
> > both the default xor code and a "neon" version that has to save/restore
> > the neon registers but doesn't actually use them.
> >
> > Arnd
>
> Thoughts? Also, Nathan brings up my previous point about restrict.
> This would benefit both GCC and Clang as they would not produce 2
> "versions" of the loop; one vectorized if the std::distance between x
> & y >= 8, one scalar otherwise. But the callers would have to ensure
> no overlap otherwise UB.
>
> diff --git a/include/asm-generic/xor.h b/include/asm-generic/xor.h
> index b62a2a56a4d4..7db16adc7d89 100644
> --- a/include/asm-generic/xor.h
> +++ b/include/asm-generic/xor.h
> @@ -7,12 +7,21 @@
>
> #include <linux/prefetch.h>
>
> +/* Overrule LLVM's cost model in order to always produce a vectorized loop
> + * version.
> + */
> +#if defined(__clang__) && defined(CONFIG_ARM)
> +#define FORCE_VECTORIZE _Pragma("clang loop vectorize(enable)")
> +#else
> +#define CLANG_FORCE_VECTORIZE

^ err, I had renamed it, but missed this. Should have been
`FORCE_VECTORIZE` but you catch the drift.

> +#endif
> +
> static void
> xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
> {
> long lines = bytes / (sizeof (long)) / 8;
>
> - do {
> + FORCE_VECTORIZE do {
> p1[0] ^= p2[0];
> p1[1] ^= p2[1];
> p1[2] ^= p2[2];
> @@ -32,7 +41,7 @@ xor_8regs_3(unsigned long bytes, unsigned long *p1,
> unsigned long *p2,
> {
> long lines = bytes / (sizeof (long)) / 8;
>
> - do {
> + FORCE_VECTORIZE do {
> p1[0] ^= p2[0] ^ p3[0];
> p1[1] ^= p2[1] ^ p3[1];
> p1[2] ^= p2[2] ^ p3[2];
> @@ -53,7 +62,7 @@ xor_8regs_4(unsigned long bytes, unsigned long *p1,
> unsigned long *p2,
> {
> long lines = bytes / (sizeof (long)) / 8;
>
> - do {
> + FORCE_VECTORIZE do {
> p1[0] ^= p2[0] ^ p3[0] ^ p4[0];
> p1[1] ^= p2[1] ^ p3[1] ^ p4[1];
> p1[2] ^= p2[2] ^ p3[2] ^ p4[2];
> @@ -75,7 +84,7 @@ xor_8regs_5(unsigned long bytes, unsigned long *p1,
> unsigned long *p2,
> {
> long lines = bytes / (sizeof (long)) / 8;
>
> - do {
> + FORCE_VECTORIZE do {
> p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0];
> p1[1] ^= p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> p1[2] ^= p2[2] ^ p3[2] ^ p4[2] ^ p5[2];
> --
> Thanks,
> ~Nick Desaulniers



--
Thanks,
~Nick Desaulniers

2021-01-20 13:29:13

by Adrian Ratiu

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Tue, 19 Jan 2021, Nathan Chancellor <[email protected]>
wrote:
> On Tue, Jan 19, 2021 at 03:17:23PM +0200, Adrian Ratiu wrote:
>> From: Nathan Chancellor <[email protected]> Drop
>> warning because kernel now requires GCC >= v4.9 after commit
>> 6ec4476ac825 ("Raise gcc version requirement to 4.9") and
>> clarify that -ftree-vectorize now always needs enabling for GCC
>> by directly testing the presence of CONFIG_CC_IS_GCC. Another
>> reason to remove the warning is that Clang exposes itself as
>> GCC < 4.6 so it triggers the warning about GCC which doesn't
>> make much sense and misleads Clang users by telling them to
>> update GCC. Because Clang is now supported by the kernel
>> print a clear Clang-specific warning. Link:
>> https://github.com/ClangBuiltLinux/linux/issues/496 Link:
>> https://github.com/ClangBuiltLinux/linux/issues/503
>> Reported-by: Nick Desaulniers <[email protected]>
>> Reviewed-by: Nick Desaulniers <[email protected]>
>> Signed-off-by: Nathan Chancellor <[email protected]>
>> Signed-off-by: Adrian Ratiu <[email protected]>
>
> The commit message looks like it is written by me but I never
> added a Clang specific warning. I appreciate wanting to give me
> credit but when you change things about my original commit
> message, please make it clear that you did the edits, something
> like:
>
> Signed-off-by: Nathan Chancellor <[email protected]>
> [adrian: Add clang specific warning] Signed-off-by: Adrian Ratiu
> <[email protected]>
>

Thanks for the suggestion. Makes sense. I contemplated adding
another patch by me on top but thought it was too much
churn. Sorry if my edits were unclear.

>> ---
>> arch/arm/lib/xor-neon.c | 18 ++++++++++-------- 1 file
>> changed, 10 insertions(+), 8 deletions(-)
>> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
>> index b99dd8e1c93f..f9f3601cc2d1 100644 ---
>> a/arch/arm/lib/xor-neon.c +++ b/arch/arm/lib/xor-neon.c @@
>> -14,20 +14,22 @@ MODULE_LICENSE("GPL");
>> #error You should compile this file with '-march=armv7-a
>> -mfloat-abi=softfp -mfpu=neon' #endif
>> +/* + * TODO: Even though -ftree-vectorize is enabled by
>> default in Clang, the + * compiler does not produce vectorized
>> code due to its cost model. + * See:
>> https://github.com/ClangBuiltLinux/linux/issues/503 + */
>> +#ifdef CONFIG_CC_IS_CLANG +#warning Clang does not vectorize
>> code in this file. +#endif
>
> I really do not like this. With the GCC specific warning, the
> user could just upgrade their GCC. With this warning, it is
> basically telling them don't use clang, in which case, it would
> just be better to disable this code altogether. I would rather
> see:
>
> 1. Just don't build this file with clang altogether, which I
> believe was
> v1's 2/2 patch.
>
> OR
>
> 2. Use the pragma:
>
> #pragma clang loop vectorize(enable)
>
> as Nick suggests in v1's 2/2 patch.
>
> Alternatively, __restrict__ sounds like it might be beneficial
> for both GCC and clang:
>
> https://lore.kernel.org/lkml/[email protected]/
>

Option 1 from v1 got clearly NACKed by Nick a while back so the
only option gonig forward is to also fix clang vectorization
together with these changes so the warning becomes unnecessary.

>> /*
>> * Pull in the reference implementations while instructing GCC (through
>> * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>> * NEON instructions.
>> */
>> -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
>> +#ifdef CONFIG_CC_IS_GCC
>> #pragma GCC optimize "tree-vectorize"
>> -#else
>> -/*
>> - * While older versions of GCC do not generate incorrect code, they fail to
>> - * recognize the parallel nature of these functions, and emit plain ARM code,
>> - * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
>> - */
>> -#warning This code requires at least version 4.6 of GCC
>> #endif
>>
>> #pragma GCC diagnostic ignored "-Wunused-variable"
>> --
>> 2.30.0
>>

2021-01-20 14:52:12

by Adrian Ratiu

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Tue, 19 Jan 2021, Nick Desaulniers <[email protected]>
wrote:
> On Tue, Jan 19, 2021 at 5:17 AM Adrian Ratiu
> <[email protected]> wrote:
>>
>> From: Nathan Chancellor <[email protected]>
>>
>> Drop warning because kernel now requires GCC >= v4.9 after
>> commit 6ec4476ac825 ("Raise gcc version requirement to 4.9")
>> and clarify that -ftree-vectorize now always needs enabling for
>> GCC by directly testing the presence of CONFIG_CC_IS_GCC.
>>
>> Another reason to remove the warning is that Clang exposes
>> itself as GCC < 4.6 so it triggers the warning about GCC which
>> doesn't make much sense and misleads Clang users by telling
>> them to update GCC.
>>
>> Because Clang is now supported by the kernel print a clear
>> Clang-specific warning.
>>
>> Link: https://github.com/ClangBuiltLinux/linux/issues/496 Link:
>> https://github.com/ClangBuiltLinux/linux/issues/503
>> Reported-by: Nick Desaulniers <[email protected]>
>> Reviewed-by: Nick Desaulniers <[email protected]>
>
> This is not the version of the patch I had reviewed; please drop
> my reviewed-by tag when you change a patch significantly, as
> otherwise it looks like I approved this patch.
>
> Nacked-by: Nick Desaulniers <[email protected]>
>

Sorry for not removing the reviewed-by tags from the previous
versions in this v4. I guess the only way forward with this is to
actually make clang vectorization work. Also thanks for the patch
suggestion in the other e-mail!

>> Signed-off-by: Nathan Chancellor <[email protected]>
>> Signed-off-by: Adrian Ratiu <[email protected]>
>> ---
>> arch/arm/lib/xor-neon.c | 18 ++++++++++--------
>> 1 file changed, 10 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
>> index b99dd8e1c93f..f9f3601cc2d1 100644
>> --- a/arch/arm/lib/xor-neon.c
>> +++ b/arch/arm/lib/xor-neon.c
>> @@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
>> #error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
>> #endif
>>
>> +/*
>> + * TODO: Even though -ftree-vectorize is enabled by default in Clang, the
>> + * compiler does not produce vectorized code due to its cost model.
>> + * See: https://github.com/ClangBuiltLinux/linux/issues/503
>> + */
>> +#ifdef CONFIG_CC_IS_CLANG
>> +#warning Clang does not vectorize code in this file.
>> +#endif
>
> Arnd, remind me again why it's a bug that the compiler's cost model
> says it's faster to not produce a vectorized version of these loops?
> I stand by my previous comment: https://bugs.llvm.org/show_bug.cgi?id=40976#c8
>
>> +
>> /*
>> * Pull in the reference implementations while instructing GCC (through
>> * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>> * NEON instructions.
>> */
>> -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
>> +#ifdef CONFIG_CC_IS_GCC
>> #pragma GCC optimize "tree-vectorize"
>> -#else
>> -/*
>> - * While older versions of GCC do not generate incorrect code, they fail to
>> - * recognize the parallel nature of these functions, and emit plain ARM code,
>> - * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
>> - */
>> -#warning This code requires at least version 4.6 of GCC
>> #endif
>>
>> #pragma GCC diagnostic ignored "-Wunused-variable"
>> --
>> 2.30.0
>>
>
>
> --
> Thanks,
> ~Nick Desaulniers

2021-01-21 02:37:35

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Wed, Jan 20, 2021 at 3:09 PM Nick Desaulniers
<[email protected]> wrote:
>
> On Tue, Jan 19, 2021 at 1:35 PM Arnd Bergmann <[email protected]> wrote:
> >
> > On Tue, Jan 19, 2021 at 10:18 PM 'Nick Desaulniers' via Clang Built
> > Linux <[email protected]> wrote:
> > >
> > > On Tue, Jan 19, 2021 at 5:17 AM Adrian Ratiu <[email protected]> wrote:
> > > > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > > > index b99dd8e1c93f..f9f3601cc2d1 100644
> > > > --- a/arch/arm/lib/xor-neon.c
> > > > +++ b/arch/arm/lib/xor-neon.c
> > > > @@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
> > > > #error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
> > > > #endif
> > > >
> > > > +/*
> > > > + * TODO: Even though -ftree-vectorize is enabled by default in Clang, the
> > > > + * compiler does not produce vectorized code due to its cost model.
> > > > + * See: https://github.com/ClangBuiltLinux/linux/issues/503
> > > > + */
> > > > +#ifdef CONFIG_CC_IS_CLANG
> > > > +#warning Clang does not vectorize code in this file.
> > > > +#endif
> > >
> > > Arnd, remind me again why it's a bug that the compiler's cost model
> > > says it's faster to not produce a vectorized version of these loops?
> > > I stand by my previous comment: https://bugs.llvm.org/show_bug.cgi?id=40976#c8
> >
> > The point is that without vectorizing the code, there is no point in building
> > both the default xor code and a "neon" version that has to save/restore
> > the neon registers but doesn't actually use them.
>
> Doesn't that already happen today with GCC when the pointer arguments
> are overlapping? The loop is "versioned" and may not actually use the
> NEON registers at all at runtime for such arguments.
> https://godbolt.org/z/q48q8v See also:
> https://bugs.llvm.org/show_bug.cgi?id=40976#c11. Or am I missing
> something?
>
> So I'm thinking if we extend out this pattern to the rest of the
> functions, we can actually avoid calls to
> kernel_neon_begin()/kernel_neon_end() for cases in which pointers
> would be too close to use the vectorized loop version; meaning for GCC
> this would be an optimization (don't save neon registers when you're
> not going to use them). I would probably consider moving
> include/asm-generic/xor.h somewhere under arch/arm/
> perhaps...err...something for the other users of <asm-generic/xor.h>.
>
> diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> index aefddec79286..abd748d317e8 100644
> --- a/arch/arm/include/asm/xor.h
> +++ b/arch/arm/include/asm/xor.h
> @@ -148,12 +148,12 @@ extern struct xor_block_template const
> xor_block_neon_inner;
> static void
> xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
> {
> - if (in_interrupt()) {
> - xor_arm4regs_2(bytes, p1, p2);
> - } else {
> + if (!in_interrupt() && abs((uintptr_t)p1, (uintptr_t)p2) >= 8) {
> kernel_neon_begin();
> xor_block_neon_inner.do_2(bytes, p1, p2);
> kernel_neon_end();
> + } else {
> + xor_arm4regs_2(bytes, p1, p2);
> }
> }
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index b99dd8e1c93f..0e8e474c0523 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -14,22 +14,6 @@ MODULE_LICENSE("GPL");
> #error You should compile this file with '-march=armv7-a
> -mfloat-abi=softfp -mfpu=neon'
> #endif
>
> -/*
> - * Pull in the reference implementations while instructing GCC (through
> - * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> - * NEON instructions.
> - */
> -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> -#pragma GCC optimize "tree-vectorize"

Err...we need to keep this but use the -f flag with __restrict for GCC
to vectorize:
https://godbolt.org/z/9acnEv

--
Thanks,
~Nick Desaulniers

2021-01-21 02:37:46

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Tue, Jan 19, 2021 at 1:35 PM Arnd Bergmann <[email protected]> wrote:
>
> On Tue, Jan 19, 2021 at 10:18 PM 'Nick Desaulniers' via Clang Built
> Linux <[email protected]> wrote:
> >
> > On Tue, Jan 19, 2021 at 5:17 AM Adrian Ratiu <[email protected]> wrote:
> > > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > > index b99dd8e1c93f..f9f3601cc2d1 100644
> > > --- a/arch/arm/lib/xor-neon.c
> > > +++ b/arch/arm/lib/xor-neon.c
> > > @@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
> > > #error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
> > > #endif
> > >
> > > +/*
> > > + * TODO: Even though -ftree-vectorize is enabled by default in Clang, the
> > > + * compiler does not produce vectorized code due to its cost model.
> > > + * See: https://github.com/ClangBuiltLinux/linux/issues/503
> > > + */
> > > +#ifdef CONFIG_CC_IS_CLANG
> > > +#warning Clang does not vectorize code in this file.
> > > +#endif
> >
> > Arnd, remind me again why it's a bug that the compiler's cost model
> > says it's faster to not produce a vectorized version of these loops?
> > I stand by my previous comment: https://bugs.llvm.org/show_bug.cgi?id=40976#c8
>
> The point is that without vectorizing the code, there is no point in building
> both the default xor code and a "neon" version that has to save/restore
> the neon registers but doesn't actually use them.

Doesn't that already happen today with GCC when the pointer arguments
are overlapping? The loop is "versioned" and may not actually use the
NEON registers at all at runtime for such arguments.
https://godbolt.org/z/q48q8v See also:
https://bugs.llvm.org/show_bug.cgi?id=40976#c11. Or am I missing
something?

So I'm thinking if we extend out this pattern to the rest of the
functions, we can actually avoid calls to
kernel_neon_begin()/kernel_neon_end() for cases in which pointers
would be too close to use the vectorized loop version; meaning for GCC
this would be an optimization (don't save neon registers when you're
not going to use them). I would probably consider moving
include/asm-generic/xor.h somewhere under arch/arm/
perhaps...err...something for the other users of <asm-generic/xor.h>.

diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
index aefddec79286..abd748d317e8 100644
--- a/arch/arm/include/asm/xor.h
+++ b/arch/arm/include/asm/xor.h
@@ -148,12 +148,12 @@ extern struct xor_block_template const
xor_block_neon_inner;
static void
xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
{
- if (in_interrupt()) {
- xor_arm4regs_2(bytes, p1, p2);
- } else {
+ if (!in_interrupt() && abs((uintptr_t)p1, (uintptr_t)p2) >= 8) {
kernel_neon_begin();
xor_block_neon_inner.do_2(bytes, p1, p2);
kernel_neon_end();
+ } else {
+ xor_arm4regs_2(bytes, p1, p2);
}
}
diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
index b99dd8e1c93f..0e8e474c0523 100644
--- a/arch/arm/lib/xor-neon.c
+++ b/arch/arm/lib/xor-neon.c
@@ -14,22 +14,6 @@ MODULE_LICENSE("GPL");
#error You should compile this file with '-march=armv7-a
-mfloat-abi=softfp -mfpu=neon'
#endif

-/*
- * Pull in the reference implementations while instructing GCC (through
- * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
- * NEON instructions.
- */
-#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
-#pragma GCC optimize "tree-vectorize"
-#else
-/*
- * While older versions of GCC do not generate incorrect code, they fail to
- * recognize the parallel nature of these functions, and emit plain ARM code,
- * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
- */
-#warning This code requires at least version 4.6 of GCC
-#endif
-
#pragma GCC diagnostic ignored "-Wunused-variable"
#include <asm-generic/xor.h>
diff --git a/include/asm-generic/xor.h b/include/asm-generic/xor.h
index b62a2a56a4d4..69df62095c33 100644
--- a/include/asm-generic/xor.h
+++ b/include/asm-generic/xor.h
@@ -8,7 +8,7 @@
#include <linux/prefetch.h>

static void
-xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
+xor_8regs_2(unsigned long bytes, unsigned long * __restrict p1,
unsigned long * __restrict p2)
{
long lines = bytes / (sizeof (long)) / 8;


--
Thanks,
~Nick Desaulniers

2021-01-21 04:28:09

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Wed, Jan 20, 2021 at 8:13 PM Arvind Sankar <[email protected]> wrote:
>
> On Wed, Jan 20, 2021 at 03:09:53PM -0800, Nick Desaulniers wrote:
> > So I'm thinking if we extend out this pattern to the rest of the
> > functions, we can actually avoid calls to
> > kernel_neon_begin()/kernel_neon_end() for cases in which pointers
> > would be too close to use the vectorized loop version; meaning for GCC
> > this would be an optimization (don't save neon registers when you're
> > not going to use them). I would probably consider moving
> > include/asm-generic/xor.h somewhere under arch/arm/
> > perhaps...err...something for the other users of <asm-generic/xor.h>.
>
> We can't directly do the patch below since there are other users of the
> asm-generic/xor.h implementations than just the neon file. If it's too
> much work to check and add __restrict everywhere, I think we'd either
> need to copy the code into the xor-neon file,

Yes; I'd rather copy that code into xor-neon, or someone under
arch/arm/ at least.

> or maybe do some ifdeffery
> so __restrict is only used for the neon version.

--
Thanks,
~Nick Desaulniers

2021-01-21 04:33:30

by Arvind Sankar

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Wed, Jan 20, 2021 at 03:09:53PM -0800, Nick Desaulniers wrote:
> On Tue, Jan 19, 2021 at 1:35 PM Arnd Bergmann <[email protected]> wrote:
> >
> > On Tue, Jan 19, 2021 at 10:18 PM 'Nick Desaulniers' via Clang Built
> > Linux <[email protected]> wrote:
> > >
> > > On Tue, Jan 19, 2021 at 5:17 AM Adrian Ratiu <[email protected]> wrote:
> > > > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > > > index b99dd8e1c93f..f9f3601cc2d1 100644
> > > > --- a/arch/arm/lib/xor-neon.c
> > > > +++ b/arch/arm/lib/xor-neon.c
> > > > @@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
> > > > #error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
> > > > #endif
> > > >
> > > > +/*
> > > > + * TODO: Even though -ftree-vectorize is enabled by default in Clang, the
> > > > + * compiler does not produce vectorized code due to its cost model.
> > > > + * See: https://github.com/ClangBuiltLinux/linux/issues/503
> > > > + */
> > > > +#ifdef CONFIG_CC_IS_CLANG
> > > > +#warning Clang does not vectorize code in this file.
> > > > +#endif
> > >
> > > Arnd, remind me again why it's a bug that the compiler's cost model
> > > says it's faster to not produce a vectorized version of these loops?
> > > I stand by my previous comment: https://bugs.llvm.org/show_bug.cgi?id=40976#c8
> >
> > The point is that without vectorizing the code, there is no point in building
> > both the default xor code and a "neon" version that has to save/restore
> > the neon registers but doesn't actually use them.
>
> Doesn't that already happen today with GCC when the pointer arguments
> are overlapping? The loop is "versioned" and may not actually use the
> NEON registers at all at runtime for such arguments.
> https://godbolt.org/z/q48q8v See also:
> https://bugs.llvm.org/show_bug.cgi?id=40976#c11. Or am I missing
> something?

The gcc version is at least useful when the arguments _don't_ overlap,
which is presumably most/all the time.

There's no point to building this file if it's not going to produce a
vectorized version at all. The warning seems unnecessary, and it's not
really a compiler bug either -- until we agree on a way to get clang to
produce a vectorized version, the best thing would be for the neon
version to not be built at all with clang. Is that too messy to achieve?

>
> So I'm thinking if we extend out this pattern to the rest of the
> functions, we can actually avoid calls to
> kernel_neon_begin()/kernel_neon_end() for cases in which pointers
> would be too close to use the vectorized loop version; meaning for GCC
> this would be an optimization (don't save neon registers when you're
> not going to use them). I would probably consider moving
> include/asm-generic/xor.h somewhere under arch/arm/
> perhaps...err...something for the other users of <asm-generic/xor.h>.

We can't directly do the patch below since there are other users of the
asm-generic/xor.h implementations than just the neon file. If it's too
much work to check and add __restrict everywhere, I think we'd either
need to copy the code into the xor-neon file, or maybe do some ifdeffery
so __restrict is only used for the neon version.

>
> diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> index aefddec79286..abd748d317e8 100644
> --- a/arch/arm/include/asm/xor.h
> +++ b/arch/arm/include/asm/xor.h
> @@ -148,12 +148,12 @@ extern struct xor_block_template const
> xor_block_neon_inner;
> static void
> xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
> {
> - if (in_interrupt()) {
> - xor_arm4regs_2(bytes, p1, p2);
> - } else {
> + if (!in_interrupt() && abs((uintptr_t)p1, (uintptr_t)p2) >= 8) {
> kernel_neon_begin();
> xor_block_neon_inner.do_2(bytes, p1, p2);
> kernel_neon_end();
> + } else {
> + xor_arm4regs_2(bytes, p1, p2);
> }
> }
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index b99dd8e1c93f..0e8e474c0523 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -14,22 +14,6 @@ MODULE_LICENSE("GPL");
> #error You should compile this file with '-march=armv7-a
> -mfloat-abi=softfp -mfpu=neon'
> #endif
>
> -/*
> - * Pull in the reference implementations while instructing GCC (through
> - * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> - * NEON instructions.
> - */
> -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> -#pragma GCC optimize "tree-vectorize"
> -#else
> -/*
> - * While older versions of GCC do not generate incorrect code, they fail to
> - * recognize the parallel nature of these functions, and emit plain ARM code,
> - * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
> - */
> -#warning This code requires at least version 4.6 of GCC
> -#endif
> -
> #pragma GCC diagnostic ignored "-Wunused-variable"
> #include <asm-generic/xor.h>
> diff --git a/include/asm-generic/xor.h b/include/asm-generic/xor.h
> index b62a2a56a4d4..69df62095c33 100644
> --- a/include/asm-generic/xor.h
> +++ b/include/asm-generic/xor.h
> @@ -8,7 +8,7 @@
> #include <linux/prefetch.h>
>
> static void
> -xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
> +xor_8regs_2(unsigned long bytes, unsigned long * __restrict p1,
> unsigned long * __restrict p2)
> {
> long lines = bytes / (sizeof (long)) / 8;
>
>
> --
> Thanks,
> ~Nick Desaulniers

2021-01-21 08:50:41

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

On Thu, 21 Jan 2021 at 05:13, Arvind Sankar <[email protected]> wrote:
>
> On Wed, Jan 20, 2021 at 03:09:53PM -0800, Nick Desaulniers wrote:
> > On Tue, Jan 19, 2021 at 1:35 PM Arnd Bergmann <[email protected]> wrote:
> > >
> > > On Tue, Jan 19, 2021 at 10:18 PM 'Nick Desaulniers' via Clang Built
> > > Linux <[email protected]> wrote:
> > > >
> > > > On Tue, Jan 19, 2021 at 5:17 AM Adrian Ratiu <[email protected]> wrote:
> > > > > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > > > > index b99dd8e1c93f..f9f3601cc2d1 100644
> > > > > --- a/arch/arm/lib/xor-neon.c
> > > > > +++ b/arch/arm/lib/xor-neon.c
> > > > > @@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
> > > > > #error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
> > > > > #endif
> > > > >
> > > > > +/*
> > > > > + * TODO: Even though -ftree-vectorize is enabled by default in Clang, the
> > > > > + * compiler does not produce vectorized code due to its cost model.
> > > > > + * See: https://github.com/ClangBuiltLinux/linux/issues/503
> > > > > + */
> > > > > +#ifdef CONFIG_CC_IS_CLANG
> > > > > +#warning Clang does not vectorize code in this file.
> > > > > +#endif
> > > >
> > > > Arnd, remind me again why it's a bug that the compiler's cost model
> > > > says it's faster to not produce a vectorized version of these loops?
> > > > I stand by my previous comment: https://bugs.llvm.org/show_bug.cgi?id=40976#c8
> > >
> > > The point is that without vectorizing the code, there is no point in building
> > > both the default xor code and a "neon" version that has to save/restore
> > > the neon registers but doesn't actually use them.
> >
> > Doesn't that already happen today with GCC when the pointer arguments
> > are overlapping? The loop is "versioned" and may not actually use the
> > NEON registers at all at runtime for such arguments.
> > https://godbolt.org/z/q48q8v See also:
> > https://bugs.llvm.org/show_bug.cgi?id=40976#c11. Or am I missing
> > something?
>
> The gcc version is at least useful when the arguments _don't_ overlap,
> which is presumably most/all the time.
>

Indeed

> There's no point to building this file if it's not going to produce a
> vectorized version at all. The warning seems unnecessary, and it's not
> really a compiler bug either -- until we agree on a way to get clang to
> produce a vectorized version, the best thing would be for the neon
> version to not be built at all with clang. Is that too messy to achieve?
>

+1

> >
> > So I'm thinking if we extend out this pattern to the rest of the
> > functions, we can actually avoid calls to
> > kernel_neon_begin()/kernel_neon_end() for cases in which pointers
> > would be too close to use the vectorized loop version; meaning for GCC
> > this would be an optimization (don't save neon registers when you're
> > not going to use them). I would probably consider moving
> > include/asm-generic/xor.h somewhere under arch/arm/
> > perhaps...err...something for the other users of <asm-generic/xor.h>.
>
> We can't directly do the patch below since there are other users of the
> asm-generic/xor.h implementations than just the neon file. If it's too
> much work to check and add __restrict everywhere, I think we'd either
> need to copy the code into the xor-neon file, or maybe do some ifdeffery
> so __restrict is only used for the neon version.
>

Don't mess with the code, please. It's really not worth it.

The current implementation works fine with overlapping inputs, but
works better when they don't. I don't see why we would change that.

If Clang cannot be forced to vectorize this code, then just disable it
for Clang - it's not the end of the world, who runs xor_blocks() on a
hot path on 32-bit ARM anyway?



> >
> > diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> > index aefddec79286..abd748d317e8 100644
> > --- a/arch/arm/include/asm/xor.h
> > +++ b/arch/arm/include/asm/xor.h
> > @@ -148,12 +148,12 @@ extern struct xor_block_template const
> > xor_block_neon_inner;
> > static void
> > xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
> > {
> > - if (in_interrupt()) {
> > - xor_arm4regs_2(bytes, p1, p2);
> > - } else {
> > + if (!in_interrupt() && abs((uintptr_t)p1, (uintptr_t)p2) >= 8) {
> > kernel_neon_begin();
> > xor_block_neon_inner.do_2(bytes, p1, p2);
> > kernel_neon_end();
> > + } else {
> > + xor_arm4regs_2(bytes, p1, p2);
> > }
> > }
> > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > index b99dd8e1c93f..0e8e474c0523 100644
> > --- a/arch/arm/lib/xor-neon.c
> > +++ b/arch/arm/lib/xor-neon.c
> > @@ -14,22 +14,6 @@ MODULE_LICENSE("GPL");
> > #error You should compile this file with '-march=armv7-a
> > -mfloat-abi=softfp -mfpu=neon'
> > #endif
> >
> > -/*
> > - * Pull in the reference implementations while instructing GCC (through
> > - * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> > - * NEON instructions.
> > - */
> > -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> > -#pragma GCC optimize "tree-vectorize"
> > -#else
> > -/*
> > - * While older versions of GCC do not generate incorrect code, they fail to
> > - * recognize the parallel nature of these functions, and emit plain ARM code,
> > - * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
> > - */
> > -#warning This code requires at least version 4.6 of GCC
> > -#endif
> > -
> > #pragma GCC diagnostic ignored "-Wunused-variable"
> > #include <asm-generic/xor.h>
> > diff --git a/include/asm-generic/xor.h b/include/asm-generic/xor.h
> > index b62a2a56a4d4..69df62095c33 100644
> > --- a/include/asm-generic/xor.h
> > +++ b/include/asm-generic/xor.h
> > @@ -8,7 +8,7 @@
> > #include <linux/prefetch.h>
> >
> > static void
> > -xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
> > +xor_8regs_2(unsigned long bytes, unsigned long * __restrict p1,
> > unsigned long * __restrict p2)
> > {
> > long lines = bytes / (sizeof (long)) / 8;
> >
> >
> > --
> > Thanks,
> > ~Nick Desaulniers