2019-05-10 09:28:12

by Christophe Leroy

[permalink] [raw]
Subject: [PATCH v2] powerpc: slightly improve cache helpers

Cache instructions (dcbz, dcbi, dcbf and dcbst) take two registers
that are summed to obtain the target address. Using 'Z' constraint
and '%y0' argument gives GCC the opportunity to use both registers
instead of only one with the second being forced to 0.

Suggested-by: Segher Boessenkool <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/include/asm/cache.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h
index 40ea5b3781c6..df8e4c407366 100644
--- a/arch/powerpc/include/asm/cache.h
+++ b/arch/powerpc/include/asm/cache.h
@@ -85,22 +85,22 @@ extern void _set_L3CR(unsigned long);

static inline void dcbz(void *addr)
{
- __asm__ __volatile__ ("dcbz 0, %0" : : "r"(addr) : "memory");
+ __asm__ __volatile__ ("dcbz %y0" : : "Z"(*(u8 *)addr) : "memory");
}

static inline void dcbi(void *addr)
{
- __asm__ __volatile__ ("dcbi 0, %0" : : "r"(addr) : "memory");
+ __asm__ __volatile__ ("dcbi %y0" : : "Z"(*(u8 *)addr) : "memory");
}

static inline void dcbf(void *addr)
{
- __asm__ __volatile__ ("dcbf 0, %0" : : "r"(addr) : "memory");
+ __asm__ __volatile__ ("dcbf %y0" : : "Z"(*(u8 *)addr) : "memory");
}

static inline void dcbst(void *addr)
{
- __asm__ __volatile__ ("dcbst 0, %0" : : "r"(addr) : "memory");
+ __asm__ __volatile__ ("dcbst %y0" : : "Z"(*(u8 *)addr) : "memory");
}
#endif /* !__ASSEMBLY__ */
#endif /* __KERNEL__ */
--
2.13.3


2019-07-08 03:55:21

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Fri, 2019-05-10 at 09:24:48 UTC, Christophe Leroy wrote:
> Cache instructions (dcbz, dcbi, dcbf and dcbst) take two registers
> that are summed to obtain the target address. Using 'Z' constraint
> and '%y0' argument gives GCC the opportunity to use both registers
> instead of only one with the second being forced to 0.
>
> Suggested-by: Segher Boessenkool <[email protected]>
> Signed-off-by: Christophe Leroy <[email protected]>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/6c5875843b87c3adea2beade9d1b8b3d4523900a

cheers

2019-07-08 22:48:26

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Mon, Jul 08, 2019 at 11:19:30AM +1000, Michael Ellerman wrote:
> On Fri, 2019-05-10 at 09:24:48 UTC, Christophe Leroy wrote:
> > Cache instructions (dcbz, dcbi, dcbf and dcbst) take two registers
> > that are summed to obtain the target address. Using 'Z' constraint
> > and '%y0' argument gives GCC the opportunity to use both registers
> > instead of only one with the second being forced to 0.
> >
> > Suggested-by: Segher Boessenkool <[email protected]>
> > Signed-off-by: Christophe Leroy <[email protected]>
>
> Applied to powerpc next, thanks.
>
> https://git.kernel.org/powerpc/c/6c5875843b87c3adea2beade9d1b8b3d4523900a
>
> cheers

This patch causes a regression with clang:

https://travis-ci.com/ClangBuiltLinux/continuous-integration/jobs/213944668

I've attached my local bisect/build log.

Cheers,
Nathan


Attachments:
(No filename) (865.00 B)
bisect.log (2.65 kB)
build.log (91.81 kB)
Download all attachments

2019-07-09 05:07:16

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers



Le 08/07/2019 ? 21:14, Nathan Chancellor a ?crit?:
> On Mon, Jul 08, 2019 at 11:19:30AM +1000, Michael Ellerman wrote:
>> On Fri, 2019-05-10 at 09:24:48 UTC, Christophe Leroy wrote:
>>> Cache instructions (dcbz, dcbi, dcbf and dcbst) take two registers
>>> that are summed to obtain the target address. Using 'Z' constraint
>>> and '%y0' argument gives GCC the opportunity to use both registers
>>> instead of only one with the second being forced to 0.
>>>
>>> Suggested-by: Segher Boessenkool <[email protected]>
>>> Signed-off-by: Christophe Leroy <[email protected]>
>>
>> Applied to powerpc next, thanks.
>>
>> https://git.kernel.org/powerpc/c/6c5875843b87c3adea2beade9d1b8b3d4523900a
>>
>> cheers
>
> This patch causes a regression with clang:

Is that a Clang bug ?

Do you have a disassembly of the code both with and without this patch
in order to compare ?

Segher, any idea ?

Christophe

>
> https://travis-ci.com/ClangBuiltLinux/continuous-integration/jobs/213944668
>
> I've attached my local bisect/build log.
>
> Cheers,
> Nathan
>

2019-07-09 06:51:01

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Tue, Jul 09, 2019 at 07:04:43AM +0200, Christophe Leroy wrote:
>
>
> Le 08/07/2019 ? 21:14, Nathan Chancellor a ?crit?:
> > On Mon, Jul 08, 2019 at 11:19:30AM +1000, Michael Ellerman wrote:
> > > On Fri, 2019-05-10 at 09:24:48 UTC, Christophe Leroy wrote:
> > > > Cache instructions (dcbz, dcbi, dcbf and dcbst) take two registers
> > > > that are summed to obtain the target address. Using 'Z' constraint
> > > > and '%y0' argument gives GCC the opportunity to use both registers
> > > > instead of only one with the second being forced to 0.
> > > >
> > > > Suggested-by: Segher Boessenkool <[email protected]>
> > > > Signed-off-by: Christophe Leroy <[email protected]>
> > >
> > > Applied to powerpc next, thanks.
> > >
> > > https://git.kernel.org/powerpc/c/6c5875843b87c3adea2beade9d1b8b3d4523900a
> > >
> > > cheers
> >
> > This patch causes a regression with clang:
>
> Is that a Clang bug ?

No idea, it happens with clang-8 and clang-9 though (pretty sure there
were fixes for PowerPC in clang-8 so something before it probably won't
work but I haven't tried).

>
> Do you have a disassembly of the code both with and without this patch in
> order to compare ?

I can give you whatever disassembly you want (or I can upload the raw
files if that is easier).

Cheers,
Nathan

>
> Segher, any idea ?
>
> Christophe
>
> >
> > https://travis-ci.com/ClangBuiltLinux/continuous-integration/jobs/213944668
> >
> > I've attached my local bisect/build log.
> >
> > Cheers,
> > Nathan
> >

2019-07-09 13:37:07

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Tue, Jul 09, 2019 at 07:04:43AM +0200, Christophe Leroy wrote:
> Le 08/07/2019 ? 21:14, Nathan Chancellor a ?crit?:
> >On Mon, Jul 08, 2019 at 11:19:30AM +1000, Michael Ellerman wrote:
> >>On Fri, 2019-05-10 at 09:24:48 UTC, Christophe Leroy wrote:
> >>>Cache instructions (dcbz, dcbi, dcbf and dcbst) take two registers
> >>>that are summed to obtain the target address. Using 'Z' constraint
> >>>and '%y0' argument gives GCC the opportunity to use both registers
> >>>instead of only one with the second being forced to 0.
> >>>
> >>>Suggested-by: Segher Boessenkool <[email protected]>
> >>>Signed-off-by: Christophe Leroy <[email protected]>
> >>
> >>Applied to powerpc next, thanks.
> >>
> >>https://git.kernel.org/powerpc/c/6c5875843b87c3adea2beade9d1b8b3d4523900a
> >>
> >>cheers
> >
> >This patch causes a regression with clang:
>
> Is that a Clang bug ?

I would think so, but cannot tell from the given information.

> Do you have a disassembly of the code both with and without this patch
> in order to compare ?

That's what we need to start debugging this, yup.

> Segher, any idea ?

There is nothing I recognise, no.


Segher

2019-07-19 03:35:29

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Mon, Jul 08, 2019 at 11:49:52PM -0700, Nathan Chancellor wrote:
> On Tue, Jul 09, 2019 at 07:04:43AM +0200, Christophe Leroy wrote:
> > Is that a Clang bug ?
>
> No idea, it happens with clang-8 and clang-9 though (pretty sure there
> were fixes for PowerPC in clang-8 so something before it probably won't
> work but I haven't tried).
>
> >
> > Do you have a disassembly of the code both with and without this patch in
> > order to compare ?
>
> I can give you whatever disassembly you want (or I can upload the raw
> files if that is easier).
>
> Cheers,
> Nathan

Hi Christophe and Segher,

What disassembly/files did you need to start taking a look at this? I
can upload/send whatever you need.

If it is easier, we have a self contained clang build script available
to make it easier to reproduce this on your side (does assume an x86_64
host):

https://github.com/ClangBuiltLinux/tc-build

Cheers,
Nathan

2019-07-19 16:46:18

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Thu, Jul 18, 2019 at 08:24:56PM -0700, Nathan Chancellor wrote:
> On Mon, Jul 08, 2019 at 11:49:52PM -0700, Nathan Chancellor wrote:
> > On Tue, Jul 09, 2019 at 07:04:43AM +0200, Christophe Leroy wrote:
> > > Is that a Clang bug ?
> >
> > No idea, it happens with clang-8 and clang-9 though (pretty sure there
> > were fixes for PowerPC in clang-8 so something before it probably won't
> > work but I haven't tried).
> >
> > >
> > > Do you have a disassembly of the code both with and without this patch in
> > > order to compare ?
> >
> > I can give you whatever disassembly you want (or I can upload the raw
> > files if that is easier).
>
> What disassembly/files did you need to start taking a look at this? I
> can upload/send whatever you need.

A before and after of *only this patch*. And then look at what changed;
it maybe be obvious what is the problem to you, as well, so look at it
yourself first?


Segher

2019-07-19 19:32:32

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Fri, Jul 19, 2019 at 10:23:03AM -0500, Segher Boessenkool wrote:
> On Thu, Jul 18, 2019 at 08:24:56PM -0700, Nathan Chancellor wrote:
> > On Mon, Jul 08, 2019 at 11:49:52PM -0700, Nathan Chancellor wrote:
> > > On Tue, Jul 09, 2019 at 07:04:43AM +0200, Christophe Leroy wrote:
> > > > Is that a Clang bug ?
> > >
> > > No idea, it happens with clang-8 and clang-9 though (pretty sure there
> > > were fixes for PowerPC in clang-8 so something before it probably won't
> > > work but I haven't tried).
> > >
> > > >
> > > > Do you have a disassembly of the code both with and without this patch in
> > > > order to compare ?
> > >
> > > I can give you whatever disassembly you want (or I can upload the raw
> > > files if that is easier).
> >
> > What disassembly/files did you need to start taking a look at this? I
> > can upload/send whatever you need.
>
> A before and after of *only this patch*. And then look at what changed;
> it maybe be obvious what is the problem to you, as well, so look at it
> yourself first?
>
>
> Segher

Thanks, I will go ahead and disassemble the full kernel given that those
helpers are everywhere and see what I can find. I'll reach out again if
I can't come up with anything.

Thanks for the advice!
Nathan

2019-07-21 07:59:53

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Fri, Jul 19, 2019 at 09:04:55AM -0700, Nathan Chancellor wrote:
> On Fri, Jul 19, 2019 at 10:23:03AM -0500, Segher Boessenkool wrote:
> > On Thu, Jul 18, 2019 at 08:24:56PM -0700, Nathan Chancellor wrote:
> > > On Mon, Jul 08, 2019 at 11:49:52PM -0700, Nathan Chancellor wrote:
> > > > On Tue, Jul 09, 2019 at 07:04:43AM +0200, Christophe Leroy wrote:
> > > > > Is that a Clang bug ?
> > > >
> > > > No idea, it happens with clang-8 and clang-9 though (pretty sure there
> > > > were fixes for PowerPC in clang-8 so something before it probably won't
> > > > work but I haven't tried).
> > > >
> > > > >
> > > > > Do you have a disassembly of the code both with and without this patch in
> > > > > order to compare ?
> > > >
> > > > I can give you whatever disassembly you want (or I can upload the raw
> > > > files if that is easier).
> > >
> > > What disassembly/files did you need to start taking a look at this? I
> > > can upload/send whatever you need.
> >
> > A before and after of *only this patch*. And then look at what changed;
> > it maybe be obvious what is the problem to you, as well, so look at it
> > yourself first?
> >
> >
> > Segher

Hi Segher,

Looks like the problematic function is dcbz, as there is no segfault
when only that function is reverted to a
pre-6c5875843b87c3adea2beade9d1b8b3d4523900a state.

I was able to expose a singular problematic callsite using the attached
patch (let me know if that is insufficient).

I have attached the disassembly of arch/powerpc/kernel/mem.o with
clear_page (working) and broken_clear_page (broken), along with the side
by side diff. My assembly knowledge is fairly limited as it stands and
it is certainly not up to snuff on PowerPC so I have no idea what I am
looking for. Please let me know if anything immediately looks off or if
there is anything else I can do to help out.

Cheers,
Nathan


Attachments:
(No filename) (1.88 kB)
0001-powerpc-Test-broken-dcbz.patch (2.12 kB)
mem-working.txt (21.98 kB)
mem-broken.txt (22.25 kB)
mem-diff.txt (45.01 kB)
Download all attachments

2019-07-21 18:27:00

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Sun, Jul 21, 2019 at 12:58:46AM -0700, Nathan Chancellor wrote:
> I have attached the disassembly of arch/powerpc/kernel/mem.o with
> clear_page (working) and broken_clear_page (broken), along with the side
> by side diff. My assembly knowledge is fairly limited as it stands and
> it is certainly not up to snuff on PowerPC so I have no idea what I am
> looking for. Please let me know if anything immediately looks off or if
> there is anything else I can do to help out.

You might want to use a disassembler that shows most simplified mnemonics,
and you crucially should show the relocations. "objdump -dr" works nicely.

> 0000017c clear_user_page:
> 17c: 38 80 00 80 li 4, 128
> 180: 7c 89 03 a6 mtctr 4
> 184: 7c 00 1f ec dcbz 0, 3
> 188: 38 63 00 20 addi 3, 3, 32
> 18c: 42 00 ff f8 bdnz .+65528

That offset is incorrectly disassembled, btw (it's a signed field, not
unsigned).

> 0000017c clear_user_page:
> 17c: 94 21 ff f0 stwu 1, -16(1)
> 180: 38 80 00 80 li 4, 128
> 184: 38 63 ff e0 addi 3, 3, -32
> 188: 7c 89 03 a6 mtctr 4
> 18c: 38 81 00 0f addi 4, 1, 15
> 190: 8c c3 00 20 lbzu 6, 32(3)
> 194: 98 c1 00 0f stb 6, 15(1)
> 198: 7c 00 27 ec dcbz 0, 4
> 19c: 42 00 ff f4 bdnz .+65524

Uh, yeah, well, I have no idea what clang tried here, but that won't
work. It's copying a byte from each target cache line to the stack,
and then does clears the cache line containing that byte on the stack.

I *guess* this is about "Z" and not about "%y", but you'll have to ask
the clang people.

Or it may be that they do not treat inline asm operands as lvalues
properly? That rings some bells. Yeah that looks like it.


Segher

2019-07-22 00:55:32

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] powerpc: Test broken dcbz

Hi Nathan,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[cannot apply to v5.3-rc1 next-20190719]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Nathan-Chancellor/powerpc-Test-broken-dcbz/20190722-051054
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 7.4.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.4.0 make.cross ARCH=powerpc

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <[email protected]>

All errors (new ones prefixed by >>):

arch/powerpc/mm/mem.c: In function 'clear_user_page':
>> arch/powerpc/mm/mem.c:364:2: error: implicit declaration of function 'broken_clear_page'; did you mean 'bdev_read_page'? [-Werror=implicit-function-declaration]
broken_clear_page(page);
^~~~~~~~~~~~~~~~~
bdev_read_page
cc1: all warnings being treated as errors

vim +364 arch/powerpc/mm/mem.c

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.37 kB)
.config.gz (24.73 kB)
Download all attachments

2019-07-22 02:51:59

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

Hi Segher,

On Sun, Jul 21, 2019 at 01:01:50PM -0500, Segher Boessenkool wrote:
> On Sun, Jul 21, 2019 at 12:58:46AM -0700, Nathan Chancellor wrote:
> > I have attached the disassembly of arch/powerpc/kernel/mem.o with
> > clear_page (working) and broken_clear_page (broken), along with the side
> > by side diff. My assembly knowledge is fairly limited as it stands and
> > it is certainly not up to snuff on PowerPC so I have no idea what I am
> > looking for. Please let me know if anything immediately looks off or if
> > there is anything else I can do to help out.
>
> You might want to use a disassembler that shows most simplified mnemonics,
> and you crucially should show the relocations. "objdump -dr" works nicely.

Copy, those are attached below if you want to take a further look at
them.

> > 0000017c clear_user_page:
> > 17c: 38 80 00 80 li 4, 128
> > 180: 7c 89 03 a6 mtctr 4
> > 184: 7c 00 1f ec dcbz 0, 3
> > 188: 38 63 00 20 addi 3, 3, 32
> > 18c: 42 00 ff f8 bdnz .+65528
>
> That offset is incorrectly disassembled, btw (it's a signed field, not
> unsigned).
>
> > 0000017c clear_user_page:
> > 17c: 94 21 ff f0 stwu 1, -16(1)
> > 180: 38 80 00 80 li 4, 128
> > 184: 38 63 ff e0 addi 3, 3, -32
> > 188: 7c 89 03 a6 mtctr 4
> > 18c: 38 81 00 0f addi 4, 1, 15
> > 190: 8c c3 00 20 lbzu 6, 32(3)
> > 194: 98 c1 00 0f stb 6, 15(1)
> > 198: 7c 00 27 ec dcbz 0, 4
> > 19c: 42 00 ff f4 bdnz .+65524
>
> Uh, yeah, well, I have no idea what clang tried here, but that won't
> work. It's copying a byte from each target cache line to the stack,
> and then does clears the cache line containing that byte on the stack.
>
> I *guess* this is about "Z" and not about "%y", but you'll have to ask
> the clang people.
>
> Or it may be that they do not treat inline asm operands as lvalues
> properly? That rings some bells. Yeah that looks like it.
>
>
> Segher

Okay, I think I understand... I think this is enough to bring up an LLVM
bug report but I'll ask some of the LLVM folks I know before doing so.

Thank you for all of the input, I really appreciate it,
Nathan


Attachments:
(No filename) (2.41 kB)
mem-working.txt (19.50 kB)
mem-broken.txt (19.72 kB)
mem-diff.txt (39.86 kB)
Download all attachments

2019-07-22 06:31:44

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Sun, Jul 21, 2019 at 07:41:40PM -0700, Nathan Chancellor wrote:
> Hi Segher,
>
> On Sun, Jul 21, 2019 at 01:01:50PM -0500, Segher Boessenkool wrote:
> > On Sun, Jul 21, 2019 at 12:58:46AM -0700, Nathan Chancellor wrote:
> > > 0000017c clear_user_page:
> > > 17c: 94 21 ff f0 stwu 1, -16(1)
> > > 180: 38 80 00 80 li 4, 128
> > > 184: 38 63 ff e0 addi 3, 3, -32
> > > 188: 7c 89 03 a6 mtctr 4
> > > 18c: 38 81 00 0f addi 4, 1, 15
> > > 190: 8c c3 00 20 lbzu 6, 32(3)
> > > 194: 98 c1 00 0f stb 6, 15(1)
> > > 198: 7c 00 27 ec dcbz 0, 4
> > > 19c: 42 00 ff f4 bdnz .+65524
> >
> > Uh, yeah, well, I have no idea what clang tried here, but that won't
> > work. It's copying a byte from each target cache line to the stack,
> > and then does clears the cache line containing that byte on the stack.
> >
> > I *guess* this is about "Z" and not about "%y", but you'll have to ask
> > the clang people.
> >
> > Or it may be that they do not treat inline asm operands as lvalues
> > properly? That rings some bells. Yeah that looks like it.

The code is
__asm__ __volatile__ ("dcbz %y0" : : "Z"(*(u8 *)addr) : "memory");

so yeah it looks like clang took that *(u8 *)addr as rvalue, and
stored that in stack, and then used *that* as memory.

Maybe clang simply does not not to treat "Z" the same as "m"? (And "Y"
and "Q" and "es" and a whole bunch of "w*", what about those?)


Segher

2019-07-22 10:17:21

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

Segher Boessenkool <[email protected]> writes:
> On Sun, Jul 21, 2019 at 12:58:46AM -0700, Nathan Chancellor wrote:
>> I have attached the disassembly of arch/powerpc/kernel/mem.o with
>> clear_page (working) and broken_clear_page (broken), along with the side
>> by side diff. My assembly knowledge is fairly limited as it stands and
>> it is certainly not up to snuff on PowerPC so I have no idea what I am
>> looking for. Please let me know if anything immediately looks off or if
>> there is anything else I can do to help out.
>
> You might want to use a disassembler that shows most simplified mnemonics,
> and you crucially should show the relocations. "objdump -dr" works nicely.
>
>> 0000017c clear_user_page:
>> 17c: 38 80 00 80 li 4, 128
>> 180: 7c 89 03 a6 mtctr 4
>> 184: 7c 00 1f ec dcbz 0, 3
>> 188: 38 63 00 20 addi 3, 3, 32
>> 18c: 42 00 ff f8 bdnz .+65528
>
> That offset is incorrectly disassembled, btw (it's a signed field, not
> unsigned).
>
>> 0000017c clear_user_page:
>> 17c: 94 21 ff f0 stwu 1, -16(1)
>> 180: 38 80 00 80 li 4, 128
>> 184: 38 63 ff e0 addi 3, 3, -32
>> 188: 7c 89 03 a6 mtctr 4
>> 18c: 38 81 00 0f addi 4, 1, 15
>> 190: 8c c3 00 20 lbzu 6, 32(3)
>> 194: 98 c1 00 0f stb 6, 15(1)
>> 198: 7c 00 27 ec dcbz 0, 4
>> 19c: 42 00 ff f4 bdnz .+65524
>
> Uh, yeah, well, I have no idea what clang tried here, but that won't
> work. It's copying a byte from each target cache line to the stack,
> and then does clears the cache line containing that byte on the stack.

So it seems like this is a clang bug.

None of the distros we support use clang, but we would still like to
keep it working if we can.

Looking at the original patch, the only upside is that the compiler
can use both RA and RB to compute the address, rather than us forcing RA
to 0.

But at least with my compiler here (GCC 8 vintage) I don't actually see
GCC ever using both GPRs even with the patch. Or at least, there's no
difference before/after the patch as far as I can see.

So my inclination is to revert the original patch. We can try again in a
few years :D

Thoughts?

cheers

2019-07-22 17:31:27

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Mon, Jul 22, 2019 at 08:15:14PM +1000, Michael Ellerman wrote:
> Segher Boessenkool <[email protected]> writes:
> > On Sun, Jul 21, 2019 at 12:58:46AM -0700, Nathan Chancellor wrote:
> >> 0000017c clear_user_page:
> >> 17c: 94 21 ff f0 stwu 1, -16(1)
> >> 180: 38 80 00 80 li 4, 128
> >> 184: 38 63 ff e0 addi 3, 3, -32
> >> 188: 7c 89 03 a6 mtctr 4
> >> 18c: 38 81 00 0f addi 4, 1, 15
> >> 190: 8c c3 00 20 lbzu 6, 32(3)
> >> 194: 98 c1 00 0f stb 6, 15(1)
> >> 198: 7c 00 27 ec dcbz 0, 4
> >> 19c: 42 00 ff f4 bdnz .+65524
> >
> > Uh, yeah, well, I have no idea what clang tried here, but that won't
> > work. It's copying a byte from each target cache line to the stack,
> > and then does clears the cache line containing that byte on the stack.
>
> So it seems like this is a clang bug.
>
> None of the distros we support use clang, but we would still like to
> keep it working if we can.

Which version? Which versions *are* broken?

> Looking at the original patch, the only upside is that the compiler
> can use both RA and RB to compute the address, rather than us forcing RA
> to 0.
>
> But at least with my compiler here (GCC 8 vintage) I don't actually see
> GCC ever using both GPRs even with the patch. Or at least, there's no
> difference before/after the patch as far as I can see.

The benefit is small, certainly.

> So my inclination is to revert the original patch. We can try again in a
> few years :D
>
> Thoughts?

I think you should give the clang people time to figure out what is
going on.


Segher

2019-07-22 17:50:05

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Sun, Jul 21, 2019 at 11:19 PM Segher Boessenkool
<[email protected]> wrote:
>
> On Sun, Jul 21, 2019 at 07:41:40PM -0700, Nathan Chancellor wrote:
> > Hi Segher,
> >
> > On Sun, Jul 21, 2019 at 01:01:50PM -0500, Segher Boessenkool wrote:
> > > On Sun, Jul 21, 2019 at 12:58:46AM -0700, Nathan Chancellor wrote:
> > > > 0000017c clear_user_page:
> > > > 17c: 94 21 ff f0 stwu 1, -16(1)
> > > > 180: 38 80 00 80 li 4, 128
> > > > 184: 38 63 ff e0 addi 3, 3, -32
> > > > 188: 7c 89 03 a6 mtctr 4
> > > > 18c: 38 81 00 0f addi 4, 1, 15
> > > > 190: 8c c3 00 20 lbzu 6, 32(3)
> > > > 194: 98 c1 00 0f stb 6, 15(1)
> > > > 198: 7c 00 27 ec dcbz 0, 4
> > > > 19c: 42 00 ff f4 bdnz .+65524
> > >
> > > Uh, yeah, well, I have no idea what clang tried here, but that won't
> > > work. It's copying a byte from each target cache line to the stack,
> > > and then does clears the cache line containing that byte on the stack.
> > >
> > > I *guess* this is about "Z" and not about "%y", but you'll have to ask
> > > the clang people.
> > >
> > > Or it may be that they do not treat inline asm operands as lvalues
> > > properly? That rings some bells. Yeah that looks like it.
>
> The code is
> __asm__ __volatile__ ("dcbz %y0" : : "Z"(*(u8 *)addr) : "memory");
>
> so yeah it looks like clang took that *(u8 *)addr as rvalue, and
> stored that in stack, and then used *that* as memory.

What's the %y modifier supposed to mean here? addr is in the list of
inputs, so what's wrong with using it as an rvalue?

>
> Maybe clang simply does not not to treat "Z" the same as "m"? (And "Y"
> and "Q" and "es" and a whole bunch of "w*", what about those?)

--
Thanks,
~Nick Desaulniers

2019-07-22 23:11:25

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Mon, Jul 22, 2019 at 10:21:07AM -0700, Nick Desaulniers wrote:
> On Sun, Jul 21, 2019 at 11:19 PM Segher Boessenkool
> <[email protected]> wrote:
> > On Sun, Jul 21, 2019 at 07:41:40PM -0700, Nathan Chancellor wrote:
> > > On Sun, Jul 21, 2019 at 01:01:50PM -0500, Segher Boessenkool wrote:
> > > > On Sun, Jul 21, 2019 at 12:58:46AM -0700, Nathan Chancellor wrote:
> > > > > 0000017c clear_user_page:
> > > > > 17c: 94 21 ff f0 stwu 1, -16(1)
> > > > > 180: 38 80 00 80 li 4, 128
> > > > > 184: 38 63 ff e0 addi 3, 3, -32
> > > > > 188: 7c 89 03 a6 mtctr 4
> > > > > 18c: 38 81 00 0f addi 4, 1, 15
> > > > > 190: 8c c3 00 20 lbzu 6, 32(3)
> > > > > 194: 98 c1 00 0f stb 6, 15(1)
> > > > > 198: 7c 00 27 ec dcbz 0, 4
> > > > > 19c: 42 00 ff f4 bdnz .+65524
> > > >
> > > > Uh, yeah, well, I have no idea what clang tried here, but that won't
> > > > work. It's copying a byte from each target cache line to the stack,
> > > > and then does clears the cache line containing that byte on the stack.
> > > >
> > > > I *guess* this is about "Z" and not about "%y", but you'll have to ask
> > > > the clang people.
> > > >
> > > > Or it may be that they do not treat inline asm operands as lvalues
> > > > properly? That rings some bells. Yeah that looks like it.
> >
> > The code is
> > __asm__ __volatile__ ("dcbz %y0" : : "Z"(*(u8 *)addr) : "memory");
> >
> > so yeah it looks like clang took that *(u8 *)addr as rvalue, and
> > stored that in stack, and then used *that* as memory.
>
> What's the %y modifier supposed to mean here?

It prints a memory address for an indexed operand.

If you write just "%0" it prints addresses that are a single register
as "0(r3)" instead of "0,r3". Some instructions do not allow offset
form.

> addr is in the list of
> inputs, so what's wrong with using it as an rvalue?

It seems to use *(u8 *)addr as rvalue. Asm operands are lvalues. It
matters a lot for memory operands.

> > Maybe clang simply does not not to treat "Z" the same as "m"? (And "Y"
> > and "Q" and "es" and a whole bunch of "w*", what about those?)


Segher

2019-07-23 10:32:34

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

Segher Boessenkool <[email protected]> writes:
> On Mon, Jul 22, 2019 at 08:15:14PM +1000, Michael Ellerman wrote:
>> Segher Boessenkool <[email protected]> writes:
>> > On Sun, Jul 21, 2019 at 12:58:46AM -0700, Nathan Chancellor wrote:
>> >> 0000017c clear_user_page:
>> >> 17c: 94 21 ff f0 stwu 1, -16(1)
>> >> 180: 38 80 00 80 li 4, 128
>> >> 184: 38 63 ff e0 addi 3, 3, -32
>> >> 188: 7c 89 03 a6 mtctr 4
>> >> 18c: 38 81 00 0f addi 4, 1, 15
>> >> 190: 8c c3 00 20 lbzu 6, 32(3)
>> >> 194: 98 c1 00 0f stb 6, 15(1)
>> >> 198: 7c 00 27 ec dcbz 0, 4
>> >> 19c: 42 00 ff f4 bdnz .+65524
>> >
>> > Uh, yeah, well, I have no idea what clang tried here, but that won't
>> > work. It's copying a byte from each target cache line to the stack,
>> > and then does clears the cache line containing that byte on the stack.
>>
>> So it seems like this is a clang bug.
>>
>> None of the distros we support use clang, but we would still like to
>> keep it working if we can.
>
> Which version? Which versions *are* broken?

AFAIK clang 8 is the first version that we could build with, without
hacks.

>> Looking at the original patch, the only upside is that the compiler
>> can use both RA and RB to compute the address, rather than us forcing RA
>> to 0.
>>
>> But at least with my compiler here (GCC 8 vintage) I don't actually see
>> GCC ever using both GPRs even with the patch. Or at least, there's no
>> difference before/after the patch as far as I can see.
>
> The benefit is small, certainly.

Zero is small, but I guess some things are smaller? :P

>> So my inclination is to revert the original patch. We can try again in a
>> few years :D
>>
>> Thoughts?
>
> I think you should give the clang people time to figure out what is
> going on.

Yeah fair enough, will wait and see what their diagnosis is.

cheers

2019-07-25 14:13:30

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Tue, Jul 23, 2019 at 09:21:53AM +1000, Michael Ellerman wrote:
> Segher Boessenkool <[email protected]> writes:
> >> can use both RA and RB to compute the address, rather than us forcing RA
> >> to 0.
> >>
> >> But at least with my compiler here (GCC 8 vintage) I don't actually see
> >> GCC ever using both GPRs even with the patch. Or at least, there's no
> >> difference before/after the patch as far as I can see.
> >
> > The benefit is small, certainly.
>
> Zero is small, but I guess some things are smaller? :P

Heh. 0 out of 12 is small.

It actually is quite easy to do trigger the macros to generate two-reg
dcb* instructions; but all the places where that is especially useful,
in loops for example, already use hand-written assembler code (and yes,
using two-reg forms).

You probably will not want to write those routines as plain C ever
given how important those are for performance (memset, clear-a-page),
so the dcb* macros won't ever be very hot, oh well.

> >> So my inclination is to revert the original patch. We can try again in a
> >> few years :D
> >>
> >> Thoughts?
> >
> > I think you should give the clang people time to figure out what is
> > going on.
>
> Yeah fair enough, will wait and see what their diagnosis is.

Thanks!


Segher

2019-07-25 21:32:42

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Mon, Jul 22, 2019 at 10:58 AM Segher Boessenkool
<[email protected]> wrote:
>
> On Mon, Jul 22, 2019 at 10:21:07AM -0700, Nick Desaulniers wrote:
> > On Sun, Jul 21, 2019 at 11:19 PM Segher Boessenkool
> > <[email protected]> wrote:
> > > On Sun, Jul 21, 2019 at 07:41:40PM -0700, Nathan Chancellor wrote:
> > > > On Sun, Jul 21, 2019 at 01:01:50PM -0500, Segher Boessenkool wrote:
> > > > > On Sun, Jul 21, 2019 at 12:58:46AM -0700, Nathan Chancellor wrote:
> > > > > > 0000017c clear_user_page:
> > > > > > 17c: 94 21 ff f0 stwu 1, -16(1)
> > > > > > 180: 38 80 00 80 li 4, 128
> > > > > > 184: 38 63 ff e0 addi 3, 3, -32
> > > > > > 188: 7c 89 03 a6 mtctr 4
> > > > > > 18c: 38 81 00 0f addi 4, 1, 15
> > > > > > 190: 8c c3 00 20 lbzu 6, 32(3)
> > > > > > 194: 98 c1 00 0f stb 6, 15(1)
> > > > > > 198: 7c 00 27 ec dcbz 0, 4
> > > > > > 19c: 42 00 ff f4 bdnz .+65524
> > > > >
> > > > > Uh, yeah, well, I have no idea what clang tried here, but that won't
> > > > > work. It's copying a byte from each target cache line to the stack,
> > > > > and then does clears the cache line containing that byte on the stack.
> > > > >
> > > > > I *guess* this is about "Z" and not about "%y", but you'll have to ask
> > > > > the clang people.
> > > > >
> > > > > Or it may be that they do not treat inline asm operands as lvalues
> > > > > properly? That rings some bells. Yeah that looks like it.
> > >
> > > The code is
> > > __asm__ __volatile__ ("dcbz %y0" : : "Z"(*(u8 *)addr) : "memory");
> > >
> > > so yeah it looks like clang took that *(u8 *)addr as rvalue, and
> > > stored that in stack, and then used *that* as memory.
> >
> > What's the %y modifier supposed to mean here?
>
> It prints a memory address for an indexed operand.
>
> If you write just "%0" it prints addresses that are a single register
> as "0(r3)" instead of "0,r3". Some instructions do not allow offset
> form.
>
> > addr is in the list of
> > inputs, so what's wrong with using it as an rvalue?
>
> It seems to use *(u8 *)addr as rvalue. Asm operands are lvalues. It
> matters a lot for memory operands.

Hmm...not sure that's specified behavior. Anyways, I've filed:
https://bugs.llvm.org/show_bug.cgi?id=42762
to see if folks more familiar with LLVM's ppc backend have some more thoughts.

I recommend considering reverting commit 6c5875843b87 ("powerpc:
slightly improve cache helpers") until the issue is resolved in clang,
otherwise I'll probably just turn off our CI builds of PPC32 for the
time being.
--
Thanks,
~Nick Desaulniers

2019-07-29 20:30:06

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: slightly improve cache helpers

On Thu, Jul 25, 2019 at 2:30 PM Nick Desaulniers
<[email protected]> wrote:
>
> On Mon, Jul 22, 2019 at 10:58 AM Segher Boessenkool
> <[email protected]> wrote:
> >
> > On Mon, Jul 22, 2019 at 10:21:07AM -0700, Nick Desaulniers wrote:
> > > On Sun, Jul 21, 2019 at 11:19 PM Segher Boessenkool
> > > <[email protected]> wrote:
> > > > On Sun, Jul 21, 2019 at 07:41:40PM -0700, Nathan Chancellor wrote:
> > > > > On Sun, Jul 21, 2019 at 01:01:50PM -0500, Segher Boessenkool wrote:
> > > > > > On Sun, Jul 21, 2019 at 12:58:46AM -0700, Nathan Chancellor wrote:
> > > > > > > 0000017c clear_user_page:
> > > > > > > 17c: 94 21 ff f0 stwu 1, -16(1)
> > > > > > > 180: 38 80 00 80 li 4, 128
> > > > > > > 184: 38 63 ff e0 addi 3, 3, -32
> > > > > > > 188: 7c 89 03 a6 mtctr 4
> > > > > > > 18c: 38 81 00 0f addi 4, 1, 15
> > > > > > > 190: 8c c3 00 20 lbzu 6, 32(3)
> > > > > > > 194: 98 c1 00 0f stb 6, 15(1)
> > > > > > > 198: 7c 00 27 ec dcbz 0, 4
> > > > > > > 19c: 42 00 ff f4 bdnz .+65524
> > > > > >
> > > > > > Uh, yeah, well, I have no idea what clang tried here, but that won't
> > > > > > work. It's copying a byte from each target cache line to the stack,
> > > > > > and then does clears the cache line containing that byte on the stack.
> > > > > >
> > > > > > I *guess* this is about "Z" and not about "%y", but you'll have to ask
> > > > > > the clang people.
> > > > > >
> > > > > > Or it may be that they do not treat inline asm operands as lvalues
> > > > > > properly? That rings some bells. Yeah that looks like it.
> > > >
> > > > The code is
> > > > __asm__ __volatile__ ("dcbz %y0" : : "Z"(*(u8 *)addr) : "memory");
> > > >
> > > > so yeah it looks like clang took that *(u8 *)addr as rvalue, and
> > > > stored that in stack, and then used *that* as memory.
> > >
> > > What's the %y modifier supposed to mean here?
> >
> > It prints a memory address for an indexed operand.
> >
> > If you write just "%0" it prints addresses that are a single register
> > as "0(r3)" instead of "0,r3". Some instructions do not allow offset
> > form.
> >
> > > addr is in the list of
> > > inputs, so what's wrong with using it as an rvalue?
> >
> > It seems to use *(u8 *)addr as rvalue. Asm operands are lvalues. It
> > matters a lot for memory operands.
>
> Hmm...not sure that's specified behavior. Anyways, I've filed:
> https://bugs.llvm.org/show_bug.cgi?id=42762
> to see if folks more familiar with LLVM's ppc backend have some more thoughts.
>
> I recommend considering reverting commit 6c5875843b87 ("powerpc:
> slightly improve cache helpers") until the issue is resolved in clang,
> otherwise I'll probably just turn off our CI builds of PPC32 for the
> time being.

Started a new thread: https://lkml.org/lkml/2019/7/29/1483
--
Thanks,
~Nick Desaulniers