2016-04-15 10:36:35

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH] qla2xxx: rewrite code to avoid hitting gcc bug 70646

More info here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646

Signed-off-by: Denys Vlasenko <[email protected]>
CC: Himanshu Madhani <[email protected]>
CC: James Bottomley <[email protected]>
CC: [email protected]
CC: Josh Poimboeuf <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/scsi/qla2xxx/qla_attr.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/drivers/scsi/qla2xxx/qla_attr.c b/drivers/scsi/qla2xxx/qla_attr.c
index 4dc06a13..2dd9c72 100644
--- a/drivers/scsi/qla2xxx/qla_attr.c
+++ b/drivers/scsi/qla2xxx/qla_attr.c
@@ -2003,9 +2003,14 @@ static void
qla2x00_get_host_fabric_name(struct Scsi_Host *shost)
{
scsi_qla_host_t *vha = shost_priv(shost);
+ /*
+ * This can trigger gcc 4.9/5.3 bug.
+ * See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
uint8_t node_name[WWN_SIZE] = { 0xFF, 0xFF, 0xFF, 0xFF, \
0xFF, 0xFF, 0xFF, 0xFF};
u64 fabric_name = wwn_to_u64(node_name);
+ */
+ u64 fabric_name = (u64)(s64)-1; /* the same as above */

if (vha->device_flags & SWITCH_FOUND)
fabric_name = wwn_to_u64(vha->fabric_node_name);
--
1.8.1.4


2016-04-15 14:40:09

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] qla2xxx: rewrite code to avoid hitting gcc bug 70646

On Fri, 2016-04-15 at 12:36 +0200, Denys Vlasenko wrote:
> More info here:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646

This bug is under investigation, so I'd rather not alter code for a gcc
bug until we know if we can supply options to fix it rather than
changing code.

James


> Signed-off-by: Denys Vlasenko <[email protected]>
> CC: Himanshu Madhani <[email protected]>
> CC: James Bottomley <[email protected]>
> CC: [email protected]
> CC: Josh Poimboeuf <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> drivers/scsi/qla2xxx/qla_attr.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/scsi/qla2xxx/qla_attr.c
> b/drivers/scsi/qla2xxx/qla_attr.c
> index 4dc06a13..2dd9c72 100644
> --- a/drivers/scsi/qla2xxx/qla_attr.c
> +++ b/drivers/scsi/qla2xxx/qla_attr.c
> @@ -2003,9 +2003,14 @@ static void
> qla2x00_get_host_fabric_name(struct Scsi_Host *shost)
> {
> scsi_qla_host_t *vha = shost_priv(shost);
> + /*
> + * This can trigger gcc 4.9/5.3 bug.
> + * See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
> uint8_t node_name[WWN_SIZE] = { 0xFF, 0xFF, 0xFF, 0xFF, \
> 0xFF, 0xFF, 0xFF, 0xFF};
> u64 fabric_name = wwn_to_u64(node_name);
> + */
> + u64 fabric_name = (u64)(s64)-1; /* the same as above */
>
> if (vha->device_flags & SWITCH_FOUND)
> fabric_name = wwn_to_u64(vha->fabric_node_name);

2016-04-15 18:56:36

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] qla2xxx: rewrite code to avoid hitting gcc bug 70646

On 04/15/2016 04:40 PM, James Bottomley wrote:
> On Fri, 2016-04-15 at 12:36 +0200, Denys Vlasenko wrote:
>> More info here:
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
>
> This bug is under investigation, so I'd rather not alter code for a gcc
> bug until we know if we can supply options to fix it rather than
> changing code.


Background. The bug exists in gcc for 2 years, but it is rather
hard to trigger, so nobody noticed.

Unfortunately for kernel, these two commits landed in Linus tree
in March 16 and 17:


On 04/13/2016 05:36 AM, Josh Poimboeuf wrote:
> It occurs with the combination of the following two recent commits:
>
> - bc27fb68aaad ("include/uapi/linux/byteorder, swab: force inlining of some byteswap operations")
> - ef3fb2422ffe ("scsi: fc: use get/put_unaligned64 for wwn access")


and now *many* users of qla2x00 and new-ish gcc are going to
very much notice it, as their kernels will start crashing reliably.

The commits can be reverted, sure, but they per se do not contain
anything unusual. They, together with not very typical construct
in qla2x00_get_host_fabric_name, one
which boils down to "swab64p(constant_array_of_8_bytes)",
just happen to nudge gcc in a right way to finally trigger the bug.

So I came with another idea how to forestall the imminent deluge of
qla2x00 oops reports - this patch.

2016-04-15 19:05:35

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] qla2xxx: rewrite code to avoid hitting gcc bug 70646

On Fri, 2016-04-15 at 20:56 +0200, Denys Vlasenko wrote:
> On 04/15/2016 04:40 PM, James Bottomley wrote:
> > On Fri, 2016-04-15 at 12:36 +0200, Denys Vlasenko wrote:
> > > More info here:
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
> >
> > This bug is under investigation, so I'd rather not alter code for a
> > gcc
> > bug until we know if we can supply options to fix it rather than
> > changing code.
>
>
> Background. The bug exists in gcc for 2 years, but it is rather
> hard to trigger, so nobody noticed.

We know this ... linux-scsi is on the cc for the other thread on this.

> Unfortunately for kernel, these two commits landed in Linus tree
> in March 16 and 17:
>
>
> On 04/13/2016 05:36 AM, Josh Poimboeuf wrote:
> > It occurs with the combination of the following two recent commits:
> >
> > - bc27fb68aaad ("include/uapi/linux/byteorder, swab: force inlining
> > of some byteswap operations")
> > - ef3fb2422ffe ("scsi: fc: use get/put_unaligned64 for wwn access")
>
>
> and now *many* users of qla2x00 and new-ish gcc are going to
> very much notice it, as their kernels will start crashing reliably.
>
> The commits can be reverted, sure, but they per se do not contain
> anything unusual. They, together with not very typical construct
> in qla2x00_get_host_fabric_name, one
> which boils down to "swab64p(constant_array_of_8_bytes)",
> just happen to nudge gcc in a right way to finally trigger the bug.
>
> So I came with another idea how to forestall the imminent deluge of
> qla2x00 oops reports - this patch.

There are actually a raft of checkers that run the upstream code which
aren't seeing any problem; likely because the code is harder to trigger
than you think. So, lets wait until the resolution of the other thread
before we panic, especially since we're only at -rc3.

James


2016-04-15 20:02:39

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH] qla2xxx: rewrite code to avoid hitting gcc bug 70646

On Fri, Apr 15, 2016 at 12:05:26PM -0700, James Bottomley wrote:
> On Fri, 2016-04-15 at 20:56 +0200, Denys Vlasenko wrote:
> > and now *many* users of qla2x00 and new-ish gcc are going to
> > very much notice it, as their kernels will start crashing reliably.
> >
> > The commits can be reverted, sure, but they per se do not contain
> > anything unusual. They, together with not very typical construct
> > in qla2x00_get_host_fabric_name, one
> > which boils down to "swab64p(constant_array_of_8_bytes)",
> > just happen to nudge gcc in a right way to finally trigger the bug.
> >
> > So I came with another idea how to forestall the imminent deluge of
> > qla2x00 oops reports - this patch.
>
> There are actually a raft of checkers that run the upstream code which
> aren't seeing any problem; likely because the code is harder to trigger
> than you think. So, lets wait until the resolution of the other thread
> before we panic, especially since we're only at -rc3.

Regardless of the outcome of the gcc bug, it seems kind of silly to
byteswap a constant value of 0xffffffffffffffff.

uint8_t node_name[WWN_SIZE] = { 0xFF, 0xFF, 0xFF, 0xFF, \
0xFF, 0xFF, 0xFF, 0xFF};
u64 fabric_name = wwn_to_u64(node_name);

Similar to what Denys suggested, it can just be:

u64 fabric_name = -1;
or
u64 fabric_name = 0xffffffffffffffff;

Wouldn't that be an improvement to the code regardless?

--
Josh

2016-04-15 21:10:00

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] qla2xxx: rewrite code to avoid hitting gcc bug 70646

On 04/15/2016 09:05 PM, James Bottomley wrote:
> On Fri, 2016-04-15 at 20:56 +0200, Denys Vlasenko wrote:
>> On 04/15/2016 04:40 PM, James Bottomley wrote:
>>> On Fri, 2016-04-15 at 12:36 +0200, Denys Vlasenko wrote:
>>>> More info here:
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
>>>
>>> This bug is under investigation, so I'd rather not alter code for a
>>> gcc
>>> bug until we know if we can supply options to fix it rather than
>>> changing code.
>>
>>
>> Background. The bug exists in gcc for 2 years, but it is rather
>> hard to trigger, so nobody noticed.
>
> We know this ... linux-scsi is on the cc for the other thread on this.
>
>> Unfortunately for kernel, these two commits landed in Linus tree
>> in March 16 and 17:
>>
>>
>> On 04/13/2016 05:36 AM, Josh Poimboeuf wrote:
>>> It occurs with the combination of the following two recent commits:
>>>
>>> - bc27fb68aaad ("include/uapi/linux/byteorder, swab: force inlining
>>> of some byteswap operations")
>>> - ef3fb2422ffe ("scsi: fc: use get/put_unaligned64 for wwn access")
>>
>>
>> and now *many* users of qla2x00 and new-ish gcc are going to
>> very much notice it, as their kernels will start crashing reliably.
>>
>> The commits can be reverted, sure, but they per se do not contain
>> anything unusual. They, together with not very typical construct
>> in qla2x00_get_host_fabric_name, one
>> which boils down to "swab64p(constant_array_of_8_bytes)",
>> just happen to nudge gcc in a right way to finally trigger the bug.
>>
>> So I came with another idea how to forestall the imminent deluge of
>> qla2x00 oops reports - this patch.
>
> There are actually a raft of checkers that run the upstream code which
> aren't seeing any problem; likely because the code is harder to trigger
> than you think. So, lets wait until the resolution of the other thread
> before we panic, especially since we're only at -rc3.

I'm not panicking, James.

By sending a workaround, I just want to make sure that *other people*
won't be forced to fix up a problem which surfaced because of *my* patch.


I'm afraid "harder to trigger than you think" is not true.
It is nearly trivial to trigger it now.
I just tried the following on a freshly installed Fedora 21 machine:

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ make defconfig
$ sed '/SCSI_FC_ATTRS/d;/SCSI_LOWLEVEL/d' -i .config
$ make oldconfig # answer "yes" to everything
$ nice make -j22
$ objdump -dr drivers/scsi/qla2xxx/qla_attr.o | grep -A10 qla2x00_get_host_fabric_name

0000000000001540 <qla2x00_get_host_fabric_name>:
1540: 55 push %rbp
1541: 48 89 e5 mov %rsp,%rbp
1544: 66 66 66 2e 0f 1f 84 data32 data32 nopw %cs:0x0(%rax,%rax,1)
154b: 00 00 00 00 00

0000000000001550 <qla2x00_fw_state_show>:
1550: 55 push %rbp
1551: 48 89 e5 mov %rsp,%rbp
1554: 53 push %rbx
1555: 48 89 d3 mov %rdx,%rbx

See?
I'm sure Fedora 22, 23 and 24 will also do that.

2016-04-15 21:16:04

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] qla2xxx: rewrite code to avoid hitting gcc bug 70646

On Fri, 2016-04-15 at 15:02 -0500, Josh Poimboeuf wrote:
> On Fri, Apr 15, 2016 at 12:05:26PM -0700, James Bottomley wrote:
> > On Fri, 2016-04-15 at 20:56 +0200, Denys Vlasenko wrote:
> > > and now *many* users of qla2x00 and new-ish gcc are going to
> > > very much notice it, as their kernels will start crashing
> > > reliably.
> > >
> > > The commits can be reverted, sure, but they per se do not contain
> > > anything unusual. They, together with not very typical construct
> > > in qla2x00_get_host_fabric_name, one
> > > which boils down to "swab64p(constant_array_of_8_bytes)",
> > > just happen to nudge gcc in a right way to finally trigger the
> > > bug.
> > >
> > > So I came with another idea how to forestall the imminent deluge
> > > of
> > > qla2x00 oops reports - this patch.
> >
> > There are actually a raft of checkers that run the upstream code
> > which
> > aren't seeing any problem; likely because the code is harder to
> > trigger
> > than you think. So, lets wait until the resolution of the other
> > thread
> > before we panic, especially since we're only at -rc3.
>
> Regardless of the outcome of the gcc bug, it seems kind of silly to
> byteswap a constant value of 0xffffffffffffffff.
>
> uint8_t node_name[WWN_SIZE] = { 0xFF, 0xFF, 0xFF, 0xFF, \
> 0xFF, 0xFF, 0xFF, 0xFF};
> u64 fabric_name = wwn_to_u64(node_name);
>
> Similar to what Denys suggested, it can just be:
>
> u64 fabric_name = -1;
> or
> u64 fabric_name = 0xffffffffffffffff;
>
> Wouldn't that be an improvement to the code regardless?

"Improvement" would be in the eye of the beholder. Semantically it
would be wrong because we're initialising a CPU local representation of
a big endian structure, so we *should* use the conversion.

James

2016-04-15 21:25:12

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] qla2xxx: rewrite code to avoid hitting gcc bug 70646

On Fri, 2016-04-15 at 23:09 +0200, Denys Vlasenko wrote:
> On 04/15/2016 09:05 PM, James Bottomley wrote:
> > On Fri, 2016-04-15 at 20:56 +0200, Denys Vlasenko wrote:
> > > On 04/15/2016 04:40 PM, James Bottomley wrote:
> > > > On Fri, 2016-04-15 at 12:36 +0200, Denys Vlasenko wrote:
> > > > > More info here:
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
> > > >
> > > > This bug is under investigation, so I'd rather not alter code
> > > > for a
> > > > gcc
> > > > bug until we know if we can supply options to fix it rather
> > > > than
> > > > changing code.
> > >
> > >
> > > Background. The bug exists in gcc for 2 years, but it is rather
> > > hard to trigger, so nobody noticed.
> >
> > We know this ... linux-scsi is on the cc for the other thread on
> > this.
> >
> > > Unfortunately for kernel, these two commits landed in Linus tree
> > > in March 16 and 17:
> > >
> > >
> > > On 04/13/2016 05:36 AM, Josh Poimboeuf wrote:
> > > > It occurs with the combination of the following two recent
> > > > commits:
> > > >
> > > > - bc27fb68aaad ("include/uapi/linux/byteorder, swab: force
> > > > inlining
> > > > of some byteswap operations")
> > > > - ef3fb2422ffe ("scsi: fc: use get/put_unaligned64 for wwn
> > > > access")
> > >
> > >
> > > and now *many* users of qla2x00 and new-ish gcc are going to
> > > very much notice it, as their kernels will start crashing
> > > reliably.
> > >
> > > The commits can be reverted, sure, but they per se do not contain
> > > anything unusual. They, together with not very typical construct
> > > in qla2x00_get_host_fabric_name, one
> > > which boils down to "swab64p(constant_array_of_8_bytes)",
> > > just happen to nudge gcc in a right way to finally trigger the
> > > bug.
> > >
> > > So I came with another idea how to forestall the imminent deluge
> > > of
> > > qla2x00 oops reports - this patch.
> >
> > There are actually a raft of checkers that run the upstream code
> > which
> > aren't seeing any problem; likely because the code is harder to
> > trigger
> > than you think. So, lets wait until the resolution of the other
> > thread
> > before we panic, especially since we're only at -rc3.
>
> I'm not panicking, James.
>
> By sending a workaround, I just want to make sure that *other people*
> won't be forced to fix up a problem which surfaced because of *my*
> patch.

Look, if gcc really proves to be intractable, I think what should
happen is revert the triggering patch, which is

commit e3bde9568d992c5f985e6e30731a5f9f9bef7b13
Author: Denys Vlasenko <[email protected]>
Date: Thu Mar 17 14:22:47 2016 -0700

include/linux/unaligned: force inlining of byteswap operations

But, as I've said a couple of times now, there are no bug reports from
the testers about qla2xxx (yet) so we can afford to wait a bit and see
if there's some other resolution that doesn't involve changing kernel
code to work around a local gcc bug.

James