2023-12-07 13:10:27

by Genes Lists

[permalink] [raw]
Subject: md raid6 oops in 6.6.4 stable

I have not had chance to git bisect this but since it happened in stable
I thought it was important to share sooner than later.

One possibly relevant commit between 6.6.3 and 6.6.4 could be:

commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
Author: Song Liu <[email protected]>
Date: Fri Nov 17 15:56:30 2023 -0800

md: fix bi_status reporting in md_end_clone_io

log attached shows page_fault_oops.
Machine was up for 3 days before crash happened.

gene


Attachments:
raid6-crash (4.04 kB)

2023-12-07 13:31:21

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: md raid6 oops in 6.6.4 stable

On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
> I have not had chance to git bisect this but since it happened in stable I
> thought it was important to share sooner than later.
>
> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>
> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
> Author: Song Liu <[email protected]>
> Date: Fri Nov 17 15:56:30 2023 -0800
>
> md: fix bi_status reporting in md_end_clone_io
>
> log attached shows page_fault_oops.
> Machine was up for 3 days before crash happened.
>

Can you confirm that culprit by bisection?

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (672.00 B)
signature.asc (235.00 B)
Download all attachments

2023-12-07 13:56:05

by Genes Lists

[permalink] [raw]
Subject: Re: md raid6 oops in 6.6.4 stable

On 12/7/23 08:30, Bagas Sanjaya wrote:
> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
>> I have not had chance to git bisect this but since it happened in stable I
>> thought it was important to share sooner than later.
>>
>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>>
>> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>> Author: Song Liu <[email protected]>
>> Date: Fri Nov 17 15:56:30 2023 -0800
>>
>> md: fix bi_status reporting in md_end_clone_io
>>
>> log attached shows page_fault_oops.
>> Machine was up for 3 days before crash happened.
>>
>
> Can you confirm that culprit by bisection?
>

That's the plan - however, turn around could be horribly slow if the
average wait time to crash is of order a few days between each bisect.
Also machine is currently in use, so will need to deal with that as
well. Will do my best.

Fingers crossed someone might just spot something in the meantime.

The commit mentioned above ensures underlying errors are not hidden, so
it may simply have revealed some underlying issue and not be the actual
'culprit'.

thanks

gene

Subject: Re: md raid6 oops in 6.6.4 stable

On 07.12.23 14:30, Bagas Sanjaya wrote:
> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
>> I have not had chance to git bisect this but since it happened in stable I
>> thought it was important to share sooner than later.
>>
>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>>
>> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>> Author: Song Liu <[email protected]>
>> Date: Fri Nov 17 15:56:30 2023 -0800
>>
>> md: fix bi_status reporting in md_end_clone_io
>>
>> log attached shows page_fault_oops.
>> Machine was up for 3 days before crash happened.
>
> Can you confirm that culprit by bisection?

Bagas, I know you are trying to help, but sorry, I'd say this is not
helpful at all -- any maybe even harmful.

From the quoted texts it's pretty clear that the reporter knows that a
bisection would be helpful, but currently is unable to perform one --
and even states reasons for reporting it without having it bisected. So
your message afaics doesn't bring anything new to the table; and I might
be wrong with that, but I fear some people in a situation like this
might even be offended by a reply like that, as it states something
already obvious.

Ciao, Thorsten

2023-12-07 14:52:58

by Guoqing Jiang

[permalink] [raw]
Subject: Re: md raid6 oops in 6.6.4 stable

Hi,

On 12/7/23 21:55, Genes Lists wrote:
> On 12/7/23 08:30, Bagas Sanjaya wrote:
>> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
>>> I have not had chance to git bisect this but since it happened in
>>> stable I
>>> thought it was important to share sooner than later.
>>>
>>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>>>
>>>    commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>>>    Author: Song Liu <[email protected]>
>>>    Date:   Fri Nov 17 15:56:30 2023 -0800
>>>
>>>      md: fix bi_status reporting in md_end_clone_io
>>>
>>> log attached shows page_fault_oops.
>>> Machine was up for 3 days before crash happened.

Could you decode the oops (I can't find it in lore for some reason)
([1])? And
can it be reproduced reliably? If so, pls share the reproduce step.

[1]. https://lwn.net/Articles/592724/

Thanks,
Guoqing

2023-12-07 15:58:32

by Genes Lists

[permalink] [raw]
Subject: Re: md raid6 oops in 6.6.4 stable

On 12/7/23 09:42, Guoqing Jiang wrote:
> Hi,
>
> On 12/7/23 21:55, Genes Lists wrote:
>> On 12/7/23 08:30, Bagas Sanjaya wrote:
>>> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
>>>> I have not had chance to git bisect this but since it happened in
>>>> stable I
>>>> thought it was important to share sooner than later.
>>>>
>>>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>>>>
>>>>    commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>>>>    Author: Song Liu <[email protected]>
>>>>    Date:   Fri Nov 17 15:56:30 2023 -0800
>>>>
>>>>      md: fix bi_status reporting in md_end_clone_io
>>>>
>>>> log attached shows page_fault_oops.
>>>> Machine was up for 3 days before crash happened.
>
> Could you decode the oops (I can't find it in lore for some reason)
> ([1])? And
> can it be reproduced reliably? If so, pls share the reproduce step.
>
> [1]. https://lwn.net/Articles/592724/
>
> Thanks,
> Guoqing

- reproducing
An rsync runs 2 x / day. It copies to this server from another. The
copy is from a (large) top level directory. On the 3rd day after booting
6.6.4, the second of these rysnc's triggered the oops. I need to do
more testing to see if I can reliably reproduce. I have not seen this
oops on earlier stable kernels.

- decoding oops with scripts/decode_stacktrace.sh had errors :
readelf: Error: Not an ELF file - it has the wrong magic bytes at
the start

It appears that the decode script doesn't handle compressed modules.
I changed the readelf line to decompress first. This fixes the above
script complaint and the result is attached.

gene






Attachments:
raid6-stacktrace (5.16 kB)

2023-12-07 16:16:08

by Xiao Ni

[permalink] [raw]
Subject: Re: md raid6 oops in 6.6.4 stable

On Thu, Dec 7, 2023 at 9:12 PM Genes Lists <[email protected]> wrote:
>
> I have not had chance to git bisect this but since it happened in stable
> I thought it was important to share sooner than later.
>
> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>
> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
> Author: Song Liu <[email protected]>
> Date: Fri Nov 17 15:56:30 2023 -0800
>
> md: fix bi_status reporting in md_end_clone_io
>
> log attached shows page_fault_oops.
> Machine was up for 3 days before crash happened.
>
> gene

Hi all

I'm following the crash reference rule to try to find some hints. The
RDI is ffff8881019312c0 which should be the address of struct
block_device *part. And the CR2 is ffff8881019312e8. So the panic
happens when it wants to introduce part->bd_stamp. Hope it's helpful
if the addresses are right.

Best Regards
Xiao

2023-12-07 17:37:58

by Song Liu

[permalink] [raw]
Subject: Re: md raid6 oops in 6.6.4 stable

On Thu, Dec 7, 2023 at 7:58 AM Genes Lists <[email protected]> wrote:
>
> On 12/7/23 09:42, Guoqing Jiang wrote:
> > Hi,
> >
> > On 12/7/23 21:55, Genes Lists wrote:
> >> On 12/7/23 08:30, Bagas Sanjaya wrote:
> >>> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
> >>>> I have not had chance to git bisect this but since it happened in
> >>>> stable I
> >>>> thought it was important to share sooner than later.
> >>>>
> >>>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
> >>>>
> >>>> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
> >>>> Author: Song Liu <[email protected]>
> >>>> Date: Fri Nov 17 15:56:30 2023 -0800
> >>>>
> >>>> md: fix bi_status reporting in md_end_clone_io
> >>>>
> >>>> log attached shows page_fault_oops.
> >>>> Machine was up for 3 days before crash happened.
> >
> > Could you decode the oops (I can't find it in lore for some reason)
> > ([1])? And
> > can it be reproduced reliably? If so, pls share the reproduce step.
> >
> > [1]. https://lwn.net/Articles/592724/
> >
> > Thanks,
> > Guoqing
>
> - reproducing
> An rsync runs 2 x / day. It copies to this server from another. The
> copy is from a (large) top level directory. On the 3rd day after booting
> 6.6.4, the second of these rysnc's triggered the oops. I need to do
> more testing to see if I can reliably reproduce. I have not seen this
> oops on earlier stable kernels.
>
> - decoding oops with scripts/decode_stacktrace.sh had errors :
> readelf: Error: Not an ELF file - it has the wrong magic bytes at
> the start
>
> It appears that the decode script doesn't handle compressed modules.
> I changed the readelf line to decompress first. This fixes the above
> script complaint and the result is attached.

I probably missed something, but I really don't think the commit
(2c975b0b8b11f1ffb1ed538609e2c89d8abf800e) could trigger this issue.

From the trace:

kernel: RIP: 0010:update_io_ticks+0x2c/0x60
=>
2a:* f0 48 0f b1 77 28 lock cmpxchg %rsi,0x28(%rdi) << trapped here.
[...]
kernel: Call Trace:
kernel: <TASK>
kernel: ? __die+0x23/0x70
kernel: ? page_fault_oops+0x171/0x4e0
kernel: ? exc_page_fault+0x175/0x180
kernel: ? asm_exc_page_fault+0x26/0x30
kernel: ? update_io_ticks+0x2c/0x60
kernel: bdev_end_io_acct+0x63/0x160
kernel: md_end_clone_io+0x75/0xa0 <<< change in md_end_clone_io

The commit only changes how we update bi_status. But bi_status was not
used/checked at all between md_end_clone_io and the trap (lock cmpxchg).
Did I miss something?

Given the issue takes very long to reproduce. Maybe we have the issue
before 6.6.4?

Thanks,
Song

2023-12-07 19:28:09

by Genes Lists

[permalink] [raw]
Subject: Re: md raid6 oops in 6.6.4 stable

On 12/7/23 12:37, Song Liu wrote:
...
> kernel: md_end_clone_io+0x75/0xa0 <<< change in md_end_clone_io
>
> The commit only changes how we update bi_status. But bi_status was not
> used/checked at all between md_end_clone_io and the trap (lock cmpxchg).
> Did I miss something?
>
> Given the issue takes very long to reproduce. Maybe we have the issue
> before 6.6.4?
>
> Thanks,
> Song

Thanks for clarifying that point.

In meantime I rebooted server (shutdown was a struggle) - finally I
fsck'd the filesystem (ext4) sitting on the raid6 - and manually ran the
triggering rsync. This of course completed normally. That's either good
or bad depending on your perspective :)

If I can get it to crash again, I will either start a git bisect (from
6.6.3) or see if 6.7rc4 shows same issue.

thanks,

gene


2023-12-08 02:05:55

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: md raid6 oops in 6.6.4 stable

On 12/7/23 20:58, Thorsten Leemhuis wrote:
> On 07.12.23 14:30, Bagas Sanjaya wrote:
>> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
>>> I have not had chance to git bisect this but since it happened in stable I
>>> thought it was important to share sooner than later.
>>>
>>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>>>
>>> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>>> Author: Song Liu <[email protected]>
>>> Date: Fri Nov 17 15:56:30 2023 -0800
>>>
>>> md: fix bi_status reporting in md_end_clone_io
>>>
>>> log attached shows page_fault_oops.
>>> Machine was up for 3 days before crash happened.
>>
>> Can you confirm that culprit by bisection?
>
> Bagas, I know you are trying to help, but sorry, I'd say this is not
> helpful at all -- any maybe even harmful.
>
> From the quoted texts it's pretty clear that the reporter knows that a
> bisection would be helpful, but currently is unable to perform one --
> and even states reasons for reporting it without having it bisected. So
> your message afaics doesn't bring anything new to the table; and I might
> be wrong with that, but I fear some people in a situation like this
> might even be offended by a reply like that, as it states something
> already obvious.
>

Oops, I didn't fully understand the context. Thanks anyway.

--
An old man doll... just what I always wanted! - Clara