2023-08-02 03:52:06

by Dusty Mabe

[permalink] [raw]
Subject: XFS metadata CRC errors on zram block device on ppc64le architecture

In Fedora CoreOS we found an issue with an interaction of an XFS filesystem on a zram block device on ppc64le:

- https://github.com/coreos/fedora-coreos-tracker/issues/1489
- https://bugzilla.redhat.com/show_bug.cgi?id=2221314

The dmesg output shows several errors:

```
[ 3247.206007] XFS (zram0): Mounting V5 Filesystem 0b7d6149-614c-4f4c-9a1f-a80a9810f58f
[ 3247.210781] XFS (zram0): Metadata CRC error detected at xfs_agf_read_verify+0x108/0x150 [xfs], xfs_agf block 0x80008
[ 3247.211121] XFS (zram0): Unmount and run xfs_repair
[ 3247.211198] XFS (zram0): First 128 bytes of corrupted metadata buffer:
[ 3247.211293] 00000000: fe ed ba be 00 00 00 00 00 00 00 02 00 00 00 00 ................
[ 3247.211405] 00000010: 00 00 00 00 00 00 00 18 00 00 00 01 00 00 00 00 ................
[ 3247.211515] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 3247.211625] 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 3247.211735] 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 3247.211842] 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 3247.211951] 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 3247.212063] 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 3247.212171] XFS (zram0): metadata I/O error in "xfs_read_agf+0xb4/0x180 [xfs]" at daddr 0x80008 len 8 error 74
[ 3247.212485] XFS (zram0): Error -117 reserving per-AG metadata reserve pool.
[ 3247.212497] XFS (zram0): Corruption of in-memory data (0x8) detected at xfs_fs_reserve_ag_blocks+0x1e0/0x220 [xfs] (fs/xfs/xfs_fsops.c:587). Shutting down filesystem.
[ 3247.212828] XFS (zram0): Please unmount the filesystem and rectify the problem(s)
[ 3247.212943] XFS (zram0): Ending clean mount
[ 3247.212970] XFS (zram0): Error -5 reserving per-AG metadata reserve pool.
```

The issue can be reproduced easily with a simple script:

```
[root@p8 ~]# cat test.sh
#!/bin/bash
set -eux -o pipefail
modprobe zram num_devices=0
read dev < /sys/class/zram-control/hot_add
echo 10G > /sys/block/zram"${dev}"/disksize
mkfs.xfs /dev/zram"${dev}"
mkdir -p /tmp/foo
mount -t xfs /dev/zram"${dev}" /tmp/foo
```

We ran a kernel bisect and narrowed it down to offending commit af8b04c6:

```
[root@ibm-p8-kvm-03-guest-02 linux]# git bisect good
af8b04c63708fa730c0257084fab91fb2a9cecc4 is the first bad commit
commit af8b04c63708fa730c0257084fab91fb2a9cecc4
Author: Christoph Hellwig <[email protected]>
Date: Tue Apr 11 19:14:46 2023 +0200

zram: simplify bvec iteration in __zram_make_request

bio_for_each_segment synthetize bvecs that never cross page boundaries, so
don't duplicate that work in an inner loop.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

drivers/block/zram/zram_drv.c | 42 +++++++++++-------------------------------
1 file changed, 11 insertions(+), 31 deletions(-)
```

Any ideas on how to fix the problem?

Thanks!
Dusty


2023-08-02 09:59:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: XFS metadata CRC errors on zram block device on ppc64le architecture

On Tue, Aug 01, 2023 at 11:31:37PM -0400, Dusty Mabe wrote:
> We ran a kernel bisect and narrowed it down to offending commit af8b04c6:
>
> ```
> [root@ibm-p8-kvm-03-guest-02 linux]# git bisect good
> af8b04c63708fa730c0257084fab91fb2a9cecc4 is the first bad commit
> commit af8b04c63708fa730c0257084fab91fb2a9cecc4
> Author: Christoph Hellwig <[email protected]>
> Date: Tue Apr 11 19:14:46 2023 +0200
>
> zram: simplify bvec iteration in __zram_make_request
>
> bio_for_each_segment synthetize bvecs that never cross page boundaries, so
> don't duplicate that work in an inner loop.

> Any ideas on how to fix the problem?

So the interesting cases are:

- ppc64 usually uses 64k page sizes
- ppc64 is somewhat cache incoherent (compared to say x86)

Let me think of this a bit more.

2023-08-02 13:14:18

by Hannes Reinecke

[permalink] [raw]
Subject: Re: XFS metadata CRC errors on zram block device on ppc64le architecture

On 8/2/23 11:41, Christoph Hellwig wrote:
> On Tue, Aug 01, 2023 at 11:31:37PM -0400, Dusty Mabe wrote:
>> We ran a kernel bisect and narrowed it down to offending commit af8b04c6:
>>
>> ```
>> [root@ibm-p8-kvm-03-guest-02 linux]# git bisect good
>> af8b04c63708fa730c0257084fab91fb2a9cecc4 is the first bad commit
>> commit af8b04c63708fa730c0257084fab91fb2a9cecc4
>> Author: Christoph Hellwig <[email protected]>
>> Date: Tue Apr 11 19:14:46 2023 +0200
>>
>> zram: simplify bvec iteration in __zram_make_request
>>
>> bio_for_each_segment synthetize bvecs that never cross page boundaries, so
>> don't duplicate that work in an inner loop.
>
>> Any ideas on how to fix the problem?
>
> So the interesting cases are:
>
> - ppc64 usually uses 64k page sizes
> - ppc64 is somewhat cache incoherent (compared to say x86)
>
> Let me think of this a bit more.

Would need to be confirmed first that 64k pages really are in use
(eg we compile ppc64le with 4k page sizes ...).
Dusty?
For which page size did you compile your kernel?

Cheers,

Hannes


2023-08-02 13:41:52

by Dusty Mabe

[permalink] [raw]
Subject: Re: XFS metadata CRC errors on zram block device on ppc64le architecture



On 8/2/23 07:03, Hannes Reinecke wrote:
> On 8/2/23 11:41, Christoph Hellwig wrote:
>> On Tue, Aug 01, 2023 at 11:31:37PM -0400, Dusty Mabe wrote:
>>> We ran a kernel bisect and narrowed it down to offending commit af8b04c6:
>>>
>>> ```
>>> [root@ibm-p8-kvm-03-guest-02 linux]# git bisect good
>>> af8b04c63708fa730c0257084fab91fb2a9cecc4 is the first bad commit
>>> commit af8b04c63708fa730c0257084fab91fb2a9cecc4
>>> Author: Christoph Hellwig <[email protected]>
>>> Date: Tue Apr 11 19:14:46 2023 +0200
>>>
>>> zram: simplify bvec iteration in __zram_make_request
>>>
>>> bio_for_each_segment synthetize bvecs that never cross page boundaries, so
>>> don't duplicate that work in an inner loop.
>>
>>> Any ideas on how to fix the problem?
>>
>> So the interesting cases are:
>>
>> - ppc64 usually uses 64k page sizes
>> - ppc64 is somewhat cache incoherent (compared to say x86)
>>
>> Let me think of this a bit more.
>
> Would need to be confirmed first that 64k pages really are in use
> (eg we compile ppc64le with 4k page sizes ...).
> Dusty?
> For which page size did you compile your kernel?


For Fedora the configuration is to enable 64k pages with CONFIG_PPC_64K_PAGES=y
https://src.fedoraproject.org/rpms/kernel/blob/064c1675a16b4d379b42ab6c3397632ca54ad897/f/kernel-ppc64le-fedora.config#_4791

I used the same configuration when running the git bisect.

Dusty

2023-08-03 21:46:56

by Dusty Mabe

[permalink] [raw]
Subject: Re: XFS metadata CRC errors on zram block device on ppc64le architecture



On 8/2/23 08:00, Dusty Mabe wrote:
>
>
> On 8/2/23 07:03, Hannes Reinecke wrote:
>> On 8/2/23 11:41, Christoph Hellwig wrote:
>>> On Tue, Aug 01, 2023 at 11:31:37PM -0400, Dusty Mabe wrote:
>>>> We ran a kernel bisect and narrowed it down to offending commit af8b04c6:
>>>>
>>>> ```
>>>> [root@ibm-p8-kvm-03-guest-02 linux]# git bisect good
>>>> af8b04c63708fa730c0257084fab91fb2a9cecc4 is the first bad commit
>>>> commit af8b04c63708fa730c0257084fab91fb2a9cecc4
>>>> Author: Christoph Hellwig <[email protected]>
>>>> Date: Tue Apr 11 19:14:46 2023 +0200
>>>>
>>>> zram: simplify bvec iteration in __zram_make_request
>>>>
>>>> bio_for_each_segment synthetize bvecs that never cross page boundaries, so
>>>> don't duplicate that work in an inner loop.
>>>
>>>> Any ideas on how to fix the problem?
>>>
>>> So the interesting cases are:
>>>
>>> - ppc64 usually uses 64k page sizes
>>> - ppc64 is somewhat cache incoherent (compared to say x86)
>>>
>>> Let me think of this a bit more.
>>
>> Would need to be confirmed first that 64k pages really are in use
>> (eg we compile ppc64le with 4k page sizes ...).
>> Dusty?
>> For which page size did you compile your kernel?
>
>
> For Fedora the configuration is to enable 64k pages with CONFIG_PPC_64K_PAGES=y
> https://src.fedoraproject.org/rpms/kernel/blob/064c1675a16b4d379b42ab6c3397632ca54ad897/f/kernel-ppc64le-fedora.config#_4791
>
> I used the same configuration when running the git bisect.

Naive question from my side: would this be a candidate for reverting while we investigate the root cause?

Dusty



2023-08-04 03:56:44

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: XFS metadata CRC errors on zram block device on ppc64le architecture

On (23/08/03 17:32), Dusty Mabe wrote:
> >>>> zram: simplify bvec iteration in __zram_make_request
> >>>>
> >>>> bio_for_each_segment synthetize bvecs that never cross page boundaries, so
> >>>> don't duplicate that work in an inner loop.
> >>>
> >>>> Any ideas on how to fix the problem?
> >>>
> >>> So the interesting cases are:
> >>>
> >>> - ppc64 usually uses 64k page sizes
> >>> - ppc64 is somewhat cache incoherent (compared to say x86)
> >>>
> >>> Let me think of this a bit more.
> >>
> >> Would need to be confirmed first that 64k pages really are in use
> >> (eg we compile ppc64le with 4k page sizes ...).
> >> Dusty?
> >> For which page size did you compile your kernel?
> >
> >
> > For Fedora the configuration is to enable 64k pages with CONFIG_PPC_64K_PAGES=y
> > https://src.fedoraproject.org/rpms/kernel/blob/064c1675a16b4d379b42ab6c3397632ca54ad897/f/kernel-ppc64le-fedora.config#_4791
> >
> > I used the same configuration when running the git bisect.
>
> Naive question from my side: would this be a candidate for reverting while we investigate the root cause?

That's certainly a possible solution.

But I don't quite understand why af8b04c63708 doesn't work.

2023-08-04 13:55:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: XFS metadata CRC errors on zram block device on ppc64le architecture

FYI, I've found an arm64 system with 16k page size support, and while
I can't reproduce the exact issue, I do see corruption with I/O test
on zram that don't show on the same system with 4k pages. I'm trying
to understand the details at the moment.


2023-08-04 14:52:52

by Hannes Reinecke

[permalink] [raw]
Subject: Re: XFS metadata CRC errors on zram block device on ppc64le architecture

On 8/4/23 15:42, Christoph Hellwig wrote:
> FYI, I've found an arm64 system with 16k page size support, and while
> I can't reproduce the exact issue, I do see corruption with I/O test
> on zram that don't show on the same system with 4k pages. I'm trying
> to understand the details at the moment.
>
For some reason zram run with a logical block size of 4k:

#define ZRAM_LOGICAL_BLOCK_SHIFT 12
#define ZRAM_LOGICAL_BLOCK_SIZE (1 << ZRAM_LOGICAL_BLOCK_SHIFT)

so we'll have sub-page accesses for larger page sizes.
My bet is that the issue goes away if we set the logical block size to
page size ...

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


Subject: Re: XFS metadata CRC errors on zram block device on ppc64le architecture

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 02.08.23 05:31, Dusty Mabe wrote:
> In Fedora CoreOS we found an issue with an interaction of an XFS filesystem on a zram block device on ppc64le:
>
> - https://github.com/coreos/fedora-coreos-tracker/issues/1489
> - https://bugzilla.redhat.com/show_bug.cgi?id=2221314
>
> The dmesg output shows several errors:
>
> ```
> [ 3247.206007] XFS (zram0): Mounting V5 Filesystem 0b7d6149-614c-4f4c-9a1f-a80a9810f58f
> [ 3247.210781] XFS (zram0): Metadata CRC error detected at xfs_agf_read_verify+0x108/0x150 [xfs], xfs_agf block 0x80008
> [ 3247.211121] XFS (zram0): Unmount and run xfs_repair
> [...]
> We ran a kernel bisect and narrowed it down to offending commit af8b04c6:
>
> ```
> [root@ibm-p8-kvm-03-guest-02 linux]# git bisect good
> af8b04c63708fa730c0257084fab91fb2a9cecc4 is the first bad commit
> commit af8b04c63708fa730c0257084fab91fb2a9cecc4
> Author: Christoph Hellwig <[email protected]>
> Date: Tue Apr 11 19:14:46 2023 +0200
>
> zram: simplify bvec iteration in __zram_make_request
>
> [...]

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced af8b04c63708fa730c0257084fab91fb2a9cec
#regzbot title zram: XFS metadata CRC errors on zram block device on ppc64le
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

Subject: Re: XFS metadata CRC errors on zram block device on ppc64le architecture

[CCing Linus and the regressions list; fwiw, initial report is here:
https://lore.kernel.org/all/[email protected]/
]

On 04.08.23 05:25, Sergey Senozhatsky wrote:
> On (23/08/03 17:32), Dusty Mabe wrote:
>>>>>> zram: simplify bvec iteration in __zram_make_request
>>>>>>
>>>>>> bio_for_each_segment synthetize bvecs that never cross page boundaries, so
>>>>>> don't duplicate that work in an inner loop.
>>>>>
>>>>>> Any ideas on how to fix the problem?
>>>>>
>>>>> So the interesting cases are:
>>>>>
>>>>> - ppc64 usually uses 64k page sizes
>>>>> - ppc64 is somewhat cache incoherent (compared to say x86)
>>>>>
>>>>> Let me think of this a bit more.
>>>>
>>>> Would need to be confirmed first that 64k pages really are in use
>>>> (eg we compile ppc64le with 4k page sizes ...).
>>>> Dusty?
>>>> For which page size did you compile your kernel?
>>>
>>> For Fedora the configuration is to enable 64k pages with CONFIG_PPC_64K_PAGES=y
>>> https://src.fedoraproject.org/rpms/kernel/blob/064c1675a16b4d379b42ab6c3397632ca54ad897/f/kernel-ppc64le-fedora.config#_4791
>>>
>>> I used the same configuration when running the git bisect.
>>
>> Naive question from my side: would this be a candidate for reverting while we investigate the root cause?
>
> That's certainly a possible solution.
>
> But I don't quite understand why af8b04c63708 doesn't work.

Seems Christoph and Hannes (thx to both of you) got a bit closer to
that, but as this apparently is causing data corruption and we are close
to -rc5 I'd like to bring the following up now, as it gets harder to
discuss these things on weekends:

Should Linus revert the culprit for -rc5 if no fix is found within the
next 48 hours?

Ciao, Thorsten

Subject: Re: XFS metadata CRC errors on zram block device on ppc64le architecture

[TLDR: This mail in primarily relevant for Linux regression tracking. A
change or fix related to the regression discussed in this thread was
posted or applied, but it did not use a Closes: tag to point to the
report, as Linus and the documentation call for. Things happen, no
worries -- but now the regression tracking bot needs to be told manually
about the fix. See link in footer if these mails annoy you.]

On 04.08.23 18:22, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:

> On 02.08.23 05:31, Dusty Mabe wrote:
>> In Fedora CoreOS we found an issue with an interaction of an XFS filesystem on a zram block device on ppc64le:
>>
>> - https://github.com/coreos/fedora-coreos-tracker/issues/1489
>> - https://bugzilla.redhat.com/show_bug.cgi?id=2221314

#regzbot monitor:
https://lore.kernel.org/all/[email protected]/
#regzbot fix: zram: take device and not only bvec offset into account
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.